Making systems - 46: Safety and security design

46.1 Introduction

This chapter covers how one goes about designing safety and security into a system.

Almost every system I have worked on has been designed, at least in part, with the assumption that the system is designed correctly as long as it performs the functions people want, while people use the system as expected.

This assumption is wrong, and the systems built this way are not designed correctly.

A correct system design is one that performs its desired behaviors and that does not perform other, undesired behaviors. This second part is just as important as the first.

The process of designing a system, as I have discussed in previous chapters, involves determining the system’s purpose in terms of the needs the system should meet, and the functions by which it can do so. Defining what the system should not do is equally important.

Safety and security are two of the common labels for things that a system should not do. They are two different aspects of the same concept, as I will discuss. Both involve defining harms that the system should not cause. A harm is the mirror image of a need: it is something that stakeholders want for the system not to do.

In this chapter, I discuss what safety and security design is and the general approach for achieving it. This is only an overview, which illustrates how this aspect of design is related to the overall work of building a system. I strongly recommend that those who actually design or analyze system safety—which includes most of the people working on systems projects—pursue this topic with more specialized sources. (I have had the most success following the STPA methodology [Leveson11].)

46.2 Why

Complex systems are usually high-value systems; otherwise why go to the effort to make them? These systems often also have real-world consequences if there are problems with them, whether malfunction or just outages. The liabilities, both economic and moral, can be large when such systems misbehave.

This implies that the systems must minimize how and how often they misbehave. This does not happen accidentally; it requires deliberate design.

The least costly time to address how to keep a system from misbehaving is early in the project. The most expensive is after a system has been implemented, deployed, and actually malfunctions: not only is there the cost of the malfunction, but the pressure to address the problem is high (making it difficult to make careful design choices) and the effects of making changes are hard to fully work out.

Safety and security are not isolated parts of the system-building work. They are woven into every step of the work, from understanding a system’s purpose through design and implementation through deployment and operation.

46.3 What safety and security are

Building safety and security into a system is, first, defining harms that the system must avoid, and, second, designing features and behaviors in the system to eliminate, reduce, or mitigate those harms.

In general terms, a harm is an anti-need. The needs a system addresses are things that will produce value for stakeholders. Harms are things that will reduce value for stakeholders. Colloquially, a harm is something that people want not to happen.

The common conception is that safety is about avoiding injury, to people, other living things, or property. Security is about avoiding disclosure of or damage to information. These are all valid harms, but there are usually far more that must be considered.

Many harms are defined by external policy or regulation. In aviation, for example, there are defined maximum acceptable rates for collisions between aircraft, or accidents causing human injury or death. For spacecraft in Earth orbit, there are regulations related to collisions on orbit and damage caused by re-entering debris. Financial transactions must comply with security standards to protect payment information.

Some of those policies or regulation focus on the means for protection as opposed to the harms to be prevented. However, the harms themselves are usually referenced or can be deduced from the mechanisms.

Stakeholders are usually concerned about more than just the harms defined by external policy. For example, many aviation safety regulations are concerned with human injury or death, but an aircraft manufacturer is concerned about the operational cost of damaged aircraft, or the reputational harm that comes from an accident even if no one is injured.

Other harms are only defined by stakeholders. The value of intellectual property, for example, varies a lot and each stakeholder may have to define their specific needs. The kinds of property harm that matter depends on the kind of system and where it will be used. For example, one recent system I worked on defined five different kinds of information with different levels of confidentiality needs for each, ranging from publicly visible to restrictions on which parties could know about it.

Once the harms are defined, the next step is to use that information in just the same way stakeholder needs are used to build a good system. Safety and security affect the system concept; they are incorporated into system and component specifications, which lead in turn to aspects of designs; safety and security properties are verified as the system is implemented.

46.4 What they are not

Safety and security are not about jumping in during design or implementation and checking for certain classes of problems. I was once in a meeting at a major aircraft manufacturer where their staff, who had recently worked on one of their new major passenger aircraft, defined security as performing certain software checks on their flight systems. (Two days later I was flying home on one of those aircraft and was not reassured.)

Addressing security is not about adding an authentication and encryption protocol to a software system and declaring the system secure (though those can be a part of making a secure system).

Safety is not about just training operating staff on how to use part of a system, though again that can be part of a safety solution.

Safety and security are properties of a whole system. They are emergent properties that only happen when many different parts of the system are deliberately designed to ensure that the parts combine together properly.

A safe or secure system requires developing evidence that the behaviors and features designed into the system are sufficient to avoid a well-defined set of harms. Claims based on one design aspect, such as data encryption, provide no evidence on their own that the system is secure; only the proper analysis that connects the mechanisms and the total set of harms is meaningful.

Sidebar: Safety versus security

Safety and security are largely similar: both are about defining possible harms and avoiding them.

However:

Safety is about inadvertent accidents, where the harm is caused either by events within the system or unintentional external events.
Security is about harms caused by malicious actions by an intelligent actor (the attacker), whether that actor is inside or outside the system.
Security must assume that some attackers are sophisticated (but not omnipotent).
Safety can often address harms that can occur repeatedly, so that one can define a safety objective to keep rates of some harm occurring below some rate.
Security, and other kinds of security, generally addresses harms that are sufficiently serious that even one occurrence is too many.

In many cases it is difficult to distinguish whether some hazardous situation is a safety matter, a security matter, or a combination. While there are differences between how one analyzes and defends against a security problem and a safety problem, the two often end up being combined.

46.5 Objectives of safety and security design

When analyzing and designing a system for safety or security, there are a few objectives to the effort.

Create a system that meets stakeholder needs (and regulatory requirements) for keeping thing safe and secure.
Gather objective evidence that the system meets these needs and requirements.
Support diagnosing and fixing problems when they are found.
Support evolving the system without breaking safety and security properties.

While the first objective is obvious, the other three may not be.

The system will change eventually. When it does, people who did not work on the original system design may be doing the work, or enough time may have passed that people don’t remember the details of how safety and security is designed into the system—especially the subtle details.

Providing these people with a record of why and how the system is safe will help them make changes without breaking safety. If there has been an accident, this record will help them track down what didn’t work as designed and provide them a foundation on which to design and implement improvements.

46.6 Analysis process

Stripping away the details of specific methodologies, safety or security analysis and design has five steps:

Identify harms.
Define accidents based on harms.
Identify the hazards that can lead to those accidents.
Address each hazard.
Update design and specifications.

Sidebar: Safety and security terms

Harm: A kind of undesired effect; a kind of loss.

Accident: An undesired and unplanned event that results in harm or loss. (Some standards, such as ICAO Annex 13, use the term incident as the general term [ICAO20].)

Hazard: A combination of system state or condition and environment conditions that will lead to an accident. (Also: threat.)

Risk: An evaluation of the seriousness or importance of a hazard, based on its likelihood of leading to an accident and severity of the harm caused.

Actor: Some entity that can participate in a hazard or an accident.

Attacker: An actor that can create a hazard intentionally. (Also: threat actor.)

46.6.1 Adversarial mindset

Before going into details, it is worth discussing the attitude that one must bring to the task of designing and analyzing safety or security.

Designing safety is about looking for any and every way that the system could cause harm. This is an uncomfortable and unfamiliar mindset for many people who are used to building systems, who are usually looking to show that what they are designing or implementing performs the functions it is supposed to, that is, for the presence of desired functions.

Safety and security are about developing evidence of the absence of harms.

One of the main ways to develop this evidence is to find all the scenarios where a harm could happen and trace out what could cause those scenarios. This requires being thorough and imaginative about ways things could go wrong.

Completeness matters for this exercise. For security, leaving even one way to attack a system is enough to let someone compromise it. For safety, eliminating one hazard may reduce the number of number of accidents but even one accident that injures or kills many people is too much most of the time. This puts a premium on meticulous analysis to find all these ways.

Safety and security analysis is also often an exercise in identifying and countering confirmation bias (Section 59.2). People are generally looking for evidence that the system works, not that it does not do things it is not supposed to.

In the Inquiry Board report for the failure of the first Ariane 5 launch, the committee wrote [ESA96, p. 6]:

An underlying theme in the development of Ariane 5 is the bias toward mitigation of random failure. […] The exception which occurred was not due to a random failure but a design error. The exception was detected, but inappropriately handled because the view had been taken that software should be considered correct until it is shown to be at fault. […] The Board is in favour of the opposite view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct.

I will discuss methodologies for how to organize the analysis and design activities so that they can be as complete and accurate as possible (Section 46.6.8).

46.6.2 Identify harms

Identifying harms is listing the harms that the system should avoid causing. Some of these harms are defined by regulation or by organizational policies, as noted earlier.

In some cases, defining the list of harms is not a technical decision that is within the authority of the team designing the system. Decisions about which harms to include in safety or security is fundamentally a business or social policy decision. It is part of how an organization determines it should position its products compared to other products, and how it should respond to societal expectations. I have found it difficult to get company leadership to take responsibility for these decisions; in multiple cases the leadership was interested in questions of market fit and profitability, but did not want to be bothered with setting company policy about product safety. Be prepared for this to be a difficult conversation inside some companies.

The majority of harms are defined implicitly by the system’s purpose. For example, in one recent system for air traffic management the primary purpose was to “deconflict drone flights so that no two drones would be operating in the same airspace volume at any given time.” The system provided deconfliction in order to minimize the number of mid-air collisions between drones. The harm for this system was therefore having “two or more drones authorized to operate in overlapping airspace volumes.”

In another system, related to autonomous road vehicles, there were several harms: injury or death to people, property damage outside the vehicle, damage to the vehicle, damage to property in the vehicle, and interfering with road usage.

Harms are usually ranked in order of seriousness. Injury or death of a person is typically more serious than a scratch on a vehicle, for example. Having some understanding of the relative seriousness of two harms becomes important when one must choose between avoiding one harm or the other—for example, choosing to risk scratching a vehicle in order to avoid colliding with a pedestrian. Sometimes the seriousness of a harm depends on its scale: injuring one person is less serious than injuring thousands in one event. Many ethical dilemmas arise when trying to rank harms.

46.6.3 Define accidents

The next step is to define accidents based on harms. Accidents are the events when harms occur. In the example air traffic management system the accident is simply that the harm occurs. In the autonomous road vehicle example, the list of accidents was long and included things like:

Colliding with a person outside the vehicle, injuring them.
Colliding with another vehicle containing passengers, injuring the passengers and causing property damage.
Colliding with road infrastructure, causing damage to the vehicle, damage to the road, and interfering with others using the road.

46.6.4 Identify hazards

The third step is to identify the hazards that can lead to each of the accidents, that is, the list of situations that can lead to the accidents. This list is typically long; there are usually many ways that an accident can happen.

In the autonomous road vehicle example, a vehicle might collide with road infrastructure:

When the vehicle’s guidance and control system failed, so that the vehicle did not follow planned paths.
When the road surface was more slippery than normal, leading to the vehicle not following a planned path.
When the vehicle’s path planning created a path that required the vehicle to perform a maneuver it could not, such as turning too sharply or braking harder than possible.
When the path planning created an incorrect path that intersected the road infrastructure.
When the path planning correctly planned a path into the infrastructure in order to avoid a worse harm (such as colliding with a person in the road).
When the vehicle’s sensing system did not have a correct estimate of the vehicle’s location and motion.
When the vehicle’s sensing system did not sense the correct location of the road infrastructure.
When the vehicle’s sensing system did not sense the road infrastructure at all.

46.6.5 Risk analysis and prioritization

Complex systems usually lead to a large number of hazards. Tackling all of them at once is not feasible; the team has only so many people who can work on them, and addressing one hazard (discussed in the next section) can interact with another hazard or create new hazards.

The better alternative is to tackle only a few hazards at a time, with a small enough team that they can maintain a common understanding of their work.

Most safety design methodologies include a risk analysis, where each hazard is given some kind of risk score based on how serious the harms are or how likely they are to occur. The team can then focus on the hazards with the highest risk scores first, working their way down to less potentially serious hazards over time.

Some hazards will not be worth addressing. A hazard that can lead to mildly annoying some users, for example, may not be worth putting a lot of effort into addressing, or addressing that hazard might be deferred to a later version. While the goal should be to build as good a system as possible, limits on time and resources mean it will not be perfect.

46.6.6 Address hazards

The final step is to determine how to address the hazards. This can be done by (in decreasing order of preference):

Eliminating the conditions that can lead to the accident;
Make the conditions unlikely to happen;
Reducing the severity or likelihood of harmful outcomes;
Limiting the damage harm does occur; and
Detecting and recovering from the harm.

The best option is to make an accident impossible. If that is not feasible, then making the accident less likely or less serious is next best. If no limitation is feasible, then the minimum is to ensure that a harm can be detected and that there is a process for recovering from the problem.

Consider the hazard identified earlier for an autonomous road vehicle: a collision with road infrastructure because the road was sufficiently slippery that the vehicle could not follow a path that avoided the infrastructure. This hazard can be broken down into sub-hazards:

Occurring when there is ice on the road.
Occurring when there is water on the road.
Occurring when there is non-weather debris (oil, mud, small particulate matter) on the road.

The first hazard can be (almost) eliminated by mandating that the vehicle never operates in temperature below 10º C, or when there are reports of remaining ice in a region. If the vehicle does not operate when there might be ice then the first sub-hazard is not possible.

This hazard is only almost eliminated because the mitigation strategy is dependent on weather and temperature information. If that information is wrong then the prevention mechanism won’t be accurate and the vehicle might encounter ice. This might happen if weather sensors malfunction, if communication about weather forecasts is disrupted, or if there are no recent reports about road conditions. These create third-level hazards: When the road surface was more slippery than normal, leading to the vehicle not following a planned path and when the road is slippery because of ice and when temperature information sources have malfunctioned.

This recursion continues until one can provide evidence that the remaining hazards are unlikely enough.

The second hazard can be made less likely by disallowing the vehicle from operating when the chances of rain are above some threshold. However, this does not eliminate the hazard: rain can occur without much warning, and water could be on the road for other reasons like dew in the morning or water flowing across a road from a roadside water leak. The rain-related sub-hazards might be eliminated by including precipitation sensors on the vehicle; water from other sources might be detected by sensors that could evaluate the quality of the road surface. Each of these mechanisms can result in residual hazards that must be addressed.

46.6.6.1 Conflicts

The safety or security objectives can conflict. The first step is to find where there are conflicts, and then one chooses which objective to satisfy.

Conflict is built into some hazards. Some situations require choosing between two potential harms, such as in the case listed earlier when an autonomous vehicle must collide with a human or with road infrastructure but cannot avoid both.

In some cases a means of addressing one hazard causes a different hazard or makes another one more likely. Leveson uses the example of doors on a passenger train. The doors should only open when the train is stopped at a station platform, so that people are not injured by falling from a moving train or by falling down to the track side from a stopped train. At the same time, the doors must open in an emergency in order to avoid injuring people by trapping them on board a damaged train car [Leveson11, p. 190]. An interlock that prevents the doors from opening addresses the first hazard but can cause the second hazard, and vice versa.

In the end, I know of no simple rule to resolve these conflicts. The choice can often be guided by the relative seriousness of the alternative harms. However, in many other cases the choice requires careful design judgment.

The way that conflicts are resolved should always be well documented. These choices are likely to be revisited as a system evolves or when someone analyses an accident to understand how to improve the system.

46.6.6.2 Alternate mitigations

Often there will be multiple ways to address a hazard, and one has to choose which way to use.

The choice can be influenced by:

How effective each choice is at addressing that hazard;
What undesired effects or residual hazards it causes;
How difficult it will be to design and implement;
How well it might adapt to future changes; and
How much resource it will use in operation.

Sometimes addressing one hazard will also address another hazard.

In some cases it makes sense to adopt two approaches that complement each other. That way if one of the approaches fails to work, the other will still keep the system safe or secure.

Again, documenting the choice and the rationale for why that choice was made is important for people who have to come later to understand what was done.

46.6.7 Update design and specifications

After identifying harms, hazards, and how to address them, these choices feed back into other work. As I will discuss in Section 46.8 below, this includes updating the concept, specification, and design of parts of the system.

46.6.8 Methodologies and completeness

Finding and fixing all the ways that a system can result in harm is one of the challenges in building a safe and secure system. Leaving one attack vector unidentified and unaddressed means the system is vulnerable to attack. Missing one major safety hazard means the system can result in an accident.

Of course, perfection is not generally possible—especially since harms or threats might not yet be known. It is generally better to identify and address as many hazards as possible rather than do nothing.

There are several methodologies for addressing safety and security harms. These methodologies all aim to provide a systematic process that will identify and resolve all significant hazards when used properly. In practice most methodologies have limitations that one should understand.

I have had the most success using the System-theoretic Process Analysis (STPA) methodology [Leveson11]. This methodology takes a top-down control-theoretic approach to identifying and addressing hazards. It takes the viewpoint that safety or security are emergent properties of the system as a whole (Section 12.4), and that by looking at whether the interactions among a set of components at one level of a system will result in those emergent safety properties. In doing so it can address all the hazards that can be identified by the other methodologies I will list, while dealing with classes of hazards that those others cannot. For example,

Before STPA, the fault tree analysis (FTA) and failure mode and effects analysis (FMEA) methodologies were most commonly used. Both these methodologies focus on failure rather than harm, and therefore cannot identify or address accidents like component interaction accidents (see [Leveson11, p. 8]) where every component functions per specification but the interaction between the components nonetheless causes an accident.

FTA is a top-down methodology that starts with an identified hazard and works out what events or conditions might lead to that hazard. It is a useful way to organize identifying how one function depends on another. In one project, I was analyzing the conditions where an autonomous vehicle could fail to follow guidance commands. This failure could happen if the vehicle throttle failed, its brakes failed, the steering function failed, or the low-level control electronics failed. I could then take each of those subsystems and identify how they could fail: steering could fail if the ability to communicate with the steering column failed, or if the steer column failed mechanically, and so on. I then could use models of failure rates to estimate which failure modes were the largest contributors to system failure.

The FTA approach has the benefit that it organizes the process of discovering relationships among functions: function A in component B depends on function C in that same component and on function D in component E. It is useful for estimating the relative contribution to high-level failure rates, which can be used to guide where effort should be put in addressing problems. It has the downside that by itself it does not guide one toward identifying and addressing interaction problems rather than failures. The STPA approach in some sense builds on the FTA approach.

FMEA is a bottom-up methodology for identifying the effects of a fault in one component. It is the basis for some safety lifecycle standards, including ISO 26262 [ISO26262]. The methodology focuses on one component, identifies ways that the component can fail, and then projects the effects of that failure outward into the system. The importance of addressing a component’s failure mechanism can then be based on the system effect.

The FMEA approach has the advantage of taking advantage of specialist understanding of a component’s design for guiding design effort. It is often used to guide the amount of effort to be put into failure tolerance; for example, in automotive electronic systems an “ASIL D component” is one that is intended for the most failure-critical uses. The methodology is of little use for making system-level safety or security cases.

46.7 Common approaches

46.7.1 Identifying hazards

Identifying all the hazards in a system is key to making the system safe or secure, but identifying hazards is something of an art. I learned how to do this by doing it, haphazardly and not very well early on and getting better over the years. That is not the best approach for learning how to do it.

There are several methodologies that are used to identify hazards. It is worth learning about them, because each one has useful techniques. The Leveson book on STPA [Leveson11] in particular provides a good discussion of the overall task of building safe or secure systems.

However, no methodology ever substitutes for good judgment; methodology can inform judgment but the people building the system are still responsible for safety.

The basic technique I have learned for identifying hazards is as follows.

I start with the harms and try to imagine all the ways those harms could happen.
After identifying hazards, I then plot out how the system functions when it is behaving as desired, then enumerate all the lower-level functions that contribute to this desired emergent behavior.
Next, I look at things that could interfere with these lower-level functions.

For identifying hazards, I often try to do this in a somewhat organized way. For example, with an autonomous road vehicle I go through one drive: getting into the vehicle, starting it up, getting it onto the road, driving along the road, and so on. In each phase of operation I try to find all of the ways each harm could happen. I also try to find all the operations that the vehicle will do; for example, a vehicle will undergo maintenance and it will remain parked somewhere for a period.

I also try to find similar systems or similar environments and discover what hazards have been observed in them. Battery problems in electric road vehicles, for example, are a well-known cause for accidents.

Note that, for complex systems, identifying all the significant hazards will likely not be complete after one short exercise. My experience is that a concentrated investigation produces many hazards in a short time, but that one discovers more hazards over time, at a rate that generally decreases over time. Continuing to look for potential hazards all through development and after a system is put into service is important.

How I understand nominal system behavior is best illustrated. Consider the example from before, of an autonomous road vehicle. The harm to be avoided is causing damage or injury by colliding with road infrastructure—roadside barriers or bridge piers, for example. To do so, the vehicle must move in a way that departs its planned lane and intersects the infrastructure.

The vehicle’s normal behavior is to follow an overall path to its destination (which streets, for example). In the short term it develops motion plans for how it should move to follow the road lane. It then manipulates the wheels to exert forces on the vehicle to move it along that plan.

The vehicle has one or more control systems that take in inputs about the desired short-term path, the vehicle’s state, and its environment, computes what forces should be applied to keep the vehicle on its desired path, and outputs commands to steering, braking, and accelerator actuators to achieve those forces. The control system recomputes these values quite frequently.

Each of these functions—input, decisions, and resulting forces—depend on lower-level functions.

Following one thread of these, the driving process takes in the vehicle’s state: position, attitude, and velocity. The state might also include the state of braking actuators, tire inflation, steering angle, or current accelerator setting. If any of these are wrong, it can lead the control algorithm to make a wrong decision and guide the vehicle outside the expected path.

The vehicle state information could be out of date (delayed), or missing, or incorrect. It could be missing because:

A sensor has failed.
The communication channel to a sensor has failed.
The communication channel to a sensor is overloaded or has interference.
The processor receiving the information has is overloaded or has a bug.

The state information could be wrong because (among other things):

An inertial measurement unit sensor has been mis-installed or has moved out of proper position (perhaps due to vibration).
A GPS unit has loss of signal.
- The GPS unit can lose signal because the antenna is damaged or missing.
- The GPS unit can lose signal because there is external interference (e.g. a tunnel).
A GPS unit is experiencing signal interference, such as jamming.
The algorithm for estimating velocity from GPS samples has a bug.

One can follow this kind of reasoning to find many different ways there can be problems with keeping to a lane.

This procedure will produce a large number of hazards. I have found that it is necessary to take care in organizing the procedure and its results. One may well discover the same hazard multiple times.

I have found it helpful to organize hazards hierarchically, starting with whole-system hazards at the top and then recording lower-level hazards that can cause those to occur.

46.7.2 Identifying implicit relations

Consider the example I’ve used of the autonomous road vehicle steering itself to follow a lane and thereby avoid collisions with roadside infrastructure. This is done through a classic control system, that takes inputs about the vehicle’s location and motion, computes reactions, and sends commands to actuators in response.

When I first sketched this example, I thought of it in those terms; in particular, that the control output is a set of commands to steering and other actuators. When driving, I think in terms of how I need to manipulate the steering or brake pedal to “make the car go where I want.”

However, this isn’t actually quite correct. The steering or braking actuation is not the point of the control system; the point is to use them to create forces that will move the vehicle. If the vehicle were not in contact with the road, the control mechanism could actuate all it wanted and it would not affect how the vehicle moved. Moreover, the control mechanism has within it a model of how actuating steering or brakes will cause desired forces. That model might be implicit, encoded in experimentally-derived parameters within the control calculations, but it nonetheless has that model.

A more complete set of relations is:

vehicle state → sensor → sensor values communication → control system → control computation → actuator commands → actuators → forces on vehicle → vehicle state.

If one doesn’t do the safety analysis in terms of forces changing vehicle motion, then the analysis will miss important hazards: errors in the controller’s model and cases where the environment does not match assumptions embedded in the controller’s model. For example, if the road surface is slippery it will not provide the friction expected to turn the vehicle. This can lead to the vehicle not turning as quickly as needed, or for the vehicle to start skidding.

Recognizing that these hazards exist can lead to addressing them. I illustrated some of the possible mitigations above.

This is an example of leaving out essential causal relationships when trying to discover hazards. It is worth the time to trace out all the low-level causal steps involved in, for example, getting a vehicle to turn. Each of the steps is a point in the system where something can happen that interferes with keeping the system in its desired state.

46.7.3 Addressing hazards

Hazards, once identified, must be addressed. Section 46.6.6 presented the overall process. It also gave the preferred order for handling hazards: eliminate them, make them unlikely, limit the harm, or at least be able to detect them.

There are some techniques I have used in past systems to address safety or security hazards.

Reduce sources of disagreement. Some hazards arise because there are two sources of information or two sources of control, and they can disagree. Sometimes redundancy is worthwhile, if it can be used to detect or mask problems. However, when that is not the case, it can be better to design away potential disagreement.

As an example, I was working on a design for a small vertical takeoff aircraft that would interact with services on the ground to maintain separation from other aircraft. There were two separate pieces of information that were to be sent from ground systems to the aircraft. They were not redundant; they were different kinds of messages, and the aircraft needed both in order to make correct guidance decisions. One engineer on the project wanted to send those messages over different communication paths. The two paths were different: one was a nearly direct transmission while the other path was routed through multiple systems before being delivered to the aircraft’s guidance system. The result was that using the two paths, with their differences in message loss and message delay, would result in a higher chance that the aircraft would not get one of the messages it needed. Moreover, the aircraft would have to be designed to handle messages that were out of sync—one arriving much later than the other, containing older information—which increased the guidance software’s complexity. These problems were not necessary; the messages could all have been sent on one path and avoided most problems with disagreement between messages.

Realistic load. XXX rename this

Many problems occur because some part of a system became overloaded and could not behave as expected. This includes just about everything: mechanical components, software components, and people.

Part of the solution is to avoid such overload by designing in extra capacity. Plan for a CPU’s load not exceeding 30% or 50%, for example. Use components designed for higher temperatures than the maximum expected.

Another part of the solution is to design the system so that the component in question is not presented with more load than it can handle. Sometimes this is easy: managing software’s demand for CPU can be done with a combination of designing the code to use a bounded amount of CPU resource and using CPU scheduling algorithms that enforce limits.

Designing the cognitive load on people is harder. Not every person is identical, and a person can be distracted from time to time. When I have worked on projects where cognitive load was an issue, we used a combination of techniques. We required the people to be tested and trained so that there was a reasonable expectation that a person would be capable of doing the task needed. We also took care to have a clear model of what information the person needed to maintain an accurate mental model of the system and how to provide them that information unambiguously.

Most important, we designed the interfaces that the person used based on the worst case situation we could determine. That worst case was unlikely—far less than 1% of the time—but it was in those times when it was most important that the person understood the situation and could decide on good actions quickly.

The design of human-machine interaction has been well studied, and I refer the reader to that work.

Finally, sometimes the load on a component cannot be managed only by design. Consider thermal load in a spacecraft: the spacecraft can be designed to absorb and radiate heat, but when the spacecraft is pointing in some directions relative to the sun it may take in more heat than it can radiate away (or radiate away more than it takes in). These situations require active control to keep components neither too hot nor too cold. This control system becomes another part of the system that must be evaluated for safety.

Contain problems. Containment involves limiting the spread of a problem or its effects from one part of a system to other parts.

This approach can take several forms.

The simplest is when a system contains multiple redundant components, as will be discussed further below. If one component has a problem, the system may be able to contain the effects of that problem to the one component without affecting its redundant partners. XXX

coupling/isolation
- value of representing relations early on
shared cause
example of working and not working

Another form is commonly used in securing systems, by dividing a system into zones and containing an intrusion into one zone. XXX

involves barriers to crossing between zones
Defense in depth term

A final form involves preventing escalation. XXX

Minor problem with one component that leads to problems with a larger component that escalates to major accident
example of aircraft control surfaces
better description needed

Stabilization. Many important system properties are realised by emergent behaviors of multiple cooperating components. A system typically has some kind of control mechanism that ensures that the components are behaving in the ways that ensure that they produce the desired behavior in combination. That control mechanism may be passive, as part of the system design, or provide active control decisions.

Disturbance to the system leads to the control mechanism working to restore system back to its desired behavior. An ideal solution will cause the system to move naturally toward the desired behavior or state without requiring active control. It will ensure that the system will not diverge further. In a mass-spring-damper system, for example, the spring acts to move a mass toward a desired position; the damper ensures that any disturbance energy put into the system decays over time and the mass returns to a stable point.

XXX image of mass-spring-damper

Many systems will require active control. Consider an inverted pendulum: a mass is placed at the top of a pole, with a mechanism to impart torque or sideways motion to the pivot point at the bottom of the pole. This machine is not naturally stable, and so it requires active control to keep the stick upright.

XXX image of inverted pendulum

While stability is usually introduced in the context of physical control systems, the concept applies in many other kinds of systems. In distributed consensus protocols, for example, the desired state is that all nodes agree on a shared data value. One node receiving a message to update the data value is a disturbance. The consensus protocols coordinate how the nodes exchange messages to restore consensus, presumably on the new value. The consensus protocols provide stability when all of the nodes reach agreement within some reasonable time. The protocols are unstable if nodes continue changing their data value for a long period or forever.

Be able to work with partial function.

Have a watchdog.

Safe mode layering
Something that watches normal behavior and kicks in when behavior is out of those bounds
- Have a control mechanism that is stable for some range of inputs, and do something else when the disturbance is beyond what the control can handle
Response to bring the system to a safe but simpler state, from which can recover

Use redundancy when it helps.

Redundant defense
- Redundancy that actually works, unlike Ariane 5 and A330; look for common mode problems

Truly different information sources
Root out common-cause failures
Rational ways to combine multiple information; avoid side-stick problem or at least make it visible

Isolation.

Contain parts of the system to limit their ability to interact
Makes reasoning about interactions easier
Helps contain problems

Example: CPU scheduling; network utilization

Avoid runaway problems.

when some problems occur, the system has to use extra resources to deal with the problem and recover from it
the extra load can cause more problems to occur, leading to increasing problems and eventually collapse
Well known in communication systems, but other such as banking exhibit similar behaviors
Avoid designs that can have this runaway, or if it is unavoidable, have explicit checks and ways to damp down the runaway or design in load-shedding

Make problems detectable.

When possible, explicitly detect when a misbehavior occurs
Log it so there’s information for fixing the system to avoid repeats
Explicitly design the system to know about problem states and have explicit recovery procedures

46.8 Using safety and security design

Position in work flow

Feedback loop

Use in fixing problems

Use in upgrades

46.9 Artifacts and documents

Safety/security objectives of system or component
Harms part of system or component purpose
Accident register
- Trace of harms to accidents
Hazard list
- Trace of accidents to hazards
- Recursive dependencies of hazards
- Risk analysis
Options for addressing hazards
- Trace to hazards and risks
Elements in concept
Safety and security parts of specification
Reflection in design
Flowdown to subcomponents
- Safety/security objectives of subcomponents
- Recursion of structure

XXX address threat actors and how that biases security decisions

XXX address challenging every assumption and not leaving connections implicit

XXX Why: increasing need