Making systems3: Development
IX   Design
Chapter 46   Safety and security design

46.1 Introduction

This chapter covers how one goes about designing safety and security into a system.

Almost every system I have worked on has been designed, at least in part, with the assumption that the system is designed correctly as long as it performs the functions people want, while people use the system as expected.

This assumption is wrong, and the systems built this way are not designed correctly.

A correct system design is one that performs its desired behaviors and that does not perform other, undesired behaviors. This second part is just as important as the first.

The process of designing a system, as I have discussed in previous chapters, involves determining the system’s purpose in terms of the needs the system should meet, and the functions by which it can do so. Defining what the system should not do is equally important.

Safety and security are two of the common labels for things that a system should not do. They are two different aspects of the same concept, as I will discuss. Both involve defining harms that the system should not cause. A harm is the mirror image of a need: it is something that stakeholders want for the system not to do.

In this chapter, I discuss what safety and security design is and the general approach for achieving it. This is only an overview, which illustrates how this aspect of design is related to the overall work of building a system. I strongly recommend that those who actually design or analyze system safety—which includes most of the people working on systems projects—pursue this topic with more specialized sources. (I have had the most success following the STPA methodology [Leveson11].)

46.2 Why

Complex systems are usually high-value systems; otherwise why go to the effort to make them? These systems often also have real-world consequences if there are problems with them, whether malfunction or just outages. The liabilities, both economic and moral, can be large when such systems misbehave.

This implies that the systems must minimize how and how often they misbehave. This does not happen accidentally; it requires deliberate design.

The least costly time to address how to keep a system from misbehaving is early in the project. The most expensive is after a system has been implemented, deployed, and actually malfunctions: not only is there the cost of the malfunction, but the pressure to address the problem is high (making it difficult to make careful design choices) and the effects of making changes are hard to fully work out.

Safety and security are not isolated parts of the system-building work. They are woven into every step of the work, from understanding a system’s purpose through design and implementation through deployment and operation.

46.3 What safety and security are

Building safety and security into a system is, first, defining harms that the system must avoid, and, second, designing features and behaviors in the system to eliminate, reduce, or mitigate those harms.

In general terms, a harm is an anti-need. The needs a system addresses are things that will produce value for stakeholders. Harms are things that will reduce value for stakeholders. Colloquially, a harm is something that people want not to happen.

The common conception is that safety is about avoiding injury, to people, other living things, or property. Security is about avoiding disclosure of or damage to information. These are all valid harms, but there are usually far more that must be considered.

Many harms are defined by external policy or regulation. In aviation, for example, there are defined maximum acceptable rates for collisions between aircraft, or accidents causing human injury or death. For spacecraft in Earth orbit, there are regulations related to collisions on orbit and damage caused by re-entering debris. Financial transactions must comply with security standards to protect payment information.

Some of those policies or regulation focus on the means for protection as opposed to the harms to be prevented. However, the harms themselves are usually referenced or can be deduced from the mechanisms.

Stakeholders are usually concerned about more than just the harms defined by external policy. For example, many aviation safety regulations are concerned with human injury or death, but an aircraft manufacturer is concerned about the operational cost of damaged aircraft, or the reputational harm that comes from an accident even if no one is injured.

Other harms are only defined by stakeholders. The value of intellectual property, for example, varies a lot and each stakeholder may have to define their specific needs. The kinds of property harm that matter depends on the kind of system and where it will be used. For example, one recent system I worked on defined five different kinds of information with different levels of confidentiality needs for each, ranging from publicly visible to restrictions on which parties could know about it.

Once the harms are defined, the next step is to use that information in just the same way stakeholder needs are used to build a good system. Safety and security affect the system concept; they are incorporated into system and component specifications, which lead in turn to aspects of designs; safety and security properties are verified as the system is implemented.

46.4 What they are not

Safety and security are not about jumping in during design or implementation and checking for certain classes of problems. I was once in a meeting at a major aircraft manufacturer where their staff, who had recently worked on one of their new major passenger aircraft, defined security as performing certain software checks on their flight systems. (Two days later I was flying home on one of those aircraft and was not reassured.)

Addressing security is not about adding an authentication and encryption protocol to a software system and declaring the system secure (though those can be a part of making a secure system).

Safety is not about just training operating staff on how to use part of a system, though again that can be part of a safety solution.

Safety and security are properties of a whole system. They are emergent properties that only happen when many different parts of the system are deliberately designed to ensure that the parts combine together properly.

A safe or secure system requires developing evidence that the behaviors and features designed into the system are sufficient to avoid a well-defined set of harms. Claims based on one design aspect, such as data encryption, provide no evidence on their own that the system is secure; only the proper analysis that connects the mechanisms and the total set of harms is meaningful.

Sidebar: Safety versus security

Safety and security are largely similar: both are about defining possible harms and avoiding them.

However:

  • Safety is about inadvertent accidents, where the harm is caused either by events within the system or unintentional external events.
  • Security is about harms caused by malicious actions by an intelligent actor (the attacker), whether that actor is inside or outside the system.
  • Security must assume that some attackers are sophisticated (but not omnipotent).
  • Safety can often address harms that can occur repeatedly, so that one can define a safety objective to keep rates of some harm occurring below some rate.
  • Security, and other kinds of security, generally addresses harms that are sufficiently serious that even one occurrence is too many.

In many cases it is difficult to distinguish whether some hazardous situation is a safety matter, a security matter, or a combination. While there are differences between how one analyzes and defends against a security problem and a safety problem, the two often end up being combined.

46.5 Objectives of safety and security design

When analyzing and designing a system for safety or security, there are a few objectives to the effort.

While the first objective is obvious, the other three may not be.

The system will change eventually. When it does, people who did not work on the original system design may be doing the work, or enough time may have passed that people don’t remember the details of how safety and security is designed into the system—especially the subtle details.

Providing these people with a record of why and how the system is safe will help them make changes without breaking safety. If there has been an accident, this record will help them track down what didn’t work as designed and provide them a foundation on which to design and implement improvements.

46.6 Analysis process

Stripping away the details of specific methodologies, safety or security analysis and design has five steps:

  1. Identify harms.
  2. Define accidents based on harms.
  3. Identify the hazards that can lead to those accidents.
  4. Address each hazard.
  5. Update design and specifications.
undisplayed image
Sidebar: Safety and security terms

Harm: A kind of undesired effect; a kind of loss.

Accident: An undesired and unplanned event that results in harm or loss. (Some standards, such as ICAO Annex 13, use the term incident as the general term [ICAO20].)

Hazard: A combination of system state or condition and environment conditions that will lead to an accident. (Also: threat.)

Risk: An evaluation of the seriousness or importance of a hazard, based on its likelihood of leading to an accident and severity of the harm caused.

Actor: Some entity that can participate in a hazard or an accident.

Attacker: An actor that can create a hazard intentionally. (Also: threat actor.)

46.6.1 Adversarial mindset

Before going into details, it is worth discussing the attitude that one must bring to the task of designing and analyzing safety or security.

Designing safety is about looking for any and every way that the system could cause harm. This is an uncomfortable and unfamiliar mindset for many people who are used to building systems, who are usually looking to show that what they are designing or implementing performs the functions it is supposed to, that is, for the presence of desired functions.

Safety and security are about developing evidence of the absence of harms.

One of the main ways to develop this evidence is to find all the scenarios where a harm could happen and trace out what could cause those scenarios. This requires being thorough and imaginative about ways things could go wrong.

Completeness matters for this exercise. For security, leaving even one way to attack a system is enough to let someone compromise it. For safety, eliminating one hazard may reduce the number of number of accidents but even one accident that injures or kills many people is too much most of the time. This puts a premium on meticulous analysis to find all these ways.

Safety and security analysis is also often an exercise in identifying and countering confirmation bias (Section 59.2). People are generally looking for evidence that the system works, not that it does not do things it is not supposed to.

In the Inquiry Board report for the failure of the first Ariane 5 launch, the committee wrote [ESA96, p. 6]:

An underlying theme in the development of Ariane 5 is the bias toward mitigation of random failure. […] The exception which occurred was not due to a random failure but a design error. The exception was detected, but inappropriately handled because the view had been taken that software should be considered correct until it is shown to be at fault. […] The Board is in favour of the opposite view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct.

I will discuss methodologies for how to organize the analysis and design activities so that they can be as complete and accurate as possible (Section 46.6.8).

46.6.2 Identify harms

Identifying harms is listing the harms that the system should avoid causing. Some of these harms are defined by regulation or by organizational policies, as noted earlier.

In some cases, defining the list of harms is not a technical decision that is within the authority of the team designing the system. Decisions about which harms to include in safety or security is fundamentally a business or social policy decision. It is part of how an organization determines it should position its products compared to other products, and how it should respond to societal expectations. I have found it difficult to get company leadership to take responsibility for these decisions; in multiple cases the leadership was interested in questions of market fit and profitability, but did not want to be bothered with setting company policy about product safety. Be prepared for this to be a difficult conversation inside some companies.

The majority of harms are defined implicitly by the system’s purpose. For example, in one recent system for air traffic management the primary purpose was to “deconflict drone flights so that no two drones would be operating in the same airspace volume at any given time”. The system provided deconfliction in order to minimize the number of mid-air collisions between drones. The harm for this system was therefore having “two or more drones authorized to operate in overlapping airspace volumes”.

In another system, related to autonomous road vehicles, there were several harms: injury or death to people, property damage outside the vehicle, damage to the vehicle, damage to property in the vehicle, and interfering with road usage.

Harms are usually ranked in order of seriousness. Injury or death of a person is typically more serious than a scratch on a vehicle, for example. Having some understanding of the relative seriousness of two harms becomes important when one must choose between avoiding one harm or the other—for example, choosing to risk scratching a vehicle in order to avoid colliding with a pedestrian. Sometimes the seriousness of a harm depends on its scale: injuring one person is less serious than injuring thousands in one event. Many ethical dilemmas arise when trying to rank harms.

46.6.3 Define accidents

The next step is to define accidents based on harms. Accidents are the events when harms occur. In the example air traffic management system the accident is simply that the harm occurs. In the autonomous road vehicle example, the list of accidents was long and included things like:

46.6.4 Identify hazards

The third step is to identify the hazards that can lead to each of the accidents, that is, the list of situations that can lead to the accidents. This list is typically long; there are usually many ways that an accident can happen.

In the autonomous road vehicle example, a vehicle might collide with road infrastructure:

46.6.5 Risk analysis and prioritization

Complex systems usually lead to a large number of hazards. Tackling all of them at once is not feasible; the team has only so many people who can work on them, and addressing one hazard (discussed in the next section) can interact with another hazard or create new hazards.

The better alternative is to tackle only a few hazards at a time, with a small enough team that they can maintain a common understanding of their work.

Most safety design methodologies include a risk analysis, where each hazard is given some kind of risk score based on how serious the harms are or how likely they are to occur. The team can then focus on the hazards with the highest risk scores first, working their way down to less potentially serious hazards over time.

Some hazards will not be worth addressing. A hazard that can lead to mildly annoying some users, for example, may not be worth putting a lot of effort into addressing, or addressing that hazard might be deferred to a later version. While the goal should be to buaild as good a system as possible, limits on time and resources mean it will not be perfect.

46.6.6 Address hazards

The final step is to determine how to address the hazards. This can be done by (in decreasing order of preference):

The best option is to make an accident impossible. If that is not feasible, then making the accident less likely or less serious is next best. If no limitation is feasible, then the minimum is to ensure that a harm can be detected and that there is a process for recovering from the problem.

Consider the hazard identified earlier for an autonomous road vehicle: a collision with road infrastructure because the road was sufficiently slippery that the vehicle could not follow a path that avoided the infrastructure. This hazard can be broken down into sub-hazards:

The first hazard can be (almost) eliminated by mandating that the vehicle never operates in temperature below 10º C, or when there are reports of remaining ice in a region. If the vehicle does not operate when there might be ice then the first sub-hazard is not possible.

This hazard is only almost eliminated because the mitigation strategy is dependent on weather and temperature information. If that information is wrong then the prevention mechanism won’t be accurate and the vehicle might encounter ice. This might happen if weather sensors malfunction, if communication about weather forecasts is disrupted, or if there are no recent reports about road conditions. These create third-level hazards: When the road surface was more slippery than normal, leading to the vehicle not following a planned path and when the road is slippery because of ice and when temperature information sources have malfunctioned.

This recursion continues until one can provide evidence that the remaining hazards are unlikely enough.

The second hazard can be made less likely by disallowing the vehicle from operating when the chances of rain are above some threshold. However, this does not eliminate the hazard: rain can occur without much warning, and water could be on the road for other reasons like dew in the morning or water flowing across a road from a roadside water leak. The rain-related sub-hazards might be eliminated by including precipitation sensors on the vehicle; water from other sources might be detected by sensors that could evaluate the quality of the road surface. Each of these mechanisms can result in residual hazards that must be addressed.

46.6.6.1 Conflicts

The safety or security objectives can conflict. The first step is to find where there are conflicts, and then one chooses which objective to satisfy.

Conflict is built into some hazards. Some situations require choosing between two potential harms, such as in the case listed earlier when an autonomous vehicle must collide with a human or with road infrastructure but cannot avoid both.

In some cases a means of addressing one hazard causes a different hazard or makes another one more likely. Leveson uses the example of doors on a passenger train. The doors should only open when the train is stopped at a station platform, so that people are not injured by falling from a moving train or by falling down to the track side from a stopped train. At the same time, the doors must open in an emergency in order to avoid injuring people by trapping them on board a damaged train car [Leveson11, p. 190]. An interlock that prevents the doors from opening addresses the first hazard but can cause the second hazard, and vice versa.

In the end, I know of no simple rule to resolve these conflicts. The choice can often be guided by the relative seriousness of the alternative harms. However, in many other cases the choice requires careful design judgment.

The way that conflicts are resolved should always be well documented. These choices are likely to be revisited as a system evolves or when someone analyses an accident to understand how to improve the system.

46.6.6.2 Alternate mitigations

Often there will be multiple ways to address a hazard, and one has to choose which way to use.

The choice can be influenced by:

Sometimes addressing one hazard will also address another hazard.

In some cases it makes sense to adopt two approaches that complement each other. That way if one of the approaches fails to work, the other will still keep the system safe or secure.

Again, documenting the choice and the rationale for why that choice was made is important for people who have to come later to understand what was done.

46.6.7 Update design and specifications

After identifying harms, hazards, and how to address them, these choices feed back into other work. As I will discuss in Section 46.8 below, this includes updating the concept, specification, and design of parts of the system.

46.6.8 Methodologies and completeness

Finding and fixing all the ways that a system can result in harm is one of the challenges in building a safe and secure system. Leaving one attack vector unidentified and unaddressed means the system is vulnerable to attack. Missing one major safety hazard means the system can result in an accident.

Of course, perfection is not generally possible—especially since harms or threats might not yet be known. It is generally better to identify and address as many hazards as possible rather than do nothing.

There are several methodologies for addressing safety and security harms. These methodologies all aim to provide a systematic process that will identify and resolve all significant hazards when used properly. In practice most methodologies have limitations that one should understand.

I have had the most success using the System-theoretic Process Analysis (STPA) methodology [Leveson11]. This methodology takes a top-down control-theoretic approach to identifying and addressing hazards. It takes the viewpoint that safety or security are emergent properties of the system as a whole (Section 12.4), and that by looking at whether the interactions among a set of components at one level of a system will result in those emergent safety properties. In doing so it can address all the hazards that can be identified by the other methodologies I will list, while dealing with classes of hazards that those others cannot. For example,

Before STPA, the fault tree analysis (FTA) and failure mode and effects analysis (FMEA) methodologies were most commonly used. Both these methodologies focus on failure rather than harm, and therefore cannot identify or address accidents like component interaction accidents (see [Leveson11, p. 8]) where every component functions per specification but the interaction between the components nonetheless causes an accident.

FTA is a top-down methodology that starts with an identified hazard and works out what events or conditions might lead to that hazard. It is a useful way to organize identifying how one function depends on another. In one project, I was analyzing the conditions where an autonomous vehicle could fail to follow guidance commands. This failure could happen if the vehicle throttle failed, its brakes failed, the steering function failed, or the low-level control electronics failed. I could then take each of those subsystems and identify how they could fail: steering could fail if the ability to communicate with the steering column failed, or if the steer column failed mechanically, and so on. I then could use models of failure rates to estimate which failure modes were the largest contributors to system failure.

The FTA approach has the benefit that it organizes the process of discovering relationships among functions: function A in component B depends on function C in that same component and on function D in component E. It is useful for estimating the relative contribution to high-level failure rates, which can be used to guide where effort should be put in addressing problems. It has the downside that by itself it does not guide one toward identifying and addressing interaction problems rather than failures. The STPA approach in some sense builds on the FTA approach.

FMEA is a bottom-up methodology for identifying the effects of a fault in one component. It is the basis for some safety lifecycle standards, including ISO 26262 [ISO26262]. The methodology focuses on one component, identifies ways that the component can fail, and then projects the effects of that failure outward into the system. The importance of addressing a component’s failure mechanism can then be based on the system effect.

The FMEA approach has the advantage of taking advantage of specialist understanding of a component’s design for guiding design effort. It is often used to guide the amount of effort to be put into failure tolerance; for example, in automotive electronic systems an “ASIL D component” is one that is intended for the most failure-critical uses. The methodology is of little use for making system-level safety or security cases.

46.7 Common approaches

46.7.1 Identifying hazards

Identifying all the hazards in a system is key to making the system safe or secure, but identifying hazards is something of an art. I learned how to do this by doing it, haphazardly and not very well early on and getting better over the years. That is not the best approach for learning how to do it.

There are several methodologies that are used to identify hazards. It is worth learning about them, because each one has useful techniques. The Leveson book on STPA [Leveson11] in particular provides a good discussion of the overall task of building safe or secure systems.

However, no methodology ever substitutes for good judgment; methodology can inform judgment but the people building the system are still responsible for safety.

The basic technique I have learned for identifying hazards is as follows.

  1. I start with the harms and try to imagine all the ways those harms could happen.
  2. After identifying hazards, I then plot out how the system functions when it is behaving as desired, then enumerate all the lower-level functions that contribute to this desired emergent behavior.
  3. Next, I look at things that could interfere with these lower-level functions.

For identifying hazards, I often try to do this in a somewhat organized way. For example, with an autonomous road vehicle I go through one drive: getting into the vehicle, starting it up, getting it onto the road, driving along the road, and so on. In each phase of operation I try to find all of the ways each harm could happen. I also try to find all the operations that the vehicle will do; for example, a vehicle will undergo maintenance and it will remain parked somewhere for a period.

I also try to find similar systems or similar environments and discover what hazards have been observed in them. Battery problems in electric road vehicles, for example, are a well-known cause for accidents.

Note that, for complex systems, identifying all the significant hazards will likely not be complete after one short exercise. My experience is that a concentrated investigation produces many hazards in a short time, but that one discovers more hazards over time, at a rate that generally decreases over time. Continuing to look for potential hazards all through development and after a system is put into service is important.

How I understand nominal system behavior is best illustrated. Consider the example from before, of an autonomous road vehicle. The harm to be avoided is causing damage or injury by colliding with road infrastructure—roadside barriers or bridge piers, for example. To do so, the vehicle must move in a way that departs its planned lane and intersects the infrastructure.

The vehicle’s normal behavior is to follow an overall path to its destination (which streets, for example). In the short term it develops motion plans for how it should move to follow the road lane. It then manipulates the wheels to exert forces on the vehicle to move it along that plan.

The vehicle has one or more control systems that take in inputs about the desired short-term path, the vehicle’s state, and its environment, computes what forces should be applied to keep the vehicle on its desired path, and outputs commands to steering, braking, and accelerator actuators to achieve those forces. The control system recomputes these values quite frequently.

Each of these functions—input, decisions, and resulting forces—depend on lower-level functions.

Following one thread of these, the driving process takes in the vehicle’s state: position, attitude, and velocity. The state might also include the state of braking actuators, tire inflation, steering angle, or current accelerator setting. If any of these are wrong, it can lead the control algorithm to make a wrong decision and guide the vehicle outside the expected path.

The vehicle state information could be out of date (delayed), or missing, or incorrect. It could be missing because:

The state information could be wrong because (among other things):

One can follow this kind of reasoning to find many different ways there can be problems with keeping to a lane.

This procedure will produce a large number of hazards. I have found that it is necessary to take care in organizing the procedure and its results. One may well discover the same hazard multiple times.

I have found it helpful to organize hazards hierarchically, starting with whole-system hazards at the top and then recording lower-level hazards that can cause those to occur.

46.7.2 Including implicit relations

Consider the example I’ve used of the autonomous road vehicle steering itself to follow a lane and thereby avoid collisions with roadside infrastructure. This is done through a classic control system, that takes inputs about the vehicle’s location and motion, computes reactions, and sends commands to actuators in response.

When I first sketched this example, I thought of it in those terms; in particular, that the control output is a set of commands to steering and other actuators. When driving, I think in terms of how I need to manipulate the steering or brake pedal to “make the car go where I want”.

However, this isn’t actually quite correct. The steering or braking actuation is not the point of the control system; the point is to use them to create forces that will move the vehicle. If the vehicle were not in contact with the road, the control mechanism could actuate all it wanted and it would not affect how the vehicle moved. Moreover, the control mechanism has within it a model of how actuating steering or brakes will cause desired forces. That model might be implicit, encoded in experimentally-derived parameters within the control calculations, but it nonetheless has that model.

If one doesn’t do the safety analysis in terms of forces changing vehicle motion, then the analysis will miss important hazards: errors in the controller’s model and cases where the environment does not match assumptions embedded in the controller’s model. For example, if the road surface is slippery it will not provide the friction expected to turn the vehicle. This can lead to the vehicle not turning as quickly as needed, or for the vehicle to start skidding.

Recognizing that these hazards exist can lead to addressing them. I illustrated some of the possible mitigations above.

This is an example of leaving out essential causal relationships when trying to discover hazards. It is worth the time to trace out all the low-level causal steps involved in, for example, getting a vehicle to turn. Each of the steps is a point in the system where something can happen that interferes with keeping the system in its desired state.

46.7.3 Addressing hazards

Hazards, once identified, must be addressed. Section 46.6.6 presented the overall process. It also gave the preferred order for handling hazards: eliminate them, make them unlikely, limit the harm, or at least be able to detect them.

There are some techniques I have used in past systems to address safety or security hazards.

Reduce sources of disagreement. Some hazards arise because there are two sources of information or two sources of control, and they can disagree. Sometimes redundancy is worthwhile, if it can be used to detect or mask problems. However, when that is not the case, it can be better to design away potential disagreement.

As an example, I was working on a design for a small vertical takeoff aircraft that would interact with services on the ground to maintain separation from other aircraft. There were two separate pieces of information that were to be sent from ground systems to the aircraft. They were not redundant; they were different kinds of messages, and the aircraft needed both in order to make correct guidance decisions. One engineer on the project wanted to send those messages over different communication paths. The two paths were different: one was a nearly direct transmission while the other path was routed through multiple systems before being delivered to the aircraft’s guidance system. The result was that using the two paths, with their differences in message loss and message delay, would result in a higher chance that the aircraft would not get one of the messages it needed. Moreover, the aircraft would have to be designed to handle messages that were out of sync—one arriving much later than the other, containing older information—which increased the guidance software’s complexity. These problems were not necessary; the messages could all have been sent on one path and avoided most problems with disagreeement between messages.

Realism.

  • Design the system with realistic expectations of what different parts can actually do
  • Especially important when designing a part of a system that includes people: need to consider their cognitive load, their ability to understand a situation, their ability to detect off-nominal conditions
    • reference decreased situational awareness with digital systems

Contain problems.

  • Defense in depth
  • Convergence and self-stabilization

Be able to work with partial function.

Have a watchdog.

  • Safe mode layering
  • Something that watches normal behavior and kicks in when behavior is out of those bounds
  • Response to bring the system to a safe but simpler state, from which can recover

Use redundancy when it helps.

  • Redundant defense
    • Redundancy that actually works, unlike Ariane 5 and A330; look for common mode problems
  • Truly different information sources
  • Root out common-cause failures
  • Rational ways to combine multiple information; avoid side-stick problem or at least make it visible

Isolation.

  • Contain parts of the system to limit their ability to interact
  • Makes reasoning about interactions easier
  • Helps contain problems
  • Example: CPU scheduling; network utilization

Avoid runaway problems.

  • when some problems occur, the system has to use extra resources to deal with the problem and recover from it
  • the extra load can cause more problems to occur, leading to increasing problems and eventually collapse
  • Well known in communication systems, but other such as banking exhibit similar behaviors
  • Avoid designs that can have this runaway, or if it is unavoidable, have explicit checks and ways to damp down the runaway or design in load-shedding

Extra margin.

  • Many problems happen because some resource runs out
  • Design extra resource in when possible

Make problems detectable.

  • When possible, explicitly detect when a misbehavior occurs
  • Log it so there’s information for fixing the system to avoid repeats
  • Explicitly design the system to know about problem states and have explicit recovery procedures

46.8 Using safety and security design

Position in work flow

Feedback loop

Use in fixing problems

Use in upgrades

46.9 Artifacts and documents

XXX address threat actors and how that biases security decisions

XXX address challenging every assumption and not leaving connections implicit