Making systems - 14: Evidence of meeting purpose

14.1 Introduction
14.2 Verification versus validation
14.3 When to evaluate a system
14.4 Kinds of evidence
14.5 Methods of gathering evidence
14.6 Completeness and minimality

14.1 Introduction

An implemented and operational system needs to meet its purpose (Chapter 9). After all, that purpose is the reason that resources have been spent on developing the system and using it. Meeting purpose means two things: that the system does all the things it is supposed to, and that it does not do things it is not supposed to.

One cannot assume that a system meets its purpose. Each system needs to be evaluated to determine whether it actually does or not, and if not, how and where it does not. The evaluations catch design and communication errors that occur when one party thinks they have specified what is needed, and another party does not understand what was meant or makes a mistake in translating the specification into practice.

How a system works changes over time as well, and regular re-evaluation catches cases where operational behavior diverges from what is needed for correct or safe operation. This includes wear and tear on the system that must be corrected with maintenance. It also includes changes in how the system is operated—from operator practice to management organization and environmental context.

In this work I talk about gathering evidence of a system meeting its purpose.

Parts of a system’s purpose can be specified quantitatively or qualitatively. Quantitative purposes can lead to deterministic ways to check that the system meets the purpose. Complex quantitative purposes, however, aren’t necessarily so easily evaluated: computational complexity or the difficulty in actually measuring system behavior can lead to quantitative properties that cannot be easily or definitively evaluated.[1] For these complex quantitative problems, one must be satisfied with statistical evidence that indicates whether the property is likely true. Qualitative purposes are not amenable to proof of satisfaction or not. These purposes are evaluated by human judgment, which again leads to evidence but not proof of satisfaction.

14.2 Verification versus validation

Systems engineering processes often use the terms verification and validation (or just V&V). These are both special cases of the general need to gather evidence for and against whether a system meets its purpose or not. I presented an overview of the difference between the two terms in Section 6.6.

The system’s purpose is near the root of all the parts that make up the system (Chapter 15). All of the rest of the information about the system derives from this purpose. There is, thus, a chain of relationships starting from the purpose to every other piece of system information. This chain runs from purpose to concept, to specification, to design, and to implementation, and downward through multiple levels of subcomponents.

Verification is the process by which each relationship gets checked. For example, a component $X$ might have a specification to provide some behavior $B$ . The component’s design is that $X$ will be implemented in some subcomponents, $X_{1}$ , $X_{2}$ , and $X_{3}$ . The design indicates how parts of $B$ are allocated to the subcomponents, as behaviors $B_{1}$ , $B_{2}$ , and $B_{3}$ . Verifying the design of $X$ means showing that the $B_{i}$ together implement behavior $B$ . The verification also means showing that all of the behaviors of the subcomponents of $X$ implement all of the specified behaviors, and that the combination of subcomponent behaviors does not lead to any behaviors not in the specification.

Theoretically, if each chain of relations is correct then the final implementation will meet the system’s purpose as expected. However, the odds are good that some specification will be ambiguous, or that some design element is not quite sufficient to meet a specification, or that one of many other little errors will occur.

Validation provides a cross-check to catch problems with the step-by-step verification process by checking some part of the system directly against the original customer needs, bypassing the chain of derivation.

This can be done either by working with the customer, or by using the information gathered about the customer’s purpose for the system.

The recorded system purpose is validated by working with the customer to ensure the purpose is accurate. Other parts, such as the concept, designs, models, or implementations, can be validated either by demonstrating them to the customer or by comparing them against the system purpose.

14.3 When to evaluate a system

Checking whether a system meets its purpose is an ongoing need, starting from when the system is first conceived, through system design, implementation, and operation. In general, a system should be evaluated any time its purpose changes, or any time its design, implementation, or operation changes.

In practice, there are five times in a system’s life cycle when the system—whether in design, in implementation, or in operation—gets checked against its purpose.

At each of the individual steps from initial concept, through specification, design, and implementation.
At the time when the system is accepted for deployment.
Periodically and regularly while the system is in operation, to monitor for drift.
At each step when a change is requested, from concept through design and implementation.
At the time when a changed system is accepted for deployment.

During development, systems are checked in two ways: step by step, and a separate evaluation of the whole system when implementation is complete. The step by step checking occurs at each development step, including generating a concept for the system, generating a specification, designing, and then implementing the design. The expectation is that if each of these steps is correct, then the concept will follow the purpose, specification will follow concept, and so on, and the resulting implementation will properly meet the system’s purpose. (See the figure in Section 6.6.) In practice something gets missed or misinterpreted at some step of development, and so the argument that each step is correct does not hold. Separately evaluating the implementation at the end directly against the original statement of purpose allows one to cross-check the step-by-step evaluation. It helps one find which step had a mistake and thus where to make corrections.

Evaluations are part of the process of working out components‘ specifications and designs. The idea of safety- or security-guided design [Leveson11, Chapter 9][Horney17] is to start with safety or security objectives as part of a component’s purpose (or the system’s purpose), refine those objectives into parts of the component’s specification, and then use this to help guide design work. Using safety or security objectives means conducting analyses of specifications or designs to see if they address the objectives, and adjusting the specification or design until there is evidence that they do meet the objectives.

Any time the system’s purpose changes, the system must be re-evaluated in light of the change. This involves repeating steps in the life cycle shown above. Re-evaluation is easy when early in initial design; the later in the life cycle, the more expensive re-evaluation gets. The scope of what parts of the system need to be re-evaluated can be limited by examining the structure of the system and how a change propagates from one component or relation to another.

A system should be evaluated regularly while in operation. In practice, systems drift over time from how they are originally designed and implemented. People who are part of the system, whether as operators, oversight, or management, can shift in their understanding of what they need to do, and often find shortcuts for their role as they adapt to how the system is to work with. Mechanical parts of the system can wear, changing their behavior or increasing the chances of failures. The environment in which a system operates can change, perhaps with people moving near an installation that was previously isolated or maintenance budgets being cut. As a simple example, in one early software system I built, the software included a billing module that would create itemized invoices to be sent to insurance companies that were expected to reimburse for medical expenses. Over time, the people who should have been running the module and creating invoices forgot to do it as regularly as it should have, leading to revenue problems for the business. Leveson discusses several other examples [Leveson11, Chapter 12].

Finally, a system’s purpose usually changes over time. The users need new features, or some assumption about how they will use the system will be found to be wrong. Regulations or security needs may change. All of these lead to a need to change the system’s design and implementation. The team will recapitulate the development process to make these changes, including evaluating the updated concept, design, and implementation against the new purpose.

14.4 Kinds of evidence

There are two kinds of evidence: evidence that something happens and evidence that something doesn’t happen. Both are needed to evaluate whether a system meets its purpose.

The first kind of evidence is an indication that the system properly implements some desired property or behavior. This is what most people think of first: that the mass of system hardware is within some maximum amount, or that the system performs action X when condition Y occurs.

The other kind of evidence, sometimes called negative evidence, is an indication that the system does not do something. Safety and security evaluations are fundamentally about collecting this kind of evidence: that the system will not do some unsafe action or enter into some unsafe state. This kind of evidence is therefore vital to determining whether a system meets its objectives, but evidence of the negative is generally hard to establish. In practice, analytic methods are the only ways we currently have to establish the absence of a condition.

Bear in mind that, as the saying goes, absence of evidence is not evidence of absence; that is, no amount of testing that fails to find an undesired condition can establish with certainty that a realistic system is free of that undesired condition. Negative evidence through testing requires testing every possible scenario, which is infeasible for anything other than trivial behaviors. Testing a very large number of scenarios can potentially generate a statistical argument for the absence of an undesired condition, but only if the scenarios chosen can be proven to provide sufficient, unbiased coverage of all possible scenarios, including rare scenarios. I have never found an example of someone being able to construct an argument for the significance of the test scenarios in a real-world system. Kalra and Paddock [Kalra16] present an analysis for testing autonomous road vehicle safety, and show that it would require an infeasible number of driving miles to show the absence of unsafe behaviors—and they conclude that alternate means are needed to determine whether autonomous road vehicles are sufficiently safe.

Many undesirable behaviors or conditions cannot be completely eliminated from a system, and instead the standard is to show that the rate at which these behaviors occur is sufficiently rare. For example, aircraft are expected to experience failures at no more than some rate per flight hour in order to be certified for operation. These safety conditions lead to a need for evidence of statistical bounds on rate of occurrence at a given confidence level.[2] If these bounds are sufficiently loose, then a carefully-designed test campaign can provide statistically significant evidence. However, statistical significance and confidence rely on the test scenarios either being selected without bias, or with a way to correct for selection bias. This means, for example, ensuring that there is no class of scenarios that are avoided in selection. It also means understanding the probability of rare but important scenarios occurring and accounting for that rarity in the number of scenarios tested or in the way scenarios are selected.

14.5 Methods of gathering evidence

There are three general methods for gathering evidence about systems satisfying their purpose:

Experimentation,
Inspection or review, and
Analysis.

Experimentation tests an operational system (or part of a system) to show evidence about some desired capability. This is the gold standard for gathering evidence that something works.

Experimentation is usually divided into two categories: testing and demonstration. Testing involves setting the system into a defined condition and providing it defined inputs, measuring the system’s response, and comparing that response to expectations. Tests are expected to be repeatable. Demonstration is more open-ended, where the system is operated for a while, possibly by people, and not always in a fully-scripted, repeatable way. Demonstrations can address some non-quantitative conditions, such as whether people like something or not.

Inspection or review is a way to check a design or system for things that cannot be readily measured by experimentation. These methods use human expertise to check the system for specific conditions. It can be useful for gathering negative evidence when other methods don’t apply. In the simplest form, inspection checks simple conditions that would be difficult to automate; for example, that a physical car has four wheels. For more complex reviews, humans observe and think about what they observe in the system to determine whether what they observe meets expected behavior.

Analysis can be used to collect both kinds of evidence. Indeed it is generally the most useful way to gather negative evidence—which is often about thoroughness, and analytic methods are better at ensuring all possibilities have been examined. Analysis takes as input a model of the system, extracted from its design or its implementation. It then applies algorithms that work to prove or disprove statements about that design, such as whether there exists some sequence of state transitions that would cause a component to enter an undesired state. The evaluation is usually performed using automated computational tools, though it can sometimes be done by hand for analyses of modest complexity. I have used analytic methods occasionally, usually for foundational components or abstractions on which the system depends for its correct operation. The first time I used it, on the design of a synchronization mechanism in a multi-threading computing environment, it caught a subtle flaw that would have occurred rarely but would have been difficult to detect. On another project, colleagues and I proved the correctness of the design of a family of distributed consensus algorithms—which helped us accurately implement the algorithms. The SeL4 operating system kernel [Klein14] has been formally proven design and implementation, showing that its implementation provides key confidentiality, integrity, and isolation properties as well as functioning according to its specification.

14.6 Completeness and minimality

Separate from these methods for gathering evidence, one also needs evidence of completeness and minimality.

When a system is believed to be complete, one doesn’t want only to show that one or a few purposes are met; eventually one needs to provide evidence that all purposes are met. This does require knowing what the purpose is, and then being able to provide evidence showing each part of it has been satisfied.

One also needs to show that the system as designed or implemented does not do things that don’t derive from and support the purpose. This includes showing that safety and security properties (of bad things not happening) are met. It also includes ensuring that people have not inserted features that the end users do not need or want, which would imply that development resources have been mis-spent and that the system can potentially do things the users will find undesirable.

Sidebar: Summary

An operational system should meet its purpose.
It takes evidence to show that this is true.
- Including evidence that the system does not do things that are not in its purpose.
Gathering evidence is an activity that happens all through a project.
- Especially important in guiding design.
Evidence comes from experimenting (testing), inspection or review, and analysis.
- Analysis is needed to show the absence of something, but sometimes one can only gather statistical evidence.