Making systems

Richard Golding

Part I: Introduction and purpose
Chapter 1: Introduction
Chapter 2: What is “making a system”?
2.1 An example
2.2 Generalizing
2.3 The rest of the book
Chapter 3: Stories about building systems
3.1 Developing a spacecraft mission without engineering the system
3.2 Marketing and engineering collaboration
3.3 Missing implicit requirements
3.4 Building at a mismatch to purpose
3.5 Heavyweight, understaffed processes
3.6 Planning the transcontinental railroad
Part II: Systems background
Chapter 4: A model of systems work
Chapter 5: Elements of systems
5.1 Introduction
5.2 System purpose
5.3 System parts and views
5.4 Structure and emergence
5.5 Evidence
5.6 Using this model
Chapter 6: Elements of making a system
6.1 Introduction
6.2 Objective
6.3 Model
6.4 Using this model
Chapter 7: A well-functioning project
7.1 Project leadership
7.2 System-building tasks
7.3 Plan for building the system
7.4 The team
Part III: Systems
Chapter 8: Purpose
8.1 Introduction
8.2 Why purpose matters
8.3 Explicit purpose
8.4 Implicit purpose
8.5 Using purpose
Chapter 9: Component parts
9.1 Introduction
9.2 Definition of component
9.3 Divide and conquer: the component breakdown structure
9.4 Component characteristics
9.5 Downsides
9.6 Why components matter
Chapter 10: Structure and emergence
10.1 Introduction
10.2 Definition
10.3 Abstraction
10.4 Emergent system properties
10.5 Working with structure
Chapter 11: System views
11.1 Introduction
11.2 Technical views
11.3 Non-technical uses
Chapter 12: Evidence of meeting purpose
12.1 Introduction
12.2 When to evaluate a system
12.3 Kinds of evidence
12.4 Methods of gathering evidence
12.5 Completeness and minimality
Part IV: Making a system
Chapter 13: Approach
13.1 Purpose
13.2 Stakeholders and needs
13.3 Mapping needs to model
Chapter 14: Artifacts
14.1 Purpose
14.2 General principles
14.3 Kinds of artifacts
14.4 Managing artifacts
Chapter 15: Tools
15.1 Purpose
15.2 General considerations
15.3 Kinds of tools
15.4 Managing tools
Chapter 16: Teams
16.1 Purpose
16.2 Directory
16.3 Structure
16.4 Communication
16.5 Team organization and size
Chapter 17: Operations
17.1 Purpose
17.2 Operation model
17.3 Life cycle
17.4 Procedures
17.5 Plan
17.6 Tasking
17.7 Support
Part V: Life cycles and project phases
Chapter 18: Introduction to life cycle patterns
18.1 Introduction
18.2 Key ideas
18.3 Purpose
18.4 A model for patterns
18.5 Documenting life cycle patterns
18.6 Work steps and artifacts
18.7 Life cycle and teams
18.8 Life cycle and planning
18.9 Principles for a life cycle pattern
18.10 In upcoming chapters
Chapter 19: Example life cycle patterns
19.1 Introduction
19.2 Whole project life cycle
19.3 System development patterns
19.4 Post-development patterns
19.5 Comparisons and lessons learned
Chapter 20: Model life cycle patterns
20.1 Introduction
20.2 The example phases
20.3 Life cycle phases
Part VI: Specifications
Chapter 21: Specifications
21.1 Purpose
21.2 Specifications and systems
21.3 Example
21.4 Kinds of specifications
21.5 Using specifications in a system
21.6 Using specifications for a component
21.7 Specification artifacts
Chapter 22: Requirements
22.1 What are requirements?
22.2 A single requirement
22.3 Multiple requirements
22.4 Organizing requirements
22.5 Writing good requirements
22.6 Requirement derivation
22.7 Advanced requirements
22.8 Analyzing requirements
22.9 Validating requirements
22.10 Verification
22.11 Limitations of requirements
22.12 Working with requirements
22.13 Tools
Part VII: Design
Chapter 23: Design introduction
23.1 Purpose
23.2 How designs are used
23.3 Design artifacts
23.4 Developing designs
Chapter 24: Breakdown structure
24.1 What is the component breakdown for?
24.2 Basic concepts
24.3 Component identifiers
24.4 Viewing the breakdown structure
24.5 Context and relationships
24.6 Advice
24.7 Examples
Part VIII: Team organization
Chapter 25: Team introduction
Part IX: Project plan
Part X: Appendixes
Appendix A: From stakeholder need to model purposes
A.1 Introduction
A.2 Stakeholders
A.3 Model elements
Further reading and inspiration
Acknowledgments
Bibliography

Part I: Introduction and purpose

Chapter 1: Introduction

12 February 2024

This book is about the intersection of systems engineering, project management, and project leadership. It is about the building of systems that are complex or that require high assurance—ones where safety or security are critical to their correct operation. In these projects, these roles must work together to guide a team to build a system that works correctly.

The intersection is about how each of these disciplines contributes to the work of building a system. The intersection is where people maintain a holistic view of the project. It is where technical decisions about system structure interact with work planning; where project leadership sets the norms for how engineers communicate and check each other’s work. It is where competing concerns like cost versus rigor get negotiated. And it is where people take a long view of the work, addressing how to prepare for the work a year or more in the future.

I’ve worked with many people who were good at one of these disciplines, but didn’t understand how their part fit together with others to create a team that could build something complex while staying happy and efficient. I have worked with well-trained systems engineers who knew the tools of their craft, but did not know how or, more importantly, why to use them and how they fit together. I have worked with project managers who had experience with scheduling and risk management and other tools of their craft, but lacked the basic understanding of what was involved in the systems part of the work they were managing. I have also worked with engineers and managers who were tasked with assembling a team, but did not understand what it means to lead a team so that it becomes well-structured and effective.

In other words, they were all good at their individual disciplines but they lacked the basic understanding of how their discipline affects work in other disciplines, and how to work with people in the other disciplines to achieve what they set out to do.

I have been a leader, systems engineer, manager, or combinations thereof for several projects, and consulted with people filling those roles in other projects. The material in this book is based on what I learned filling those roles, and in trying to help people and teams build systems.

There are three parts to this book. In order to talk about building a system, one first needs a definition of what “a system” means. I present a model for thinking about systems at the high level, which provides structure and vocabulary for what follows. Next, I present a model for thinking about what building a system means. This model breaks down “building” into a few sub-topics, which provide a structure for discussing how to do the work. Finally, I discuss how one can approach these building topics in pragmatic ways.

This book is not a book on systems engineering or project management per se. Rather, it provides an overarching structure that organizes how the systems engineering and project management disciplines contribute to systems-building. While I reference material from these disciplines as needed, do not expect (for example) to learn the details of safety analyses here. I do discuss how those analyses fit with the other work needed for building a system, and provide some references to works by people who have specialized in those topics.

This book is for people who are building complex systems, or are learning how to do so. I provide a structure to help think about the problems of building systems, along with ways to evaluate different ways one can choose to solve problems for a specific project. I relate experience and advice where I have some.

This book is meant to be used in two different ways. First, I suggest reading through the first introductory part to understand the general structure for systems and systems-building. With more time, read Part III on the model of systems and Part IV on building systems. After doing so, you should have a general understanding of what I mean by building systems. Second, the rest of the book is written so that it can be read linearly to get all the details, but it is intended more to be dipped into as a reference when you are want to understand something specific in more detail.

Chapter 2: What is “making a system”?

30 November 2023

Systems engineering and systems work are about making a system work as a whole. By “system”, I mean a thing that has some purpose to do something, and that is made out of smaller parts that have to work together to achieve that purpose.

Systems engineering is the discipline that addresses systems as a whole. Systems generally are built out of many simpler parts. Systems engineering maintains the holistic view of what all those parts combine to make and do. The systems team figures out what the whole system should be and do to meet customer objectives, how the various parts must interconnect to achieve that purpose, and then show that the result meets objectives.

A systems engineer is one part of the team that designs and builds a system, and the system engineers’ role complements other roles. Most engineers on a team concentrate on the parts; the systems engineers assemble the parts into the whole. Product or mission managers interact with customers to find out what they desire; systems engineers assemble the evidence that the system as a whole meets those objectives.

A project to make a system usually is big enough to require many people—partly for the number of tasks to be done, and partly for the diverse skills needed to complete them. This leads to a need to manage the work and the team: how to organize the people, how to ensure that the right work gets done, and how to ensure enough resources are available to support the work.

Most systems work is not done when the first complete version has been implemented. The work continues as the system is deployed, as the system is in operation, and as the needs the system fills evolve. This means that the work of making a system is a long-term effort, usually lasting years and involving people who will never meet each other.

Because the systems effort deals with the complexity of the whole system, team members must work out how to communicate with their peers what the context is for each part of the system. This involves finding abstractions to explain and simplify the whole, defining patterns of interactions among system parts, and figuring out interfaces between parts that achieve those interactions.

In this book I present ways to think about the work of making and maintaining systems, including models for what the work is, and approaches that have been useful for building real systems.

2.1 An example

A customer wants a small cottage, built out of Lego™ bricks. Someone on the team works with the customer to get more information about the cottage they want. They would like the cottage to be white. They would like it to have a window.

The people building the cottage start with a pile of the bricks:

And they build a cottage:

There are three systems problems with this cottage.

The cottage is supposed to be white, but there is a red brick on the side.
There is a brick on the side that doesn’t fit in its place.
The cottage lacks a door.

These three problems illustrate three general kinds of systems problems.

A part has the right “shape”, performing the right functions (holding up part of the side wall), but has a wrong attribute (red instead of white).
A part doesn’t fit into the whole (doesn’t fill the gap, projects out from the side wall) and doesn’t fill needed functions (keeping the side wall free of openings).
A part is missing (a door). Note that the original sketch developed with the customer didn’t include a door—it only had an annotation about a window. People implicitly know that cottages need doors but builders may miss out on the door if it isn’t explicitly specified.

After some systems work, these problems can be identified and corrected. In this case, correcting the problems involved taking the cottage more than halfway apart and rebuilding it. The result meets what the customer wanted:

Later on, the customer might determine that they need a bigger cottage. There are options for how that enlargement could be done: extending vertically by adding a second floor, or extending horizontally by adding a wing. In this example using bricks, it is easy enough to try both options and see which one the customer prefers. The structure is simple enough that one can look at the cottage and see how it is built, and so work out how to add a few bricks to add a floor or a wing. A real building, however, is not so simple: one needs to consider the structural strength of the walls, the nature of the ground on which it is built, and the complexity of the structure. A real building has subsystems like plumbing, electrical wiring, and air conditioning to consider in making changes. Knowing how the wiring and plumbing have been modified in a real building after it was first built helps guide the choices for how to extend the building.

Extending the cottage illustrates three other aspects of systems work.

Systems change after they are first built. The changes can themselves be complex projects.
There are often choices about how to make a change, and making a decision requires knowing about the current state of the system—including why earlier decisions were made.
Changes can happen long after the original system was built, and after the people who designed and built it have left or after their memory of what they did has faded.

2.2 Generalizing

Systems work has five parts:

Understanding the purpose or objectives of the system.
Designing and building the system, including managing the many artifacts that record the design and implementation of the system.
Coordinating the team that builds the system, including ensuring good communication amongst them.
Ensuring that the component parts of the system fit together properly.
Checking that resulting system is fit for the intended purpose.

That is, there is both a technical aspect to the work and a human aspect.

2.2.1 Fitting parts together

Making sure that component parts fit requires tracking how the parts relate to each other. There are many kinds of relations:

Structural, where one part physically supports another;
Thermal, where one part generates heat and another moves it away;
Data, where one part generates information, another communicates it, and yet another uses the data;
Electrical, where one part supplies electrical power and another consumes it;
Traffic flow, where one part provides a physical path that other parts use to move from place to place; and so on.

Each of the component parts will have its own specification, and its own team that specialize in designing and building that part. The systems work is to ensure that each of those teams is well enough informed that it will design and build a part that relates properly to other parts. If one component part is consuming electrical power and producing heat, then it needs to be connected to another component part that can supply it with the electricity it needs, and the heat it generates needs to be moved away so that the component does not overheat and destroy itself. The systems work involves working out what parts supply energy and what parts remove the heat, and then coordinating negotiation between the teams designing all those parts to ensure that the electrical and thermal capacity is enough.

In the cottage example, the brick that is turned the wrong way fails to provide support to the row of bricks above it, and it fails to make a solid wall.

The systems team, in summary, is responsible for designing how all the individual parts in the system will interact with each other, and providing information to parts teams on the role their part needs to fill to make the system function as a whole.

2.2.2 Meeting objectives

Even if all the parts can supply power to each other and such, the aggregation of all the parts into a complete system still has to meet the customer’s objectives. The objectives will include things that the system is expected to do. The objectives will also include properties that the system should have: from simple things like color or mass, to complex ones like safety or profitability.

The team designing and implementing the system must be able to show the customer that the system they are building meets those objectives. If the customer has said, for example, that they expect a certain level of safety when the system is in operation, the team must show evidence that the system they are building will be at least that safe. If the customer has set a target for profitability, then the team must show why it will meet that objective.

In the cottage example, the incorrect design did not meet the customer’s objective of a white cottage: one of the walls had a red brick in it.

The customer will not spell out every objective explicitly. Part of systems work is to find all of the objectives, even the implicit ones. In the cottage example, the customer did not explicitly call out that the cottage needed a door—probably because everybody knows that useful buildings need a way to get in and out. The systems team must work with the customer to find and document all these objectives that are obvious—but then might get missed.

In summary, the systems team is responsible for documenting the customer’s purpose or objectives for the system, and for gathering the evidence that the system as designed or built meets those objectives.

2.2.3 Coordinating

Systems work involves mediating and coordinating design and implementation work among many teams, as well as with the customer and any other in-house stakeholders.

Showing developers how their part fits into the whole. There is a cooperative division of labor between those who work on the system and those who specialize in a part. The systems worker needs to track how parts will interact, and provide guidance to the part specialists on how to make their part fit. The parts specialist provides the in-depth knowledge and focus to build their part, and they need to work with the systems people when they find that they cannot meet a system specification or when they discover new facts about their part.

Handling changes. When a customer requests a change to system objectives, or when someone finds a problem with a design, someone has to work out how to change the system’s design. This effort is generally shared between those working on the system and those working on parts. The systems team identifies what parts might be affected by the change. Part Teams work out what changes, if any, are needed to their part of the system. Together they negotiate how the changes work together so that the revised system is complete and consistent.

Negotiating between developers. Different part specialist teams will develop different parts. The systems effort provides the teams with preliminary interface specifications for how their parts should interact. Sometimes one team cannot support the originally-specified interface, and someone needs to work out how to revise the interface so that it satisfies both parts teams. Systems work will also assign preliminary budgets to different parts—how much mass the part can have, how much RF energy it can emit, and so on. The systems effort makes sure that these budgets are feasible: that the mass allowances fit within the whole system’s total mass allowed, for example. When one part needs to exceed its budgeted amount or comes in well under its amount, there is a systems effort to work out how to reallocate budgets or to identify parts that must be redesigned to provide more margin.

Scheduling and management. While systems and parts teams concentrate on designing and building the technical system, others are responsible for scheduling the work. This includes finding a good order for teams to tackle the design, implementation, and verification of individual parts, so that teams that depend on each other don’t get stalled waiting for one another. This also includes tracking development processes, to ensure that work is being done accurately and that no work is being missed. The systems effort informs these management tasks by identifying how parts depend on each other and where risks are in the system.

Customer negotiation. The customer is not expected to provide a detailed specification of the system; that’s the job for the team that is designing and building the system. From the earliest interaction with the customer through each change and on to customer acceptance, there is systems work in recording the customer’s general objectives, translating that into more precise specifications of the system as a whole, and then working with the customer to validate that the specification actually meets the customer’s desires. A product manager or mission manager may have the primary responsibility for the customer interaction, but the systems team supports them by documenting objectives, developing preliminary system designs, and showing how those designs meet objectives.

2.2.4 Managing artifacts

While people often think of “the system” as its final implementation, from a systems engineering point of view there is much more to the system. The system does include its implementation as it is deployed, but it also includes the designs and the reasoning that leads to the designs. The system includes all of the evidence that the final product complies with regulation and with customer needs.

Consider three different events that normally occur after a system is in operation. First, customer needs change over time, and one needs to build a new version of the system that adds some new capability. The team that will modify the system needs to have the specifications, designs, and rationales that led to the original system so that they can design and build a revised version quickly and correctly. Second, the operational system experiences a failure. Besides determining root causes and repairing the system back to operation, there are likely to be inquiries into what went wrong and why—sometimes including criminal investigations. Maintaining the specification and design information provides key information to these investigations. Finally, a system needs to be checked regularly to ensure that it continues to be as designed. Components wear down and need to be replaced. People become accustomed to a procedure and take shortcuts. Management of the operational system changes organizational structure, negating some key assumption about how people work together to keep the system running correctly or safely.

The products that a systems engineering team has to produce, therefore, are more than just the final hardware, software, or operating procedures. The systems engineering team produces or manages all the artifacts that produce the final system. This includes specifications, designs, analyses, approvals, test systems, test cases and results, and many others. The systems engineers and project management jointly must produce and manage other artifacts like schedules and budgets.

The project team must treat all these artifacts with as much care as they do the final operational system artifacts.

2.2.5 Communicating

The systems team has a unique perspective on the work to build a system. Each part team has a detailed view of how work is progressing on their part, but it is not their job to maintain detailed situational awareness of how everything else in the system is progressing. Well-functioning parts teams will communicate regularly with each other, so everyone will have some information about how the overall work is going, but in practice they will not have all the information they need.

The systems team has the job of making sure that all the teams have all the information they need about the design for how parts connect together and about how design and implementation efforts are proceeding. When someone requests a change to the system, the system team is responsible for ensuring that everyone affected by the change is aware. As the design for one part solidifies, the systems team must ensure that all the teams building parts that interact with that part are aware of that part’s design.

When two parts teams have a disagreement about how to design an interface between their components, it is up to the systems team to coordinate the negotiation. This includes defining the decision criteria that will be used to evaluate different solutions.

Finally, a coherent system implementation depends on each parts team having a correct picture about what they are supposed to design and build. The systems team supports this by maintaining the baseline system design—the record of the approved decisions about the design that allow everyone to understand the same facts about the system design. This includes creating and maintaining the artifacts that reflect the decided-upon design, and communicating to all the teams as those design artifacts are updated and approved.

In summary, it is up to the systems team to maintain the official baseline design for the system, and to push information to every team that needs it.

2.3 The rest of the book

I present some basic models for how to think about systems work in the rest of this first part. After discussing a basic structure for organizing systems work, Part III goes into more detail about what systems are, while Part IV goes into detail about what working with systems is. Following those, a series of parts discuss models of system lifecycles and how to use them to organize the work being done, the tools involved, and issues related to managing a complex systems project.

Chapter 3: Stories about building systems

3.1 Developing a spacecraft mission without engineering the system

10 April 2024

The project. I worked on a NASA small spacecraft project. The project’s objective was to fly a technology demonstration mission to show how a large number of small, simple spacecraft could perform science missions. The mission objectives were to demonstrate performing coordinated science operations on multiple spacecraft, and to demonstrate that the collection of spacecraft could be operated by communicating between one spacecraft and ground systems, and the spacecraft then cross-linking commands and data to perform the science operations.

The problem. The mission had one set of explicit, written mission objectives to perform the technology demonstration. It also had a number of implicit, unwritten constraints placed on it, primarily to re-use particular spacecraft hardware and software designs.

Those two sets of objectives resulted in conflicts that made the mission infeasible. There were three key technical problems: power consumption was far in excess of what the spacecraft’s solar panels could generate; the legacy that could not communicate effectively over the distances involved; and the design had insufficient computing capability to accurately compute how to point spacecraft for cross-link communication.

Conflicts like these are not uncommon when first formulating a system-building project, and NASA processes are structured to catch and resolve them. The NASA Procedural Requirements (NPRs), a set of several volumes of required processes, require projects to formalize mission objectives and analyze whether a potential mission design is feasible. This work is checked at multiple formal reviews, most importantly the Preliminary Design Review (PDR).

At the PDR, expected project maturity is:

Program is in place and stable, addresses critical NASA needs, has adequately completed Formulation activities, and has an acceptable plan for Implementation that leads to mission success [italics mine]. Proposed projects are feasible with acceptable risk within Agency cost and schedule baselines. [NPR7120, Table 2-4, p. 30]

This project, however, failed at three of the necessary steps. First, the project did not perform top-down systems engineering, such as a proper documentation of mission objectives, a concept of operations, and a refinement of those into system-level and subsystem-level specifications. In particular the implicit and undocumented constraints were never documented as requirements; they were tacitly understood by the team and rarely analyzed. Those requirements that were gathered were developed by subsystem leads, and they were inconsistent and did not derive from the mission objectives. Second, individual team members did analyses that showed problems with the the ability of the radios, their antennas, and the ability to point the spacecraft in such a way that cross-link communications would work. The people involved repeatedly tried to find a solution in their individual domain of expertise to fix the problem, and the problems were never raised as a systemic problem. Finally, the PDR was the final check where these problems should have been brought to light as the refinement of mission objectives and the concept of operations would fail to show communication working. Instead, the team focused on making the review look good rather than addressing the purpose of the review.

Outcome. The project proceeded to build the hardware for multiple spacecraft, began developing the ground systems and developing the flight software. After several months, the project neared the end of its budget, and the spacecraft design was canceled. Something like two years’ worth of investment was lost, and the capability of performing a multi-spacecraft science mission was never demonstrated.

The agency later found some funds to develop a much simplified version of the flight software and relaxed the mission objectives substantially to only performing some minimal cross-link communications. A version of that mission was eventually flown.

Solutions. The project made four mistakes. Each one of them could have been corrected if the project had followed good practice and NASA required procedures.

First, the conflicting mission objectives and constraints should have been resolved early in the project. NASA has a formal sequence of tasks for defining a mission and its objectives, leading to a mission definition that is approved and signed by the mission’s funder. If the project had followed procedure, the implicit constraints would have been recorded as a part of this document. Documentation would have encouraged evaluation of the effects of those constraints.

Second, the project did not do normal systems engineering work. The systems engineering team should have documented the mission objectives, developed a concept of operations for the mission, and performed a top-down decomposition and refinement of the mission systems. In doing so, problems with conflicting objectives would have been apparent. The systems leadership would have been involved in analyses of the concept, and thus been aware of where there were problems.

Third, the team lacked effective communication channels that would have helped someone working one individual problem raise the issues they were finding up to systems and project leadership, so that the problems could be addressed as systems issues. For example, one person found that the flight computer would not be able to perform good-enough orbit propagation of multiple spacecraft so that one spacecraft would know how to point its antenna to communicate with another. A different person found problems with the ability of the radios to communicate at the ranges (and relative speeds) involved.

Finally, the PDR should have been the safety net to find problems and lead to their resolution. The NASA procedural requirements have a long list of the products to be ready at the PDR. More than 30 specifically the responsibility of systems engineering [NPR7123, Table G-6, p. 81], and the project overall has a similar number of products [NPR7120, Appendix I]; there is some overlap between these lists. The team took a checklist approach to these several products, putting together presentations for each topic in a way that highlighted progress in the individual topics but failing to address the underlying purpose: showing that there was a workable system design.

Had any of these mechanisms worked, the systems and project leadership would have detected that the conflicting mission objectives were infeasible and led the project to negotiate a solution.

Principles. This example is related to several principles for a well-functioning project.

Section 7.4.1—Principle: Document team structure; in particular, the authority of each team member
Section 7.1.3—Principle: Systems view of the system
Section 7.2.3—Principle: Follow the spirit, not just the letter
Section 7.3.6—Principle: Analyze for feasibility
Section 7.4.5—Principle: Define exceptional communication paths

3.2 Marketing and engineering collaboration

12 April 2024

The project. I worked at a startup company that was building a high-performance, scalable storage system. The ideas behind the system came from a university research project, which had developed a collection of technology components for secure, distributed storage systems.

The company had developed several proof-of-concept components and was transitioning into a phase where it was getting funding and establishing who its customers were. The company hired a small marketing team to work out what potential customers needed and to begin building awareness of the value that the new technology could bring.

The problem. The marketing team had experience with computer systems, but not with storage in particular. They could identify potential market segments, but they did not have the background needed to talk with potential customers about their specific needs.

The engineering team were similarly not trained at marketing. Some of the team members had, however, worked at companies that used large data storage systems and so had experience at being part of similar organizations.

Solutions. The marketing team set up a collaboration with some of the technical leads. This collaboration left each team in charge of their respective domains, with the technical leads helping the marketing team do their work and the marketing team providing guidance about customer needs to the engineering team.

One of the technical leads acted as a translator between the marketing and engineering teams, so that information flowed to each team in terms they understood. Technical leads joined the marketing team on customer visits, helping to translate between the customers‘ technical staff and the marketing team. The marketing team conducted focus group meetings, and some of the technical leads joined in the back room to help frame follow-up questions to the focus groups and to help interpret the results.

Outcome. The collaboration helped both teams. The marketing team got the technical support and education they needed. The engineering team got proper understanding of what customers needed, so that the system was aimed at actual customer needs.

Principles. This example is related to the following principles:

Section 7.1.1—Principle: Communication and translation
Section 7.2.1—Principle: Start with a purpose before doing work

3.3 Missing implicit requirements

13 April 2024

The project. This occurred at the startup I worked at that was building a scalable storage system.

The problem. The team had a focus on making the system highly available, to the point where we had an extensive infrastructure for monitoring input power to servers and providing backup power to each server. If the server room lost mains power, our servers would continue on for several minutes so that any data could be saved and the system would be ready for a clean restart when power came back on. We did a good job meeting that objective.

What we forgot is that people sometimes want to turn a system off. Sometimes there is an emergency, like a fire in a server room, and people want the system powered off right away. Sometimes preventing the destruction of the equipment is more important that losing a few minutes’ worth of data. We had no power switches in the system and no way to quickly power it down.

Outcome. In practice this wasn’t too serious a problem because emergencies don’t happen often, but it meant that the system couldn’t pass certain safety certifications.

Solutions. We made two mistakes that led to the problem.

The first mistake was that everyone on the team saw high availability as a key differentiator for the product, and so everyone put effort into it. This created a blind spot in how we thought about necessary features.

The second mistake was that we did not work through all of the use cases for the system and so implicit features, including power off. Building up a thorough list of use cases can serve as a way to catch blind spots like this, but the team did not build such a list.

Principles.

Section 7.2.6—Principle: Work against cognitive biases

3.4 Building at a mismatch to purpose

15 April 2024

The project. I consulted on a project to build a technology demonstration of a constellation of LEO spacecraft for the US DOD. This constellation was to perform persistent, world-wide observations using a number of different sensors. It was expected to operate autonomously for extended periods, with users world wide making requests for different kinds of operations. The constellation was expected to be extensible, with new kinds of software and spacecraft of new capabilities being added to the constellation over time.

One company organized the effort as the prime contractor. That company built a group of other companies of various sizes and capabilities as subcontractors. The team won a contract to develop the first parts of the system.

The problem. The constellation had to be able to autonomously schedule how its sensors would be used, and where major data processing activities would be done. For example, someone could send up a request for an image of a particular geographic region, to be taken as soon as possible. The spacecraft would then determine which available spacecraft would be passing over that region soon. Some of the applications required multiple spacecraft to cooperate: taking images from different angles at the same time, or persistently monitoring some region, handing off monitoring from one spacecraft to another over time, and performing real-time analysis on the images gathered on those spacecraft.

The prime contractor selected its team of other companies and wrote the contract proposal for the system before doing systems engineering work. This meant that neither a detailed concept for the system’s operation nor a high-level design had been done.

After the contract was awarded, the team had to rapidly produce a system design. This effort went poorly at first because the system’s concept had not been worked out, and different companies on the team had different understandings of how the system would be designed. The team had to deliver initial system concept of operations and requirements quickly after the contract was awarded. The requirements were developed by asking someone associated with each expected subsystem to write some requirements. Needless to say, the concept, high-level design, and requirements were all internally inconsistent.

After the team brought me on to help sort out part of the design problems, we began to do a top-down system design and establish real specifications for the components of the system. We were able to begin to work out general requirements for the autonomous scheduling components.

The project team had determined that they needed to use off-the-shelf software components as much as possible, because the project had a short deadline. One of the subcontractor companies was invited onto the team because they had been developing an autonomous spacecraft scheduling software product, and so the contract proposal was written to use that product.

However, as we began to work out the actual requirements for scheduling, it became apparent that the off-the-shelf scheduling product did not match the project’s requirements. The requirements indicated, for example, that the system needed to be able to schedule multiple spacecraft jointly; the product only handled scheduling each spacecraft independently. The system also had requirements for extensibility, adding new kinds of sensors, new kinds of observations, and new kinds of data processing over time. This suggested that strong modularity was needed to make extensibility safe, but the off-the-shelf product was not at all modular.

Outcome. The mismatch between the decision to use the off-the-shelf scheduling product and the system’s requirements led to both technical and contractual problems.

The technical problem was that the scheduling product could not be modified to work differently and thus meet the system requirements. The project did not have the budget, people, or time to do detailed design of a new scheduling package that would meet the need.

The contractual problem was that the subcontractor had joined the project specifically because they saw a market for their product and were expecting to use the mission to get flight heritage for it. When it became clear that their product did not do what the system needed, they discussed withdrawing from the project.

In the end, the customer decided not to continue the contract and the project was shut down.

Solutions. This project made three mistakes that, had they been avoided, could have changed the project’s outcome.

First, the team did not do the work of early stage systems engineering to work out a viable concept and high-level design before committing to partners and contracts. This would have made it clear what was needed of different system components. It would also have provided a sounder basis for the timelines and costs in their contract proposal.

Second, the team made design and implementation choices for some system components without understanding the purpose that those components needed to fill.

Finally, the team made commitments to using off-the-shelf designs without determining whether those designs would work for the system.

Principles. The solutions above are related to the following principles:

Section 7.1.3—Principle: Systems view of the system
Section 7.2.1—Principle: Start with a purpose before doing work
Section 7.2.2—Principle: Evaluate tools before adopting them

3.5 Heavyweight, understaffed processes

24 April 2024

The project. A colleague was an engineer working on an electronics-related subsystem at a large New Space company that was building a new launch vehicle.

The problem. The team in question was responsible for designing one of the avionics-related subsystems and acquiring or building the components. This required finding suppliers for some components and ordering the necessary parts.

The company had processes in place for both vendor qualification and parts ordering. They included centralized software tools to organize the workflow.

The vendor qualification process began with submitting a request into the tools. The request was then reviewed by a supplier management team; once they approved a supplier, the avionics team could start placing ordering requests to buy parts. The purchase request would similarly be routed to an acquisition team that would make the actual purchase from the supplier.

The intents of this process were, first, to take the work of qualifying potential vendors and managing purchases off the engineering team, and second, to ensure that the vendors were actually qualified and that parts orders were done correctly.

From the point of view of the engineers building the avionics, the processes were opaque and slow. They would put in a request, and not know if they had done so properly. Responses took a long time to come back. At one point, my colleague reverse engineered the vendor qualification process in order to figure out how to use it; the result was a revelation to other engineers.

It also appeared that the positions responsible for processing these requests were understaffed for the workload. In practice these people did not have the time to do proper reviews of the vendors most of the time.

Outcome. Having supply chain processes was a good thing: if it worked, it increased the likelihood that the acquired parts would meet performance and reliability requirements, that the vendors would deliver on schedule and cost, and that the cost of acquiring parts remained within budget.

However, getting vendors qualified to supply components and then getting the components took a long time, delaying the system’s implementation and then delaying testing and integration.

The suppliers and the parts did not get the intended scrutiny, which may have let problem suppliers or parts through.

The company acquired a reputation with its employees of being slow and difficult to work for.

Solutions. There are four things that could have been done to make these processes work as intended.

First, the processes should be documented in a way that everyone involved knows how the process works. In this situation, it seems that people playing different parts in the process knew something about their part, but they did not understand the whole process; if there was documentation, the people involved did not find it. The process documentation should inform all the people involved what all of the steps are, so they understand the work. It should make clear the intent of the process. It should also make clear what would make a request successful or not.

Second, the processes should be evaluated to ensure that every step adds value to the project, compared to not doing that step or doing the process another way.

Third, the supporting roles—in this case, those tasked with reviewing and approving requests—should be staffed at a level that allows them to meet demand.

Finally, the project should regularly check whether its processes are working well, and work out how to adjust when they are not working.

Principles. The following principles apply:

Section 7.1.4—Principle: The team is a system
Section 7.1.6—Principle: Keep it lightweight and actionable
Section 7.4.2—Principle: Plan on reorganizing the team as it grows

3.6 Planning the transcontinental railroad

24 April 2024

The project. The first transcontinental railroad to cross North America was built between 1862 and 1869. [Bain99] It involved two companies building the first rail route across the Rocky Mountains and the Sierra Nevada, one starting in the west and the other in the east. It was built with US government assistance in the form of land grants and bonds; the government set technical and performance standards that had to be met in order to get tranches of the assistance. The technical requirements included worst-case allowable grades and curvature. The performance requirements included establishing regular freight and passenger service to certain locations by given dates.

The problem. The companies building the railroad had limited capital available to build the system. They had enough to get started, but continuing to build depended on receiving government assistance and selling stock. Government assistance came once a new section of continuous railroad was accepted and placed into service. In addition, the two companies were in competition to build as much of the line as possible, since the amount of later income depended on how much each built.

This situation meant that the companies had to begin building their line before they could survey (that is, design) the entire route. They operated at some risk that they would build along a route that would lead to someplace in the mountains where the route was uneconomical—perhaps because of slopes, or necessary tuneling, or expensive bridges.

Because the building began before the route was finalized, the companies could not estimate the time and resources needed for construction beyond some rough guesses. The companies worked out a general bound on cost per mile before the work started, and government compensation was based on that bound. In practice the estimate was extravagantly generous for some parts of the work.

Solutions. The initial design risk was limited because there were known wagon routes. People had been traveling across the Great Plains and the mountains in wagons for several years. While the final route did not exactly follow the wagon routes, the early explorations ensured that there was some feasible route possible.

The companies built their lines in four phases: scouting, surveying, grading, and track-laying. (In some cases they built the minimal acceptable line with the expectation that the tracks would be upgraded in the future once there was steady income.) Scouting defined the general route, looking for ways around bottlenecks like canyons, rivers, or places where bridges or tunnels would be needed. Surveying then defined the specific route, putting stakes in the ground. The surveyed route was checked to ensure it met quality metrics, such as grade and curvature limits. After that, grading crews leveled the ground, dug cuts through hills, and tunneled where necessary. Finally, track-laying crews built bridges and culverts where needed, then laid down ballast, ties, and rail. After these phases, a section of track was ready for initial use.

Scouting ran far ahead of the other phases, sometimes up to a year ahead. Survey crews kept weeks or months ahead of grading crews. The grading and track-laying crews proceeded as fast as they could. All this work was subject to the weather: in many areas, work could not proceed during winter snows.

Outcome. The transcontinental railroad was successfully built, which opened up the first direct rail links from one coast of North America to the other. The early risk reduction—through knowledge of wagon routes—accurately showed that the project was feasible.

The companies were able to open up new sections of the line quickly enough to keep the construction funded. The companies received bonds and land grants quickly enough, and revenue began to arrive.

The approach of scouting and surveying worked. The scouting crews investigated several possible routes and found an acceptable one. While there were instances of tentatively selecting one route then changing for another—sometimes for internal political reasons rather than technical or economic reasons—no section of the route was changed after grading started. In later decades other routes were built, generally using tuneling technology that was not available for the first line. Many parts of the original line are still in regular use.

Principles. The transcontinental railroad project was an example of planning a project at multiple horizons, when the work of implementing begins before the design is complete, and where the plan and design is continuously refined.

Section 7.3.1—Principle: Prioritize work by risk or uncertainty
Section 7.3.3—Principle: Have a long-term plan
Section 7.3.6—Principle: Analyze for feasibility

Part II: Systems background

Chapter 4: A model of systems work

26 November 2023

This book is about the work involved in making a system—what a system is, and how to do a good job making one.

The organization of this book reflects these two aspects. I start by discussing a model of what a system is: what the purpose of a system is, what makes it up, how the pieces of a system can be organized (Chapter 5). The model of what a system is is a foundation for a model of how to build that system—the human elements of organizing and performing the work involved, and the tools that can organize and enable the work (Chapter 6).

The system is the objective of the work (Section 5.3). A system is built to achieve some purpose for some user(s). The system is made up of parts that work together to achieve that purpose. The system is the outcome of all the work that people do in making the system. The system goes through a life cycle (Section 6.3.5), from concept to design and implementation, to operation, updates, and eventually retirement. Most importantly, the system is not static: it will evolve rapidly as it proceeds from concept through design; once it is in operation, it will continue to evolve as its users’ needs evolve.

Some “systems” actually consist of multiple systems that must work together. Consider an electronic widget, consisting of hardware and software. The widget is what most people think of as “the system”, but one must also implement:

The manufacturing system for building the widgets, including supply chains, widget assembly, and unit tracking;
A distribution system for getting widgets to customers, including warehousing, delivery channels, and interfaces with sales systems; and
Repair and maintenance facilities to keep the widgets working, and possibly to recycle widgets when they are retired.

Moreover, many electronic widgets can be upgraded in the field with updated software. The updated software is a new version of part of the widget, but it needs a distribution mechanism for getting the new software versions to widgets in operation. In other words, the “system” is usually a lot more than just the widget.

The tools are things that are used in the process of designing, implementing, deploying, and maintaining the system. They are not part of the system itself; they are not delivered to users and play no part in system operation. This includes things like:

Configuration and artifact management for storing design and implementation artifacts;
Test equipment used to verify that the system meets its objectives;
Task and issue tracking;
Design tools, such as CAD systems; and
Software tools, such as build systems and checking systems.

The execution is about the actual work to make the system, using the tools. Execution includes:

Who the people are who do the work, and how they are organized;
The tasks they perform, and the goals or milestones they meet along the way;
How the tasks are chosen, ordered, and coordinated;
The plan for completing the work;
Who makes decisions, about what, and when; and
How information propagates through the team.

All three parts—system, tools, and execution—are themselves systems that deserve to be treated seriously as systems. The organization of the team, in particular, is a system that should be designed (not just allowed to self-organize by happenstance). The organization, including its roles, authorities, and responsibilities, should be documented just as any product system should be. An organization is put into operation, and should be monitored to see what is functioning well and what needs to be updated as an organization grows and matures.

The first part of this book is, therefore, organized around this way of thinking about making a system. The rest of the introduction part presents overviews of models for thinking about systems, tools, and project execution in order to provide a foundation for what follows. Other parts follow to provide greater detail.

Chapter 5: Elements of systems

6 December 2023

5.1 Introduction

Working with systems is about working with the whole of a thing. It is a bit ironic that to make the whole accessible to rational design, we need to talk about the parts that make up systems work.

That is one of the first points about systems. Most systems are too complex for a human mind to remember and understand as a whole at one time. To work on these systems, we must find ways to abstract and to subset the problem. This book discusses some of the techniques for slicing a system into understandable parts, along with ways to use those techniques and why to use them. In the end, however, everything in here deals with carefully-chosen subsets of a system.

This chapter covers some of the essential concepts and building blocks that are the foundation for the techniques discussed in the rest of this book.

The subjects for systems work can be divided into four groups:

The system’s purpose
The parts and views of the system
The structure and emergence in the system
Evidence that the system and its design meet its purpose

The first three subjects are connected by a reductive approach to explaining complex systems, in which the high-level purpose is explained by reducing it to simpler constituent parts and structure, and conversely expressing the purpose as emergent from these simpler parts. The final subject is about ensuring that the system does what it is supposed to do (and only that).

5.2 System purpose

Every system that is designed and built has a purpose. That is, someone has an expectation of the benefits that will come from building the system, and they believe that those benefits will outweigh the costs (in resources, time, or opportunities) that will be incurred building the system.

Every system must be designed and built to address its purpose, and no other purposes, at the lowest cost practically achievable. This point may seem uncontroversial on its surface, but I have observed that the majority of projects fail to work to this standard, and incur unnecessary costs, schedule slips, or missed customer opportunities. Every design choice must be weighed according to how well each option helps satisfy the purpose or not; if an option does not, it should not be chosen.

Making design decisions guided by a system’s purpose means that the team must understand what that purpose is. The purpose must be recorded in a way that all the team members can learn about it. It also needs to be accurate: based on the best information available about what the system’s users need, and as complete as can be achieved at the time. The record of the purpose should avoid leaving important parts implicit, expecting that people will know that systems of a particular kind should (for example) meet certain safety or profitability objectives; people who specialize in one area will know some of these implicit needs but not others. The purpose documentation should also include secondary objectives, such as meeting regulatory requirements or leaving space in the design for anticipated market changes.

The understanding of a system’s purpose and costs will shift over time, both as the world changes and as people learn more accurately what the value or cost will be. When the idea for the system Is first conceived, the purpose may be accurate for that time but the understanding of the cost is likely to be rough. As design and development progress, the understanding of cost improves, but the needs may change or a customer may realize they misunderstood some part of the value proposition.

A system’s purpose also changes over longer periods of time. People add new features to an existing product to expand the market segment to which it applies or to help it compete against similar products. The technology available for implementing a system changes, creating opportunities for a faster, cheaper, or otherwise better system.

Systems leadership have to balance the needs for a clear and complete statement of a system’s purpose with the fact that the understanding of purpose will change over time. The agile [Agile] and spiral [Spiral] management methodologies arose from this need for balance between opposing needs. Later chapters ! Unknown link ref address how systems engineering methodologies can help address this need.

Working in a way that is driven by system purpose requires discipline in the team and its leadership. Many junior- and mid-level engineers are excited about their specialist discipline, and want to get to designing and building as quickly as possible—after all, those are the activities they find fulfilling. I have observed team after team proceed to start building parts of a system that they are sure will be the right thing, without spending the effort to determine whether those parts are actually the right ones. Those design decisions may end up being correct many times, which leads to a false confidence in decisions taken this way (“I’m experienced; I’m almost always right!”). The flaw is that the wrong decisions can have a high cost, high enough to outweigh any benefit from the rapid, unstudied decision.

I have seen many teams say—rightly—that they need to make some design decision quickly, see whether it works, and then adjust the design based on what they learn. This line of reasoning is both a good idea and dangerous. If a team actually does the later steps of evaluating, learning from, and changing the design then this approach can result in good system design. (This is discussed more in later chapters XXX on prototyping.) However, most teams lack the leadership discipline to perform to this plan: once there is some design in place, pressures to keep moving forward drive teams to live with the bad initial design and accept complexity and errors. It requires discipline and commitment from the highest levels of an organization to take the time needed to learn from an early design and change what they are doing. The leadership must be prepared to push back against pressures to just live with a poor design and instead to require their team to take the time to learn and adjust, and to be clear with external parties, such as investors, that the plan is a necessary and positive way to realize a good product.

In Chapter ! Unknown link ref, we discuss techniques that can help to keep a system’s development grounded in its purpose, while adapting to changes in purpose and learning about the system’s design choices over time.

5.3 System parts and views

Systems are designed and built by people. The methods used to build them must account for two human issues. First, most systems today are too complex for one person to keep in mind all the parts at one time, leading to a need to work with subsets of the system at any given time. Second, most systems also require multiple people to design or build, either because of specialties or the total amount of work involved. This leads to the need to break the work up into parts for different people to work on.

There are two techniques used to address this need. First, systems are divided into component parts, typically in a hierarchical relationship: the system is divided into subsystems, which are in turn subdivided, until they reach component parts that are simple enough not to require further subdivision. Second, people approach the system through narrow views, each of which covers one aspect of the system but across multiple component parts—such as an electrical power view, an aerodynamics view, or a data communications view.

Dividing the system into component parts creates pieces that are small enough to reason about or work on in themselves. The description of the part must include its interfaces to other parts, so that the design or implementation can account for how it must behave in relation to other parts. However, the interface definitions abstract away the details of other parts, so that the person can concentrate their attention on just the one part.

Dividing up the system also allows different people to work on different parts, as long as both parts honor the interfaces between them. The division into parts, and the definition of interfaces, create divisions of responsibility and scope for communication for the different people. This is addressed further in the Teams section (Section 6.3.3).

The hierarchical breakdown of the system into components and subcomponents provides a way to identify all of the parts that make up the system, ensuring that all can be enumerated. It also defines a boundary to the system: the system is made up of the named parts, and no others.

Reasoning about views of a system provides a similar and complementary way of managing the complexity of reasoning about a system by focusing on one aspect across multiple parts, and abstracting away the other aspects. This allows different people to address different aspects, as long as the aspects do not interact too much. For example, specialist knowledge, such as about electrical system design, can be brought to bear without the same person needing to understand the aerodynamics of the aircraft in which the electronics will operate.

Sidebar: Non-reductive systems

This approach of defining a system in reductive terms—using parts and structure—is not a formal necessity of systems in general. Rather, this approach is used as a way for ordinary people to define, build, and check systems.

There are numerous examples of non-human processes that have developed complex systems that are not easily explained reductively. Many of these were developed using evolutionary methods, both biological and electronic. Others arise from other optimization and machine learning techniques.

Consider the circuit discussed by Thompson and Layzell [Thompson99]. This circuit was developed by evolving a design on an FPGA, so that the result would distinguish between inputs at two different frequencies. The resulting circuit design achieves its objective, but is not readily understandable by decomposing the design into individual elements on the FPGA—indeed, the presence of some cells that did not appear to be used directly appeared to be essential to the circuit’s function.

Similar results have been demonstrated in mechanical design, using generative design tools.

While most such designs of which I am aware focus on a single objective, such as signal recognition or handling mechanical forces, the techniques used can in principle work to optimize several aspects of a system at once. Imagine, for example, a structure that provides mechanical structure, thermal conduction, and electrical transmission combined in one. Such a design would generally not be susceptible to a reductive analysis because the structure entwines these aspects.

While these designs are not readily understood by decomposition, they still must be verified for conformance with their purpose. This starts with a clear definition of purpose, from which the fitness or objective function used in optimization can be derived. For critical systems or components, the objective function must not only specify what the desired behaviors are, but also the undesired behaviors and the behaviors when the system is outside its intended performance environment. In some methods, the objective “function” can be an adversarial neural network that must itself be trained based on the system’s purpose. The result of the generative or optimization method must also be verified against the purpose to check that the result is in fact correct—which can catch errors in building the objective function, or subtle dependencies on environment. (For example, the signal recognition circuit only worked well on the specific FPGA on which it was evolved; when moved to another FPGA of the same model, it was reported to work poorly. [Thompson99])

5.4 Structure and emergence

Decomposing a system into component parts is one part of the system’s design; the other part is how those components relate to each other. The relations between parts define the structure of the system. These relations include all the ways that components can interact with each other, at different levels of abstraction. At low levels, this might be interatomic forces at the molecular level; at medium levels, mechanical, RF, force, or energy transfers; at higher levels, information exchange, redundancy, or control.

The structure needs to lead to the system’s desired aggregate properties, such as performance, safety, reliability, or specific system functions like moving along the desired path or providing reliable electrical service.

The aggregate properties are emergent, and arise from the way the structure combines the properties of individual components.[1] The structure must be designed so that the system has the desired emergent properties and avoids undesired ones. For example, a simple reliable system has a reliability property that arises from the combination of two or more components that can perform the same function, along with the interaction patterns of each component receiving the same inputs, each component generating consistent outputs, how the two or more results are combined, and how each component responds to failure.

The structure must be designed to avoid unanticipated emergent properties, especially when those properties are undesirable. In a safe or secure system, for example, it is necessary to show that the system cannot be pushed into some state where it will perform an unsafe action or provide access to someone unauthorized. Avoiding unanticipated emergent properties is one of the hardest parts of correctly designing a complex system.

The structure must be well-designed for the system to meet its purpose, and for people to be able to understand, build, and modify it. In particular the structure needs to be:

Understandable by the people who design, implement, verify, and use the system;
Analyzable and verifiable so the system’s design and implementation can be confirmed to meet the system’s purpose; and
Modifiable as needs and technologies change.

There are good engineering practices that should be followed to achieve these aims, as we discuss in ! Unknown link ref.

Finally, the structure determines the interfaces that each component part must meet. Those interfaces in turn determine a component’s functions and capabilities, which guide the people working on the component, as discussed in the previous section.

5.5 Evidence

It is not enough to design and build the system; the team must also show that the system meets its purpose.

The team developing or maintaining the system must be able to show that the system complies with its purpose to customers, who need to know that the system will do what they expect; to investors, who need evidence that their investment is being used to create what they agreed to fund; and to regulators, especially for safety- or security-critical systems, who are charged with ensuring that systems function within the law.

The team also needs to ensure that pieces of the system meet the system’s purpose as they are developing or modifying those pieces. They must be able to judge alternative designs against how well they meet the purpose, and once built they must be able to check that the result conforms to purpose.

The process of showing that a system or a component part fulfills its purpose involves gathering evidence for and against that proposition, and combining the evidence in an argument to reach an overall conclusion about compliance. There are many kinds of evidence that can be gathered: results of testing, results of analyses, results of expert analysis, or results from performing a demonstration of the system. These individual elements of evidence are then combined to show the conclusion. The combination usually takes the form of an argument: a tree of logic propositions starting with the purpose and devolving hierarchically into many lower-level propositions that can be evaluated using evidence. The process must show that the structure of the argument is both correct and complete in order to justify the final conclusion.

Pragmatically, arguments about meeting purpose usually follow a common pattern, as shown below. The primary argument that the implementation meets the purpose consists of a chain of verification steps. The implementation complies with a design, which complies with a specification, which complies with an abstract specification, which complies with the original purpose. As long as each step is correct, then the end result should meet the original purpose—but at each step there is the possibility of misinterpretation or missing properties, or that the verification evidence at each step is not as complete as believed. In practice this approach leaves plenty of uncaught errors in the final implementation. To catch some of these errors in the chain of verification steps, common practice is to perform an independent validation, in which the final implementation is checked directly against the original purpose.

Some industries, particularly dealing with safety-critical automotive and aerospace systems, add an additional kind of evidence-based correctness argument. This is often called the safety case or security case, and consists of an explicit set of propositions, starting with the top level proposition “the system is adequately safe” (or secure) and showing why that conclusion is justified using a large hierarchy of propositions. The lowest-level propositions in the hierarchy consist of concrete evidence; intermediate propositions combine them to show that more abstract safety or security properties hold. ! Unknown link ref

Finally, evidence takes many forms, depending on what needs to be shown. Some correctness propositions can be supported by testing. These typically show positive properties: the system does X when Y condition holds. Some of these conditions are hard to test, and are better shown by analysis or human review of design or implementation. Negative conditions are harder to show: the system never does action X or never enters state Y, or does so at some very low rate. These require analytic evidence, and cannot in general be shown by testing.

We discuss matters of correctness, verification, validation, and the related arguments in ! Unknown link ref.

5.6 Using this model

Part III goes into greater detail about each part of this model.

This model of systems provides a foundation for organizing the work that needs to be done to build the system. As will be discussed in the next chapter , each component will require a number of tasks to design, build, verify, and integrate that component. The specific flow of work is based on the system’s life cycle (Section 17.3).

This model also provides a basis for defining the artifacts (Section 6.3.1, Chapter 14) that record the system’s design and implementation. ! Unknown link ref discusses what information is maintained for each part of the system.

Finally, the model of how systems are organized provides guidance for planning the work to build the system. XXX reference continuous system integration, risk-based planning, risk management, prototyping

[1] While there is extensive literature on the philosophical basis for emergent properties (see e.g. [OConnor21]), the case for emergence in human-designed systems is altogether simpler. Supervening properties, such as “safety” or “redundancy” or “performance” can be treated as real, and human engineering follows a model of physicalism in which higher-level emergent properties arise wholly from properties of the lower levels of the system—regardless of whether that interpretation is fundamentally true or not.

Chapter 6: Elements of making a system

29 March January 2024

6.1 Introduction

The previous chapter defined what a system is. In this chapter, I turn attention to how to make that system. “Making” includes the initial design and building of the system, as well as modifications after the initial version has been implemented.

Making the system is a human activity. Building a system correctly, so that it meets its purpose, requires a team of people to work together. Building systems of more than modest complexity will involve multiple people, usually including specialists who can work on one topic in depth and people who can manage the effort. It involves people with complementary skills, experiences, and perspectives. Such systems take time to build, and people will come and go on the team. Systems that have a long life that leads to upgrades or evolution will involve people making modifications who have no access to the people who started the work.

This chapter provides a model to organize and name the things involved in the making of a system—the activities, the actors, and what they work with. Later chapters provide details on each part of this model. This model includes both elements that are technical, such as the steps to design some component, and elements that are about managing the effort, such as organizing the team doing the work or planning the work. Note that this model does not attempt to cover all of managing a system project—there is much more to project management than what I cover here.

The model presented in this chapter only serves to name and organize. I do not recommend here different approaches one can take for each of the elements of the model; only attributes that good approaches should have. Later parts of this book address ways to achieve many of these things. For example, the team that is designing a system should have an organization (a desirable attribute), but I do not address which organizational structures one can choose from.

The assembly of all the parts involved in making a system is itself a system. In those terms, this chapter presents the purpose (Chapter 8) of the system-making system and a high-level concept for how to organize the high-level components (Chapter 9) in that system.

6.2 Objective

This model of making captures the activities and elements involved in executing the project to make or update a system.

The approach used for making the system should:

Build a good system. This means the system should have a clear design, be safe and secure, and be maintainable.
Be cost and time efficient. A system usually has a budget or a deadline; the work should proceed without wasting either time or money.
Keep the workers who build the system satisfied.
Satisfy the customers who will sponsor or use the system.
Position the organization building the system for future work, if appropriate.

6.3 Model

The making model has five main elements:

Artifacts: the things created that make up the system and its records
Tasks: the activities that are performed to make artifacts
Team: the people who perform tasks
Tools: things that the team uses in performing tasks
Operations: how the team manages the work to be done

6.3.1 Artifacts

The artifacts are the things that are created or maintained by the work to make the system.

The artifacts have three purposes. First, the artifacts include the system’s implementation—the things that will be released or manufactured and put in users’ hands. The artifacts should maintain the implementation accurately, and allow people to identify a consistent version of all the pieces for testing or release. Second, the artifacts are a communication channel among people in the team, both those in the team in the present and those who will work on the system later. These people need to understand both what the system is, in terms of its design and implementation, and why it is that way, in terms of purpose, concept, and rationales. Finally, the artifacts are a record that may be required for future customer acceptance, incident analysis, system certification, or legal proceedings. Those evaluating the system this way will need to understand the system’s design, the rationales for that design, and the results of verification.

The artifacts should be construed broadly. They include:

Records of the system’s purpose (Chapter 8).
Documents recording the system’s concept and design.
The system’s implementation.
Verification records.
Rationales for design choices.
Plans, defects, analyses, and activity logs.
Procedures and processes to guide work.
Information about the team and roles.

Artifacts other than the implementation are valuable for helping a team communicate. Accurate, written documentation of how parts of the system are expected to work together—their interfaces and the functions they expect of each other—are necessary for a team to divide work accurately.

Many engineers focus solely on the implementation artifacts, especially in startup organizations that are trying to move quickly, and do not produce documents recording purpose, design, or rationales. If the organization is successful and the system they are building enters service, at some point this other information will be required—as the team membership turns over, or as the complexity of the system grows, or as the team finds flaws that need to be corrected. The startups I have observed have all had to reconstruct such information after the fact; the reconstructed information is less accurate and costs more than it would have been if it had been recorded from the beginning.

Finally, the artifacts should be under some kind of configuration management. Artifacts will evolve as work progresses. One artifact may be a work in progress, meaning others may want to review or comment but that they should not count on the artifact’s contents being stable. An implementation artifact may reflect some design artifact; when the design artifact is revised, people must be able to see that the implementation reflects an older version of the design. When the implementation artifacts are packaged up and released, the resulting product needs to have consistent versions of all the implementation parts.

6.3.2 Tasks

These are the individual activities that team members perform. The tasks use and generate artifacts. I rely on the colloquial definition of “task” and do not try to formalize the term here.

Systems projects usually have vast numbers of tasks. These include tasks for designing, building, and verifying the system; they also include tasks for managing the project, reviewing and analyzing parts of the system, and approving designs and implementations.

There are usually far more tasks to be worked on than people to do them. Tasks also usually have dependencies: something needs to be designed before it is implemented, or one part of the system should be designed before another.

Tasks, in themselves, need to be known and tracked. People on the team need to know what they can be working on, and who is doing other tasks that might relate to their work. Managers need to be able to track what is being done, what tasks are having problems, and ensure that tasks are coordinated and completed.

Operations, discussed below, addresses questions of what tasks are needed and which ones should be performed in what order.

6.3.3 Team

These are the people who do the tasks. They are not an amorphous group of indistinguishable and interchangeable parts; each person will have their own abilities and specialties. Each person will also have their own authority, scope, and responsibilities.

The team should be organized. This means:

Everybody knows who is on the team.
Each person knows what they should work on—the scope of their responsibility—as well as what they should not work on.
Each person knows who has what responsibility and authority, so that they know who to talk with when they have a question or request.
Each person knows who they should inform of both technical and administrative matters, so that when one person is making a discovery or decision they know who they should tell.
Each person should know the process by which decisions are made, so that they know what decisions have been made or not.

In addition, the team needs to be staffed with enough of the right people to get work done. This means that people with management responsibility need to know who is on the team and their respective strengths, as well as the workload each one has and the overall plan for moving the project forward.

6.3.4 Tools

These are things that the team uses to get its tasks done. The tools are not part of the system being produced, though they are often systems in their own right. An end user of the system being produced will not use these tools, either directly or indirectly.

The tools include things like:

Tools for storing and tracking artifacts as they are created and updated.
Software build tools and hardware design tools (e.g. CAD systems).
Verification and analysis tools.
Testing infrastructure, including physical equipment, procedures, and analysis software.

6.3.5 Operations

Operations is about organizing the work that the team does. Its primary function is to ensure that the right tasks are done by the right people at the right time.

Operations sets up “a set of norms and actions that are shared with everyone” in the project [Johnson22, Chapter 2]. It gives people in the team a shared set of rules and procedures for doing their work, and it uses those procedures to manage a plan and tasks that coordinate that work. When people share a set of rules and procedures, they can each have confidence in how others are working and in the results that others produce.

There are two primary objectives for operations: making sure the work proceeds efficiently, and ensuring product quality. Operations has secondary objectives, including keeping the organization informed of progress and needs.

Ensuring the project runs efficiently implies several things.

Avoiding unnecessary work and rework.
Avoiding delays where some team members are blocked because some other part is not yet ready.
Ensuring resources are available when needed.
Organizing communication within the team so that needed information is shared.
Ensuring that the project can continue to operate into the future by providing for communication with future team members.

Ensuring quality means:

Making sure that design and implementation are firmly based in system purpose.
Ensuring that work is checked for meeting correctness and quality standards.
Ensuring that work involving multiple people is coordinated so that they do not work divergently.

I look at operations through the lens of the tasks that people on the team will do. Operations is about tracking what tasks need to be done, who is working on them, and how those tasks are going. Operations is, in a way, a feedback control system that keeps the flow of tasks running smoothly.

Operations is more than overseeing tasks, however. It is equally about guiding the team through its work, especially in how people should coordinate their efforts. This starts with setting out the guidelines for how work should get done: procedures and process. That leads to planning, which sets the longer-term direction for the project’s work and allows project management to check whether the work is proceeding well. Planning leads to managing the work being done at the moment. All of these depend on information that supports decisions that have to be made.

I use the following model to define the parts that make up operations. This model has a flow from a project life cycle, which is established early in the project and changes rarely, through parts that organize the work, onward to day-to-day tasking. I explain this model in more detail in Chapter 17.

Life cycle. This defines the overall patterns of actions that the team will perform as it does the project. It defines phases of work and how one phase should happen before another. A typical phase is made up of many tasks; it covers (for example) the the work designing some component. The life cycle also defines milestones, which provide planned times when checks on work are done in a phase.

A life cycle pattern says things like: “First work out purpose, then specifications, then design, then implementation. At the end of each of these phases, have a review with one person designated to approve moving forward.”

There are many different life cycle patterns, and usually an organization or a project will need to pick one—and then customize the life cycle to meet its specific needs. Sometimes the life cycle will be determined by external requirements; for example, NASA defines a common life cycle for all its projects [NPR7120].

Procedures. While the life cycle defines in general what to do, the procedures define how to do some tasks. They provide specific instructions for how to do particular actions or tasks. The instructions might take the form of a checklist, a flow chart, or a narrative.

People on the team need to know how to do things that require coordination. While team members should be able to do most of their work independently, at some point they will need to work together. The work will go more smoothly if everyone understands when they need to work together and how to do it.

There are also some tasks that are procedurally complex, even when only one person is involved. For these tasks it is helpful to have written down the steps to perform—which serve in effect as a checklist.

Procedures should be defined for tasks where getting the actions right is critical or where the task is complex. In the example below, checking a document artifact into a repository is simple, but needs to be done correctly. Performing a design review and approval has potentially many steps to go through: communicating the design to others for review, an approval decision by a designated team member, and changing the status of design documents to show that it has been released. When the life cycle defines a point in the project when something should be checked, such as during a review, procedures ensure that all the needed checks actually happen.

Documented procedures help the team perform tasks accurately, helping to make sure that steps aren’t missed. They also help the team do those tasks in compatible ways so that one person’s work can build on another’s.

I have seen teams that try to operate without some ground rules for working together. This can work quite well for teams up to three or four people, and when the artifacts they produce do not need high assurance (that is, when what they produce is not safety- or security-critical). On larger teams that have not written down their basic process rules, I have always seen failures to communicate or consult. These failures sometimes led to errors in the system that had to be corrected later once found. Sometimes they led to one person damaging another person’s work, requiring time and effort to recreate overwritten designs.

Documenting procedures also provide a way for the project to learn and improve. If some procedure is not working well, the team can identify which procedure is the problem and then change it. As long as team members then follow the revised procedure, the team’s ability to work should improve over time. Contrast this to not documenting a procedure: some people may have opinions on how to do it better, and they may start doing it the new way, but not everyone will know about the change, and people may forget it after a little while. This makes learning slower and less reliable.

Plan. The plan defines the overall intended path forward to a completed system, along with selected milestones along the way. It is a current best estimate of the general steps needed to move the project toward that goal.

A plan records the approach the team intends to take to build the system. It lays out the phases of work expected, in coarse to medium granularity. In doing so, it records decisions like the flow from specification to design to implementation to verification. It records when the team decides to investigate different ways to design some component, perhaps prototyping some of the ways. It documents expected dependencies and parallelism.

The plan is, therefore, a record of how parts of the life cycle pattern are applied to this specific project. Just as there are many patterns that a project can choose to use, there are many different ways to organize the project’s work. I discuss these choices in depth in Chapter 17.

A plan is not necessarily a schedule. A schedule is usually taken to mean a sequence of events with a high confidence of accuracy and completeness. A plan, on the other hand, reflects the uncertainties that come with developing a complex system. In the beginning, the plan can be specific about a few things in the near term but must be vague about the longer term until enough design has been completed to fill out later work. As a project progresses and more and more becomes known, the plan should converge to something like a schedule.

A plan is broader than a list of specific tasks. It consists of a number of work phases, and dependencies among them. This information then guides the specific tasks, as discussed in the section on tasking below.

Plans are used in prospect, in the moment, and in retrospect. They should provide guidance on what direction the work will likely go in the future, even when that direction has uncertainty. They are used in the present to track what is happening now. They provide history of what has been done, to understand how the team’s work compares to predictions and to provide accountability for everyone responsible for working on the project.

I have never encountered a project that had a single plan for the whole duration of the work. Plans have always been dynamic. Early in the project, we knew that we needed to develop a concept for the system but did not yet know enough to sketch out the work involved in building that concept. Later we had a general structure for the system, but there were technical questions to resolve; once resolved, we would know what we were building. Later in the project, we would find defects or we would get a change order, resulting in unanticipated work.

Tasking. This is the day-to-day definition of tasks to be done, their assignment to team members to perform, and tracking their progress.

Tasking involves continuous decision-making: the choice of which tasks should be performed next, or which tasks should be interrupted to deal with higher-priority tasks. These choices merge several streams of potential tasks: ones that derive from the nearest parts in the plan; ones made newly urgent by a change in what is known about the system; ones about fixing errors that have been discovered; and tasks related to new outside requests.

The team will need to keep track of both the potential tasks and the ones that have been assigned and are being worked on. This implies record-keeping artifacts.

The criteria for deciding about tasks should be encoded in procedures, as discussed above. The procedure for choosing tasks can be viewed as a control system that responds to project events to affect the set of tasks assigned for work, with the aim of making the project’s execution run efficient. “Efficiently” means meeting the goals set out above for operations: ensuring that the right work is done, that people aren’t blocked from getting work done, and that the work follows orders or dependencies needed for high-quality work.

How the tasking control system works depends on the development methodology used in the project. Agile development, for example, often focuses on making tasking decisions at regular intervals (for each “sprint”); other methodologies focus on making tasking decisions continuously.

Support. The decisions made during operations take into account several kinds of supporting information. These include:

The project’s budget. This accounts for the money and other resources made available to build the system and how much has been used to date. This can be combined with the estimates in the plan to determine whether the project has enough resources to complete, or to reach intermediate milestones.
Risk. This is a list of potential problems that could affect the project’s execution. These risks are things like “part X may not be delivered on time” or “regulator Y may object to a design decision”. Good management practice keeps track of these, checks in on them regularly, and works to either eliminate the risk or mitigate the outcome.
Uncertainty. This is a list of the technical uncertainties in the system—things like “are batteries of energy density X available” or “the control algorithm that maintains output within range Y is not yet determined”. These uncertainties can be addressed by including work in the plan to investigate the questions. Finding answers will allow the plan to become a more accurate estimate of the path forward, as well as leading to choices in the system design.

Sidebar: Resource-constrained projects

Traditional project planning approaches grew out of projects, such as building construction, that focus first on time and budget. This kind of project treats the completion date as the driving factor in organizing work, and assumes that in general as many workers can be brought in as are needed to complete the work quickly, and that parallelism between tasks is limited primarily by dependencies between tasks. For example, in building a house, one contractor typically brings in a team to frame the structure, while another brings in a team to add the electrical wiring or plumbing into the structure. Each of these teams can bring in as many people as needed to get the work done, and then those people go on to another construction project elsewhere when their part is done.

This model of project planning leads to tools organized around a graph of dependencies between tasks. These tools usually provide analyses like critical path analysis, which shows the longest path through the graph of tasks and therefore the hardest constraint on how quickly the work can be completed. Planning the project well often hinges on understanding the dependencies between tasks and the critical path through them.

Most complex technical system projects, on the other hand, do not fit this model well. Each person working on the project needs to understand the context of their work, and there is usually a substantial cost to add someone to the project—largely in them learning about how the project works and how the system is organized. The collection of trained people on the team constitutes a valuable resource that the organization tries to keep around to maintain the system or to work on similar systems.

This approach leads to a different approach to planning work. While dependencies are certainly important, there are often many tasks that any one person can work on (and it is common to expect some degree of multitasking). In this case, getting the order of operations precisely right is not as important. It is more important to ensure that everyone can stay busy and that any major dependencies are accounted for.

6.4 Using this model

This chapter has presented a model for thinking about the work involved in making a system. This model, in itself, does not prescribe any particular way of managing building a system; it only names the topics that need to be addressed and provides some objectives by which an approach can be judged.

In Part IV, I go into more depth about each of the elements in this model.

Those who manage a project will need to decide how they will go about organizing their work. As I noted earlier, how a project is organized and run is itself a system, and the techniques discussed in this book apply as much to designing and operating the project’s operations as they do to designing, building, and operating the system product. Chapter 5 and Part III discuss the model for what a system is.

Part V discusses how the work of building a system can be organized around the life cycle of a project. Chapter 18 introduces the idea of a life cycle. It also introduces the idea that a life cycle model provides a basis for working out the tasks that need to be done to build the system. Subsequent chapters discuss what each of the phases of a life cycle, along with the artifacts and activities that go into each one.

Part VIII discusses ways to organize the team that will do the work.

Part IX presents approaches for planning and organizing the tasks that need to be done.

Chapter 7: A well-functioning project

2 April 2024

I have been a part of many projects. These projects built a wide range of systems, including specialized small business record keeping, local government IT applications, low-level graphical user interface tools, large storage systems, spacecraft systems, and ground transportation.

Some of these projects went well. They produced systems that were useful for their customers. The systems held up over many years of use, working correctly and supporting their users in ways they needed. The projects proceeded (fairly) smoothly: no major unexpected flaws, teams that worked together well, completion within close to the expected time and resources.

Paraphrasing Tolstoy, all well-functioning projects are alike; each project that has problems has problems in its own way. [Tolstoy23] Though there are several ways for a project to go well, there are far more ways they can go wrong—and it takes deliberate effort to keep a project on the path that goes well.

I have watched many of these projects struggle through problem after problem, most of them self-inflicted. The causes have included poor team organization, lack of a coherent system design, lack of taking the time to think through designs, lack of design, internal organization politics, and many others. The struggles led to canceled projects, startups that had to get extra funding rounds and missed their market opportunity, and unsafe systems being used in public spaces—often consequences not just for the people building the system but for their funders and for society at large.

This book was inspired by observing these problems and finding ways to do better the next time.

So what does a project need to do to function well? To develop a useful, safe system, on a reasonable schedule and budget? To keep its team functioning at a sustainable pace, without internal disruptions? The rest of this book seeks to provide some answers.

My general principles fall into four categories:

The project or organization leadership;
The tasks for building the system;
The plans for building it; and
The team that builds it.

For each of these, I will list some principles I have found important to making a project that runs well, or to keep it running well.

7.1 Project leadership

I have watched many projects, especially in startup companies, try to create a team of the best specialists: executives who are skilled at fundraising and external relations; an HR person who has a track record at recruiting; someone with marketing skills and connections, and a few engineers who can build the key technical parts of the system. Most of the projects that have staffed only with such specialists have either failed or had serious problems with execution.

These projects had a gap at the center of the work. Everyone is responsible for some piece, but there is no one whose responsibility is to link the pieces together: to build either the team or the product as a coherent system. People in the team generally don’t really understand each others’ work. They have trouble finding how to work with each other. The executives don’t understand the work or the team, and issue instructions that don’t make sense. The team makes poor technical decisions because no one understands how the artifacts they are building must work together.

This gap leaves three needs unmet. First, there is communication and translation between the executive team and the engineering team. Second, there is organizing and running the engineering team. And third, there is maintaining a systems view of the team’s technical work.

7.1.1 Principle: Communication and translation

Have at least one person in the organization who can communicate with people in the executive team, marketing, and engineering, and translate among them.

The executive team is, in most organizations today, a collection of specialists in running the company as a whole: corporate activities, finance, legal, public relations, marketing. I have found this to hold equally for independent companies, especially startups, and for projects that are part of larger organizations. The details may differ but the roles are largely similar.

The engineering team is also mostly a collection of specialists in one area or another, according to the needs of the system being built. They will understand parts of the system, but few of them are tasked with making all the parts cohere so they work together. Most of the engineering team will have been contributing by having specific, deep skills.

The communication need is to represent these parties to each other. The executive team is responsible for setting the overall direction for the project. The engineering team needs this direction translated into actionable directions. The executive team also must be responsible for high-level safety and security decisions (e.g. what kinds of safety hazards the company will address in its system products). The executive team has the responsibility for these decisions, and those then need to be translated into the safety and security engineering processes. In the other direction, the engineering team needs to provide feedback to the executive and marketing teams on the feasibility and cost of different possible feature or market decisions the executive team could make.[1] The project management part of the engineering team also has the information about how work is progressing and can provide information about the time and people needed to reach different milestones.

7.1.2 Principle: Provide staff to run the engineering team’s operations

Designate at least one person to oversee how the team building the system operates. This person (or people) organize the team, and adjust how it operates as the team grows and the work progresses.

An organization is a system, and a team of more than a handful of people will not self-organize in a useful way. I will argue below that this system needs careful design to work well.

I consulted with a small startup that did not have someone responsible for organizing the engineering team. The startup had begun as a very few people, who were figuring out the basics of what their company could build. The co-founders did not create an organization below the executive level; instead, they expected that they could all just work together and figure it out. And, predictably, they did not figure it out once they added a few more people to the team and had to specialize.

Johnson [Johnson22] discusses how to organize a growing company, and I recommend her work to the reader. She presents many ideas about what to do to organize a company’s operations. While that book focuses more on the human-oriented parts of operations, such as hiring and performance evaluation, the ideas it presents provide a solid foundation for parts specifically about engineering, such as how to organize design and implementation verification (which are as much a human activity as a technical one).

An organization that is going to successfully build a complex system will need to designate someone as having the primary responsibility for creating and maintaining the team’s structure and patterns of behavior. Either that, or they need to get improbably lucky.

7.1.3 Principle: Systems view of the system

A team building a complex system must have at least one person who is responsible for the system as a whole, not just its parts.

A coherent, working system does not occur by chance. It requires deliberate effort for a collection of parts to work together, and for the collection to fulfill the purpose for a system.

This deliberate effort can be achieved, theoretically, by a group of uncoordinated specialists. However, this amounts to the Infinite Monkey Theorem, where enough workers and enough time can produce any system. For realistic systems, many more times the projected lifetime of the universe might be enough.

In reality, the majority of the engineering team is responsible for parts of the system, not the whole thing. It is not the job of these people to be responsible for the systems view of the whole; nor is it usually their training or experience.

Building a system requires coordination so that the parts work together. This can be achieved by designating one or a few people to be responsible for the coordination, or by having the parts-builders work by consensus. Work by consensus requires skills and time that few people have, unless the team has no more than perhaps five or six members.

Building a coherent system also requires having a way to measure coherence and satisfaction of system purpose. If a team is to work by consensus, all members of the team must have a consistent understanding of these criteria. If a smaller group is responsible for the system as a whole, then fewer people are required to share this understanding.

The shared understanding starts with the purpose for the system. The definition of the system’s purpose is outside the engineering team’s scope; it comes from the customer or their proxies by way of marketing roles (Section 5.2 and Chapter 8). The translation of information about customer needs into an actionable system purpose is the responsibility of a system role. This includes documenting the system purpose, developing a concept of the system, and writing down top-level system specifications. In doing so, the role works with the executive and marketing teams to confirm that the purpose and concept as developed match what the customer and organization actually intend.

The systems role is responsible for ensuring that the component parts of the system fit together into a coherent system. To meet this responsibility, the systems role is responsible for the design of the high-level decomposition of the system into parts, and how those parts are related—the functional and non-functional relationships (Section 5.4 and Chapter 10). While the systems role delegates the work to design and build the components, the role does check that the results match the specification of how the components interact. The systems role also guides the order of work, especially for how to plan integration.

7.1.4 Principle: The team is a system

A well-performing team is deliberately designed to have a structure that gives each member incentives and support to work together. The team’s leadership establishes the design, and monitors the team’s function to adapt the team structure when needed.

An effective team does not happen by accident. When a team is not given a structure and rules about how to work together, they will find ways to work. They will build up habits in response to a few specific early needs—and those habits will not make for a team that communicates well, cooperates well, or makes good systems decisions.

When medium to large teams try to self-organize, they react to problems they face immediately, and each person determines their response based on their own values and self-interest. The team members are not trained or incentivized to plan the team’s organization for future needs; instead, they find ways to work through individual problems as they come up. The team members in general do not have a view of the entire effort that will be needed to build the system, and so they find solutions based on their specific needs.

Team work exhibits variations of collective action problems. [Olson65] These problems occur when a group must work together; each member of the group must contribute in some way, and in return everyone in the group receives some benefit. The optimal strategy for an individual is often at odds with the optimal strategy for the common good. Many commonly-known cooperation problems, such as the tragedy of the commons or the prisoner’s dilemma, are kinds of collective action problems. (In fact an engineering team represents a particularly complex kind of collective action problem, because the contributions of different group members can combine non-monotonically: the value of one person’s contribution to the common good can be negated by another’s contribution.)

In other words, the natural tendency for a group is to form an organization that is reactive to immediate needs and to individual objectives, rather than the long-term objectives of the project as a whole.

Creating an effective team is, therefore, a deliberate act. It involves working out what the team needs to do as a whole, and then designing a structure for the team. That structure should address:

Who is responsible for what actions (designing some component, testing something);
How communications should flow in the team (both communication within the team as a whole and within sub-groups to coordinate building specific parts of the system);
Who makes what decisions, and how those decisions are made; and
How to identify and respond to exceptional conditions (problems in the team, technical problems).

Maintaining a team’s effectiveness is also a deliberate act: good project leadership monitors how the team is doing and adapts organization or processes when needed. The team organization, or its processes, or its role assignments may work well for a while, but not fit the team’s needs as well later. The project’s leadership may set up a team organization or process and then find it doesn’t work as well as expected.

The organization of a team can be evaluated against the objectives in Section 6.3.3: how well people know how they fit into the organization and how that affects the actions they take.

I discuss matters of designing a team in Chapter 16 and in Part VIII.

7.1.5 Principle: Team habits

A team with good habits and culture can get work done. A team with poor habits will not, except by unlikely random chance.

Whether a team follows procedures and processes also depends on whether following them is the norm for the team.

Teams follow habits. Establishing the habit at the beginning of a project is not difficult. Changing their habit later is quite hard and rarely successful. The leadership of a team has one opportunity to set up a team to follow a process without undue effort. When they squander that opportunity, a project has difficulty from then on. If people in a team do not have a de jure process to follow, they will work out ways to get things done, and those habits will be the default way they work. Those habits are likely to have been worked out in reaction to a few specific, immediate situations and won’t account for the indirect ways that one piece of work affects another, and thus will not meet the project’s needs well.

It is possible to change a team’s habits after the fact. However, it takes time (a lot of it) and effort. The transition from one way of working to another will take time, as people will follow their habits without thinking until new habits set in. People will need constant reminders and incentives to change their behavior. There will be a period when people are doing a mix of old and new, which can increase chaos for a while (and often creates extra work to clean up the differences). People will feel extra stress and often there will be a decrease in morale or civility in the team until they settle into the new norm.

Most of the projects I have worked on over the years have been about innovation. The people who start such projects do so because they are excited about what they can build, whether about the technical aspects or the market aspects. They are motivated to get moving as quickly as they can. They usually are trying to make a prototype or do a demonstration as soon as they can. They do not have excitement about the work of crafting a team; if they need that, they will get to it later when they have the prototype built, or when they have the next funding round…

This tendency is often exacerbated by the way some funders behave. They reward market opportunity and technical originality, which incentivizes a team to build the market case and technology demonstrations as quickly as possible. Funders rarely reward or even evaluate whether the project leadership has capability to form a well-functioning team. When a team’s ability to execute effectively and efficiently is not valued by the funder, they will not put the effort into crafting the team.

A project’s leadership must incentivize and model following processes in order to build a team’s habits. I am aware of a company that set out anti-corruption processes, including ethical standards and a hot line for reporting violations. The leadership did not, however, make it clear to the employees how these would be acted on, and there was no demonstration of the standards being enforced. The employees correctly realized that the leadership was not serious about enforcing the standards, and it led to significant internal theft.

7.1.6 Principle: Keep it lightweight and actionable

People will use processes that they can figure out how to follow and that clearly give them benefit. Don’t make processes more difficult than what the team can do.

People will generally follow prescribed practices and procedures as long as 1) they know about them; 2) understand them well enough to perform them; and 3) the practices have high value relative to the effort required.

The first aspect implies that processes and procedures are documented and organized in a way that team members can find them. This also implies that when people join the team, they are taught how to find and use them.

A practice or procedure must be both clearly written and actionable for people to understand it and use it. I have encountered “plans” or “procedures” on multiple projects that amounted to a list of aspirations, rather than a specific set of actions that someone could follow. In one example, a security incident response procedure said things like “we will contact the responsible parties”, without naming who the responsible parties are (or even better, listing them with contact information). Had there been an actual incident, vague statements like this one would have led to time spent figuring out who the responsible parties were, and likely coming up with a wrong answer when under the time pressure of trying to resolve a critical incident.

A process or procedure that requires too much time or effort will lead people to try to create workarounds, usually subverting the reason that a procedure was established. This is the problem of a procedure that people perceive as too “heavy”. Keeping procedures as simple as possible will help. At the same time, some work is simply complicated, perhaps needing several people involved because it affects all of them. When some work is necessarily complex, it is vital to clearly document the process so that everyone involved understands both their own role and what the others involved will be doing.

I will discuss these topics more in Chapter 17, and especially in Section 17.2.1.

7.2 System-building tasks

Most engineers understand the need to use good technical judgment as they build a part of a system, but it is just as important to follow good practices in how the team approaches the work.

7.2.1 Principle: Start with a purpose before doing work

Understand why something is being built—its purpose—before trying to design and build it.

This is one of the most important principles in this book, and it applies in a great many ways.

“Purpose” here means the objectives for some work, the need that is to be met by doing the work or the reasons that it is worthwhile to spend the time and resources involved.

If someone starts designing or building something without understanding the purpose of the work, it is unlikely that what they build will actually meet the need that caused them to start the effort. And even if they do meet the need, perhaps by focusing on the purpose part way through the work, they are likely to have spent time and resources in false directions.

When someone takes on a task, whether to build part of the system or to oversee team operations, it is that person’s responsibility to ensure that they accurately understand the purpose of the work. Ideally they will be told the purpose as part of the task, but the person is still responsible for confirming that they correctly understand the purpose. I have found that taking explicit steps to confirm understanding saves time and effort, even for small tasks.

At the same time, the person who defines a task is responsible for ensuring that there is a clear purpose to the work and communicating that purpose to whoever takes on the task. In other words, the purpose for work is involved in a communicative action.

This principle applies to building a whole system. As I discussed in Section 5.2, a system needs a purpose—a customer need, for example—that it will fulfill. This purpose originates with the customer, or whoever will use the system and the value that the system will provide them.

The principle also applies to building components of a system. Each component (Section 9.2) has some role in the system: functions, behaviors, or properties that it should have that contribute to the system as a whole meeting its high-level purpose.

Other work also should have purpose. Organizing the team, or maintaining the project plan, or reviewing a component design are all tasks that have purposes. Someone doing these tasks should understand why the organization or review is being done, and they should ensure that how they do the work addresses that purpose even if associated procedures don’t spell out every step involved.

I argue in an upcoming principle that successful projects perform checks to ensure that the work that is done correctly fulfills its purpose. Without a clearly-defined purpose, it isn’t possible to determine whether a design or implementation or plan is correct or accurate.

I discuss how purpose fits into a system-building project throughout the rest of this book. I address the purpose for a system in Chapter 8. Each chapter in Part IV, on how to make a system, discusses the purpose of steps in building a system. As I present more specific topics, such as specifications (Chapter 21) or designs (Chapter 23), I present the purpose for that aspect of system-building before talking about what it is or how it works.

7.2.2 Principle: Evaluate tools before adopting them

Investigate whether tools, procedures, methodologies, designs, or implementations fit the project’s purpose before adopting them.

Every complex system is different from others in some way. The differences may be technical, such as how some component must behave, or they may be operational, such as the kind of team, the organization hosting the project, or the customer’s needs.

Differences mean that things taken off the shelf may or may not address the project’s need. An off-the-shelf electronics board might be a good fit, or it might not be available within the time needed, or it may lack a key security feature, or it may have reliability features that the project’s design does not need (but that do not interfere with how the board will be used). Similarly, a development methodology might address the project’s need for moving quickly and being flexible, but it might not work for a project’s distributed team.

In many cases the off-the-shelf methodology or design can be used in many different ways. The team may need to make choices about which of those ways are helpful for this specific project. The team may need to adapt procedures or methodologies for the procedure to fit what this project needs.

A well-functioning project will evaluate something that can be adopted, whether it is a component design or a procedure or a tool, against what the project needs that thing to be. Something that might be adopted can be measured in terms of the benefits of using it, the costs of adapting it to meet the project’s needs, and the costs of using the thing without adapting it. If the benefit outweighs the costs, then the thing can be used. If the thing does not quite meet the project’s need but can be adapted, then an investigation will reveal how to adapt it.

Sometimes a project will be obliged to adopt a process or use a component that is not a good fit. In that case the thing should be evaluated so that the team has a clear-eyed understanding of what problems could arise, and they can work out mitigations to avoid the worst problems.

This principle has a serious risk: that it will become an excuse for the Not Invented Here syndrome. No projects have the time or resources to invent everything from scratch—especially when reinventions often lose sight of the experience that has gone into building existing procedures or components. A team has to balance using tools that are pretty good but not perfect against the cost of inventing from scratch.

The idea of satisficing applies. This is when one applies a solution that is good enough to satisfy a need, without attempting to find a perfect solution. Writing of adapting buildings:

The solutions are inelegant, incomplete, impermanent, inexpensive, and just barely good enough to work. The technical term for it, which arose from decision theory a few decades back, is “satisficing”. It is precisely how evolution and adaptation operate in nature.

Even after generations of satisficing, the result is never optimal or final. […] The advantage of ad hoc, make-do solutions is that they are such a modest investment, they make it easy to improve further or tweak back a bit. [Brand94, page 165]

7.2.3 Principle: Follow the spirit, not just the letter

When a project has adopted a procedure or tool, that procedure or tool has a purpose. When using it, keep the purpose in mind and make sure that purpose is met—not just following a procedure or using a tool blindly.

A well-functioning project does not adopt its procedures or methodologies on whims; it addresses them to purposes. In organizations like NASA, the procedure standards represent several decades of accumulated experience. While the procedure may not be written to make the purpose and experience clear, these reasons exist behind what has been written.

I worked on a NASA project that reached its Preliminary Design Review (PDR) milestone. The team followed the long NASA checklist for what should be presented at that review. Unfortunately, the team did not keep in mind what the PDR was actually for: ensuring that the early, conceptual design coheres as a system and showing that the system is ready to proceed to steps that will involve greater investment. Instead they developed material that checked each box on the agenda, without addressing the system as a whole. The reviewers could tell that the design did not make sense; moreover, the review failed to reveal the actual problems that the design had.

A team should document the reasons or purposes for which they adopt a procedure or a tool. Similarly, each person on a team should put effort into understanding why the team has adopted procedures and tools.

7.2.4 Principle: Document things so there is a future

Document both how things work and why they work so that people can understand the system when they work with it in the future.

It is easy to want to design or implement at full speed, keeping focused on the immediate goal: getting the thing built.

That goal misses the larger purpose of building something—that the built thing meets its purpose and specification, and that it continues to do so as the system evolves.

In practice, the initial design and implementation of a component involves much less effort than is spent on checking that implementation, integrating it with other components, fixing bugs, and making changes later. A project that is building a system to succeed in the long term optimizes for all these other tasks, not just the initial design or implementation.

All these later tasks involve understanding specification, design, or implementation of a component. Understanding means not just being able to see the design or implementation artifact, but also knowing why the component is what it is. This includes documenting the rationales that led to significant decisions about the component. It also includes providing people a guide to understanding the component’s design or implementation, especially if there are subtle aspects to the component that are easy to miss if one is looking just at a design document or an implementation.

When someone is the code for some component and asked to change some behavior, and that person isn’t the one who initially implemented that component (or they are the same person, but it was a while ago), they begin by building up a mental model of how the component works. Once they have that mental model, they can proceed to work out how to change it. They will think of different ways they could make changes, and evaluate them to see if the changes will have the effect they intend and that the changes will not have some other undesired effect.

Building up an accurate mental model involves working out constraints that led to the component’s design, major decisions about how the component is structured, and how different parts of the component work together to achieve its functions. This information is not encoded directly in software source code or mechanical drawings or circuit designs; all those things are the products of a process that works through all those other things on the way to producing the design or implementation artifact.

The person who is tasked with changing a component, and then building up a model of how that component works, can get information two ways: from documentation or by reverse-engineering it from the implementation artifacts. In practice it is usually best to do both. A circuit design is the truth about how an electrical component works, and so this is the most accurate way to learn about the implementation. However, a circuit design or software source code leaves out the rationale for why the design is the way it is. Having documentation about the design, about why the design is the way it is, and a guide to the implementation will help the person understand the component more accurately and more quickly.

Of course, having documentation only helps if that documentation is accurate. If the documentation doesn’t match how the component was actually implemented, then the documentation will lead someone astray when they try to learn how a component works.

There has been a saying in agile software development that “the code should be documentation”. This is usually interpreted as “the code should be the only documentation”, which is not what the people who developed agile methodologies intended.[2] The point in the agile methodology is that software code is necessarily documentation, and it should be written so that it is clear and readable so that others can read and understand the code.

I have experienced both the advantages of having good documentation and the disadvantages of having no or inaccurate documentation. Many years ago, I developed a multithreading package for a research system. That package included a peculiar thread-synchronization primitive tuned for that specific application; correct implementation depended on some unobvious code in one place. It took some time to analyze the design to identify that condition, and if I had not written it down I would not have remembered it correctly when I had to modify the package a year or two later. On the other side, on a personal project I was developing a responsive, single-page web application and developed a combination of JavaScript code running in the browser and Ruby code running in the server to achieve it. I did not document the design, and when I needed to improve it after a couple of years I had to reconstruct the design. I spent much more time than I would have liked on that reconstruction.

7.2.5 Principle: Build in checks

Make independent checks of all critical specifications, designs, and implementations a normal and expected part of project work. Define in advance who will do the checks and when they will do them.

Having one person check another’s work is a basic mechanism for maintaining quality, safety, and security in a system. It applies equally to technical work, such as verifying that a design matches specifications, and to project operations, such as checking that a procedure is working as intended or that team communication is flowing.

Note that this does not mean that developers can avoid writing unit tests or performing design analyses. They should be doing those, and independent checks should be done as well.

There are many advantages to performing reviews or checks:

Reviews catch errors, especially when someone with a different perspective can see something the person who did the work missed;
Checks in processes or organizations can identify team problems early, when they can be corrected more easily than if a problem were allowed to fester;
Reviews and checks provide a way to build shared operational and technical culture; and
Reviews provide an opportunity for people to learn from each other.

There are two significant disadvantages that can lead to a team skipping checks. First, checks take time and effort. When a team is pressed for time or short handed, it’s easy to let a check go by. Second, done poorly a review can feel like a lack of trust or like an attack on someone’s work.

Nonetheless, checks and reviews are important enough that a well-functioning project will find ways for checks to happen.

Having checks be a built-in norm for the team helps address the disadvantages. If everyone knows that checks are going to happen, the time and effort involved will be planned for. People will notice if checks are being skipped, and will ask why—helping to ensure that the checks actually do happen. Separately, when everyone’s work is checked, it becomes easier to convey the sense that no one is being singled out or is not trusted.

I discuss how checking can be built into a project’s life cycle patterns in Chapter 17.

7.2.6 Principle: Work against cognitive biases

Take deliberate, ongoing actions to avoid the negative effects of cognitive biases, such as confirmation bias or team echo chambers, and missing or incorrect information.

The work of building a system involves making many complex decisions. These decisions are based on the information that the person making the decision has, along with their skills, experience, and biases.

Incorrect decisions can be made when people work with beliefs or biases that are inaccurate. This leads to concepts or specifications that reflect the errors, and from there to designs and implementation that do not meet system needs. There are many terms for these various situations, including confirmation bias, echo chambers, or recency bias.

These errors arise from many different causes:

When someone has incorrect beliefs or biases, and follow those beliefs without question. For example, when a systems engineer has a belief about customer needs that is not a match to the actual customer needs.
When a team develops inaccurate shared beliefs. This can occur when team discussions progressively reinforce some information: one person raises an offhand question, someone else starts pondering that question, and as discussions continue the question becomes a fact the team accepts.
Incentives to pursue only some lines of investigation, leading to biased evidence for decision-making. I have observed this happen when someone had previous good experience with a particular design approach, and limited their investigation to similar approaches.
Missing information leading to biased decisions. For example, when a decision is to be made about how to re-design some component to add a new capability, and part of the rationale for how the original design was determined is missing. This can lead the re-designer only to consider the aspects of the rationale that were recorded and not consider the missing aspects. (See Leveson [Leveson00, §3.2, p. 17] for discussion.)
Continuing belief in information that has been found to be false, or decided against. This can occur, for example, when a project was considering multiple design options and has chosen one of them, but not everyone on the team understands or accepts that the decision has been made. This can also occur when a customer changes the system requirements and not everyone realizes that the requirements have changed.
The Dunning-Kruger effect [Kruger00], when someone is not aware of the limitations of what they know and proceed with false confidence to make design decisions.

These biases can lead to serious system flaws when incorrect decisions are made about high-level system design or safety and security functions.

There is no one method that will eliminate these problems. Indeed, many of these problems are a necessary flip side to cognitive behaviors that have positive outcomes, such as group agreement and pruning a search space when making decisions.

A well-functioning team takes deliberate and ongoing steps to reduce the problems that come from cognitive bias. These address the problems from two directions: prevention, by making complete information available, and reducing occurrence, by building into the project’s procedures methods to avoid or catch problems.

A project can reduce the chances of cognitive bias issues by maintaining complete written records of key information. Information about customer needs (and how those were determined) and rationales for design decisions are most important. Completeness in designs and verification records also helps. Sharing information that changes widely as well as documenting it in writing helps avoid team members working from outdated assumptions.

Reducing occurrences of erroneous bias involves finding ways to see around the bias into information that would have been ignored or dismissed. This almost always comes from finding a way to get perspective that sees a problem from a different perspective. Training team members to take deliberate steps that will try to falsify their hypotheses gives each team member their own improved perspective. Building in reviews where decision rationales are explained to people with different perspectives helps catch biased decisions before they cause errors. Designating someone to be a devil’s advocate in discussions about complex decisions makes it clear that the team is taking the possibility of bias seriously.

Continuous training for team members in their own disciplines and in related ones improves their skills, in addition to what they learn by experience. Greater knowledge and skills helps combat the kinds of cognitive bias related to the Dunning-Kruger effect. Training in related but different subjects improves their open-mindedness, giving the team members new perspectives to use in thinking through decisions.

Project leadership has an important role in avoiding problems that arise from bias. Good leadership models behaviors where the leader explicitly looks for falsifying evidence and alternative perspectives. The leadership has the ability to allocate effort to investigating decision alternatives and being the devil’s advocate in discussions. The leadership sets expectations for the rest of the team by inspecting decision rationales to ensure that steps have been taken to address possible biases.

7.3 Plan for building the system

Complex systems, with dense graphs of relationships between their parts, cannot be built without a plan. A project cannot get such a system built by following a random walk through the space of possible tasks. However, plans have often been over-done, trying to lay out a definite schedule where in fact there are unknowns and then having scheduling crises when something runs long or over budget. A middle ground that remains honest about what is known and what is not, that allows flexibility as the project moves forward, and that also guides the work in a consistent direction works better.

7.3.1 Principle: Prioritize work by risk or uncertainty

Put effort into work that carries risk or uncertainty as early as possible.

Common project management practices advocate paying attention to the critical path: the set of tasks that must be completed on time in order for the project as a whole to complete on time. If any one of these critical tasks runs late, the project as a whole will be late. Each task has some measure of slack, the amount that it can start early or run late without delaying the end of the project. If a task has no slack, meaning it must start and finish on time, it is part of the critical path. Most projects have at least one sequence of critical tasks from the start (or from the present) to the end of the project.

This definition of critical path is useful but overly simplistic. It is useful because it gives a way to identify work that can put the project at risk, and once identified that work can get extra attention to make sure it goes as planned. The definition is simplistic because, at least in the basic formulation, it assumes that the graph of tasks and the duration and dependencies of each task are all known.

The critical path method is a special case of the general principle of using risk and uncertainty to inform project planning. In general, what work could lead to the project being delayed, or to the project failing?

There are at least four kinds of risk or uncertainty to consider.

First, there is the risk that some external event will affect the project. A customer might change their needs. Regulation might change, affecting how the system must be designed. A supplier might go out of business and thus not deliver components. Weather might delay an essential testing operation. Some geopolitical event might happen that changes the ability to manufacture an essential part.

Second, there is uncertainty about how to build part of the system. At the beginning of a project, there is neither concept nor design for the system and so the time required to build it is uncertain. As the design begins to develop, there will be some parts of the system that have low technical risk because they involve well-understood problems, such as wheels for a road vehicle. There will be other parts that cannot be built using available designs, such as a spacecraft that needs low-mass, low-power radio subsystem that can communicate with another spacecraft. If the team can find or develop an appropriate radio, then the project can move forward—but if it can’t be, then the system design or the mission will require significant re-work. It may not even be possible to meet the customer needs within the time and budget they require.

Third, there is uncertainty about the time and effort required to build something. There may be a likely technical solution for some component, but the difficulty of constructing it may have hidden surprises. The time needed for a supplier to provide a purchased component might not be known until a contract is signed with them. The complexity of testing the integration of certain components and fixing bugs might not be understood.

Finally, there is schedule risk from a “long lead” task or sequence of tasks that will take a long time to complete.

A well-functioning project searches out risks and uncertainties like these and puts attention and effort on them. Deliberately spending effort addressing technical and schedule risks early in a project means that potential problems are addressed when it is cheapest to handle them. Consider finding out halfway through a project that there simply is no component available to fill some need. Addressing this might require a redesign of much of the system—but much effort has already been spent building parts of the system that now must be discarded. This is a waste of resources; more seriously, it presents a problem that all too often leads project management to decide to fudge the solution and build a system that does not work as needed.

This principle requires dedication to examining the state of the project thoroughly and without bias.

7.3.2 Principle: Prioritize integration

Integrate components as early as possible. When possible, integrate mockups or skeleton components before building out the component details.

There is common wisdom that the cost of fixing an error in a complex system generally increases over time, up to the release into production. While the hard evidence for this is lacking, I find general acceptance that this occurs, though with plenty of exceptions.[3] The idea of increasing cost over time has led to methods that successfully catch errors early, including concept, requirement, and design reviews, test-driven development, and automated checking tools.

Studies such as those reported by Leveson [Leveson11, Sections 2.1 and 2.5] suggest that the greatest cause of system failures now comes from design errors related to the interaction of separate components: the robustness of individual components is not the problem, but instead how components work together. This appears to be the case even with requirement and design reviews, which certainly catch many errors before they are implemented.

I have found two methods help reduce integration-related errors.

The first method is to use semi-formal, top-down design analysis methods in conjunction with design reviews. I recommend the STPA method that Leveson presents. [Leveson11] The Mars Polar Lander loss review called out the lack of such analyses as a significant contributor to the loss of the spacecraft. [JPL00, Section 5.2.1.1, p. 16]

The other method is to organize development around integration, so that the component interactions can be tested (not just analyzed) as soon as possible. This principle means focusing on how components will work together before implementing fully detailed components. This leads to building the system in increments, starting with a collection of stub or skeleton components that implement a few parts of the component behaviors and integrating them together into a partial system with limited capabilities. This partial system is then tested, with an emphasis on seeing if the interactions work correctly. Once problems with the integration are sorted out, another tranche of functionality can be added and tested. Along the way, one always has a partial system that runs.

Integration first has two benefits. First, if the component interactions do not work well, multiple components will be affected by a redesign. Detecting the problem before investing in all the details of the components means less re-work. Second, it is usually easier to test interactions with mockup or skeleton components than with “real” components. One can instrument the mockups to observe detailed states that are harder to observe in a complete implementation. One can also add fault injection points to make it easier to create off-nominal test scenarios.

This principle is not one to apply blindly, however. The purpose of integration-first development is to address uncertainty or risk that comes from potential component interaction problems. Some components may have their own internal technical risks, and sometimes it is more important to sort out that risk before addressing component interaction risks. Of course, the ideal would be to address both in parallel.

7.3.3 Principle: Have a long-term plan

Maintain a plan for how to get from the present to a completed system. Detail out the near future; have a concrete but less detailed plan in the medium term; and have a general approach beyond that. Evolve the plan as understanding about the work changes.

Consider the task of planning a route for walking from one place to another. If one has a map of roads or trails connecting the locations, one can search out a path by using a standard shortest-path graph algorithm, which evaluates various parts of paths in an orderly way until it finds a “best” path.

This is analogous to building a system with few unknowns. One can start by designing the system on paper and checking it out. This approach is a low-risk way to build a system, as long as one can be sure that all of the components can be built as designed and that their integration into a system will work as planned. This situation applies when building a system that has strong similarity to other systems, so that there is an existing body of knowledge about what works. This is the basis for repeatable engineering methods, as evaluated by standards such as CMMI. [CMMI] It is also the situation that led to the waterfall system development methodology.

What if there is no map? What if the terrain in between is unknown, and the distance is far enough that one can’t do something like climb a hill and look?

Most projects that are working to build an innovative complex system have a situation like this. At the beginning, there is no obvious path to follow to get to the desired system; indeed, there may not be any path that gets there if the desired system is not feasible.

The team working on the project needs a plan that will guide their work, giving it a general direction for the long term, some concrete plan for the medium term, and details in the short term. As the work progresses, some of the medium-term work will turn into specific, detailed tasks. Some of the tasks will provide information that fleshes out parts of the general, long-term work into more concrete medium-term work. Sometimes bug reports or change requests create new short-term tasks that change the medium- and long-term parts of the plan.

A plan like this benefits the team. It helps ensure that people get all the tasks done, without some getting missed. It conveys decisions about how work is prioritized, which helps the team work independently. It gives a basis for measuring progress and predicting whether milestones will be reached on time.

The act of maintaining the plan provides the opportunity to think about priorities (such as those in the previous principles) and the dependencies between parts of the work.

A flexible, evolving plan strikes a middle ground between a fixed schedule and a purely reactive tasking approach. A fixed schedule, of the kind often associated with the waterfall development methodology, often either becomes a fiction after a few weeks when unknowns intrude onto the planned perfection, or the schedule becomes flexible and takes effort to maintain without a discipline to doing the maintenance. A purely reactive approach, which can be seen as in Agile methodologies taken to an extreme, has the risk of the team wandering around chasing whatever immediate priority comes along, and then having execution difficulties when some work requires more planning than one sprint’s duration.

Of course real projects rarely take either extreme approach; in practice real projects adjust schedules over time. Having a discipline for maintaining a plan from the beginning helps the evolution proceed smoothly.

7.3.4 Principle: Set up intermediate internal milestones

Define regular internal milestones for showing a part of the system working in an integrated way.

Internal milestones that demonstrate some system function give the team a focus for their work in the medium term.

Each milestone demonstrates a set of system capabilities working, especially if those capabilities involve integrating functionality in multiple components. The milestones include a demonstration of the new capability working, in order to prove that the system is working and to give the team a concrete success to celebrate. Internal milestones like these put the team’s focus on a part of the system, leading to capabilities that are integrated together early. (This approach supports the principle of prioritizing integration, above.)

The functionality for each milestone should represent some significant amount of work. I have scheduled such milestones about two or three months apart. If a project is using Agile-style sprints, the milestone should include the effort from several sprints.

I have often focused these milestones on some high-level system function or on some pathway through the system. In the software effort on one multi-spacecraft project, the first milestone demonstrated that the basic software and communication frameworks functioned in a testing environment. The next milestone showed simple control loops in the flight software working; the milestone after that, collective guidance for the collection of spacecraft. Each milestone built on the work of the ones before it.

Of course, not all of the team need be involved in one of these milestones. Part of the team may be working in parallel on other functions. In the multi-spacecraft system example, other parts of the team were working on spacecraft hardware design, mission design, ground systems, and so on.

There is a risk in this approach: that the team takes too narrow a focus and fails to account for the larger system. Any focused effort, whether for an internal milestone or for something else, must be balanced by consideration of the whole system. In the project above, the systems engineering team kept working in parallel to the software teams in order to ensure that the software designs continued to meet mission needs and would integrate properly with the spacecraft hardware and ground systems.

7.3.5 Principle: Use prototyping safely

Use prototyping to validate a concept or determine if an approach is technically feasible. Never let a prototype escape and become treated as a part of the real system.

Building a prototype of a component or a part of the system is an excellent way to learn about how the component or part can be built, and how it will work. It is also a good way to check that a potential design will meet its needs.

Building a prototype is also one of the more dangerous activities that a team can do while building a system. The risk is that a prototype will appear to function in the way needed and will be treated as if it is an initial version of the “real” component, even though it is not.

A prototype has value when it can be developed quickly, at lower cost than its “real” counterpart. Taking shortcuts, implementing only some parts of functionality, not performing much verification—these are all positive approaches to building a prototype and negative for building a component to be used in the final, deployed system.

One example of what can happen comes from a colleague. He was tasked with building some sample software code that would show developers how one could construct a particular kind of application on a new operating system product. The sample code was intentionally simple; it illustrated a particular flow of activity that an application would need to do. It was not a full application in itself. He took some shortcuts in non-essential parts of the code, making the primary part of the application robust but (for example) making some helper functions non-reentrant because they were not an essential part of what was being illustrated. Unfortunately, after this code was published as part of a tutorial, people began blindly copying the helper functions—even though the example was labeled as illustrative only. This led to other organizations releasing buggy applications because they took the easiest and fastest route to building their application by just copying the helper functions.

I observed another example in an ambitious autonomous vehicle system. The company in question began development of their vehicle by building prototypes of several key systems, both hardware and software. In doing so they learned a lot about the problems they were trying to solve. The prototyping effort did what it should: it provided information about how the system should be designed as well as a platform for experimenting with algorithms (such as some of the control systems). Unfortunately, the company did not label or treat these artifacts as prototypes; they saw them as early versions of the real system. The prototypes allowed them to demonstrate vehicles that could perform some operations to investors. This led to increasing pressure to get more features implemented, and to correct problems they found with the vehicle operations as soon as possible. The prototypes had never been designed for reliability, safety, or security, and early safety analyses found significant flaws. Interestingly, the company did treat their hardware platforms as prototypes, and built a hardware platform that was designed to meet safety and security requirements to replace the early prototype boards.

These examples point to both the positive and negative sides of prototyping. To the positive, in both examples, developing a simplified version of the system in question helped people understand the problem at hand. The effort to develop the prototype went faster because the effort focused on only the essential element of what needed to be learned, and omitted aspects that would be needed for a production system. On the negative, in both cases the prototypes ended up being treated as production ready. The prototypes, having been built without the rigor needed for correct, safe, or secure function, led to flaws in the system products. These flaws increase the cost of building a working system, and the flaws tend to be discovered late in development when it is far more costly to correct them. (One startup company I worked with had to rebuild a third of its project when they realized how much they were spending to try to patch up the prototype-quality software they had written; they had to go through extra venture funding rounds to get their product released.) end missed

Using prototyping, thus, is a necessary and helpful part of building a complex system, but it must be done with discipline that keeps prototypes separate from the “real” system components.

Some project managers have talked with me about solving this by policy: they will have their team build a prototype but they will ensure that the prototype is not used for production, and they will put building a real component into the schedule. Unfortunately I have then seen this resolve fade away quickly as the project begins to run late or have funding issues or have an important demonstration coming soon. These imperatives have always, in my experience, taken precedence over system correctness and even over the longer-term cost and schedule to build a working system.

Prototypes are used more safely when they cannot be used in the real system. For example, people often construct storyboards or slides of the user interface for an application. These storyboards allow the developers and potential users to explore how the interface will work, but they cannot be made into an executable application. Similarly, building a software prototype using languages or tools that cannot be integrated fully into the production system helps keep that software from being used in production. Using prototype hardware that is similar but perhaps in a different form factor allows a team to see if a hardware design can work without risking the prototype being put into production.[4]

7.3.6 Principle: Analyze for feasibility

Analyze a system concept for feasibility before committing large amounts of resources to it.

I have worked on multiple projects that were, in retrospect, infeasible. Project A was trying to build a collection of cubesats to perform a demonstration of cross-link communication between the spacecraft. No radio or flight computer was available that could achieve communication between spacecraft except for a brief period at the start of the mission. Project B involved designing a commercial system for which no commercial business case existed—the system was fundamentally a public good that would not generate a commercial return on investment. A third Project C depended on multiple competing government contractors voluntarily developing a shared system architecture, when the rational behavior for all the contractors was to focus only on their own work. Yet another, Project D, depended on secure operating system technology that did not yet exist.

In all these cases, large amounts of money and effort were spent before the projects were canceled.

With hindsight, it is clear that the problems with all but one of these projects could have been detected early. In Project A, basic systems engineering could have created a mission concept of operations and modeled whether available radio and computing hardware was up to the task. The incentive for competing contractors in Project C not to collaborate was clear from the beginning, but the management overseeing the project chose to continue anyway. The missing technology in Project D was identified early but the customer insisted on proceeding.

Project B was the exception. It was defined as a two year limited-time exploration of the problem. At the beginning of the project, no one involved knew whether the system was feasible or if there was a business case. Over the course of the project we learned about the nature of the system, including that it produced a public good [5] rather than a private good, and thus it was not a sensible commercial product.

7.4 The team

A project’s people do the work of building the system. The team is itself a system made up of complex parts, and how effectively it works depends on how well it is organized and led. Supporting a team with the structure it needs, and in particular with the communication channels it needs, gives the team a fighting chance of working effectively and working through the difficult problems that will come along.

7.4.1 Principle: Document team structure

Define clear roles and responsibilities for each member of the team. Document and share that information so everyone has an accurate understanding.

As I noted earlier (Section 7.1.4), the team is itself a system. As a system, it has structure—who is on the team, what their roles and authority are, and how people should communicate (Section 6.3.3).

There are many ways projects can structure their teams. The specific choices depend on the nature of the project—the number of people, the range of disciplines involved, whether there is one organization or many.

In a well-functioning project, everyone on the team will have a common understanding of what that structure is. Each person will know who they should communicate with and when. Each person will know what their areas of responsibility and authority are, so that they know when they can make a decision and when they should work with someone else. They also will know who to go to for answers to questions about other parts of the system.

A shared understanding of team structure becomes most important when people find problems to address. If one person finds a problem with the design of a component, they will need to work with the people who are responsible for components sharing functional or non-functional relationships (Section 10.2). If there are interpersonal problems between two team members, the responsibility for escalating problem resolution should be clear.

Clear team structure enables delegation. In a project of more than trivial complexity, the work must be shared among multiple people. Sharing responsibility only works when both parties trust each other: that both will do their part of the work, that both will communicate what should be done and the progress that has been made, and that both will communicate when they find a problem with the planned work. This trust depends on a shared understanding of the rules about responsibility and communication.

7.4.2 Principle: Plan on reorganizing the team as it grows

Adapt the structure of the team as it grows, to reflect the increased coordination needed as the number of interactions increases.

A very small team, of up to around five people, needs little formal structure, because all the people can interact directly with all the others to coordinate the work. A large team needs formal structure, with defined scopes of responsibility and communication paths. In between, the team needs some degree of structure.

As a team grows, it will move gradually from the size where it needs little structure to needing more and more structure. It will reach points where it is outgrowing the structure it has had and needs to change to have a more formal structure. I have observed that teams need to change at around 5, 30, and 70 people.

In a well-functioning project, the leadership monitors the team’s performance to detect when the team is reaching a size where it needs a change in structure.

Some of the signs that a team needs to move to a more formal structure include:

Loss of clarity about who is responsible for what.
Problems with people having the information they need, not being informed when they should, or having trouble finding the right people to ask questions of.
Overloaded managers or leads who are tasked with coordinating work.

7.4.3 Principle: Have shared procedures

Document procedures that everyone on the team will use for important tasks.

Procedures define how people perform certain tasks (Section 6.3.5 and Section 17.4). These procedures should be documented and easy for everyone on the project to find. The team should have a cultural norm of following the procedures—not just the letter of the procedure, but the spirit of it as well.

People working together means one person does part of the work, then another builds on their work. For this to succeed, people need confidence that the work they build on has been done properly. Part of that assurance comes from having shared procedures and having a team norm that everyone is following those procedures.

Some procedures are simple lists of steps or checklists. For example, if a team is using a shared artifact repository like git, everyone needs to follow conventions about how to check in work, maintain branches, and baseline versions (such as by pulling to a main branch). If someone does not follow the procedures, then the state of the repository can become damaged.

Other procedures are more complex. Completing a Preliminary Design Review (PDR) in the NASA lifecycle (Section 19.2.1) means that the project is ready to commit money and resources to begin detailed design and, later, implementation. This is a check on the whole project, not just on the design of one part. Passing the review implies that many project artifacts are completed, at least to a preliminary level: cost and schedule baselines, security and export control plans, orbital collision and debris avoidance plans, specifications to at least three levels, technical concepts, operational concepts, and many others. If the project continues but some of these checks are not true, then the project is likely to have serious problems later. (This was the case on a NASA project I worked on.)

7.4.4 Principle: Define regular communication paths

Document regular times and media for team members to communicate with each other.

The work the team does is interconnected. A decision about one part of the system affects other parts, following the system structure relationships. The decisions are based on information that, in turn, comes in part from the other parts of the system. Others on the team are responsible for ensuring that the project is making progress, including detecting when something is not going as expected.

Regular communication ensures that this information is pushed to the people who need it. A well-functioning team knows when to share information (such as times when decisions are being made), and who to share it with (the people whose work it will affect). Such a team will also avoid pushing information to those who do not need it. This avoids inundating people with useless information and thereby obscuring information they do need.

To achieve this, make sure that the project’s operational procedures include defined points when team members are expected to communicate. This might include times like starting on the design for a component, when changes are proposed for an interface, and when a component’s design or implementation are ready for review and approval.

Other team members need regular communication for other purposes. Status updates provide information to update the project’s plan. Other communication ensures that the team is working well, helps project leadership keep a finger on the team’s productivity and satisfaction, and provides a way for everyone on the team to learn the project’s overall goals. Johnson discusses communication as a foundation for team functioning [Johnson22, Chapter 2] and how communicating feedback is essential for keeping team members working at their best. [Johnson22, Chapter 5]

7.4.5 Principle: Define exceptional communication paths

Define and document clear expectations about when and how someone will raise issues with others. Make this an essential part of the team’s cultural norms.

Delegation and sharing work is essential to a team that is building a complex system, and they are based on mutual trust. One part of that trust comes from each party doing their work well, following the project’s procedures and the team’s norms. The other part is being able to trust that people will communicate when there is a problem. (See Section 16.1 for more on this.)

There are many things that can go wrong. Someone can find an error in a specification or design. They can find that they don’t have the resources or skills to complete some task. People can have disagreements that they cannot resolve. A supplier can be late providing some component.

When these things happen in a well-functioning team, people will communicate—not keep the problem to themselves. The project’s operational procedures should make it clear how to handle some of these cases. For example, when someone finds a design error, they work with the person responsible for the design to find a solution, and they let others doing work that could be affected by the design change know. Ideally, they will ask for feedback from these other people to make proposed changes work for related parts of the system.

Communicating about exceptional situations only works if both the person raising an issue and the recipients can trust that the message will be heard, acted upon, and that all the parties involved will handle the matter respectfully. Much has been written about how to create an environment where this happens—see Johnson [Johnson22], for example—and I will not try to add to what others have written.

7.4.6 Principle: Provide independent resources for checks

Explicitly organize the team so that people have responsibilities for checking others’ work, including through reviews and by doing testing. Manage relationships in the team to keep the checking from being taken personally.

Building checks into the work plan is a principle listed above. The principle of doing checks requires having team members available to do those tasks. Having someone who did not do the design or implementation perform checks improves the odds that they will find a problem because they do not have implicit assumptions/biases of the designer or developer. This implies that a well-functioning team will be staffed to provide for independent checks, and that some team members know they will be responsible for checks.

It is easy to underestimate the effort required for reviews and tests. Doing a meaningful design review takes significant effort, because the reviewers need to actually understand the design—not just look for particular easy-to-find markers that might indicate a problem.

I have heard many opinions about how much of a team’s effort should be allocated to reviews and checks, anywhere from half the effort to a small fraction. My own experience has been that the teams where about one-third of total effort was allocated to reviews and testing had better outcomes than the teams with less effort available. The appropriate fraction of resources is likely dependent on many factors not yet appreciated.

Reviewers and testers can end up having an adversarial relationship with designers and implementers, and so the way reviewing and testing tasks are allocated requires some care. In one organization I worked with that had permanent testing teams separate from developer teams, the developers looked down on the testers and relations between the teams were sometimes difficult. While some tension is useful so that the work remains independent, careful management will monitor the relationships and work to ensure that the interactions between developer and checker do not become personal and that the skills required for both roles are honored.

[1] In one company, I worked closely with the marketing team. This included going to meetings with customers and participating in the back room during focus groups. Working together was fruitful: I could answer technical questions from the marketing team in real time, and I could hear what the customers were saying first hand. At the same time, I got a perspective from the marketing team of how the product we were designing needed to be positioned. This collaboration was one of the best examples I have of how a translation role improves a company’s work.

[2] Martin Fowler states: “[…] I feel the need to rebut a common misunderstanding. Such a principle is not saying that code is the only documentation. Although I’ve often heard this said of Extreme Programming – I’ve never heard the leaders of the Extreme Programming movement say this. Usually there is a need for further documentation to act as a supplement to the code.” [Fowler05]

[3] I have tried to find studies that back up the common claim that the effort required to fix errors increases by an order of magnitude with each phase of development. I found several on-line discussions that said that while this claim seemed to be more or less true, all citations appeared to lead back to one IBM-internal presentation that could not be verified. Many of the discussions ended with the claim that it was too hard to define “error” and “cost” in consistent ways, and so there might never be quantitative evidence.

[4] Spacecraft avionics development often involves building a “flat sat”: a collection of electronics boards laid out on a test bench. Sometimes these boards are development samples from a vendor, or early prototype boards before the team has finalized a design. The flat sat can be used to determine whether the boards can communicate as expected; they can have test probes attached in ways that production hardware might not allow; they can be connected to computer systems that emulate external inputs and outputs. Nobody ever expects the flat sat to get flown, because it’s just a collection of boards and cables; it doesn’t look like the real thing.

[5] A public good is something where the benefit accrues to everyone; it is not possible to exclude anyone, and providing the good to one person does not decrease the good available to others. [Hardin20, Section 2] Fire protection and lighthouses are examples. A private good is one where access to the good can be limited and providing it to one person means it is not available to another. Owned goods are an example.

Part III: Systems

In this part, I discuss the model presented in Chapter 5 that structures how one can think about the content of a system. This includes

The system’s purpose;
The components that make up the system;
How structured relations between components lead to emergent properties;
How subset views of the system can help understand the system’s content; and
How evidence can show whether the system does or does not meet its purpose.

Chapter 8: Purpose

17 August 2023

8.1 Introduction

Creating a system requires time, effort, and many other resources. The result of spending those resources should be worth the expenditure: the system should do something useful for someone.

This is another way of saying that the system should have a purpose, and that the purpose should be expressed in terms of what the system can do for the people or organizations that will depend on it. This definition of a system’s purpose means that it depends both on what the system does and who it does it for; both must be worked out to be able to accurately reason about a system’s purpose.

The list of who the system is for should be expansive, including everyone who has an interest in the system. This includes the system’s users, who will need to benefit directly from what the system does. It also includes the people or organizations who build and maintain the system and their investors, who will need to get benefit from the effort and resources they put into making the system. It includes others, such as regulators or industry groups, who represent the public interest in avoiding dangerous activities. This list amounts to the (often-abused) term stakeholder, interpreted broadly.

Each of these stakeholders will have a different interest in the system. The needs of each stakeholder must be discovered and recorded. Users derive benefit from the system’s explicit behaviors. Builders and funders derive benefit from compensation for the system, and in the longer term from the potential opportunity to evolve the system, provide it to others, or develop new systems. Regulators, industry groups, and the public derive benefit from how the system affects the world at large in terms like safety, fairness, or security. All these needs must be satisfied, and they cannot be satisfied reliably unless they are known.

8.2 Why purpose matters

Purpose provides a basis for decisions about whether something is worth doing, or to choose among different ways to do something. It guides the design and implementation: each part of the design can be judged on whether it adds to meeting the purpose or not. The sum can be judged on whether it meets all or enough of the purpose to justify building or deploying the system

This principle applies to parts of the system as well as to the system as a whole. Each part has a purpose that it needs to fulfill in order for the system to fulfill its purpose.

Purpose matters because of what happens when one does not give it enough consideration. I illustrate this with two examples, from among the dozens I have encountered.

Early in my career, I was tasked with building software that would be used by machine shop workers to process repair work orders and manage parts inventory. This system would be installed on minicomputer systems with terminals around the shop. I had what I thought was a clever idea for the user interface, based on the ideas of non-modal UIs that were beginning to enter the world in the early 1980s. The result met all of the functional requirements needed—and was completely unusable. I had focused on building something I thought surely would be good without doing the work to understand the needs of the shop workers who would use the system.

More recently, I worked with a startup that was building a software system to control a small vehicle. The software designer had decided that the foundational software infrastructure should provide an event loop mechanism, where the infrastructure would cycle at some frequency, and in each cycle would call functions to read sensor data, perform computations, and write commands to actuator devices. This is a common design pattern for this kind of system, and a reasonable starting point. However, when the designer was asked how they envisioned this being used to implement PID controller logic, it turned out that they had not ever considered what a controller would need and many necessary capabilities were missing. By the time the first version of the system was released for deployment, the vehicle had no control systems implemented.

The common thread in these examples is that in neither case did the person responsible work through the system’s purpose in order to ensure that what was built would be useful. Instead, the designs were based on an unvalidated belief about the right design, and the choices resulted in unusable implementations.

In both cases, a significant amount of time was spent building a system that did not work. In both cases the resulting system could potentially have been redesigned and reimplemented, but building the wrong thing had used up the available time and delivery deadlines were close by the time they were finished. In the case of the shop management system, the project subsequently failed as a result. In the vehicle control system, at the time of writing it remains to be seen if the team can get funding and time to correct the errors.

Both examples would probably have turned out better if effort had been put into a proper articulation of what the system needed to provide before anyone went into depth on design.

I discuss gathering information about purpose and documenting it in ! Unknown link ref.

8.2.1 Not monolithic or fixed

While it would be nice if purpose could be defined once and then remain fixed for the life of the system, this rarely happens.

First, a system’s purpose is rarely fully understood, especially in the beginning of a project to build the system. A team can begin by talking to potential stakeholders and finding out what they need, but inevitably someone will realize some important system behavior well after design or implementation are in progress. Not all of the stakeholders may be apparent at the beginning: for example, in one project I worked on, insurers turned out to be an important stakeholder, but we didn’t appreciate that for quite some time. A team must expect that their understanding of a system’s purpose will be rough at the start and become more accurate over time.

Second, purpose is not usually monolithic: there are many things that could be part of the system’s purpose, and usually people want many more things that are practical to build. The list of potential features usually has to be narrowed down from a long list of user or stakeholder wishes to a short list of the most important features—perhaps with a plan to add more capability over time. This means being able to separate the different features or properties and rank them by importance and achievability. A team must expect that items will be added and removed from a system’s agreed-upon purpose as time goes by.

Finally, needs change. If a project to build and deploy a system takes a few years, the world in which it is deployed will likely be different from the world when the project started. Available technology may change, or the user’s market may have shifted, or new regulation may come.

The result of these conditions is that a system’s purpose is not fixed, and the team building the system must be prepared for these changes. Being prepared means regularly checking for changes in stakeholder value and recording what is learned. It means using design and development processes that can adapt to these changes when they happen. And it means a management commitment to managing change honestly, pushing back on user requests when needed and supporting the development team when changes need to be made. It also means that an organization must be prepared to end a project when the system’s intended purpose no longer has enough value to its stakeholders.

! Unknown link ref discusses how to gather information about purpose, and how to work with that as the understanding changes.

XXX add references to prototyping and end user validation

8.2.2 Inconsistent or conflicting purposes

Having multiple stakeholders usually means that two or more stakeholders will have incompatible needs or desires. Even a single stakeholder may have conflicting desires.

This can cause two problems. First, conflicting needs make it hard to design a system that meets its purpose. Second, conflicting objectives make it harder to rank and choose among potential system objectives.

There is no simple recipe for handling such inconsistencies. One first has to recognize when an inconsistency or conflict exists, which requires understanding what all the stakeholders are saying and understanding the implications of that information. Then one has to work with the stakeholders to find a resolution—be that a negotiation that produces a compromise, or a realization by one party that their needs cannot be met. This can lead to difficult discussions, especially with customers: it is hard to tell a customer that current regulations make some feature they strongly desire illegal.

8.3 Explicit purpose

A system’s or component’s purpose can be separated into explicit and implicit parts. I use a simplified eVTOL aircraft as an example to explain these.

The explicit part is what stakeholders who will directly use the system say they need. This includes:

The functions or behaviors the users want. Example: the aircraft needs to carry up to four people over a distance of at least 100 km, with takeoff and landing at heliport-like facilities.
The use cases that give context to the functions. Example: the aircraft will be used as a taxi service to carry passengers between airports and city centers.
Important properties the users want. Example: the aircraft must be at least as safe as commuter-class aircraft.

The stakeholder can only rarely specify exactly what they want. They may have a general idea, but it often requires several discussion sessions for them to express the idea clearly. The team eliciting the purpose from them usually needs to employ active feedback techniques, providing the stakeholder with an interpretation of what the team thinks they have said in order to validate that they have correctly understood the needs.

See ! Unknown link ref for more about different kinds of stakeholders and projects, and what must be done to learn about each kind.

8.4 Implicit purpose

A system’s implicit purpose comes from stakeholders who are involved in the system but are not its direct users.

The organization developing the system has some reason for doing so. Example: an aircraft company designs the eVTOL aircraft in order to make a profit selling the aircraft and maintaining them.
A venture organization or bank providing funding expects a return on funds invested in the project.
Regional and local governments have an interest in promoting economic development in their region. Example: a regional government supports the company building the aircraft in order to provide jobs in its region, for tax revenue, and to build a reputation as a place to do similar work.
Government agencies and industry bodies have an interest in the public good. Example, in the US the FAA reviews and certifies aircraft designs for safety before they are allowed to fly in general use, based on regulations in 14 CFR Part 23 [14CFR23], and defines how the aircraft must integrate into the air traffic control system.

8.5 Using purpose

A system’s purpose must guide its design and development. This means that the purpose provides the standard on which design and management decisions can be made. There are several activities in system development that depend on purpose.

First, a project must actively gather and validate its understanding of the system’s purpose. This activity must be explicitly planned for, and sufficient time and resources provided. The resulting information should be validated with the customer and recorded in artifacts that can be referenced throughout the life of the system.

Second, the desired purpose is almost always more complex than what can be developed feasibly at first. The initial desires need to be ranked and pared down to what is essential.

Third, every project has a “go-no go” decision checkpoint, when the team decides whether to proceed with building a system or not. The fundamental question is whether a system can be built that meets all its important purposes, and this requires an analysis to determine whether that is feasible. Is it likely that a system can be built that meets the customer needs? And that will provide necessary compensation to the organization that builds it? Will other stakeholders agree to it? If the answer is no to any of these, then the team should not proceed further in building the system.

Next, purpose should guide design and implementation decisions. Each part of the system must play a role in meeting a stakeholder need, and the team should be able to articulate how it does so. If some part does not support the system purpose, it should not be built. If there is a choice to be made between different design or implementation approaches, the one that best meets the system’s purpose should be the choice. Moreover, the team must be able to explain how each of these choices were made. Chapters ! Unknown link ref present methods for ensuring this happens.

Finally, the system’s design and implementation should be checked against the decided purpose. ! Unknown link ref discusses system validation and acceptance.

Chapter 9: Component parts

30 January 2024

9.1 Introduction

In this book, I describe a system in terms of its parts and its structure. The system overall has a purpose, which can be described in terms of things it should do or properties it should have. The system meets this purpose by combining the parts together with the structure of how the parts interact. One should be able to show that the desired system behavior and properties follow from the combination of parts and structure.

In this chapter, I start by discussing components, the term I will use for parts. In the next chapter I will discuss structure, and how the combination of parts and structure leads to emergent properties that meet system needs.

Terms. I use the term “component” as a generic term for a part of the system. Some approaches use different terms, such as “element” or “item”. Other approaches use different terms depending on the level of encapsulation: system, subsystem, component, subcomponent, for example. I use the term “component” throughout, with “system” reserved for the system as a whole, and “subcomponent” used to denote a component that is part of another component.

9.2 Definition of component

A component is something that is part of a system and that people can think of as a unit. “Unit” implies some kind of singular aspect to the component: one purpose, one implementation, or one boundary, for example.

This definition implies that different people will see different things as unitary components, often depending on the level of abstraction one wants to work with. One person may think of “the electrical storage system” as a unitary component, while another person may think of battery cells and power regulator chips as components, and the electrical storage system as a collection of components. Both views are correct, and both are useful at different times or for different people.

The focus on unitary purpose or boundary is a way to address complexity in a system. The focus is meant to help humans organize and understand the system they are working with by taking a divide-and-conquer approach. It means that some people can focus their attention solely on the component, making sure that it is designed and implemented to meet its purpose while not having to think about the rest of the system. The focused attention on the component must be complemented by attention on the system structure that connects the component to others, as described in the next chapter.

There are three related principles that can help identify what is a component and what is not. (Some of this is based on principles presented by Parnas [Parnas72].) These are only guides, and there are exceptions to each of them.

Is there a singularity of purpose in the thing that might be a component?
Is there stronger coupling or connectedness within the thing than across a boundary to the outside?
Can the design or implementation of the thing be replaced by a different version without affecting other components in the system?

The goal of the first principle is to organize components around their purpose. If a thing has multiple purposes, that suggests that it might be divisible into smaller parts, each with a sharper focus, or that part of the thing might be better combined into something else with a similar purpose.

The second principle addresses how independent a thing is from other things. Independence can be viewed in terms of causal relationships with other components, as covered in the next chapter. The more tightly two things are related, the more they will have to be designed, implemented, and tested together; the less they are related, the more they can be worked on independently. If two things are strongly related, one should consider merging them into a single component; if they are loosely related, they can be more readily treated as separate components.

The final principle is also related to independence. If the design or implementation of a thing can be replaced with little or no effect on the design of the rest of the system, then that is evidence that the thing is independent and can be treated as a component. Having clear and narrow interfaces between the thing and the rest of the system is a sign that the component is independent. More broadly, replaceability is often an indication that something should be considered a separate component.

There is one additional indication that something should be treated as a component: when it is something that is usually sold or acquired as a unit. Electronic chips, antennas, motors, and batteries are all generally bought as units. Software packages are often acquired as units, whether bought or acquired as open source. A person hired as a contractor to fill a commonly-defined role can be seen as a component in a system.[1]

9.2.1 Component purpose

Every component has a purpose, which defines how that component contributes to the system as a whole. “Purpose” is a broad term, including behaviors that the component should have, properties it should exhibit, or functions it should provide. A component’s purpose is not necessarily defined precisely; sometimes, the purpose is a somewhat ambiguous prose statement of what a human wants the component to do or be. Turning that ambiguous statement of purpose into a precise and actionable definition is part of the engineering process. I discuss this in ! Unknown link ref.

Most human-designed components have a single primary purpose or property, possibly with multiple secondary purposes. Consider a battery: its primary purpose is to store electrical energy and make it available to other parts of the system. The battery may have a number of secondary purposes, such as providing mechanical structural rigidity, providing thermal mass to help maintain a constant temperature in the system, or contributing to the location of the system center of mass.

Each component has a number of properties that derive from its purpose: its state and behaviors, the inputs and outputs it can provide, and constraints on how it should be used. The documentation of these properties provides an unambiguous and precise specification of the component.

People working on the component need to have the purpose (and the specification that derives from it) available as they do their work. This information guides how they design the component, and how they verify that a design or implementation meets its needs. It is important that all of the purpose is available to them in one place so that they know they are considering everything they need to consider, without hidden surprises they couldn’t find.

9.2.2 Limits of the component approach

Components help human engineering and understanding—but when humans aren’t doing the design, there are limits on how the approach applies.

Consider a mechanical structure designed with a generative design tool. The tool can take in a specification of what the structure should achieve—forces, attachment points, and so on—and will find a design that optimizes for given criteria such as weight or cost. These structures often do not resemble ones people design because the tool can explore a more complex design space than a person can, and as a result often produce substantially better results than the human designs. Such designs can also potentially co-optimize multiple functions, such as a mechanical structure that includes channels for coolant flow within the structure or that meets RF reflection properties. While a person could make such a design, generative tools can do so at far lower cost.

As a second example, consider a neural network trained to recognize elements in a visual scene. The neural network is designed by performing a training process that uses a large number of examples of the kind of recognition the system should perform. The resulting network is typically much more accurate than a manually-designed algorithm. However, people cannot investigate the network itself and readily determine how the connections in the network lead to accurate (or inaccurate) image recognition. It is difficult to look at a specific connection in the network and explain how it affects the result, or how changing that setting will change recognition properties.

Both these examples are components that will be part of a larger system. As components, they have a defined purpose, from which a specification can be derived defining what the component should do. From there, automated methods take over to produce the design (for the mechanical part) or directly produce the implementation (for the visual recognition component). If these components were designed by people, we would expect that we could review and understand the component’s design as a check on its correctness. As machine-generated components, however, we only verify that the design or the implementation complies with its specification.

There is one significant difference between the two examples: how they can be verified for compliance with their specification. A mechanical component’s specification is generally complete: all of the conditions in which the component should function and the component’s behavior in each environment can be specified. This means that compliance can usually be checked using finite element analysis software tools, and example components can be built and subjected to their intended loads. Components implemented using neural network methods, on the other hand, usually are expected to function in a complex environment that is too large to fully enumerate. The training methods use a number of example cases, and induce from those examples an implementation that should properly generalize to all, or enough, real cases. The compliance of the component therefore cannot be completely verified, but must be done statistically.

9.3 Divide and conquer: the component breakdown structure

The component approach involves breaking down the system into unitary component parts, in order to make each part manageable by a person. However, as we have seen, different people use different levels of abstraction to understand the parts of the system.

In practice, people divide up a system first into major subsystems, and those into smaller components, and so on until the components are simple enough to deal with. This recursive division defines components at varying levels of abstraction: the electrical power system as a whole, with the power storage, power distribution, and power generation components as parts of the overall power system.

The following is an (intentionally) partial breakdown structure for a spacecraft, illustrating how the spacecraft as a whole (the “space segment” of the whole system) is organized into multiple trees of components.

This recursive division creates a tree-structured component breakdown structure of the parts of the system. The breakdown structure organizes components in a way that helps people find components, including both finding a specific component that they are looking for and discovering related components that they do not already know about. The structure also defines levels of abstraction that allow people working at different system levels to focus their attention.

The breakdown structure organizes components, but it does not define the system structure, which I will discuss in the next chapter. The system structure defines how components interact with each other, which generally crosses different parts of the breakdown structure tree.

The system and high-level components should be broken down into subcomponents that have a strong internal relatedness and weaker relationships between subcomponents. In doing so, the high-level component provides an abstraction of its subcomponents. This usually means breaking into subcomponents either by function or physical location. Most people think first of dividing by function: the electrical system, the hydraulic system, the communication system. Location is often more implicit. For example, a space flight mission is organized first by ground system, launch system, and flight system (physical locations) and then by function in each location.

Consistency in organizing the hierarchy around some principle, such as function or location, is important. I have worked on some projects that began by organizing by function, then decided that the resulting hierarchy was not balanced: the hierarchy went deep in some parts of the system and remained shallow in others. This led to the team reorganizing the breakdown to take esthetics such as balanced depth into consideration. While the resulting hierarchy was easier to draw, actually using the organization became more complex and error-prone. High-level components no longer provided an abstraction of a collection of subcomponents as a whole. Instead, the collection of related subcomponents was split between two or three high-level components; nowhere was the one abstraction of the whole set represented. Building specifications, tests, and project plans became harder because related things were no longer related in the hierarchy.

A system will not necessarily have a single optimal breakdown structure. When that happens, one must pick some approach and stick with it. Some systems will have lower-level components that contribute to multiple high-level functions. If the system is organized according to the high-level functions, then the low-level components could fit into multiple branches of the hierarchy. I will discuss this further in the next chapter , when I cover how one uses hierarchy to organize the structure of the system.

9.4 Component characteristics

Each system component is defined by a number of characteristics. These characteristics define an external view of the component: information about the component that can be observed without knowledge of how the component is designed internally. The characteristics constrain the component’s internal design, but should only include those aspects that will affect how the component fits with other components to make up the system.

There are six kinds of characteristics in the component model used in this book:

Form
State
Actions or behaviors
Interfaces
Other non-functional properties
Environment

Form. The “shape” of the component. The component does not typically change its form over time. For physical components, form is obvious: the geometry of the volume or area that the component occupies. Form might include the material of which a physical component is made. For electronic or data components, form is how it is packaged: a data file in some format, or a software component in the form of an executable application.

Examples include:

The geometry of an aircraft wing rib (which roughly matches the shape of the wing’s airfoil)
The length and diameter (or gauge) of copper wire carrying an electrical current
The library format used for a software package

State. This is the mutable “condition” of the component at a particular point in time. More formally, state is the information that is necessary and sufficient to encapsulate the past history of the component, so that any reaction that the component performs to some input is fully determined by the input and the state. State can be discrete (such as binary-encoded digital data) or continuous (such as the angle and angular momentum of a rigid body at a point in time).

Practical examples include:

The value of a digital memory register
The heat stored in a solid object
The angle of a control knob that is connected to a rheostat
The velocity and rotation rates of a rigid body moving in space
The wear on a bearing connecting a rotating object to a non-rotating one
The altitude of an aircraft

Actions or behaviors. These are the state changes that the component can perform. Some behaviors are reactive, meaning they are initiated by some input. Other behaviors are continuing, meaning that they continue to be performed without further input.

Examples of reactive behaviors:

Incrementing a memory value in response to a sequence of program instructions
Starting current flow in a circuit in response to closing a switch
Changing the temperature of a component as heat is applied to an interface
Changing the rotation speed of a rigid body in response to applied torque

Examples of continuing behaviors:

Steady change of the rotation angle of a rigid body at its rotation rate, given no external torque applied to change the rotation rate state

Interfaces. These are the ways in which a component is connected to other components in the external world. Inputs can be given to a component, and output can be received from it. Inputs and outputs create a causal relationship between actions in one component and another. Inputs trigger reactive behaviors in the component that receives the input. Outputs can be a result of a reactive behavior, or an observation of a continuing behavior. Outputs are the only way another component can observe information about a component.

Examples of inputs:

A digital message received over a network or I/O port
The changing of an electrical switch position from open to closed or vice versa
Heat received by contact with another component, or by radiation from another component
Force transmitted across a contact with another component
Movement of a user interface control by a user

Examples of outputs:

A digital message sent over a network or I/O port
Electrical current or voltage from a generating component
A visual observation of the angle of a rotating rigid object
An alarm sound played to alert a user
Ongoing sound and vibration that a person uses to monitor a component’s function

Non-functional properties. Components often have some properties that do not change over time (or change very slowly). These properties are not state per se, but they create important constraints on the component’s design and implementation and affect how the component should behave.

Some non-functional properties:

Mean time to some type of failure
Minimum or maximum capacity (in liters or amp-hours)

Environment. A component is also characterized by the expected environment in which the component will operate. This can be viewed formally as part of the component’s interface, but in practical terms it is useful to call it out separately. The environment specification typically includes information like the storage and operating temperature range, humidity, atmosphere, gravitation or acceleration, electronic signal environment, or radiation.

! Unknown link ref details more about components and how to specify them.

9.4.1 Characteristics and hierarchy

A high-level component provides an abstraction for the subcomponents that make it up. This implies that each of the characteristics of a high-level component—its form, state, behaviors, and so on—needs to be reflected in the subcomponents. For example, if the high-level component has some state A, then one or more of its subcomponents must have some state that, when aggregated, implements A. If the high level component has form B, then the subcomponents when put together must have that same shape.

Consider a radio communications component. The purpose of the component is to send and receive data packets with another radio somewhere else. The radio component has interfaces to communicate data with another local component, an interface to emit and receive RF signals, and other interfaces for control, configuration, power, and heat transfer. This example radio component, similar to those that might be used on a small spacecraft, has an antenna that is initially retracted but can be deployed on command.

The radio is built of a number of subcomponents. These subcomponents must implement the state of the radio overall, as well as all its interfaces. The diagram below shows a simplified possible implementation.

The set of subcomponents implements each of the interfaces named in the high-level radio component. Many of them are provided by the transceiver component, but the antenna handles the RF signal sending and receiving.

The state of the high-level radio is divided over the subcomponents. Again, much of the state is contained in the transceiver component, as it performs the data manipulation. The deployment state is a physical property of the antenna: it is either retracted or extended.

In the example implementation, however, there are multiple powered components—the sensor and actuator related to deploying the antenna in addition to the transceiver. This results in a more complex power state than defined in the higher-level radio component: some of the components could be powered on while others could be powered off, rather than a binary on/off overall state. During design, discrepancies like this should lead to improving the specification of the state of the high-level component.

9.5 Downsides

As I have noted, breaking a system into separate and independent components benefits the people who need to understand the components. This advantage generally outweighs other considerations, but there are downsides to this approach.

The first downside is that a reductive approach doesn’t allow for many kinds of system optimization. Having two separate components means that the two are not jointly optimized.

Software language compilers illustrate this. If each program statement is considered independent, the compiler translates each statement into a block of low-level machine code. However, optimizing compilers break this independence, and gain large speed improvements in the generated code. For example, a code optimizer can detect when two statements perform redundant computations and merge them. An optimizer can detect that a repeated computation (in a loop, for example) can be moved out of the loop and performed only once.

Software optimizers allow a developer to write understandable code, and it performs optimizations that can be proven to maintain correctness but that make the resulting machine code hard for a person to understand and verify. There remains the possibility of system optimizers that perform similar translations, but they are not generally available today.

The second downside is that breaking a system into many components creates an organizational problem: how does one name or find a particular part? A hierarchical component breakdown can help organize the pieces.

9.6 Why components matter

We split complex systems into component parts in order to make parts that are understandable by the people who have to work on the parts. The approach also makes it easier to manage parallel design, implementation, and verification of the parts. If one wants to acquire a component from an outside source, having a definition of what the component is helps the acquisition process.

Each of the people working on the system needs information to work on their parts. Defining a component provides a locus around which to organize the information related to a component. Having a model of what a component is provides a basis for designing artifacts that will contain the right information.

Different people will need to work at different levels of abstraction in the system. Organizing the components hierarchically provides these different levels of abstraction.

The people working on the system need to find pieces of the system, both when they are looking for information about a specific piece and when they are trying to learn what the pieces are. The hierarchical structure provides a way to name and find information about a component, and provides a structured index to help people browse and discover.

Finally, it is generally understood that the structure of a system is related to the structure of the team that builds it [Conway68]. I discuss this further in Chapter 16. XXX add ref to detailed team structure chapter

[1] With the obvious note that the person is not, in themselves, the component; the role they play is the component. The person still should be treated as a person, and not as a cog in a machine.

Chapter 10: Structure and emergence

3 November 2023

10.1 Introduction

Component parts of a system define the building blocks out of which a system can be built, but by themselves they do not create the complex, high-level behaviors that systems are built to exhibit. System behaviors and properties arise from how the component parts work together. How the components are connected, and how they interact over those connections, is the structure of the system.

In this chapter, I define what is meant by system structure and provide examples of how behavior can emerge from the combination of components and their interactions.

To build a system, one generally has to build a model of what the system is and does. This model will play essential roles in designing a system and analyzing its design. Enquiry into how to organize information about a system’s structure helps one develop a useful model, and so in this chapter I present an informal way to model a system’s structure.

10.2 Definition

The meaning of “system structure” has been debated, but I use the following definition, chosen for its engineering utility:

Structure is how each component part’s behavior relates to each other component part’s behavior.

More generally, the structure is the graph of how components affect each other.

Components can be related in two different ways:

A functional relationship, which defines how an event in one component can cause an event in another component, and
A non-functional relationship, which defines how events in two components are correlated (or not).

Functional relationship. The functional relationship is a relation from one component to another that maps how some output on an interface of one component can potentially be received on an interface of another component, and thereby cause a reaction in the receiving component. That is, the functional relation is a map of possible interactions that can be viewed as a directed graph, with components as nodes and directed edges showing how causality can flow between them.

Consider two electronic components connected by a signaling line, similar to those used in several serial communication standards. One component is able to send a signal on the line by changing the voltage relative to a common ground; the other component is able to observe the voltage and determine what signal was sent. By sending a sequence of different voltage levels, the sender can transmit a series of zero and one bits over the line to the receiver. The receiver can decode the bits into a message, perhaps containing a number, and act on the message it has received.

This functional relation is separate from and mostly independent of the component breakdown, defined in the previous chapter. The component breakdown is primarily about organizing the parts so they are understandable, and do not imply a causal relationship. Functional relationships show how components in different parts of the component hierarchy work together. The component breakdown is helpful for defining levels of abstraction; we deal with those in the next section.

Non-functional relationship. A non-functional relationship between two components indicates how their behaviors may be related in non-causal ways, such as two components being independent of each other or showing correlated behaviors. These effects do not depend on interaction between the components, but instead are based on inherent characteristics or history of each component.

Independence and correlation are typical non-functional relationships. These terms are defined in the usual statistical sense. Informally, two components are independent if the probability of some event occurring on both components is the same as the product of the probability of each event occurring on its own. Events on two components exhibit some degree of dependence if the probability of both occurring is different from the product of each event occurring on its own. For a positive correlation, when one event occurs the other is more likely to occur. For a negative correlation, when one occurs the other is less likely to occur. At the extremes, one event occurring means the other is certain to occur, or that the two events never occur together.

Many non-functional relationships are the result of common-cause events. This can occur when two otherwise-independent components A and B have functional relationships with a third component C. When an event occurs in C, it interacts with both A and B so that both change their states. After such an event, the states of A and B are no longer independent.

System reliability is often built on a foundation of failure independence. For example, data can be stored in two copies, so that if one copy fails the other remains available. A scheme like this fails when both copies fail, and so the copies are designed to be independent to minimize the chances of both failing together. Independence can be a result of using different technologies to store each copy, or using devices from different manufacturers. Two devices from the same manufacturing batch might share a common manufacturing defect, which would increase the probability that both will fail.

10.2.1 Examples of functional relationships

Here is a list of some kinds of functional relationships that I have encountered in systems I have worked on. The first few relationships are simple and primitive from an engineering point of view, while the later examples are built as abstractions on top of simpler relationships.

Mechanical force transmission. One component can apply force to another component, perhaps causing the second component to move.
Electrical energy transmission. One component can provide electrical energy to another component, which allows the second component to turn a motor or operate electronic circuits. The first component thus has a degree of control over the second: when it provides energy, the second component can operate; when the first does not, the second component does not operate.
Fluid movement. One component can send a fluid (perhaps a gas or liquid) to another. As with electrical energy, the receiving component can use that fluid to perform some actions, such as moving using hydraulic systems or transforming a fuel into kinetic energy in an engine.
Data movement. One component can send information to another, perhaps using an electric or optical connection between them. The second component might store that information, changing its internal state, or it might perform some computation action on the information, such as computing a control action based on sensor data. Data movement is typically an abstraction on top of a lower-level electrical or optical relationship (though other relationships, including force transmission, are sometimes used).
Control. One component can provide abstract control over another, providing actuation commands to the second component. The second component uses those commands to change its state or behavior. A control relationship is typically built as an abstraction on top of a data movement relationship.
Authorization. In many systems, a person or component needs to have permission in order to perform some action; if it lacks that permission, it will be prevented from performing the action. Authorization is typically granted by one component to another component. For example, an administrator can use a permissions management tool to record who is allowed to do what; when a user attempts to request another component to perform some action, that component checks with the permission system to verify that the action is allowed. In this way the permissions management tool provides control over the component that is doing the checking. This is a highly-abstracted relationship between the permissions tool, user, and other component—but there remains a causal relationship among them.

10.2.2 Examples of non-functional relationships

Non-functional relationships capture ways that components can behave in coordinated ways without a direct causal relationship between them. These are typically states or behaviors that occur because two components share some common state (or do not share such state).

The following examples all relate to the independence or dependence of different components that are being used redundantly to improve reliability.

Location independence. Essential IT systems are expected to survive disasters, such as a fire, storm, or earthquake destroying a building housing servers and storage. This leads to using two data centers located far apart geographically—so that they are unlikely to experience local disasters at the same time. This location independence relationship does not involve causality between the two data centers; it involves only their state of being in different locations in relation to expected kinds of disasters.

Batch independence. Two separate but otherwise identical components might have been produced at the same time on the same factory line. This can lead to those two components sharing a common manufacturing defect: contamination in some material used in their manufacture that is not caught during acceptance testing, for example. Common manufacturing defects can lead to common failure modes, which is a problem if the two components are supposed to provide redundancy for each other. As a concrete example, an electronics board used in a spacecraft might have the wrong solder used to make connections, which can lead to the development of tin whiskers that create shorts on the board. If two or more components are being used for redundancy, selecting components from different batches that are unlikely to share manufacturing defects can reduce the chances that both will fail at the same time.

Wear correlation. Mechanical components often experience some kind of wear as they are used. Rotating components such as a wheel, for example, have to interface to a non-rotating component. The interface between the two is designed to minimize friction using smooth finishes, lubricants, and bearings—but there will still be some small amount of friction that will wear down the surfaces over time, leading to increased friction. Eventually the friction will become great enough to interfere with correct operation. Installing multiple new mechanical components at the same time and using them in the same way leads to correlated probabilities of failure. Mixing components of different ages, or applying different usage patterns to them, can reduce this correlation and improve the effectiveness of redundant designs.

Common-cause failure modes. This is a catch-all for many kinds of correlated failures, when two components that are nominally independent in fact respond to common situations by failing—perhaps in different ways. Consider primary and backup control systems, where the backup has a different implementation than the primary and provides a simpler control capability. (In spacecraft systems, this backup is sometimes called “safe mode control”.) In most situations, if one of the two components has a problem, it is unlikely that both will fail; this results in increased reliability over having just one or the other. However, there can be some situations that will cause both the primary and backup to fail: some input condition that neither can handle, such as a rotation rate beyond what the system designers expected to be possible, or an environmental condition, such as a radiation burst that damages the electronics of both the primary and the backup.

10.3 Abstraction

An abstraction is a summary or reduced form of a more complex thing, usually focused on the essential or intrinsic aspects of the complex thing. The abstraction is separate from any concrete instantiation of it.[1]

People use abstraction to manage the complexity of a large system. In an airplane, people talk about “the electrical system” or “the powerplant”—things that are built out of thousands of subcomponents, but which are usefully thought of as whole things in themselves. While the component breakdown structure, in the previous chapter , is one example of abstracting the details of multiple components into one larger, abstract component (or subsystem), most complex systems have multiple, overlapping ways to abstract and simplify views of the system.

In general, abstracting structure is about taking a relation between two (or more) high-level components and breaking it down into relations between subcomponents. In the example below, two high-level components A and B have a functional relationship. A and B are both abstractions of a set of subcomponents. The relationship between A and B is an abstraction of the relationship between the A.2 and B.1 subcomponents.

As a concrete example, consider software on two microcontrollers that communicate over a serial line. The software on each breaks down into an application software component and a serial driver. The serial drivers communicate (over a serial cable) directly.

Non-functional relationships can follow a similar pattern. If two high-level components A and B exhibit some kind of correlated behavior without direct causation, and those high-level components decompose into lower-level components, then at least one of the subcomponents of A must have a corresponding non-functional relationship with a subcomponent of B.

10.3.1 Overlapping abstractions

Abstraction is not necessarily purely hierarchical: some high-level abstractions overlap. Two different people can look at the same component and need to work with different aspects of it, and see it as part of different high-level abstractions. This is common in systems of even moderate complexity.

Consider an aircraft with modern avionics and engine systems. The avionics provide many functions: flight deck displays, pilot inputs, navigation, radio communications, autopilot, among many others. The powerplant provides thrust to move the aircraft and electrical power to run other systems, but in a modern aircraft it also includes an engine controller (FADEC) that provides autonomous management of engine operations.

The avionics and powerplant have overlap. The flight deck display will display engine status: thrust, temperature, thrust reverser deployment, and alerts when there are engine problems. The pilot thrust levers are connected to avionics, but provide commands to the engine controller. The autopilot needs to know the capabilities of the engines and how to provide them with control settings.

This overlap leads to a question: is the engine display function part of avionics or part of the powerplant? The answer is that it is part of both, depending on who is looking at that part of the system.

Consider a specific avionics unit for general aviation aircraft: the Garmin G3X display [Garmin13]. It can connect to an engine interface adapter, which in turn connects to sensors or a digital engine controller on the engine. The display is a general-purpose component, which can provide a pilot with many different kinds of information; engine status is just one function. The G3X unit contains a configuration database that defines what engine information it will be receiving, how to display that information to the pilot, and the conditions when it should issue alerts. This database resides within the avionics display unit, implying that someone designing the avionics system will be concerned with it. However, the database is specific to the powerplant installed on the aircraft—changing an engine model requires changing the database—and so it is of concern to people designing the powerplant.

This pattern is common in systems that have multiple functions: some particular component will contribute to multiple high-level functions, and different people will see that component as part of one abstraction or another based on what functions they are working on. Models of the system must accommodate these overlaps.

When two abstractions overlap, shared components must support both abstractions by implementing behaviors and properties that accurately support both higher-level abstractions. In the G3X avionics example, the configuration database needs to address the configuration of the powerplant as well as the interface to support pilot information displays. This can add complexity to designing the shared component, since behavior that supports one abstraction must not interfere with behavior necessary to support the other.

10.3.2 Abstracting a relationship

Some relationships between high-level, abstract components are themselves abstractions.

Consider once again the example of two microcontrollers that communicate with each other, as in the earlier section, but this time they communicate using a wired Ethernet rather than a serial cable. At the abstract level, there is a functional relationship from A to B where A sends data to B.

The data communication relationship, however, is an abstraction. The microcontrollers communicate using an Ethernet, which might consist of a pair of cables and a switch. The cables and switch reify the abstract relationship, meaning they take the abstract and make it into something real.

The inputs to and outputs from the reified data communication link are the same (at the abstract level) as the high-level abstract relationship: data gets transferred from microcontroller A to B.

This is an example of a general pattern. Two components at a high level may have a functional relationship, and both the components and the relationship between them decompose into a number of subcomponents. The consistency between the high-level abstraction and the lower-level details must be maintained, of course, but there is nothing that requires that a high-level relationship can only decompose into lower-level relationships.

In fact this pattern continues recursively down to the lowest observable levels. In the example, microcontroller A passes data into the Ethernet cable as a set of low-level electrical signals. Those signals, in turn, are made up of yet lower-level electromagnetic behaviors of the atoms in the conductors that join the microcontroller to the cable.

10.3.3 Consistency

A high-level abstraction and a lower-level implementation of the abstraction need to be consistent with each other. Speaking broadly, the high- and low-levels are consistent with each other if the low level implements everything in the high level abstraction, and everything in the low level implementation is reflected in the abstraction—that is, that neither level adds or removes anything from the other.

Abstraction does imply simplification, however. The high-level abstraction of a distributed software component might have a “logging” relationship to a centralized monitoring system. The decomposition of that relationship might involve a logging subcomponent within the software that uses a network connection to send log records to a receiver component within the monitoring system. The high-level logging relationship focuses on the ability to reliably and securely send log information to the monitoring system. To be consistent, the lower-level details must provide a way to transfer that information—using the network to move the data, for example. The statement that the information is sent securely—which would need to be better defined at the high level—might be matched by state and behaviors of the endpoint software components to authenticate each other and encrypt data in transmission.

Continuing this example, the lower-level implementation would not be consistent with the high-level abstraction if the network communication mechanism provided a way to send information in the other direction, from the monitoring system to the distributed software component.

We can put this in somewhat more formal terms as follows.

Components: Within one abstraction-implementation hierarchy, each high-level abstract component decomposes into one or more lower-level components. (Note that if a component is not decomposed, it isn’t abstract.) All of the state and behavior of the abstract component is implemented within the states and behaviors of its subcomponents, and the relations between its subcomponents. All of the states and behaviors of the subcomponents are reflected in the abstract state and behaviors of the high-level component.
Functional and non-functional relations: If there is an (abstract) functional relation between two abstract components A and B, then there must be one or more functional relations between the subcomponents of A and the subcomponents of B that implement all of the high-level abstract functional relation. All of the functional relations between subcomponents of A and subcomponents of B must be reflected in abstract functional relations between A and B.

This definition of consistency means that an abstract component or relation has to reflect all of the states, behaviors, or interactions that the lower-level components or relations can have, so that the abstract things model all of what the lower levels will do, and it cannot add to what the lower-level parts do. In reverse, the lower-level components or relations must implement all of what the abstract components or relations do, without adding other behaviors or interactions.

10.4 Emergent system properties

Emergence is the complement of abstraction: it is how high-level properties or behaviors arise from the properties or behaviors of a collection of lower-level components and their interactions. Put another way, one designs the emergent properties in a system to make abstractions true. Previously, in Chapter 5, I introduced the idea that system properties and behaviors are emergent from the properties and behaviors of the components that make it up, combined with the way those components interact. This idea continues recursively through a system, where each high-level abstraction is achieved by designing how subcomponents work and interact.

Emergent behaviors or properties are usually things that cannot be sensibly talked about at lower levels: these are properties that the individual components do not have on their own, but that the aggregation does when the components are combined. In physics, concepts such as gas pressure are emergent: no individual gas molecule has meaningful pressure, but the collection of a large number of molecules in an enclosed space gives rise to measurable pressure. Similarly, the shape of a leaf is emergent. No cell making up the leaf in itself has a property of the shape of the leaf, but the aggregation of all the cells as well as how those cells interact as the leaf grows (that is, morphogenesis) leads to a consistent shape that can be perceived of the whole.

In engineered systems, properties such as safety or “correct behavior” are emergent from the design of components and their interactions [Leveson11]. Consider an automobile: it has a property that the driver must be able to control its speed. The driver’s ability to control arises from the driver’s ability to give commands to regulate speed and the vehicle’s correct response to those commands. The vehicle’s speed arises from a combination of motor behavior, brake behavior, wheel interfaces to the road surface, vehicle inertia, and external forces like wind or gravity. One can talk about the rotational rate of the motor, or the degree to which brakes are applied, but driver control over speed arises from the combination of all these things.

There is a rigorous discipline of systems theory that provides a foundation for this discussion ! Unknown link ref.

An emergent, high-level property is said to supervene the low-level properties of components. A change in the high-level property can only occur when there is a change to the low-level properties. This principle implies that one can in general design low-level properties in order to achieve a desired high-level property. It may be difficult to do this design, of course, but it is possible; properly-designed low-level properties do not necessarily create undesired emergent behavior.

Designing a system so that a desired property or behavior emerges from components involves placing constraints on how lower-level components behave and interact. This is a top-down approach to handling emergent behavior. Reliability properties, for example, are often met using redundant components; for those redundant components to provide reliability, they must be connected in a way where one component can provide service when another fails—a property arising from how the redundant components interact with other components. The redundant components must also exhibit a non-functional relation of some degree of failure independence. I will discuss several more examples in the coming sections.

It is generally more effective to work top down, from a desired emergent property of an abstraction to the components and relations that will make it up, than to work bottom up, starting with a set of component behaviors and hoping a desired abstract property will emerge. Component properties combine in unexpected ways, and determining whether they combine in a way that produces the desired result and at the same time avoids unintended consequences is most often a nearly-intractable problem. Working top down means determining the constraints that must apply to the components and structure that implement the abstraction; analyzing (or designing) the components to determine if they meet those constraints is a simpler and more tractable problem.

For example, the software components inside most operating systems cannot be evaluated for good evidence that they provide the operating system’s intended features in all usage scenarios—and practical experience with popular operating systems shows that most contain large numbers of undiscovered errors. Those operating systems were generally built from the bottom up, with new components being developed on their own following only a minimal goal of function, and then added to an existing system. Only a very few operating systems or software systems of comparable complexity have been analyzed to prove that they actually implement their stated function correctly, and those examples have all started with clear definitions of the abstract behavior and worked from there to design the lower-level components and structure. [Klein14]

10.4.1 Examples of emergent properties

Emergent properties can be simple or complex; what they share is that the combination of properties or behaviors from multiple components yields something of a nature that would not apply to the individual components. Here is a set of examples illustrating different kinds of emergent properties or behavior, ranging from the almost trivial that one might not ordinarily think about as emergent to the complex, and including both desired and undesired emergent behaviors.

10.4.1.1 Reliable data communication

Reliable communication happens when information is sent from one place to another, with the information received matching the information sent. “Reliable” is usually qualified: a maximum probability that any arbitrary bit or message that is received does not match what was sent, and qualifications on the environmental circumstances such as distance between sender and receiver, or the absence of deliberate interference.

At a high level, communication involves an information source and an information sink. The source and sink have a functional relation of sending information from one to the other.

At the lower level, communication involves a set of components. The information source and sink remain. The functional relationship between them is reified by a chain of components: a transmitter, a receiver, and the medium between transmitter and receiver. It also involves various encodings used in sending from transmitter to receiver over the intermediate medium. The components have functional relations from one to the next, for moving information along this chain of components. The transmitter and receiver have a non-functional relationship: agreement on the encodings to be used to move information over the medium between them.

Neither the transmitter or receiver in themselves move information reliably from source to sink. Instead, reliable transmission is a simple emergent property of combining all the lower-level components and their relations. The reliability comes from properly matching the designs of the transmitter and receiver, including how they encode signals for transmission and reception, so that they can achieve the desired reliability on the medium that connects them.

10.4.1.2 Door closing and latching

Consider a door, perhaps to a cabinet. The door can be open or closed. When open, it can be closed by a person acting to close it. If no one acts on the door, it might remain open or close on its own. When the door is closed, it remains closed until a person takes a specific action to open it. “Remaining closed” means that the door stays closed even when force up to some defined limit is applied to the door. These behaviors should occur reliably for at least some number of open-and-close cycles. They only need to hold reliably in some benign environment (no deforming forces, no corroding atmosphere, and so on).

This is an example of an emergent property of a high-level component that can be achieved by properly designing the subcomponents that make it up.

One possible implementation of the door that would meet this high-level property uses a latch to hold the door closed. When the door swings closed, the latch engages and keeps the door closed. The latch can be connected to a knob or lever that a person can use to release the latch, allowing the person to perform a two-part action to open the door (release the latch, apply force to the door to move it open).

The high-level door thus decomposes into three subcomponents: the basic door, a latch, and a knob. These three subcomponents, plus the door’s user, have four functional relationships:

Latch to door: the latch holds the door closed when engaged.
Knob to latch: the knob can be moved to disengage the latch.
User to door: apply force to open or close the door.
User to knob: apply force to turn the knob.

The high-level opening action that a user can apply to open the door decomposes into a sequence of lower-level actions: a turn action applied to the knob, an opening force applied to the door, probably followed by a release action on the knob. The high level closing action decomposes into, first, ensuring that the knob is released, then applying a closing force to the door.

The implementation admits states that do not directly map to the high-level states of the door. For example, the implementation allows the user to turn the knob and then take no further action. This leads to a state of the system where the door is in the closed position and the latch is disengaged. If the environment applies an opening force to the door, the door is not restrained and will swing open. A designer will have to work out what these intermediate states are, and determine whether they are acceptable or not. (In this case, the situation might be resolved by saying that the high-level “open” condition maps to any implementation state where the door position is not closed or the latch is disengaged. Handling intermediate implementation states is not always so simple.)

The knob and latch will have properties that, together, support the high-level property that the door will remain reliably closed through some number of open-and-close cycles. These properties likely involve constraints on the wear imposed on each of them each time the door opens or closes, and the amount of wear before they begin to be unreliable. Similarly, the property that a closed door stays closed when some amount of force is applied to the door decomposes into properties on the latch and knob to ensure they will hold the door in position.

The overall property of remaining closed is an emergent property of the design of the latch and knob. The latch by itself is not closed or open by itself; that is a property of the door that arises when the latch is engaged and the door is in a closed position.

10.4.1.3 Failure resilience

A failure resilient component is one that can mask one or more failures of its parts while continuing to provide correct behavior. This is one way to meet a goal that a component is reliable or available; the other way is to make the fundamental reliability of the component higher.

For a concrete example, consider a control system for an autonomous road vehicle. The control system takes in commands from a user or other outside system, then must provide correct, active control of the vehicle’s attitude and movement to travel on the commanded path. Typical acceptable failure rates are one in 10^-7 to 10^-9 operational hours. The vehicle should fail safely where possible when the control system fails, but I will leave that aside in this discussion.

Many systems achieve this level of failure resilience using redundancy and voting. In this approach, multiple independent processors run the control algorithm synchronously, each receiving the same sensor input and generating actuation output. The actuation output from each processor is fed to voting logic, which determines whether a majority of the processors are generating consistent output and if so applies that output to the plant being controlled. If one of the processors fails by stopping, or by generating different outputs, the voting logic masks out the presumed failure.

The combined components will generally perform the same operations as one single computing component by itself, but the combination will fail less frequently. This improvement is an emergent property of the combination. It depends on two non-functional relationships between the redundant components: that they all exhibit the same behavior, and that they generally fail independently.

For the example vehicle control system, I found that the approach of using three identical embedded computers was—based on reliability analysis, not measurement—likely to provide only a modest improvement to overall vehicle safety. The redundant computers were not fully independent: they ran the same code, they shared the same power source, and were subject to heat and vibration in the same environment, all of which increased the chances two or more computers would fail together. They had a greater degree of independence to matters like a cable vibrating out of its connector or dust shorting out traces on the boards. In other applications, such as spacecraft, there are more sources of independent failure, such as radiation upsets. For spacecraft and aircraft, the cost of unreliability is also higher than for a road vehicle, making this approach to redundancy worthwhile.

An incident involving an Airbus A330 landing on 14 June 2020 illustrates how lack of independence between supposedly-redundant computer systems leads to failure [TTSB21]. The Airbus A330 uses three flight control primary computers; on landing, these control the braking, thrust reversers, and spoilers that slow the aircraft on the runway. In this incident, there was an error in the flight control law implemented in all three flight computers. On touchdown, the flaw was triggered in one flight computer after another until all three had failed, leaving the pilots only manual control of the brakes. The pilots were able to apply manual braking to stop the aircraft before running off the end of the runway. The failure occurred because there was a design flaw common to all three flight computers, meaning that there was no redundancy in the face of the particular condition that occurred on that landing.

10.4.1.4 Undesired emergent properties

Components are usually designed and organized so that together they achieve the desired emergent system properties. However, the same design can exhibit other emergent properties that are undesirable.

Network congestion is a commonly-cited example of undesirable emergent behavior. In its simplest manifestation, when multiple streams of data meet and cross at some router in the network, the streams can overwhelm the router’s capacity to process and forward data. The router typically drops some packets in order to try to keep up, which causes some of the streams in turn to detect missing packets and retransmit them—causing even more traffic through the router. This was first observed in the Internet in October 1986, when a particular congested network link was moving about 0.1% of the data it normally could when not congested [Jacobson88].

This has led to congestion avoidance and congestion control mechanisms in Internet protocols, which aim to either keep stream data rates below the level when congestion starts or recover quickly when congestion does occur. The sender and receiver behaviors in the congestion control mechanisms, however, have been found to lead to behavior synchronization across multiple senders, leading to oscillating loads that repeatedly overwhelm a bottleneck, then back off, wasting resources for a while, until the cycle leads to another period of congestion [Zhang90].

These behaviors are similar to other situations where behavior is unstable, and once it starts to behave poorly it gets progressively worse. In many of these cases, congestion or overloaded conditions make it more difficult for mechanisms that would address the situation to work.

The lesson to draw from the possibility of undesirable emergent behavior is that system designs need to be analyzed to look for such negative behavior—not just analyzed to ensure that desired behaviors happen. This is related to a kind of confirmation bias ! Unknown link ref where one is motivated, usually unintentionally, to look for evidence that confirms what one wants or expects. It often requires deliberate effort to look for evidence of negative behavior.

10.4.1.5 Spacecraft imaging a ground location

The final example takes the basic principles in the previous, simpler examples and combines them into a realistically complex case.

Consider a spacecraft system that is intended to take images of ground locations and send those images to users on the ground. The system includes many different parts:

A mission control center, where users can determine locations they would like imaged, generate commands for the spacecraft, receive images, and view the results
One or more ground stations, with antennas and transceivers that can communicate with the spacecraft
The spacecraft, which in turn includes:
- A communications subsystem, for exchanging messages with ground stations
- A control subsystem, which receives and executes commands from the mission control center
- An orbit determination subsystem, which might include a GPS receiver and logic for determining where the spacecraft is and how it is moving in orbit
- An attitude control subsystem, which controls the direction that the spacecraft is pointing and rotating
- A power subsystem, which provides power to other spacecraft components
- A payload subsystem, including an imager and memory for storing images

The process of taking an image involves every one of these parts, as well as others omitted from the example to keep the list from getting too long to read. It includes:

A user determining what ground location to collect an image of;
Formulating commands to take that image, and sending them to the spacecraft;
Receiving the commands, decoding them, and executing them when the proper time comes:
- Turning the spacecraft to point at the ground location, based on knowing where the spacecraft is and how it is moving;
- Powering up the imager, taking a picture, copying it to storage, and turning the imager off
Packaging the image data along with related metadata for transmission to ground
Sending the data to the mission control center by:
- Determining when the spacecraft is coming in range of a ground station;
- Turning on the transceiver;
- Encoding the image in packets and sending those packets to ground;
- Turning the transceiver off when out of range of the ground station;
- Repeating this sequence until all of the image has been transmitted
Decoding the image in the control center and providing it to the user

If any one of those steps fails to happen properly, the system as a whole will fail to achieve its objective. At the same time, no one component involved in these steps achieves the system objective by itself. In other words, the system behavior of taking an image of a ground location is an emergent property of the system as a whole.

This example is typical of most system properties and behaviors, in that achieving the desired behavior involves many components working properly together. This implies that all these components have been designed to have their individual properties, and that the components have been wired together with the right functional and non-functional relations to work together.

This example also illustrates a common issue: that components depend on other components for their function. For example, the ability for the spacecraft to communicate with the ground depends on the spacecraft being able to determine when it is coming in range of a ground station. This means that the spacecraft must be able to tell where it is, which might rely on the GPS system. If there were to be a problem with the GPS constellation, the spacecraft would not be able to communicate correctly. This kind of dependency creates non-functional relationships—in this case, a non-functional relationship between ground station systems and spacecraft communications that communications will function only when the GPS constellation is working properly.

10.4.1.6 Safety and security

Leveson argues that safety is a fundamentally emergent property:

Safety, on the other hand, is clearly an emergent property of systems: Safety can be determined only in the context of the whole. Determining whether a plant is acceptably safe is not possible, for example, by examining a single valve in the plant. In fact, statements about the “safety of the valve” without information about the context in which that valve is used are meaningless. Safety is determined by the relationship between the valve and the other plant components. As another example, pilot procedures to execute a landing might be safe in one aircraft or in one set of circumstances but unsafe in another. [Leveson11, §3.3]

I argue that related properties, including security, are similarly emergent and must be understood, designed, and analyzed in terms of how components are related.

10.5 Working with structure

The notions of components, structure, and emergence form a foundation for the work to be done when designing and building a system. Upcoming chapters will define the tasks, artifacts, and processes involved in terms of this basic model of how systems can be organized.

For example, the design of a system consists of artifacts that document what the components are in the system, and the desired properties and relations that connect them. Verifying the design involves gathering evidence for and against whether the behaviors that emerge from the components and their relations match the desired system behaviors. A design can be evaluated based on properties of the graph of relations between components, and the graph of relations can guide investigations into whether subtle non-functional relations (such as expected component independence) will hold.

In addition, there are common design patterns of components and relations that provide guidance for implementing complex behaviors. These design patterns can be expressed in general terms of components and relations, making the patterns broadly applicable rather than specialized to a particular use case.

Sidebar: Emergence all the way down

I have taken a pragmatic approach to abstraction and emergence, focusing on the kinds of relations and abstractions one actually encounters in building most real systems. This means only drilling down into lower layers of abstraction as far as is needed, and not as far as it could go.

Consider data that is exchanged between two electronic components. Data is an abstract component that has no direct physical reality; it is an emergent property of lower-level components and relations. The data itself is dependent on mechanisms for observation and interpretation by people—including agreement between sender and receiver on what the data “mean”. The data are transmitted from one component to another using low-level electrical signals over wires; the signals are designed to move the data from one component to the other. The low-level electrical signals are themselves an emergent property of yet lower level atomic and electromagnetic behaviors in the transmitter, wires, and receiver. These may in turn be emergent properties of yet lower level structures and forces, some of which may not yet be understood.

It is intriguing to think about how far one can take this approach. Luckily we can usually stop at some practical level and take the rest for granted.

[1] Following the definition of “abstract” in the Merriam-Webster online dictionary.

Chapter 11: System views

13 November 2023

11.1 Introduction

Systems are too big for one person to understand all the facts at once. It’s necessary to focus on subsets to manage the scale.

At the same time, different people have different interests as they are working on a system. They need a particular kind of information about part of the system, but do not need to be distracted by other kinds of information.

These needs for subsetting lead to developing multiple views on a system. Each view defines a subset of the information on a system, with the subset defined to support a particular person’s needs and interests. Ideally, each person can do their work using one view or another, and when all the work has been done using many different views the work has addressed all of the system.

Some of these views have a technical focus, being about the function or properties of the system and its parts. These views support those who design, analyze, implement, or verify parts of the system. Other views are non-technical, supporting people who manage the project, organize the teams doing the work, handle scheduling, and similar tasks.

The view concept I am defining here is a general mechanism for subsetting information about the system. There are several architecture framework standards that define “view” and “viewpoint” concepts, including DODAF ! Unknown link ref and ISO 42010 ! Unknown link ref. The view concepts in those framework standards arise from ideas about the processes that should be used to build systems well, and are thus more specific than the general idea presented here. These standards focus on developing models of a system’s design, with subset views that are motivated by exploring the objectives that system stakeholders have in the system. The approach in these standards is one way to use the general idea of subsetting information about a system based on some focus; I will discuss this further in later chapters when I turn attention to how to build systems using the foundational concepts presented now.

11.2 Technical views

Technical views are ones that subset the contents of a system in a way useful to the designers, implementers, or verifiers of the system. These views focus on how a part of the system functions or is organized in some technical sense.

These views can focus in different ways, depending on the specific need:

On sets of related components;
On paths through components to achieve some function; or
On the dependencies of components or relations.

A view focused on a set of components is useful to someone responsible for a particular subsystem or abstraction. The view can collect all the components, at varying levels of abstraction, related to one part of the system. This might be defined as one or more subtrees in the component hierarchy (Section 9.3)—for example, all the components that make up an electrical power system for a spacecraft. This might also start from some other abstraction. Views like this can be used when working out how an abstraction is to be realized in concrete subcomponents (Section 10.3). It can also be useful for checking whether certain design properties hold, like total mass.

A view focused on a path through the system is useful for working out or checking how behaviors are realized. Such a view might start with an event in one component, then trace how one event causes events in adjacent components, onward until the high level behavior is complete. Views like this are useful when checking where a path might have gaps that need to be addressed. It is also useful for checking that a causal path among abstract components and relations is properly realized in concrete subcomponents.

Looking at a path can help reveal what conditions need to hold for each step in the path to occur properly. For example, in the spacecraft commanding example in the previous chapter, a ground pass has to happen successfully if a command message is to be received at the spacecraft. A successful ground pass requires a functioning and available ground station, accurate ground knowledge of where the spacecraft will be, knowledge in the spacecraft of where a ground station is and when it will be in range, and the ability to operate the communications subsystem.

The third kind of view focuses on trees or graphs of dependencies. This information is useful to someone who is verifying that some safety or security condition holds. It is also useful for revealing where there are unexpected vulnerabilities in a system. In particular, looking at the transitive closure of dependencies can reveal unexpected shared dependencies between two components. In the spacecraft commanding example above, a spacecraft’s ability to know when it should operate its transceiver for a ground pass might be based on the spacecraft knowing its location through GPS. This creates a dependency on a GPS receiver on board and the correct function of the GPS constellation. Further, it may require the spacecraft to maintain an attitude where GPS antennas can see the GPS constellation; this may conflict with other demands on spacecraft attitude (like pointing an antenna toward a ground station). Both the communications transceiver and GPS receiver may rely on a shared electrical power system.

These three kinds of views are not mutually exclusive. Often someone can benefit from starting in one view, such as a path through the system, and then use other views to explore or refine the system, such as checking on dependencies.

11.3 Non-technical uses

Some views are useful for managing project execution. As a manager or lead, I have been responsible for working out what tasks people need to do to develop the system to some milestone, along with potential dependencies among tasks and estimates of the time and resources needed. I have needed to understand the system in order to derive this information about tasks.

For example, I have often started with a high-level design for a part of a system, containing a few abstract components and relations and a few paths through them for performing key behaviors. I have used one or two paths through those components to sketch out milestones that the team can design and develop toward; at each milestone, the designs or implementations will be integrated to demonstrate some level of functionality. This management step uses views of a few paths through the system. After that, I have worked from the view of components and relations that feed into each milestone to work out a set of design and development tasks that will get each part ready for its milestone. These steps use information about the components and relations involved to work out both the individual tasks and how those tasks might depend on each other, leading to constraints in how the effort can be scheduled. I expand on these techniques in ! Unknown link ref.

Following paths through a system, as well as tracing through the ways that abstractions are decomposed, allows one to find gaps in the current understanding. These gaps represent uncertainty, which can lead to risk. Further, following paths through the system that lead to and from some uncertainty to other components or relations helps one work out how much other parts of the system may be affected by uncertainty. This allows one to judge the potential effects of changes that may arise from the uncertainty; the magnitude of the effects is part of determining how much developmental risk some gap poses. I discuss how to use this kind of analysis in ! Unknown link ref.

Sidebar: Specifying a view

The descriptions above may seem focused on extracting subsets of a defined system, but the view concept is intended more generally.

In set theory, subsets are often specified one of three ways: by listing the elements of the subset; by constructing the subset through combinations of set operations such as intersection and union; and by specifying a characteristic function—essentially, a description of a query on the set.

All of these have been useful to me at one time or another. While a system is being designed, the population of components and relations that make it up will be changing constantly. The path through components and relations to achieve some function will be steadily refined; in many cases, there may be two or more alternative designs for parts of the same path to compare. This case lends to a query-like formulation of views, which are updated as the system’s contents change. On the other hand, tasks to verify that a design or implementation are correct and complete benefit from being an unchanging snapshot. This way someone can step through each part of the system, verifying each piece and each integration, without having that work change as people make changes to the system.

Chapter 12: Evidence of meeting purpose

20 November 2023

12.1 Introduction

An implemented and operational system needs to meet its purpose (Chapter 8). After all, that purpose is the reason that resources have been spent on developing the system and using it. Meeting purpose means two things: that the system does all the things it is supposed to, and that it does not do things it is not supposed to.

One cannot assume that a system meets its purpose. Every system needs to be evaluated to determine whether it actually does or not, and if not, how and where it does not. The evaluations catch design and communication errors that occur when one party thinks they have specified what is needed, and another party does not understand what was meant or makes a mistake in translating the specification into practice.

How a system works changes over time as well, and regular re-evaluation catches cases where operational behavior diverges from what is needed for correct or safe operation. This includes wear and tear on the system that must be corrected with maintenance. It also includes changes in how the system is operated—from operator practice to management organization and environmental context.

In this work I talk about gathering evidence of a system meeting its purpose. Parts of a system’s purpose can be specified quantitatively or qualitatively. Quantitative purposes can lead to deterministic ways to check that the system meets the purpose. Complex quantitative purposes, however, aren’t necessarily so easily evaluated: computational complexity or the difficulty in actually measuring system behavior can lead to quantitative properties that cannot be easily or definitively evaluated.[1] For these complex quantitative problems, one must be satisfied with statistical evidence that indicates whether the property is likely true. Qualitative purposes are not amenable to proof of satisfaction or not. These purposes are evaluated by human judgment, which again leads to evidence but not proof of satisfaction.

Systems engineering processes often use the terms verification and validation (or just V&V). These are both special cases of the general need to gather evidence for and against whether a system meets its purpose or not. In this chapter I focus on the general matter of checking a system, and I will note in this chapter and in later chapters ! Unknown link ref when these specific uses of evaluating a system apply.

12.2 When to evaluate a system

Checking whether a system meets its purpose is an ongoing need. It should be done continuously, starting from when the system is first conceived, through system design, implementation, and operation. In general, a system should be evaluated any time its purpose changes, or any time its design, implementation, or operation changes.

In practice, there are five points in a system’s lifecycle when the system—whether in design, in implementation, or in operation—should be checked against its purpose.

At each of the individual steps from initial concept, through specification, design, and implementation.
At the time when the system is accepted for deployment.
Periodically and regularly while the system is in operation, to monitor for drift.
At each step when a change is requested, from concept through design and implementation.
At the time when a changed system is accepted for deployment.

During development, systems are checked in two ways: step by step, and a separate evaluation of the whole system when implementation is complete. The step by step checking occurs at each development step, including generating a concept for the system, generating a specification, designing, and then implementing the design. The expectation is that if each of these steps is correct, then the concept will follow the purpose, specification will follow concept, and so on, and the resulting implementation will properly meet the system’s purpose. In practice something gets missed or misinterpreted at some step of development, and so the argument that each step is correct does not hold. Separately evaluating the implementation at the end directly against the original statement of purpose allows one to cross-check the step-by-step evaluation. It helps one find which step had a mistake and thus where to make corrections.

Any time the system’s purpose changes, the system must be re-evaluated in light of the change. This involves repeating steps in the life cycle shown above. Re-evaluation is easy when early in initial design; the later in the life cycle, the more expensive re-evaluation gets. The scope of what parts of the system need to be re-evaluated can be limited by examining the structure of the system and how a change propagates from one component or relation to another.

A system should be evaluated regularly while in operation. In practice, systems drift over time from how they are originally designed and implemented. People who are part of the system, whether as operators, oversight, or management, can shift in their understanding of what they need to do, and often find shortcuts for their role as they adapt to how the system is to work with. Mechanical parts of the system can wear, changing their behavior or increasing the chances of failures. The environment in which a system operates can change, perhaps with people moving near an installation that was previously isolated or maintenance budgets being cut. As a simple example, in one early software system I built, the software included a billing module that would create itemized invoices to be sent to insurance companies that were expected to reimburse for medical expenses. Over time, the people who should have been running the module and creating invoices forgot to do it as regularly as it should have, leading to revenue problems for the business. Leveson discusses several other examples [Leveson11, Chapter 12].

Finally, a system’s purpose usually changes over time. The users need new features, or some assumption about how they will use the system will be found to be wrong. Regulations or security needs may change. All of these lead to a need to change the system’s design and implementation. The team will recapitulate the development process to make these changes, including evaluating the updated concept, design, and implementation against the new purpose.

12.3 Kinds of evidence

There are two kinds of evidence: positive evidence and negative evidence. Both are needed to evaluate whether a system meets its purpose.

Positive evidence is an indication that the system properly implements some desired property or behavior ! Unknown link ref. Positive evidence is what most people think of first: that the mass of system hardware is within some maximum amount, or that the system performs action X when condition Y occurs.

Negative evidence is an indication that the system does not do something it is not supposed to ! Unknown link ref. Safety and security evaluations are fundamentally about collecting this kind of evidence: that the system will not do some unsafe action or enter into some unsafe state. Negative evidence is therefore vital to determining whether a system meets its objectives, but negative evidence is generally much harder to establish than positive evidence. In practice, analytic methods are the only ways we currently have to establish the absence of a condition.

Bear in mind that, as the saying goes, absence of evidence is not evidence of absence; that is, no amount of testing that fails to find an undesired condition can establish with certainty that a realistic system is free of that undesired condition. Negative evidence through testing requires testing every possible scenario, which is infeasible for anything other than trivial behaviors. Testing a very large number of scenarios can potentially generate a statistical argument for the absence of an undesired condition, but only if the scenarios chosen can be proven to provide sufficient, unbiased coverage of all possible scenarios, including rare scenarios. I have never found an example of someone being able to construct an argument for the significance of the test scenarios in a real-world system. Kalra and Paddock [Kalra16] present an analysis for testing autonomous road vehicle safety, and show that it would require an infeasible number of driving miles to show the absence of unsafe behaviors—and they conclude that alternate means are needed to determine whether autonomous road vehicles are sufficiently safe.

Many undesirable behaviors or conditions cannot be completely eliminated from a system, and instead the standard is to show that the rate at which these behaviors occur is sufficiently rare. For example, aircraft are expected to experience failures at no more than some rate per flight hour in order to be certified for operation. These safety conditions lead to a need for evidence of statistical bounds on rate of occurrence at a given confidence level.[2] If these bounds are sufficiently loose, then a carefully-designed test campaign can provide statistically significant evidence. However, statistical significance and confidence rely on the test scenarios either being selected without bias, or with a way to correct for selection bias. This means, for example, ensuring that there is no class of scenarios that are avoided in selection. It also means understanding the probability of rare but important scenarios occurring and accounting for that rarity in the number of scenarios tested or in the way scenarios are selected.

12.4 Methods of gathering evidence

There are three general methods for gathering evidence about systems satisfying their purpose:

Experimentation,
Inspection or review, and
Analysis.

Experimentation tests an operational system (or part of a system) to show positive evidence about some desired capability. This is the gold standard for gathering positive evidence.

Experimentation is usually divided into two categories: testing and demonstration. Testing involves setting the system into a defined condition and providing it defined inputs, measuring the system’s response, and comparing that response to expectations. Tests are expected to be repeatable. Demonstration is more open-ended, where the system is operated for a while, possibly by people, and not always in a fully-scripted, repeatable way. Demonstrations can address some non-quantitative conditions, such as whether people like something or not.

Inspection or review is a way to check a design or system for things that cannot be readily measured by experimentation. These methods use human expertise to check the system for specific conditions. It is primarily used to gather positive evidence, but it can be useful for gathering negative evidence when other methods don’t apply. In the simplest form, inspection checks simple conditions that would be difficult to automate; for example, that a physical car has four wheels. For more complex reviews, humans observe and think about what they observe in the system to determine whether what they observe meets expected behavior.

Analysis can be used to collect both positive and negative evidence. Indeed it is generally the most useful way to gather negative evidence—which is often about thoroughness, and analytic methods are better at ensuring all possibilities have been examined. Analysis takes as input a model of the system, extracted from its design or its implementation. It then applies algorithms that work to prove or disprove statements about that design, such as whether there exists some sequence of state transitions that would cause a component to enter an undesired state. The evaluation is usually performed using automated computational tools, though it can sometimes be done by hand for analyses of modest complexity. I have used analytic methods occasionally, usually for foundational components or abstractions on which the system depends for its correct operation. The first time I used it, on the design of a synchronization mechanism in a multi-threading computing environment, it caught a subtle flaw that would have occurred rarely but would have been difficult to detect. On another project, colleagues and I proved the correctness of the design of a family of distributed consensus algorithms—which helped us accurately implement the algorithms. The SeL4 operating system kernel [Klein14] has been formally proven design and implementation, showing that its implementation provides key confidentiality, integrity, and isolation properties as well as functioning according to its specification.

12.5 Completeness and minimality

Separate from these methods for gathering evidence, one also needs evidence of completeness and minimality.

When a system is believed to be complete, one doesn’t want only to show that one or a few purposes are met; eventually one needs to provide evidence that all purposes are met. This does require knowing what the purpose is, and then being able to provide evidence showing each part of it has been satisfied.

One also needs to show that the system as designed or implemented does not do things that don’t derive from and support the purpose. This includes showing that safety and security properties (of bad things not happening) are met. It also includes ensuring that people have not inserted features that the end users do not need or want, which would imply that development resources have been mis-spent and that the system can potentially do things the users will find undesirable.

[1] For example, consider a system property that is equivalent to the halting problem ! Unknown link ref or first order logic satisfiability ! Unknown link ref. Both these problems are formally recursively enumerable ! Unknown link ref, meaning that if a property is true that can be found in a finite time, but if the property is false it may take an infinite amount of time to determine that it is.

[2] Estimating the total number of species in an environment is a similar problem. One way to generate an estimate is to look at the rate at which new species are discovered. When most species in an environment have been discovered, the rate at which new ones are found decreases and in the limit goes to zero when all species have been discovered. See the work of Wilson and Costello [Wilson05] as an example of performing such estimation. It has been argued that the rate of discovery of undesirable system behaviors should follow a similar model.

Part IV: Making a system

In this part, I discuss the model presented in Chapter 6 for how to go about building a system:

Starting with the approach to organizing the model, and what stakeholders need from the project;
The artifacts that are created and maintained during the work;
The tools used to do the work;
The team that builds the system; and
Planning and guiding the work itself.

Chapter 13: Approach

8 January 2024

Making a system is about the activities to build the system and the people who do that work. In Chapter 6, I laid out a basic model for these activities and what they involve. The model involves five elements (repeated from that chapter):

The model is organized around the tasks that are performed to build the system. The tasks generate artifacts, including design and implementation. The team is the people who do these tasks. The people use tools to do some of the tasks. And finally, the plan organizes the work.

This model provides a template for thinking about how to set up the processes and policies for a system-building project. That is, when it comes time to do a project, one can use this model to help guide the decisions about how the project will be run. In this book I do not specify how one should make these decisions—each project has its unique needs, and no one recommendation will be a good solution for every project. Instead, the model provides a framework for understanding what decisions need to be made, and in later chapters I provide menus of choices for different parts of the model.

All the pieces of running a project are themselves a system, whose purpose is in general to get the system built. In this part, I follow a general approach for designing any system in order to lay out a set of functions that each part of the model can have. In doing so, this lays out a framework for the criteria by which someone can judge potential designs for their project’s organization.

The approach, then, begins with working out the purpose of the system for running the project. The purpose in turn derives from the stakeholders who must be satisfied with the execution. In the rest of this chapter, I lay out a template list of stakeholders and the needs each of them might have. This set of needs will then provide guidance for what each component part of the model—artifacts, team, plan, and tools—should have in order to satisfy the stakeholders.

13.1 Purpose

The primary purpose of the system for running the project is:

Get the system built, accepted, in operation; maintain and evolve it.

There are also secondary objectives that different stakeholders will have, which we will discuss next. This includes, for example, needs of the organization hosting the team that does the work: the organization in most cases expects at least to be able to cover the cost of development. If the organization doesn’t believe that it can cover the cost, they may well decide not to pursue the project.

In the next step, I identify potential stakeholders. Following that I will identify potential needs each can have.

13.2 Stakeholders and needs

The first step in working out a system’s purpose is to identify the stakeholders who define the purpose (or put constraints on the project that are, in effect, part of the purpose).

I group stakeholders into five classes:

The customer for which the system is being built;
The team that builds the system;
The organization(s) of which the team members are part;
Funders who provide the investment to build the system; and
Regulators who oversee the system and its building.

Each of these are meant to be roles, rather than single entities. For example, when a system is built under contract for an organization who is paying for the work, that organization is both the customer (they will be using the system) and the funder (they are paying for the system-building).

13.2.1 Customer

The customer is the person or organization(s) that will use the system once it has been built and deployed. The system’s value in the world in the end derives from what the customer can do using the system.

The customer primarily cares about the system meeting some need they have. In addition, they care that the system:

Is reliable, safe, and secure.
Will be within their budget, both up front and over time during operation.
Can be certified or approved (if needed).
Can be adapted as their needs change.

Variations. The simplest kind of customer is when one organization contracts with another organization to build the system for the first organization. In this case, it is clear who needs to be satisfied with the system (the one paying for it).

Other times the customer is internal: when an organization determines that it needs some system for its own use. Who defines the purpose of the system is then usually clear—though sometimes it is unclear who defines the purpose, because there is not such a clear separation between the “customer” and the builders.

Finally, the more complex situation occurs when the customer is hypothetical. This occurs when an organization builds a system product in hopes of providing it to future (paying) customers. In this case, there is no one person or organization who can dictate the system’s purpose. Instead, the team designing the system must build up an idea of who potential customers are and what they might want.

I discuss the different kind of customers further in ! Unknown link ref.

Relation to broader management. Most organizations have someone or a team responsible for finding and working with customers. This might be a business development group, or a sales and marketing group. These people will be responsible for actually working with the customer, and they should stand in as a proxy for the customer during internal discussions. The systems aspects that I discuss here support the interface between the marketing or business development people and the people who build the system that is delivered to the customer.

13.2.2 Team

The team is the collection of all the people who do the work to design and build the system. This group includes developers and engineers, managers, contracting specialists, marketing, and everyone else who does tasks related to getting the system built.

Many of the things that the team needs are not directly related to building the particular system, but are aspects of the organization for which they work. An organization’s policies and management have the most effect on whether the team are satisfied, but there are aspects of systems work that can support (or hinder) the organization.

The people in the team need, in general:

Satisfaction in their work. This means that they should have work that uses their skills and provides a reasonable challenge. They should have confidence that their work will have a positive outcome.
Appropriate staffing. The team needs to have enough people to get the work done without being overstaffed, leaving some people underused.
Secure position. The people in the team need to feel secure in their position, meaning they understand how they fit into the project, what is expected of them, how they will be evaluated, and where authority lies. They also need to understand how they can raise problems to be addressed, and have confidence that they will be heard.
Fair compensation. The people need to believe that they are being compensated fairly for their time and effort (and in ways that involve more than money).
Confidence in the project. The people need to have confidence that the project is well run, and that both their work and the result of their work will accord with ethical standards.

Variations. The team can be as simple as one or two people, or it range to a large team of hundreds. The team can be all within one organization, or it can be spread over multiple organizations (such as when multiple organizations collaborate on a project). A team can also be viewed as including external vendors who provide parts of the system or essential services.

Relation to broader management. Most of a team’s needs are matters of project management and business operations, not of systems-building in itself. The organization defines its human resources policies, for example, which address matters of how people are evaluated or paid, and how they can report problems.

However, the organization of systems work can help to meet these needs. Accurate staffing depends on understanding the work to be done, which in turn depends on the system’s design. Well-defined job descriptions and processes help people understand how to get their job done, contributing to people feeling secure in their position.

13.2.3 Organization

The organization is the entity or entities for whom the people in the team work, and which provide a legal entity for the project. I use the term “organization” rather than “business” or “company” because there are many kinds of organizations that can run a project: a government, a consortium of other organizations, a non-profit organization, or an informal group of people can all run a project.

All organizations share one concern: the ability to deliver the system. This includes having the ability to communicate with the customer (or model potential customers) and the ability to hire and support the team doing the work.

Organizations also share a need to maintain their reputation. If an organization has a reputation for delivering good systems, on time and on budget, they will be more likely to be able to keep going.

Some organizations have additional needs, focused around how the project will position them to deliver other things to other people. An organization may need to show a profit—enough to fund the organization’s overhead and to deliver returns to funders. An organization may need to be able to sell the system to potential customers. And an organization may need the project to position the organization for future work, based on improving the organization’s capability and maintaining its reputation.

Variations. There are many different kinds of organizations. These include:

A for-profit business, which looks for a financial return on its investment.
A non-profit organization, which looks to cover its costs and see a return in some other way (such as a community benefit).
A government, which looks to have some public need met.
A group or consortium of organizations, such as a primary contractor with subcontractors.

Relation to broader management. Obviously, most of an organization’s needs are handled not by systems building, but by the organization’s management and operations. The systems project supports these needs, however. The organization needs to be able to estimate the cost and time involved in a project in order to ensure that it has the funding needed to complete the project. The organization’s reputation depends in no small part on its ability to execute the systems-building project, so things that helps the project move ahead efficiently and smoothly will be good for the organization.

13.2.4 Funders

Funders provide the capital or other resources needed to build the system.

A funder has one primary need: the return on their investment. The return may be monetary (profit from sales of the system) or it may be more intangible (a business ecosystem, regional economic development).

Some funders will have secondary needs, such as enhancing their reputation and positioning themselves for funding future projects.

Variations. Funders can be external to the organization building the system, providing investment in the expectation of a monetary return. Venture capital funding is one example of this kind of funder.

The customer can be a funder when the customer pays for building the system. This can be a commercial customer funding the project through a contract. This can also be a government organization providing a development contract. The expected return in these cases is primarily the system itself, and secondarily less tangible benefits like the development of capacity to build similar systems.

A project can also use internal funding. This occurs when an organization has the capital to develop a system itself. The organization generally expects a return on its investment either by improving the organization’s own capabilities, such as by building a tool that helps the organization run better, or by providing a product that the organization can sell for a monetary return.

Relation to broader management. While the organization has the primary responsibility for working with funders, a systems-building project can help meet the funders’ needs by building the system efficiently, using the investment well, and by producing a good system, which helps ensure that the expected return will occur.

13.2.5 Regulators

Regulators in general are people or organizations independent from the team and project. The regulators provide an external check on organizations and products to ensure they meet safety and security regulations, or that they provide legally-required public value.

Regulators need compliance with regulation in the system and in the work the team does to build the system. The regulator may verify that regulations have been met by inspecting the final system or by auditing records of the system’s creation. The regulator may block a system’s deployment until the system can be certified as meeting these requirements, as happens with aircraft. Alternatively, the regulator may depend on the team to know and follow the regulations and only check the system’s compliance when something goes wrong. The US automotive industry is an example of this.

The systems-building process, at minimum, supports regulators’ needs by knowing and following the regulations. This often can involve dialog with the regulatory organization to ensure that the team has all the information it needs, and to ask for clarification or guidance when the team is unsure about the regulation. The team also needs to maintain records that can be checked to show how it has complied with regulations. When the system requires certification before being deployed, the team usually needs to engage with the regulators to ensure the process goes smoothly.

Variations. A government organization is the obvious regulator. They have the charter to look after the public interest, especially when a project has incentives that would work against that interest.

Industry organizations can act as de facto regulators. A group of companies can come together to set voluntary standards for the systems they make. The groups that standardize the Internet ! Unknown link ref or WiFi ! Unknown link ref for interoperability are examples. These organizations do not have authority to penalize systems that do not comply, but a system that does not is not allowed to claim compatibility.

Finally, there are non-governmental organizations that set safety or security standards, often for a particular industry. ISO, SAE, and others provide safety standards ! Unknown link ref and companies have grown up around them to help other organizations comply with the standards. These organizations also have no authority to penalize non-compliant systems directly, but compliance is usually evidence used to show that government regulations are met, or provide a defense against lawsuits.

13.3 Mapping needs to model

The previous section introduced a set of stakeholders that have an interest in how the project operates, and a summary of each of their needs. The next step is to work out how the model for performing the project can support meeting those needs (see the diagram above). This involves mapping the stakeholder needs to each of the parts of the model (artifacts, team, tools, plan).

I developed this detailed mapping. Appendix A reports the details of each stakeholder and their needs, along with the full derivation from needs to the requirements for the pieces of the system-building model. The mapping has the form of tables of requirements or objectives, with each stakeholder need mapped to one or more objectives for each part of the system-building model. The result is that every stakeholder need is either supported by aspects of the system-building model, or is explicitly labeled as the responsibility of others outside the system-building project. The derivation also shows that every objective listed for the system-building model is justified by helping meet some stakeholder need.

The remaining chapters of this part of the text explain what each part of the model should be or do. These chapters are based on the derivation in the appendix.

Chapter 14: Artifacts

19 January 2024

14.1 Purpose

Artifacts are all the things created in the process of making a system. It starts with records of the purpose of the system, and the requirements it must fulfill. It includes the implementation of the system ready to deploy—such as hardware inventory in a stock room and software ready for installation. The artifacts include everything in between, including design, source code, verification records, rationales for decisions, records of reviews and approvals, and many, many more. The artifacts also include information used by the team to help do its job, such as information about who is on the team, processes to follow, and how the team operates.

The objectives for artifacts are documented in Section A.3.1.

The artifacts have three functions: as deliverables, as communication, and as a record of the project for auditing.

As deliverables, the implementation artifacts are the actual system to be deployed. It should be possible to take a set of implementation artifacts, assemble them (following instructions that are themselves artifacts) and have a working instance of the system. These artifacts are joined by things like records of regulatory approval and information associated with serial numbers or versions showing the history of the specific artifacts deployed in the system.

Most of the artifacts, however, are for communication: between people working on one task and another, between the customer and system designers, between those who implement and those who verify. Sometimes those people are working concurrently, such as when two people design two components that are expected to work together. Sometimes the communication is between someone who specifies attributes for a part of the system and someone who implements that parts. The communication is also between someone who made a design decision and someone who, years later, must understand that decision in order to make changes to the system.

Audit is a special case of communication. It is between the project and someone outside who will be checking the project’s work. In many cases the external party will have an adversarial role, looking to find mistakes or violations. Regulators, for example, may look through records to check that the team has followed processes that meet regulatory requirements.

Note that there are many ways to achieve the objectives laid out in this chapter. Each project will need to determine how to handle its own artifacts. The specific solution will depend on the complexity of the project, the size of the team, and requirements from the organization or industry. The appropriate solution may change over time: as a team grows, it may need more formal mechanisms.

I have seen a range of working approaches for handling artifacts. Two projects kept track of planning information on designated whiteboards. Others maintained plans in project management tools. (The whiteboard approach had a problem: one time someone erased the board. Luckily there was a recent picture of its contents.)

I have also been on projects that had an overly complicated solution. One project was a joint venture between multiple companies on multiple continents. That project used multiple repository tools for different kinds of information. There was a process for proposing design and implementation changes, but no one knew quite what it was or how to follow it. After a few years that joint venture fell apart, in part because the teams could not figure out how to work together.

Whatever solution you adopt, it is important that it fit your project and team. It should be capable enough to manage the kinds of artifacts your team will use, and simple enough for the team to use.

The objectives in this chapter can help you work out what capabilities your solution should handle.

14.2 General principles

The artifacts are meant to be shared, at least within the team and sometimes to people outside. The people using these artifacts will come and go, so supporting people who will use them in the future is as important as sharing in the moment.

This leads to some general principles about artifacts.

People should be able to find the artifacts they need. An artifact is not useful it the people who need it don’t know it exists, or if they don’t know how to find it. The artifacts should be organized in some way that helps people find them.

“Finding” has multiple aspects. It can mean that when they know something exists, they can get to that artifact conveniently. It can mean that they know that a general kind of thing probably exists, and they need to be able to navigate through to the artifacts of that kind. They may not know what is out there, and need to be able to browse or discover artifacts in order to learn about the system. Or it might mean that they need to have confidence that they can itemize all of a certain kind of artifact, without missing any.

People should have confidence that they have found the correct artifact. In the worst case, someone will look for a particular thing and find three or four potentially-relevant artifacts. Which, if any, of those should they believe? What if they disagree with each other?

This principle generally means, first, that any particular piece of information or artifact should be in one place. There should not be two different artifacts that appear to be authoritative sources for the same piece of information. It also means, second, that when there are legitimately multiple versions of an artifact, those versions should be clearly identified and that a user should see consistent versions of different artifacts unless they take explicit actions to see different versions.

The artifacts should be maintained securely. The system that the customer will ultimately use is based on many artifacts that the project maintains. If someone subverts or damages some of those artifacts, the resulting system will be compromised. If someone destroys some of the artifacts, some of the team’s work will be lost.

This argues at minimum for maintaining the integrity of the artifacts, meaning that the artifacts or the collection of them cannot be modified in an unauthorized way. (Good practice is that any change to an artifact can be traced reliably to the person who made the change.)

Some of the artifacts may need to be kept confidential, if they contains secret information. Almost every project has some information to be kept confidential, at minimum as part of maintaining the integrity of artifacts. (Login credentials, for example.)

14.3 Kinds of artifacts

This section lists the kinds of artifacts that the analysis in Appendix A showed contribute to meeting stakeholder needs. The artifacts are listed in the order in that analysis.

14.3.1 Purpose and constraints

The artifacts should include clear documentation of the customer’s purpose for the system. Every feature of the system should derive, directly or indirectly, from this purpose. If that purpose is not written down, the team is unlikely to accurately design to meet those needs—and is likely to add features that the customer does not want (so-called “gold plating”). These artifacts should be visible to most of the team in order to guide them as they design, build, or verify the system.

The customer’s non-functional constraints should be included. This includes the safety, security, and reliability they expect.

Constraints from other stakeholders should also be documented. The organization may place constraints on the project, such as expected profitability. Regulators can place many constraints that must be met to license or certify the system.

The understanding of the purpose or constraints will change over time. A customer will find they have needs they did not initially realize, or they will discover that whatever purpose was agreed with the team is not quite what they meant. An organization or regulators may change their constraints as time goes by.

There should be an explicit record of the changes requested or identified. If a change is accepted—and the project may choose to reject some changes—then it should lead to a new version of the purpose and constraints. It should be possible to determine whether other artifacts, such as requirements or design, are consistent with a particular version of the purpose and constraints.

The specific kinds of artifacts include:

A record of the customer’s purpose for the system.
- Source material that shows how this purpose was extracted from unstructured interactions with the customer.
- A record of what the customer wants to use the system for.
A record of the customer’s non-functional requirements, such as safety, security, or reliability.
- A record of how these requirements were determined.
A record of the constraints that the organization places on the system.
A record of the constraints that regulations place on the system. (More of this in Section 14.3.7 below.)
Records of how this information has changed over time.
- Records of requests to change part of the purpose.
- Records of decisions to accept or reject the change.

14.3.2 Team information

Maintaining information about the team helps the team work together.

I worked on one project where the management did not want to put together an org chart or a list of team members. I ended up talking to the wrong person about a particular technical subject—that person was happy to talk about it, but it turned out they were not actually on the part of the team working in that area. Their opinions turned out not quite to agree with those of the person actually in charge, but I hadn’t been able to find the person I should have been talking to.

This kind of confusion is more common than people expect, and it results in people getting the wrong information, or in people not getting information they should.

Information about the team is only valuable if it is accurate, however. The team should have someone responsible for keeping it up to date—meaning that ideally updating the information is a normal part of the processes (! unknown reference chap-plan) for bringing in a new team member or changing assignments.

The specific kinds of artifacts that will help include:

A roster of the people on the team. Ideally this includes ways to contact them, especially if the team is distributed.
The organization of the team: how the team is grouped and who reports to whom. This can be important when there are conflicts within the team, and people need to escalate some issue.
The roles, responsibilities, and authority of each team member, and conversely the people who fill each role. It should be clear who needs to be consulted for a decision, and who has the authority to make the decision, as well as who does not have that authority.

14.3.3 System artifacts

These artifacts are the system that is being built—the majority of the work of a project.

The system artifacts include:

Concepts for the system, which derive from the purpose and constraints, and can be used to learn about potential design approaches for the system ! Unknown link ref.
Specifications and designs for the system and its components ! Unknown link ref.
Rationales of the design and the choices it reflects, which will aid future team members when they need to learn how the system works—perhaps to fix an error, or to change the system.
Analyses of designs to check whether they match specifications and the customer purpose.
The implementation of components.

The exact set of these system artifacts depends on the process and life cycle ! Unknown link ref that the project uses. If the life cycle has some review milestone that a part of the system is supposed to meet, then there may be documents or analyses specific to that review.

That said, good system building practice involves some core kinds of artifacts: specifications, designs, and implementation.

These artifacts should include some items that are more about the system building process than about the deliverable system itself. These include:

Records of the major technical uncertainties in the system. Technical questions will need to be worked through, and may require special work to find solutions. This work needs to be accounted for in project planning. Maintaining a list of questions also ensures no major open question is left unaddressed before the system is finished.[1]
Records of reviews and approvals. Depending on the process that a project uses, designs or implementations will need to be checked before they go from being proposed works in progress to an accepted part of the system. Keeping track of who reviewed the artifact, what their conclusions were, and who approved the artifact helps the team keep track of the status of each artifact.

How the team maintains these artifacts can vary widely. Many software efforts use version control systems, which maintain versioned software artifacts in a repository server. Many hardware design tools either provide their own versioning repository, or are designed to work with a separate repository system. For hardware artifacts—not their design—one must work out where to store and how to track each physical artifact.

14.3.4 Verification artifacts

Verification artifacts support verifying that the system (or components in it) meet their intended purpose and specification, and that they are free of errors.

These artifacts include:

Designs for how different parts of the system should be verified: test procedures, symbolic analysis, or reviews. For tests, this includes how a test should work.
Test implementations. If software is involved, this can be software test cases. For hardware, this can include lab equipment setup, test jigs, as well as written instructions on how to perform a test.
Results of performing verifications. Note that results apply to a specific version of artifact that has been checked, and so the records need to maintain information that lets people see whether some specific version has been checked or not.

These constitute both a record of what parts of the system have been checked and found to meet their verification criteria.

Verification should be repeatable. The artifacts maintained for doing verification checks should be complete enough that different people can perform the checks in the same way. The instructions for performing checks should be clear. The test equipment should be maintained and people should have instructions on how to use it. Software test environments should be controlled so that when a test is run twice, it is in the same environment both times.

The verification results are generated by people performing checks, and used by people reviewing part of the system to ensure it has been checked before it is accepted as working. They may also be audited by regulators or other outsiders who will be checking whether the project has built the system properly.

14.3.5 Release, manufacture, and deployment

Releasing and deploying a system are complementary steps. Releasing involves taking implementation artifacts and making them available for manufacture or distribution.[2] Manufacturing the system follows if needed—involving producing and assembling hardware, or packaging software into a deployable form. Deployment takes the manufactured system and sets it up for a customer to operate.

The artifacts should include the procedures used to release, manufacture, and deploy the system. The release procedures define the sequence of steps involved in taking a version of the implementation artifacts, checking that they have been verified and meet the intent of a release (such as the features implemented or bugs fixed), and placing copies of those artifacts in a separate area as a release. The manufacturing procedures define how to take those released artifacts and manufacture products that are ready for deployment: assembling hardware according to a released hardware design, for example, and giving them serial numbers. The deployment procedures tell how to take those manufactured artifacts and install them so that they are a working customer system.

There are different variations on this flow of operations depending on whether one is releasing and deploying a whole system or an update, whether the artifacts are electronic (software or data) or physical (hardware components), and whether the system will be mass produced or not.

Hardware components will generally start with a release of a hardware design. That hardware design is the basis for manufacturing instances of the component. Whether it is a single unit made in house or many units produced in a dedicated facility, the manufacturing procedure determines how the hardware products are made. Before finishing manufacture, hardware components are typically given an identity, often recorded as a serial number, that identifies the specific component instance and associates it with records like which design release version was used, what subcomponent parts were used, date of manufacture, and so on. Then the part is placed in inventory from which it can be deployed.

Software components most often follow a different path. Being electronic information rather than physical, there is no “manufacturing” step. The release procedure gathers implemented software and creates a deployable package from it. The manufacturing procedure gives the package an identity (a release number) and signs it or otherwise sets up security protections. It can then be copied to a server that makes it available for distribution and deployment.

Deployment procedures take hardware from inventory and software from a distribution server and puts it into use for a customer. This could be as simple as letting customers know that a software update is available for download. It involve moving a number of physical components to a customer site, setting them up, and performing deployment checks to ensure that the installed system is working. It could be as complex as delivering a spacecraft to its launch provider, preparing it for launch, and having the spacecraft start up on orbit.

The whole process of producing deployed systems often generates a lot of records. Hardware devices have associated records about what specific design was used, what subcomponents were used, when and were it was manufactured, and then accumulate service records: when deployed, what defects were reported, what repairs made, how the device was disposed at end of life. Software has similar records: the identity of the software image, the versions it contains, how it was built, when it was made available for deployment, where it has been deployed, and its service history.

14.3.6 Project operations

Artifacts that support operations can be broken down in the same way that operations itself is (Section 6.3.5).

The project’s life cycle and procedures can be maintained in simple documents. Because these documents serve as a reference for team members, it is important that people be able to find easily the parts of the documentation they need for a particular situation: for example, if someone is setting up a design review for a particular component, they need to find the procedure for design reviews. The documents also need to support people reading through the life cycle or procedures to learn how the project operates in general. Having a good table of contents or index and accurate summaries can help them understand the breadth of operations before they need to learn about some specific procedure.

I have worked on several projects—especially including NASA projects—that develop complex “management plans” and “systems engineering management plans”. I have found that few people in the team actually use these documents. The management plans often follow a template that speaks to the team’s aspirations (“the team will do X”) but does not lay out the actual procedures (“do X by doing Y and Z”). The information in these plans is also often organized for a management reviewer, rather than for the people who need to follow the procedures. As a result, the documents sit unread after being approved and the team operates on shared lore about how to do one task or another, and the plans become increasingly out of date as the team’s practice diverges from the original intent.

Instead, the life cycle and procedure documentation should:

Be organized to be used primarily by the team members who will do the work.
Give clear and specific directions for procedures that a team member can follow to get a task done.
Be regularly updated as the project’s procedures change.

Beyond the life cycle and procedures, planning and tasking activities involve creating and maintaining records. These artifacts are often maintained using specialized tools, such as project planning tools and task management (or issue management) systems.

Operations also maintains records of supporting information, such as budgets, risk registers, and lists of technical uncertainty.

14.3.7 Regulatory artifacts

Working with regulators typically involves a lot of records. The team uses some of these to guide how it builds the system. Other records form a legally-binding record of what the project has done and how the team has interacted with the regulators.

First, the artifacts should include records of the regulations that the project must comply with. This might be as simple as references to publicly-available reference sources (such as web sites that make current government regulations available). It may also include documents that explain what these regulations mean. This information is only of value if it is accurate; this means it must be kept up to date as regulations change. (In some fields, it is worthwhile having someone who tracks likely upcoming regulatory changes so that the team can anticipate those as well as working to current regulations.)

The artifacts should also include records of the processes that the team needs to follow working with the regulators. For example, if the system must obtain a license before being deployed for use, then there will be a process for applying for that license. Again, this information must be kept up to date to be useful. The processes are often difficult to find or interpret, so it is helpful to maintain documents that explain the process as well as just a record of the process.

Second, systems that need licenses or certification will require applications to regulators. The application information should be maintained, including copies of any application forms (with dates!) and any supporting documents generated as part of putting the application together. For example, I helped one team apply for a license to operate a small spacecraft in low Earth orbit. The license application included an orbital debris assessment report, which was sent to the regulator as part of the application packet. The assessment report included information generated by a debris assessment tool [NASA19]. The database used by the assessment tool was an artifact to be maintained, along with the report itself.

Correspondence with regard to the applications also needs to be maintained. This should include any information that shows how the team took steps to follow the application processes.

Next, the project must keep records of licenses or certificates that have been issued.

Finally, the project will need to maintain evidence that the system it has built complies with regulation, whether a license application is involved or not. These take the form of a mapping from a table of regulatory requirements to the evidence of compliance with each of the requirements. The evidence can be complex: for example, showing that the probability of a particular hazard occurring being below a mandatory threshold.

14.4 Managing artifacts

Artifacts are the result of the team’s work, and thus they carry value to the team and its customers. They represent the system being built. They are used continuously to inform and manage the team. They are often used long after they are created, to audit the work and to guide modifications to the system.

The artifacts change over the duration of the project. An early design draft gets revised into a version used to build the corresponding component. Later, the design is revised for a second-generation component.

These conditions lead to three general principles for managing artifacts: security to protect integrity, organization so people can find the artifacts, and change management.

14.4.1 Security

The artifacts need to be managed in a way that preserves their value by maintaining their integrity. Losing or damaging an artifact results in a loss that could be anything from annoying (losing minutes from a status meeting) to fatal to the project (damaged implementation of a critical component). The artifacts should be protected against both accidental loss, such as a server breaking, and malicious loss. For data artifacts, this means using resilient storage systems with good cybersecurity. For physical artifacts, it means storing artifacts in storerooms that maintain a benign environment and that provide physical security.

Access to the artifacts should be limited to authorized people using access control mechanisms. These mechanisms reduce the risk of malicious damage by limiting who can get to the artifacts. For artifacts that need to be kept confidential, limiting access helps reduce knowledge leaking to unauthorized people.

14.4.2 Organization

A random jumble of artifacts is of little use to people on the team. The team members need for the artifacts to be organized in a way that allows them to find the ones they need accurately and quickly.

There are two kinds of “finding” that team members will do.

In the simple case, they will know what they need: the design document for some component, or the risks associated with the project, or widget serial number X. To find something specific, they need to know where to find artifacts and how those artifacts are organized. They can use that organization to get to the specific one.

The other case is when someone knows they have a need but does not know exactly what they are looking for. This might be someone who has recently joined the project, or someone who is working in an area they aren’t familiar with. These people will need to be able to see and learn how the artifacts are organized, and will need a guide to help them understand what is available.

Finally, there should be one logical place for each artifact, and artifacts should not be duplicated. (There might be copies for redundancy, but the people looking for one artifact should see those copies as if they were one thing.) Two people looking for the same information should not end up finding two different artifacts that cover the same topic and that have diverged from each other. This leads to people building incompatible components, sometimes in ways that are hard to detect but that lead to errors in the system.

14.4.3 Change management

As I have noted, artifacts change regularly over the course of a project. However artifacts are managed, they need to account for the effects of these changes.

Some artifacts, like records of task assignments and progress, change often but at any given time there is only one accurate copy of the information.

Most system artifacts, on the other hand, evolve in more complex ways. At any given time there may be multiple versions that are works in progress—containing incomplete changes that their creators don’t believe are ready to be used by others. Some of those in-progress versions may develop to become accepted versions, ready for others to use: a design that is ready to be implemented, or an implementation ready for integration testing. A version that has been used like this may later become obsolete as an updated version comes along.

This pattern of change calls for supporting versioning on this kind of artifact. Versioning means that one can find multiple versions of the artifact, and each artifact has an identifiable status so that someone can know whether they should be using that version to build other artifacts, or just looking at the version to understand it.

The dependencies of one artifact on another, such as a design leading to an implementation, and and implementation leading to verification test results, means that mutually consistent versioning is also important. When looking at an overall version of the system, it should be clear that (for example) the design for component X has been updated, the implementation for that component is in progress of being updated to match the design, and any verification results are from an older implementation that may no longer be accurate.

Most project life cycles and procedures define different statuses that an artifact version can have, along with procedures for how that version can change status. While the details differ, the statuses generally include some sort of work in progress, proposed, approved (or baselined), and superseded. The procedures generally say what has to happen for a version to move from one status to another, such as defining that a proposed design needs a review and approval step to be accepted as a baseline.

14.4.4 Implementing artifact management

There are many tools and processes in use today for managing artifacts. At the time of writing, no one tool works well for all kinds of artifacts, and so a project must stitch together its approach to managing artifacts out of multiple different tools.

Electronic artifacts. Software development uses version control systems to manage electronic files. There are many such systems, all of which provide a storage repository with a few common features:

Maintaining download a consistent view of all the files being managed.
Ability to change files locally, then commit them to the repository.
The ability to define branches of the repository, which store separate, internally consistent revisions of all the files. One branch is often used as the team’s shared working view of the files, while people create other branches to isolate works in progress.
The ability to merge revisions from one branch to another, so that updates from a working version can be combined into the team’s shared version.

Other industries use document control systems to manage collections of electronic files. These systems also provide a repository for a collection of files, but the generally focus on the management of documents rather than just versioning. They commonly include features like:

Support for review and approval processes.
Sophisticated search, using organized metadata or document analysis.
Strong access control and permissions on individual documents.

In addition, tools such as CAD systems or requirements management often include versioning and workflow features. These tools support creating different versions of an artifact, and defining a workflow for the procedure to be followed for approving a version as a baseline.

In practice the tools for managing artifacts do not often work together, requiring a project to (for example) select one tool for managing software artifacts, one for CAD system artifacts, another for structured systems engineering artifacts (such as requirements or specifications), and another for documents that do not fit neatly into these other categories.

Hardware artifacts. Many projects will create physical artifacts—mechanical components, electronic boards, manufacturing jigs, and testing equipment. These physical components need:

A place where they can be stored until needed. This might be a stock room for some components, or lab space for test equipment.
Information about the specific artifact, perhaps using a serial number to associate the information. This might include a part number, version information, and usage or manufacturing history.
An inventory of the artifacts, including where each one can be found.

[1] “I kept a seven-by-ten-inch black notebook divided into six sections, as follows: (1) Schedule, (2) Systems Briefings, (3) Experiments, (4) Flight Plan, (5) Miscellaneous, and (6) Open Items. Section 6 meant problems of which I became aware as we went along, and which were duly listed by number. As long as they remained unsolved, or open, I reviewed them periodically and bugged the appropriate people for solutions. As they were solved, they were closed, and I drew a line through that number. By the morning of launch, I had 138 items, and all 138 had been crossed out. If this process was a bit scary and time-consuming, it was also immensely satisfying. It was going to be one hell of a flight, if only I could figure out… Whip out the notebook and write it down before I forgot it.” —Michael Collins, writing about the preparation for the Gemini 10 flight [Collins74].

[2] The term “release” has different meanings in different contexts. The term here could be taken as “release to manufacturing” in those situations where “release” requires qualification.

Chapter 15: Tools

29 January 2024

15.1 Purpose

Tools are things that people use while designing and building the system. The tools are not part of the system itself; they are not delivered to an end user. Their purpose is to help the team do their job. Each project will have its own needs for tools, so this list is meant to inspire ideas rather than prescribe what may be needed for building any specific system. There are, however, some common principles for selecting and managing tools.

This chapter brings together information about many different kinds of tools, with references to the other parts of this volume that discuss details.

Please note: I do not recommend specific tools.

15.2 General considerations

There are a few general principles that apply to selecting tools generally.

First, most tools will be used for shared work. Tools should be evaluated on how well they help the team work together. Computer-based tools that manipulate shared data, such as CAD tools, should make it easy for multiple people to access the information concurrently. They should support the project’s approach to versioning information ! Unknown link ref. Physical tools should be accessible to those who need to use them. This is especially important to consider if people work in multiple physical locations.

Second, many tools require training to be used effectively and safely. The project must ensure that each person has been trained to use a tool safely before they are allowed to use it. That implies that tools should be evaluated on the quality of educational material available on how to use them.

Third, good tools are integrated so that they work together. Tools that can share information can provide greater value to the team than ones that cannot.

Next, tools should support the general life cycle and procedures the project uses. They should fit into the project’s procedures for managing artifacts, versioning them, and reviewing them.

Finally, tools should be secure. Good tools will support the project’s overall approach to security, including controlling access to information based on a person’s role in the project. This includes both electronic and physical security.

15.3 Kinds of tools

This section provides an overview of all the kinds of tools discussed elsewhere in this volume, with references to the sections that provide details. The overview can serve as a checklist for a team working out what tools they need.

15.3.1 Storing and managing artifacts

The tools for storing and managing artifacts are discussed in Section 14.4.4.

Electronic artifacts. Alternatives include:

Version control systems, typically used in software development systems.
Document control systems, used to manage general documents, typically including workflow and access control features.
Document control systems built into discipline-specific tools, such as CAD systems.

Hardware artifacts. These can use:

Storage facilities and stock rooms.
Inventory management systems.

15.3.2 Specification tools

As I will discuss in Part VI, the team will develop specifications for system components. A specification defines a component’s external interfaces—in systems terms, how the component is part of functional and non-functional relationships (Section 10.2).

There are several kinds of specifications (Section 21.4), including requirements, interface definitions, and models.

Requirements (Chapter 22). Requirements provide textual statements of things that are to be true about a system or component. Requirements can be managed using:

Spreadsheets, which are easy to use but provide little or no automation.
Dedicated requirements management tools, which directly organize requirements and the traces between them.
UML/SysML requirements tools that use the UML diagram standard to represent requirements and the connections between them.

I list a number of considerations for selecting requirements management tools in Section 22.13.

Interface definitions (! Unknown link ref). Interface definitions specify how one component can interact with others. These can be written using:

Textual Interface Control Documents, often using a template for each document. These provide flexibility but little support for navigation and automation.
Interface definition languages that are part of software or communication development tools.
Interface definitions based on industry-specific standards. For example, the SAE J1939 standard defines messages that on-board automotive components can exchange, and the standard provides message definitions in a spreadsheet.
Project-specific interface definition tools. Several projects I have worked on have developed their own tools, which integrated with other tools the projects used.

Models (! Unknown link ref). Mechanical, mathematical, electronic, behavioral, and other kinds of models are used as specifications. Relevant tools include:

Mechanical and electronic CAD systems.
UML and SysML tools for modeling behaviors.
Symbolic math packages and notebooks or workbooks for documenting and evaluating mathematical models.

15.3.3 Design tools

A project’s design phase works out a set of designs for the system and its components that satisfy the corresponding specifications (! Unknown link ref).

A design records the structure of each component—whether a high-level, composite component or a low-level component (Chapter 9). It also records analyses that lead to designs and rationales for how a design ended up as it did.

There are two kinds of design artifacts: the breakdown structure and the designs themselves. The model in Section 9.4 has six parts to a component design: form, state, actions or behaviors, interfaces, non-functional properties, and environment.

Breakdown structure. I recommend that the component designs be organized by the component breakdown structure (! Unknown link ref). This structure organizes the components into a hierarchical name space, giving each one a unique identifier and showing how one component is made out of others.

On most projects, I have used a spreadsheet to list all the components, the breakdown organization, and their names. This has worked well enough, and I am not aware of tools that explicitly support such organization.

Form (! Unknown link ref). The form represents the aspects of a component that do not change, or only very slowly. The design of physical components is generally handled using CAD tools. These tools use notations or drawing standards appropriate to each subject.

Mechanical designs using CAD systems.
Electronic board designs using CAD systems.
Electronic block diagrams and circuit diagrams using CAD systems.
Occasional designs that are not supported by CAD systems and are drawn by hand.

State, actions, behaviors (! Unknown link ref). This part of a design addresses the parts of a component that change readily.

UML and SysML tools provide diagramming notations for representing state machines and behavioral diagrams.
More sophisticated state and behavioral specifications, such as I/O Automata [Lynch89], typically are expressed in specialized, non-graphical languages.
Additional state information is typically recorded in text. This allows for state information that is not readily expressed in the formal notations a project uses.

Non-functional properties (! Unknown link ref). These properties change slowly and are not part of the component’s form.

Allocatable resources (or budgeted items) that are part of a component, such as mass, area, or energy, are often recorded in spreadsheets. A few system design tools can track how these resources have been allocated.
Most other properties, such as reliability measures, are recorded in text.

Environment. This is the environment in which the component is expected to operate, or in which it may be stored. This is usually recorded in text.

15.3.4 Analysis tools.

These tools help the design process by providing feedback on how well a particular design works. They also are used when verifying a proposed design.

Tools for thermal analysis, computational fluid dynamics, electromagnetics, mechanical structures, and electronic circuits. These tools are often integrated with CAD tools.
Software analysis tools that can evaluate designs for problems with complexity or security.
Theorem-proving tools that can evaluate system designs formally for safety or security properties.
Tools that coordinate review processes. Some of these integrate with CAD or software development tools, facilitating people performing manual analyses of a design.

15.3.5 Build tools.

These tools help translate designs or implementations into operable components that can be integrated into a running system, or used for testing.

The built artifacts will need to be stored and tracked, as discussed above .

Physical artifacts. The building of physical artifacts is, in effect, manufacturing one or a small number of those artifacts. These can be built in multiple ways.

Contract manufacturing provides a building service. Once a contract is negotiated, it takes as input a design for the component to be built and delivers a physical component built to that design. Contract management tools and procedures can help a team work with a contract provider.
Additive (3d printing) and subtractive (milling) manufacturing tools can take in CAD designs and produce a physical product. The physical manufacturing systems usually have associated software systems that translate CAD drawings into the forms needed by the manufacturing tool.
Manual physical build tools. This includes a workshop of tools to shape materials and join parts. It also includes guides, forms, or jigs used make particular parts.

In-house building will require maintaining a stock of the materials used in the components. This may include a stockroom of pre-acquired parts, such as metal or plastic stock and fasteners, or suppliers that can provide the needed material quickly.

The building process should be deterministic: if the team builds multiple instances of the same component, the components should all look and behave the same way. This places constraints on whatever tools and procedures are used to build the components.

Software artifacts. Software artifacts are built by translating source code into binary and packaging it in forms that can be installed on a target system.

Compilers and linkers perform the translation into binary forms
A build environment provides a server, perhaps virtual, for running the software build tools.
Release management tools package the binary software for distribution, and then typically signing the packages and copying them to a distribution server.

The software build process must be repeatable: if the same software is built twice, the result should be identical in behavior (differing only in things like version numbers, timestamps, or affected signatures). This usually means that the software build tools should be under configuration management so that identical tools will be used each time.

15.3.6 Testing tools

Testing involves taking a component, or collection of components, and subjecting it to some sequence of activities to verify that the component behaves as specified.

Testing occurs at two different times during system development: as people are building parts of the system and when a component or the system is being verified for final acceptance. These two uses lead to somewhat different needs in the tools for testing.

Tests need to be accurately reproducible: someone should be able to run a test one time on one component, then run the same test later on the same component and get the same result. Of course some component behaviors are not fully deterministic, but accounting for that, one should be able to count on passing a test meaning that the component really does meet the specification being tested. If a test fails, people need to be able to reproduce what happened to understand the flaw and to determine whether a fix works.

Reproducibility places constraints on testing tools. Physical tests will need to be done in consistent environments, using control and measurement tools that can be calibrated to ensure they are behaving consistently. Software tests similarly need to be run in controlled environments.

Hardware testing. Testing hardware components can range from measurements of single components to integration tests of subsystems or even the complete system. The tools available vary widely, depending on the kind of testing being done.

All hardware testing will involve:

Lab space where testing can be performed safely. Lab space is often a limited shared resource, so lab space scheduling tools can be helpful.
Inventory management tools that will allow testers to identify which components are to be used for tests.
Safety equipment to contain the hardware under test. This can range from safety goggles, to greater personal protective equipment, to containment enclosures or hoods.
Tools to help design test sequences.
Tools to help people follow a testing procedure correctly, including checking off completed steps and recording information at designated points in the test sequence.
Tools to record the test results.

Tools that support testing electronic components can include:

Power supplies, possibly with remote control to turn power circuits on and off.
Input or signal generators. In some cases these can be sophisticated software-based systems that generate input signals following the rules of a test scenario.
Tools to monitor the component’s internal state and output, such as test points or ports. This information might be shown live to a tester, or recorded for later analysis.

Tools for testing mechanical components include:

Tools for creating the testing environment, such as a shake table, vacuum chamber, wind tunnel, or Faraday cage.
Supplies of consumable materials used in tests, such as gases or liquids.
Test controllers that can take a scenario specification or procedure and control the testing environment and system under test accordingly. This may be a set of controls that a person following a script can manipulate or a complex control system that automatically manages the test components.
Measurement and recording tools to determine what happened during the test. These range from recording data taken from the system to video and audio records of the test.

Integrated system testing can go well beyond the tools listed here. Flight testing a new aircraft, for example, is far more complex than suggested by these tools. I leave the design and operation of such testing to others better versed in it.

Software testing. Software testing generally involves setting up a number of test cases or scenarios, running the software being tested, and recording the results. There are many different tools that can be used, and these depend on the kind of test being performed and the environment or language being used.

Categories of tools include:

Languages or applications for defining test cases, including the software configuration to be tested, the inputs and events applied, and the conditions that qualify as correct.
Tools for simulating larger systems that are not being tested. These range from simple component mockups or scaffolds to simulated environments. Integration testing can go as far as a software-in-the-loop simulation environment.
Tools to run software tests, both on-demand (during development) and in batches (during continuous testing and during acceptance verification).
Debugging and tracing tools that can gather information about what occurred during a test and help a person understand why a test failed.
Tools that collect and manage results.

15.4 Managing tools

Good tools can enhance a team’s performance. Poorly chosen or implemented tools can harm it. One must choose tools carefully and apply thought to how they are implemented and used.

A project’s tools are themselves systems, and should be treated with the same care as the system being built for a customer.

Each tool should have a purpose. Spending the time to work out who will benefit from a particular tool, both directly and indirectly, can provide useful guidance when choosing between options for that tool.

The engineering support tool industry has generated many products that can be used, meaning there are often many possibilities to choose from. While sometimes the team can cut a decision process short because they already have experience with one particular tool, in the other cases it is worth setting out some criteria for making the choice.

Factors that can influence the choice of tool include:

What is the specific need that the team has? How well does a tool meet that need?
Does the tool need to be shared among multiple people? If so, how does that work? Software systems are often deployed with a limited number of licenses, limiting who can use them concurrently. Physical tools have inherent limitations, and one may need to acquire enough of them to meet need.
How well does the tool integrate with other tools and the procedures the project uses? If the tool manipulates artifacts that need to be under configuration management, does that tool support that?
Does the tool implement features that conflict with those of other tools, or that conflict with the procedures the team uses? I have worked on several projects that adopted a tool for a specific purpose (requirements management and mechanical CAD, for example), and those tools had internal review and approval workflows and configuration management. These internal features made it harder for the team to follow a project-wide review and approval workflow that used information managed by multiple separate tools.
Are there safety or security needs that the tool should meet? If so, how well does it do so? And do the safety or security features integrate with the project’s safety and security measures?

Once a tool has been chosen, it will need to be purchased or built, and deployed for the team to use. This usually requires finding space for the tool, whether that is physical space in a lab or capacity on a compute server. The acquired tool will need to be deployed and integrated into the project’s systems: adding information about the tool to an inventory database, setting up a service schedule if needed, integrating software systems with the project’s security mechanisms.

Team members will need to learn how to use new tools. For some tools, this can amount to providing a written introduction or presentation on how the tool works. More complex tools will require more formal training. If there are safety or security risks in using the tool, the project should ensure that people are required to receive training before using the tool. It is common to track formally which people have gotten this kind of safety training.

Chapter 16: Teams

29 March 2024

Building a complex system involves a team of people to do the work. The people in the team will fill many different roles: developers, managers, customer and regulatory interfaces, support staff, among others.

In this chapter I discuss the issues to be addressed when deciding how a team should be organized, including its structure, roles, and communication.

16.1 Purpose

When many hands do the work, the team needs to be organized so that the work is coordinated. Each person needs to know their responsibilities, and how to find each other person they may need to interact with. In general the team needs clarity about each person’s responsibilities, about communications within the team, and about who is on the team. (See Section A.3.2 for details.)

The work must be coordinated so that different pieces of work are compatible, that all the pieces of the system are built, and that the work is done efficiently. For this to happen, people on the team will need to communicate with each other—and that means they need to know who they should be talking with. They need to know who is responsible to work on which pieces, so that they do not duplicate work. And when something goes wrong, they need to know who to work with to find a solution.

The ability to share work is key to a project being able to scale up to build a complex system. The people on the team need to be able to trust that others will follow the same rules they do: that they will share important information, that they will consult when needed, that they will limit their decisions to their scope of authority. When a team has this kind of trust, each person can do their portion of the work with confidence that the others are doing their own parts as well.

Sidebar: Delegation and micromanagement

Projects involving many people require sharing work. If someone doesn’t share work, then they will be overwhelmed, will take too long to get work done, and will be a single point failure in the project.

Delegating or sharing work implies a dynamic between the two people involved. Person A delegating the work defines the work that Person B, the delegatee, is to do. Person B does the work and periodically gives progress updates. Once the work is delegated, Person B can proceed independently and Person A can turn their attention to other things.

One way this can go wrong is if Person A doesn’t let Person B get on with the work independently, and instead tries to micromanage the work. Learning the habit of managing loosely takes time and effort—but it requires trust between the two people involved. That trust in turn depends on Person A having confidence that Person B will follow shared norms doing the work.

Another way this can go wrong is if Person B isn’t able to complete the work independently. If Person B finds a problem with the work, such as design error, that is beyond their scope, they can raise the issue to Person A and jointly resolve the problem. If Person B is unable to do the work, perhaps because they don’t understand the problem or find they lack a necessary skill, they can raise the issue and jointly handle the problem. If Person B tries to muddle through, however, they stand a good chance of not doing the work needed, leading to Person A needing to check their work in detail and possible redo the work.

In other words, sharing work requires having clear expectations of how to define delegated work and when to raise exceptions.

16.2 Directory

Two of the first things people on the team need to know is their own role and who else is on the team. Once they have that information, they can communicate with others to learn other things they need to know.

Consider the following scenarios.

Person A is working on some component. That component has an interface with another component, and so person A needs to coordinate how they implement their part of that interface with someone working on the other component.
Person B has finished a design for an update to a component. Project procedures say that they need to have the design reviewed and approved before moving on to implementing the design. Person B needs to find out who the reviewers and approver will be.
Person C discovers an ambiguity in the specification for a component, and they are concerned that this ambiguity may lead to a flaw in the designs that follow from the specification. Person C needs to find the people responsible for the specification so they can discuss the potential problem and find a resolution to the ambiguity.

For all these scenarios, the people need to determine who on the team is responsible for some part of the system beyond what they are working on themselves.

To meet this need, the project should maintain some kind of directory of people on the team. This should record:

Who each of the people are;
Where they are located or how they can be contacted;
The roles and authority each one fills, including what parts of the system them work on; and
How they fit into the team’s structure.

This information is generally fairly simple, but it must be kept current. If people come to believe that the directory is likely out of date they will not trust it.

16.3 Structure

A team of more than perhaps three or four is not an amorphous blob of anonymous people; it is organized so that each person has their own specialized roles, and authority is not duplicated. The team’s structure is the pattern of this organization.

The structure may arise spontaneously or deliberately, but teams that are able to deliver complex systems will have some degree of organization. A team grows because it makes use of specialized skills or because it has work that should be done in parallel. In both cases, avoiding duplicated work matters. If one person has a needed specialization, then they should be the ones doing the work that uses that skill. If the project needs parallelism, doing duplicate work fails to meet that need. Further, duplicating work often results in team members who believe their work is not valued, when that work duplicates something another person is doing.

At the same time, complex systems require people to collaborate. Sometimes one component will need people with multiple skills to get it designed and built.

A few aspects of team structure address part of these dynamics of collaboration and avoiding duplication: how people are grouped into sub-teams, and how decision authority—for both technical and management decisions—is distributed in the team.

Groups of people within the team will work closely together when they are working on closely-related components, or on different aspects of one component. Sometimes these groups form ad hoc, when the people involved find that their work is interdependent. Other times this grouping is worth establishing formally and maintaining for an extended time. This might happen when all the people working in one discipline, such as contract management or electronics design, are organized so that they share skills across the work for different components. This might also happen when the people who are working on one high-level component (that is, subsystem) work together to maintain the consistency of all the pieces that make up the high-level component. Note that one person might be part of more than one group.

Each group should have some reason to exist. People on the project should know how the groups are organized and the purposes for each. Generally speaking, each technical part of the system should be associated with exactly one group. People in the project should know who to talk to about any part of the system, and they should not get conflicting information about who to talk to or how some part of the system works.

Technical decision authority defines who has the final responsibility for ensuring that the design and implementation of some part of the system meets its specifications and objectives, including safety and quality of work specifications. The person who has that responsibility must have the corresponding authority to approve the design or implementation, based on verification checks. The verification checks should provide an independent view of the work that will catch errors that the designer or implementer could not see because they were directly involved in the work. The approver may delegate some or all of the reviewing and decision, but in the end one person must be responsible. (See sidebar below.)

There are several ways that technical decision authority can be assigned, as I discuss in ! Unknown link ref.

Management decision authority defines who makes decisions about work assignments and about resolving conflicts within the team. While this is largely a matter of project management, the design of the team’s structure affects how people will resolve management issues. This can include making decisions about who will be part of which sub-teams, and about staffing in general. Perhaps most importantly, the people with management decision authority also have responsibilities when conflicts or problems are reported.

Both technical and management authority are generally hierarchical. For most projects, there is one person or small group of people who have overall responsibility for the entire project. The authority that others have derives from delegation from this top-level authority. A project can choose different kinds of hierarchy, with deep or shallow chains of authority.

There are many ways projects can organize their teams. I discuss some of these ways in ! Unknown link ref. Whatever approach one chooses for a project, that approach should be evaluated against these needs.

Sidebar: Team structure and system structure

It is generally understood that the structure of a system is homomorphic to the structure of the organization that is building the system [Conway68]. This means that people must work to ensure that the structure of the team and the structure of the system are compatible, possibly by organizing the team around the system structure when possible. Doing so requires having an understanding of what the system structure is, and the hierarchical component breakdown ! Unknown link ref provides part of that understanding. In the other direction, the team’s organization will inevitably bias how the system is organized and built; being aware of the two organizations helps one to see unhelpful bias reflected in the system organization.

16.4 Communication

Following the model in Chapter 10, the component parts of the system are interconnected. When one person works on one component that has a relationship with another component, needs to ensure that the related components have compatible designs and implementations. Doing so means that the people working on each component need to communicate with each other.

People communicate when they want or need to. Creating an environment and procedures that help them realize when communication is needed is part of the art of organizing a team.

To design the procedures and team structure, one needs answers to two questions: when do people need to communicate, and with whom should they communicate?

I identified some scenarios above for events that create a need for people to communicate. There are, of course, a great many other cases, but these give an idea of the breadth of events that define when people will need to communicate.

There are four general times when people will need to communicate:

When they are looking for information that another person may have. For example, when someone finds they need to know how some component is going to behave.
When they have information that will affect someone else’s work. For example, when one person decides on a component design, and that component interacts with another component.
When they need a decision or action. For example, when someone has completed a proposed design and procedures indicate that the design should be reviewed and approved before moving to implementation, or when someone has a team problem that needs to be resolved at a higher level.
When a decision or action has results. For example, when reviews are done, or when action is being taken on a team problem.

Some of these times can be encoded in procedures that the team will be following ! Unknown link ref. Many others will occur in the moment, when someone realizes they need to know something or need to ensure that someone else knows something.

When a person finds that they do have a need to communicate, they then need to figure out who to communicate with. If they are looking for information about a part of the system, they should be able to use directory information (Section 16.2 above) to determine who should know about that part. If they need to push out technical information that affects other parts of the system, they can use the functional relations in the model (Section 10.2) to determine the affected parts, then use team directory information to find out who to talk with for each of those parts. If someone is asking for an action to be done, procedures can indicate the responsible role, and team information will direct them to the people filling that role.

16.5 Team organization and size

A team’s organization generally starts small and informal, as a very small group starting to investigate a customer’s need or a potential system project. As the project moves forward, the team grows and its needs for structure change. The team also changes as people join and leave, and as people move from role to role.

I have found that most teams go through phases as they grow—rather than showing smooth changes over time. These changes arise from the combination of complexity growth, development of group relationships, and the growth in understanding of the work ahead.

Small groups (of just a few people) have been observed to go through a development sequence [Tuckman65][Tuckman77]. These small groups begin as the group forms, and the people work out how they should relate to each other and how to get work done. As time goes by they develop into a cohesive group that gets work done and where people trust each other. (The studies do not discuss how this process can fail, leading to a group that does not cohere or disbands.)

The interpersonal complexity of a team grows with the size of the team. The number of potential connections between team members is $O(n^{2})$ in the size of the team. In my anecdotal experience, the amount of time spent on coordinating work within the team grows in line with the number of connections. If there is no structure to the team, at some point the amount of time and effort spent on communication will exceed the amount spent doing work building the system.

When a project starts, the nature of the system to be built is not well understood. The team has to go through a process of working out the purpose of the system, developing concepts, and eventually beginning to design. Along the way, the team gets increasing understanding of the work ahead.

In practice, the combination of these causes leads the team to change its organization over time. At the beginning, the initial exploration of what the project might be (working a purpose and finding some initial concepts) is typically a small group. This small group will go through a process of learning to work together, but typically the group can self-organize and does not need hierarchy for much. As the work progresses and a few more people join the project, they will initially try to fit into the self-organized small group. These additions will alter the interpersonal relationships, but at some point the complexity of using consensus will necessitate creating some initial structure. The team will settle into this structure. But as the team continues to grow, it will initially accommodate people into the structure but eventually reach another point where more structure is needed to manage complexity.

The message is that a project should expect its team organization to change over time. Almost every project I have been part of has been resistant to addressing a need for changing team structure, and has put off dealing with it until a crisis occurs. In every case this cost the organization time and money, needlessly setting back the project. A project’s leadership should be alert to the need to periodically reorganize the team so that this can be done before it causes problems.

Sidebar: Unitary decision authority

I worked on two projects that had problems building their systems because someone on the team got conflicting instructions on the objectives for some component they were supposed to be building.

In one case, a software developer was tasked with implementing a particular CPU scheduling algorithm in a real-time operating system kernel. This scheduling algorithm had been chosen in order to make certain system safety properties work, and to enable some high-level control features. The developer in question did not understand the assignment, and reached out to someone else—someone not authorized to make decisions about the CPU scheduling algorithm. The developer got advice from the other source and implemented a different scheduling algorithm. The other algorithm could not provide basic safety and control features the system needed. As this project was being executed on a cost-plus contract, the developer’s organization had to pay for someone to remove the work the developer had done and implement the correct algorithm.

In another case, one senior system architect (systems engineer) was responsible for a particular feature set of the system. The system architect was working with a pair of developers to work out a design for those features. A second senior system architect, who was not responsible for that part of the system, was having a conversation with the developers and instructed them to design the features in a particular way. This conflict in instructions to the developers led to confusion that took several days to detect and resolve.

Problems like these are instances of a common design flaw pattern: conflicting control. This is a common source of accidents in control systems [Leveson11, Section 4.5.3], and it applies just as much to the system of building a system.

The techniques for addressing a potential system hazard apply to the conflicting authority: first try to eliminate the conditions that can lead to a hazard, then make it unlikely to happen, reduce the likelihood of it causing a problem, and then try to limit the damage when it does happen.

The first line of defense is thus to organize the project so that conflicting decisions and authority do not occur, or make it unlikely. This is most easily done by having for each part of the system exactly one person authorized to make decisions, and making that information clearly available to everyone on the team. Note that this does not mean that only one person is allowed to design; rather, it means that one person has responsibility for the design. The responsible person can and should delegate the design effort as much as possible to the people actually doing the work, and the responsible person should focus on setting objectives for the design, guiding the design, and checking that the results are acceptable.

Theoretically, a team can avoid conflicting decisions or directions by having a few people operating in a way where they reach consensus before making decisions. In practice consensus algorithms work well enough for computer systems but people find it hard to work that way: communication happens informally, people are in a hurry, or someone has a good idea they get enthusiastic about and don’t wait to share it with others for agreement first.

The second line of defense is to have regular review points in the project when discrepancies can be caught.

Chapter 17: Operations

11 February 2024

Operations covers how people on the team organize the work of building the system.

I introduced the basic ideas of operations in Section 6.3.5. I model operations as five parts: life cycle, procedures, plan, tasking, and support. In this chapter I go into more detail about each of these parts. The material in this chapter is based in part on the needs analysis reported in Appendix A.

This chapter details out the model for operations in general, without recommending specific solutions.

This chapter is focused on the operations directly involved in building the system. This is a subset of the larger matter of organizational operations

17.1 Purpose

Operations is about organizing work, in the form of tasks. It is complementary to team and artifacts, which I discussed in previous chapters. Operations ensures that people know what tasks they should be doing, similar to knowing what they should be producing (artifacts) and who they do it with (team).

I leave “task” largely undefined, relying on its colloquial meaning. It should be taken to mean some unit of work to be completed.

Operations should organize the work so that:

The right tasks are done at the right time by the right person.
- Each task is performed by the person with the right skills to do it.
- This is accomplished with tasking based on the plan.
Everyone does their work in compatible ways.
- Each team member has a single common model for how to do their work.
- The rules are documented clearly and understandable by the whole team.
- People understand what steps will be coming up after they perform some specific task.
- These conditions lead to people having confidence in what others are doing, and they allow detection and correction of problems.
- This is accomplished with the life cycle and procedures.
The work is done efficiently.
- The team avoids work that is not actually relevant to the system being built.
- They minimize work that is a dead end, re-work due to quality problems, waiting time because of dependencies, and overhead for operations.
- Operations allows detection and correction of problems, including accountability and feedback on schedule.
- This is accomplished with procedures and plan, especially in how they account for uncertainty and risk.
The work is of high quality.
- The team thinks through needs before moving forward, allowing for controlled exploration or prototyping.
- Work is checked independently to catch flaws.
- Work and steps do not fall through the cracks.
- Work is coordinated so system parts fit together, and flaws are detected and fixed.
- This is accomplished with procedures and tasking.
The work meets deadlines and budgets.
- The project can project forward the time and resources required to reach milestones, allowing for uncertainty.
- The work is actually possible for the team to complete.
- Project management can detect and resolve potential problems by adapting the plan, changing system objectives, or getting more resources.
- Progress is visible to project management, funders, and the organization.
- This is accomplished with plan and tasking, especially flexibility in planning.
Adapts with need.
- The project gracefully handles requests for changes in purpose or regulation.
- The project gracefully handles learning more things about the system as work progresses.
- the operations support people making decisions that change the plan.
- This is accomplished by building points where things are checked into the life cycle, and using procedures to deal with change.
The project supports its customer and funder.
- The project’s execution fits with acquisition and funding processes.
- This is encoded in life cycle and possibly procedures.

Each project will work out its own approach to operations. The list above provides objectives against which an approach can be measured.

17.2 Operation model

The model operations in Section 6.3.5 has five parts:

The life cycle is the overall pattern of how the project works, with phases and milestones.

Procedures are the checklists or recipes for performing key tasks.

The plan is an evolving understanding of the path forward for the project.

Tasking is the assignment of tasks to people, and figuring out what tasks each person should do next.

Support maintains tools and information needed to do the other parts.

These are ordered by rate of change and at which decisions are made. The life cycle is established early in the project and changes slowly after that. Procedures change a bit more frequently, but not often. The plan is updated on a regular cadence, while tasking is continuous.

17.2.1 Making the model work

Some people will look at the life cycle and procedure parts of this operations model and say that it is “process”—a term that has acquired a negative connotation. Yes, the life cycle and procedures do define processes that are supposed to guide the team. Process, when done well, helps a team work more effectively and more happily. Done well process is simple: it is a guide for how to do common sequences of events, or tasks that are critical to be done a certain way. It provides a checklist to make sure things get done and aren’t missed. It encodes checks to make sure technical work is done correctly.

I have outlined the advantages that life cycle patterns and procedures can bring to operations in Section 17.1 above.

In my experience, the potential disadvantages, and the reasons people have come to dislike the idea of process, arise from three misuses of operations: making it too heavy, making it too complicated, and defining something the team is unwilling to use.

As an example, a colleague told me about a project they had worked on where getting approval to order a fairly simple part (for example, a cable) took multiple approvals and potentially weeks to complete (heavy process). Indeed, nobody was even sure exactly how to go through the process to get an approval to get the part ordered (complicated process). The processes were, presumably, put in place to ensure that only parts of sufficient quality were used and to manage the spending on parts acquisition. In practice the amount of money spent on people’s time far outweighed potential cost savings, and the amount of work required for people to review an order over and over meant that the reviewers did not have the time needed to perform meaningful quality checks.

A “heavy” life cycle or procedure is one that takes more effort or more time than is warranted for the value it provides. This works against the objective of doing work efficiently. Each part of a life cycle pattern or procedure should have a clear reason for being included. The effort and time involved should be compared to that reason, and the procedure or pattern should be redesigned if the comparison shows it is too heavy. To avoid this, each procedure and life cycle pattern should be scrutinized to eliminate any steps that are not actually needed.[1]

A complicated life cycle or procedure is one that involves many steps, often with complex conditions that have to be met before some step can proceed. In the example from my colleague, nobody on the team could figure out all the steps that needed to be done. This can be avoided by, first, ensuring that procedures are as simple as possible, and second, by documenting them and making that documentation easy for people on the team to find and understand.

Teams are generally willing to follow procedures, as long as a) they know what the procedures are; b) they understand the value of following them; and c) following procedures has been made a part of the team’s norm. This means that the life cycle patterns and procedures should be documented, and their purpose or objectives should be spelled out. Normalizing following the procedures, however, is not something that can be accomplished by just writing something down. This has to be practiced by the team from the start, with leadership setting examples. Involving the team in setting up the life cycle patterns and procedures can help people understand and buy into the process.

Bear in mind that when a project adopts a particular life cycle pattern, the project is making an implicit commitment about staffing. If the pattern indicates that certain reviews must happen before key events happen, like ordering an expensive piece of equipment or beginning a complex implementation effort, then the project must ensure that there are enough people with enough time to perform those reviews. If the project does not staff enough, people on the team will quickly learn the (correct) message that the project or its organization does not actually care about the reviews and will begin to work around the pattern.

How all of this is handled for a particular team depends a lot on the team’s size. It’s common for a project to start with simple life cycle and planning when it is small and the project is uncertain. The project will need to shift strategies at times as the team grows, as the work becomes more complex and interconnected.

For some projects, the life cycle will be determined by an external standard. NASA defines a family of life cycles for all its projects [NPR7120]. This flow is designed to match the key decision points where the project is either given funding and permission to continue, or the project is stopped. It defines a sequence of phases A through F, with phases A-C covering development, D covering integration and launch, E covering operations, and F covering mission close out. Specific kinds of projects or missions have tailored versions of this overall life cycle.

Many companies have similar in-house project life cycle standards that revolve around decision points for approving the project for development and ensuring a product is ready for commercial release.

17.3 Life cycle

A project’s life cycle is the set of general patterns of how work unfolds. They encode a few principles that the project has decided on. I introduce the idea of life cycle here, and discuss specific examples and guidelines for building a life cycle pattern in Chapter 18.

The life cycle patterns help team members understand how the work they are doing fits with other work. It provides guidance on what they should expect from work that others are doing that will lead into work they will do. It helps people work out who is doing work related to their own, and who to talk to about that work. It helps people understand what steps will be coming up after they perform one step.

The life cycle is not a schedule. It is only a set of patterns, and it should guide the team as they work out a plan and schedule tasks that achieve that plan.

The life cycle is not much connected to the specific system being built. A life cycle pattern can be more or less well suited to a project depending on attributes of the system being built—most especially how often there are irrevocable or expensive-to-reverse decisions.

The pattern generally consists of:

A set of phases that the work can go through.
A set of milestones or checkpoints associated with each phase.
Conditions that should be true before starting or ending the phase.
Dependencies between phases. In other words, the life cycle can be viewed as a directed graph of phases.

For example, a simple life cycle pattern might say that the project must start with a phase where it works out and documents the customer’s purpose for the system before proceeding on to other work. That purpose-determining phase would conclude with a milestone where the customer reviews and agrees on the team’s purpose documentation. The next phase would involve developing a general concept for the system. This phase would include review milestones, checking that the concept will meet the customer’s purpose and that it can likely meet the organization’s business objectives. After those reviews, there might be a milestone where the organization makes a go-no go decision about whether to proceed with the project.

The life cycle model is general. It is not meant to provide a diagraming model or formal semantics; rather, it is a technique for working out how the project will order its work. It has evolved from a combination of examining several different life cycle standards, observing how teams use Gantt charts for scheduling, and the common practice of sketching things out on a whiteboard.

The life cycle model is connected to the development methodology that a project chooses to follow. A project that uses an agile-style or spiral development methodology will use different patterns for some development steps then what a project following a waterfall methodology will use. I will discuss this further in Section 17.5 below.

There are two ways that one can view life cycle patterns. The first way is as a path to be followed: one must go here, then here, then here. The other is as a way to measure progress. Being in some phase means certain things are believed done, while other things are in progress and yet others will be done later. These two views are compatible, and it is useful to use both viewpoints.

The difference between the two comes when dealing with changes. If the work on some component is in phase X, what happens when an error is found in work from an earlier phase? Or when a request for a change in behavior arrives? And what if one chooses to build a component in multiple steps, creating a simple version first then adding capabilities over time?

This is where viewing the pattern as a measure of progress is helpful. Consider a component that is to go through specification, design, implementation, and verification phases. When the work is in implementation, the implication is that specifications and designs are complete and correct. If someone then finds a design problem, the expectation that design is complete is no longer true. This situation leads to those tasks needed to make true again the condition that the design is correct. Put another way, this “rewinds” the status of the work on that component into the design phase. People will then do the tasks needed to advance back to the implementation phase by correcting the design and performing a review of at least the design changes.

In addition, work does not actually happen perfectly linearly. While someone is working on the specifications for a component, they may well be thinking of design approaches. In the example above of rewinding to design when a flaw is discovered, some implementation work still exists while the design is reworked. Part of the implementation might continue if there is someone to do it and parts of the implementation are unlikely to be affected by the redesign.

A life cycle pattern can be coarse-grained or fine-grained. A coarse-grained pattern would have phases that apply to the whole project at once, and take weeks or months to complete. The NASA family of life cycles [NPR7120] is coarse-grained: it is organized around major mission events like approval to move from concept to design, or from fabrication to launch. Fine-grained patterns might apply to a single component at a time, such as a component being first specified, then designed, then implemented, then verified, as a sequence of four phases with review checkpoints at the transition between phases.

Some life cycle patterns have phases that can overlap and repeat. Consider a fine-grained life cycle pattern that applies to each component. The general pattern might be:

A project might apply this pattern to each component being built. When multiple components are being developed in parallel, multiple instances of this pattern will be proceeding at the same time, and different components may be at different points in their cycle.

Finally, the project’s life cycle patterns do not necessarily imply one-way linear progress. A project or the work on one part of the system can potentially move through a phase, progress to another, and later rewind back to the earlier phase.

Consider the situation mentioned earlier, where a design flaw is found or a feature change request arrives during implementation. The dashed line in the following diagram shows how work proceeds on this component. It proceeds through specification and design into implementation, with accompanying reviews ensuring that both the specification and design are complete. During implementation, the need for a design change arises, and work reverts back to the design phase. Once the redesign is done and reviewed, work goes back to proceeding with implementation.

A project should clearly document the life cycle patterns it will use and make them accessible to the whole team. While the patterns are used directly for planning, making them accessible to everyone ensures that everyone knows the rules to follow and reduces misunderstandings about what is acceptable to do.

17.4 Procedures

Procedures define how specifically to do actions or tasks defined in the life cycle. They often take the form of checklists or flow charts.

Procedures are related to the system being built, but are generally portable between similar projects.

Having clear procedures will:

Let new team members know how to do steps that they might not know how to do correctly on their own.
Help all team members do steps that aren’t performed often, where people haven’t learned the steps through frequent execution.
Ensure that key steps aren’t missed.
Build in ways to check quality.
Provide a way for the team to learn and improve: when a procedure isn’t working well, the team can revise the procedure.

Having common procedures for the whole team makes key work steps less matters of opinion and more based on shared fact. This can improve team effectiveness by removing a source of conflict between team members.

A project can realize these benefits only when the team members know what procedures have been defined, when they can find and understand the procedures, and when the team uses those procedures consistently.

Three steps help team members know what procedures are defined. First, the procedures should be defined in one place, with a way to browse the list of procedures as well as a way to find a specific procedure quickly. Second, the life cycle should indicate when one procedure or another is expected to be used. (For example, when the life cycle indicates that an artifact should be reviewed, it should reference the procedure for performing the review.) Finally, new team members should be shown how to find all this information for themselves.

Understanding and using procedures depends on the procedures being actionable: they should indicate the specific conditions where they apply, and provide a list of concrete steps for someone to perform. This is especially true for procedures that will be used when someone someone is under stress, such as in response to a safety or security accident. I have often seen “procedures” that say things like “contact the relevant people”—which is unhelpful. The procedure needs to list who the relevant people are (or at least their roles) so that a person in the middle of incident response can contact the correct people quickly.

17.5 Plan

The plan is a record of the current best understanding of the path forward for the project. It contains the foreseeable large steps involved in getting the system built and delivered, and getting it to external milestones along the way. It guides the work, as opposed to people working on tasks at random.

The plan:

Guides the work being done, showing which tasks should be done sooner and which can be deferred.
Helps people ensure that all needed work gets done, and that important work does not fall through the cracks.
Helps people find and address situations where some work might be blocked waiting for other work to complete, improving efficiency of the work.
Supports forecasting the progress the project will make and the resources involved, showing whether key milestones can be met and prompting re-planning if they cannot. This supports desires from the organization, customer, and funder to know how soon they will have a working system.
Communicates to the team how each thread of work fits into the whole.
Provides a framework for adapting project execution when changes are needed and as people learn more about the work ahead.

Plans versus schedules. I differentiate between a plan and a schedule.

A schedule is a “plan that indicates the time and sequence of each operation”.[2] In practice, a schedule is treated as an accurate and precise forecast of the tasks that a project will perform. People treat the timing information it provides as firm dates, and will count on things being done by those dates. Schedules are often part of contractual agreements.

Because people outside the project use a schedule to plan their own activities, a schedule is hard to change.

Schedules are appropriate when the work to be done can be characterized with sufficient certainty. In most construction projects, for example, once the building design is complete, the site has been checked for geologic problems, and permits have been obtained, the remaining steps to actually construct the building are generally well understood and the time and effort involved can be estimated with confidence. However, before the site has been inspected one might not be able to create an accurate schedule because there could be undiscovered problems in the ground (perhaps an unmapped spring or an unstable mud layer).

The plan, on the other hand, is not a detailed schedule. It is a general indication of the steps to be taken, along with as much information about time required for different steps as can be estimated. It will reflect varying degrees of certainty about the steps and timing, from fairly certain in the near term to highly uncertain later in the work. It provides guidance, but it does not represent a promise of dates or exact sequencing of events.

A plan is dynamic and constantly changing, as it is a reflection of where the project currently stands.

At the beginning of a project that requires innovation, the team is just beginning to work out what the system will be, and so they cannot build a schedule because there are too many unknowns. As the project works out the customer needs and basic concept, the flow of work becomes a little clearer but most of the work ahead is still unknown. People will continue to learn more and more about the system, and at each step there will be fewer unknowns and the certainty of plans can improve. Even so, the exact schedule is not known until the very end of the project—when there are no places left that could hide surprises.

To some degree, the difference between a schedule and a plan is an attitude. A schedule is something people treat as a contract, and so it does not accommodate uncertainty well. A plan is a flexible current best estimate that doesn’t promise much except to accurately reflect what is known, and avoids information that might appear accurate but in fact is not certain. A schedule is useful to someone writing a contract to get something done. A plan is about an honest accounting of where the project stands and where it is going, and thus more useful to the people building the system.

Plan contents. A plan gathers four types of information:

The set of work steps that can be foreseen to be needed.
Milestones, both internally-defined and those imposed from the outside.
Dependencies among the work steps, and between work steps and milestones.
Estimations of uncertainty about all of these. The chunks of work and milestones form an acyclic graph, with dependencies as edges between the work or milestones. The work can be annotated with estimates of resources required and time, to the degree those are known—and they should not be annotated if the information cannot be estimated with reasonable confidence.

In addition, some projects will give each work step a priority or deadline. Tasks that should be done soon should be scheduled early, perhaps to meet a deadline, to address uncertainty, or to account for a task that is expected to be lengthy.

There is no set format for recording a plan. I have used scheduling tools that use PERT charts and Gantt charts as user interfaces. I have used diagramming tools that help the user draw directed graphs. I have used graphs and time tables written on white boards. I have used tools meant for agile development, with task backlogs and upcoming iterations. All of these have had drawbacks—scheduling tools are not meant for constantly-changing plans; agile development tools are structured around that methodology; drawings on white boards and drawing tools are hard to update over time.

Making and updating the plan. The plan starts at the beginning of a project, and is continuously revised until the project ends.

Assembling an initial plan starts with knowing the status of the project and working out the destinations. At the beginning of a project, the status is that the project is largely undefined beyond a general notion of what customer problem the system may solve. The endpoint might be delivering a working system, or it might involve expecting to deliver a series of systems that grow over time.

Initial plan for a new project.

If the project is already in progress, one starts on the plan by working out what is currently completed and in work.

Example initial plan with milestones filled in.

The next step is to fill in major intermediate milestones and work steps. The project’s life cycle patterns should provide a guide to these. For a new project, the life cycle might indicate that the project should start with a phase to gather information about customer needs. As the first phases progress, the team will begin to develop a concept for the structure of the system. If the customer or funder has required some intermediate milestones, those can be laid in to the plan, along with very general work steps for getting ready for each of those milestones.

Example life cycle pattern for the overall project.

It is normal for the plan to have large work steps that amount to saying that somehow the team will get something completed or designed or whatever. In the example above, “implement system” is completely uncertain when the project starts. When one does not know how part of the system will be designed, or how to implement some component, or even how some part of the work should proceed, it is better to put in a work step that accurately reflects the uncertainty. Being accurate about what is known and not known prompts people to find answers to the unknowns, gradually leading the plan toward greater certainty.

The plan then grows according to the system design. As the team works out the components that will make up the system, each new component creates a stream of work to be done to specify, design, implement, and verify that component, as specified by the life cycle. All these add new work steps into the plan, along with dependencies from one step to the next.

Example pattern for developing a component (linearly).

The plan should be revised regularly. It will change whenever there is some change to the likely structure of the system and as each component proceeds through its specification and design work. Many components will require some investigation, such as a trade study or prototyping, before they can be designed. The plan will evolve as those investigations generate results.

Part way through building the system, the plan will typically become large and show significant parallelism. This is also normal and desirable, because it reflects the true state of development. Mid-project there usually are many things that people could be working on. The plan should reflect all these possibilities so that those managing the project know the true status of the work and can make decisions with accurate information.

Example plan in progress. Some steps are complete, some are in progress.

The life cycle patterns a project uses provide building blocks out of which people can construct parts of the plan, but they do not dictate the plan entirely. Maintaining the plan is not simply a mechanical process of adding a set of work steps each time someone adds a new component to the design. There are three more factors to consider, and these make maintaining the plan a task requiring some skill.

First, the various components will be integrated into the system. The steps to put the components together and then verify that they interact correctly adds more work steps.

Second, a component does not necessarily proceed linearly through specification to design to implementation. Often the design will require investigation, perhaps a trade study to compare possible alternatives. In many cases it is worth building a simple prototype of one or more of these alternatives to learn more about the component before settling on a design. This turns a design step into several steps. Sometimes the outcome of an investigation is that the whole approach to designing a set of components is wrong and design needs to be revisited at a higher level. (This is the rewinding discussed in the section on life cycle above.)

Third, many system development disciplines, such as agile or spiral development, do not proceed linearly with developing a component from start to finish in one go. They often focus on building a simple version of a component or of a collection of components first, and then adding features over time.

Each project will have its own style for addressing these factors, and this will be reflected in the specific work steps included in the plan. For example, when a project follows a spiral development methodology, the plan for developing a part of the system might have several internal milestones: first a simple version of the components that can do some minimal function, then another version or two with increasing function. There might be design, implementation, and verification steps for each component involved for each milestone.

A project should document what methodology it has chosen, so that team members know what to expect and so they can plan consistently.

Plan and tasking. The plan is used to guide tasking—the assignment of specific tasks to specific people (Section 17.6). The plan includes work steps that are in progress and ready to be executed. These are the sources of tasks that people can pick up and work on.

Most of the time, there will be more tasks that are ready to be worked on than there are people to do them. The plan organizes them and thereby helps the process of deciding what someone should do—whether a manager makes task assignment decisions or people pick tasks for themselves. If work steps in the plan include priorities, those will help guide task assignment decisions.

The plan and tasking together support accountability and measurement. They should allow someone to identify when a plan was changed, to see if the change was an improvement in retrospect. They should help identify when some tasks were completed faster or slower than expected, or completed with quality problems. This information can be used to improve forecasting and to identify tasks and procedures that should be restructured.

Plan and forecasting. Most projects will have deadlines they must meet. Customers want estimated delivery dates, so they can make preparations for steps they will take to put the system in operation. Funders may want intermediate milestones to show that their investment is on track. Others want to know the budget—money and time—required to get the system built, or to meet other internal milestones. The team will need to manage project execution in order to meet those deadlines.

One can look at this as a control problem. Forecasts using the plan provide the control input: based on the current plan, including its uncertainties, is the project likely to hit a deadline or not? The control outputs are to rearrange the work steps in the plan or to add and remove steps. Adding or removing steps often means adding or removing capabilities from the system, also known as adjusting the system to fit the time available.

Forecasting using the plan will always be imprecise because the plan reflects the actual uncertainty in the project. In some industries it is possible to estimate the time and effort required for work steps, within a reasonable error bound, once the system is well enough understood—for example, in many building construction projects. However, when building systems that do not have extensive comparable systems to work from, estimates will be unreliable for much of the project’s duration.

There are ways to manage a project’s plan to reduce uncertainty as quickly as possible. I discuss those in ! Unknown link ref.

17.6 Tasking

Tasking is about the day-to-day management of what tasks people are working on and what tasks are ready to be worked.

The choices of what tasks are ready is based on the plan, along with bugs that have been found, management tasks that need to be done right away, and ongoing tasks that do not show up in the plan.

Tasking builds on the plan. The plan should be accounting for which tasks need to be done sooner than others in order to meet deadlines or to avoid stalling because of a dependency between tasks.

The objectives for tasking are:

To keep everyone on the team productively busy;
To keep everyone on the team informed of what they can work on, and who will be working on related tasks;
To help detect when a task is not going as expected and lead to a response to address problems;
To focus on higher-priority tasks; and
When a task involves multiple people, they are all available together.

One can treat tasking as a decision or control process that works to meet those objectives. Other scheduling disciplines, such as job-shop scheduling ! Unknown link ref and CPU scheduling ! Unknown link ref, can provide useful ideas for how to make choices about who should work on what.

There are many different choices about when, who, and how much. Each project will need to define its own approach, usually following whatever development methodology the team has selected. The approach should be documented as a procedure that the team follows.

Decisions about tasking can happen at many different times. It can happen reactively, when one task is completed, when a task someone is working on is stalled waiting for something else to happen, or when some urgent new task arrives (such as a high-priority bug or an external request). It can also happen proactively or periodically, putting together a set of tasks for someone to do ahead of time.

Tasking can be done by different people as well. One person can have a scheduler role and make these decisions. A group can divide up tasks by discussing and reaching consensus. Each person can take on tasks when they are ready for more. Combinations of these also work.

Finally, tasking decisions can occur one task at a time, or they can focus on giving each person a queue of tasks to perform.

A large project will have a very large number of tasks—potential, in progress, and completed—to keep track of. Using a shared task tracking tool of some kind is vital. Without one, tasks will be forgotten, or there will be confusion about how they have been assigned. The tracking tool is another one of the tools that the project should maintain (Chapter 15).

Each task must be defined clearly enough that the person doing the work can properly understand what is to be done, and so that everyone can agree when the task is complete.

17.7 Support

The decisions made in planning and tasking need supporting information.

Risk and uncertainty affect choices of what should be done sooner or deferred. I have often chosen to prioritize work that will reduce risk or clarify uncertainty, in order to make the project more predictable down the road. Many projects maintain a risk register, which lists matters that could put the project at risk. These risks are often programmatic, such as the risk that a delayed delivery from a vendor will cause the project to miss a deadline. I have on some projects maintained a separate, informal list of the technical uncertainties yet to be worked out; for example, how should a particular subsystem work?

Project management will also need to manage budgets. Programmatic budgets, most often funding, affects how the project execution can proceed. Technical budgets, such as mass, power, or bandwidth, are aspects of the system being built. For both types of budgets, the amount of the resource that has been used and the amount left need to be tracked. The project will need to estimate how much more of them will be needed to finish the project. If there isn’t sufficient resource left, then the project management will have a decision to make—whether reallocating resources, reducing demand, or finding more resources.

Almost every project will need to report on how the work is progressing, relative to deadlines and available resources. The plan mechanisms should help people obtain and organize this information.

Sidebar: Development methodologies and operations

Each project will at some point choose a development methodology to follow. There are several popular methodologies, such as waterfall development, spiral development, or agile development, along with a great many variants of each.

The operations model I have presented is a mechanism that can support any of these methodologies. The methodologies affect the life cycle patterns, how the plan is structured, and how tasking is done.

Waterfall development is characterized by developing the system linearly, starting with a concept and working through design and implementation of each of the pieces, then integrating those pieces together to form the final system. The life cycle pattern for waterfall development will reflect this ordering, and plans will follow the life cycle pattern.

Spiral development is organized around a set of intermediate milestones. The system becomes a bit more complete at each of these milestones (or iterations). Each milestone adds some set of capabilities to the system and the system, or some part of it is integrated and operable at each. The life cycle pattern for spiral development will define the way each spiral or iteration proceeds. The plan will reflect how the team will reach each of the upcoming milestones.

Agile development is organized around short cycles (called sprints in some versions of the methodology). Each cycle typically lasts one to four weeks, and adds a small number of capabilities to the system. The system is expected to be integrated and operable at the end of each cycle. Unlike spiral development, the objectives for each cycle are typically decided at the beginning of the cycle based on the set of tasks that are ready to execute, and priorities for each task. This means that agile development is primarily about tasking, and it relies on a plan that defines what all the ready tasks are.

In practice, most projects end up using a combination of methods.

The cost or difficulty of changing a decision usually drives a project to combine methods. The easier it is to change a decision, meaning undoing the work of some tasks already completed, the more agile the methodology can be. The more costly it is, the more care that should be taken to ensure that changes downstream are unlikely.

The cost of change is significantly lower near the beginning of a project, when there is less work to be redone and when one change will not cause a cascade of changes to other work already completed. As work progresses, a particular change will become increasingly costly.

The cost of change also depends on the kind of work involved. Software and similar artifacts are malleable. The cost of changing a line of software source code or changing one line in a checklist is, in itself, tiny, though a change in the software may cause a cascade of changes in other parts of the system and may cost time and effort to verify. Changing a built-up aircraft airframe, on the hand, is costly in itself—in both materials and in effort.

These differences in the cost of change lead to differences in life cycle patterns and planning related to potentially-expensive decisions. For example, the NASA family of life cycles [NPR7120] follows a linear pattern in its early phases so that key aspects of the project can be worked out before the agency commits to large amounts of funding, especially for building aircraft or spacecraft hardware. Parts of some of these projects follow a more agile process after they have passed the Critical Design Review milestone Section 19.2.1.

[1] In many cases I have seen, steps get added to procedures because someone wants to make sure they have a voice in any decisions made. This is a legitimate concern, but blindly adding review or approval steps to a procedure often does not really solve the problem. In most cases, the need to have a voice or to check something can be met with ensuring that regular communication happens, along with providing the person doing the main work of the procedure with the tools to perform most checks themselves.

[2] Per Merriam-Webster Online Dictionary.

Part V: Life cycles and project phases

Chapter 18: Introduction to life cycle patterns

23 February 2024

18.1 Introduction

System building in general follows a common story.

A project to develop a new system begins when someone has an idea that people should make the system. At this initial moment, the system is largely undefined. There is a vague concept in a few minds, but all the details are uncertain.

The project then moves the system from this initial concept through to an operational system, and through the operational life and eventual disposal of the system. During development, the team will need to ensure steps are taken in order to produce a correct, safe system. Designs will be checked. Implementations will be tested. The system as a whole will be verified before being deployed into service. At the same time, the resources spent on building the system must be used efficiently, doing the work that needs to be done and avoiding the work that doesn’t need to be done.

Many projects continue system development beyond the first operational version, with ongoing development or problem fixes. Some projects include the steps to shut down and dispose of the system once it has completed its functions.

The life cycle is how a project organizes the way the team moves through this story. It is a pattern that defines the phases and steps in the work: what will come first, what will done before something else, and when checks will happen. It provides checklists to know when some step is ready to be done, and when it should wait for prerequisites. It provides checkpoints and milestones for reviewing the work, so that problems are found and dealt with in a timely way. It provides an overall checklist to ensure that all the work that needs to be done is in fact done.

Section 17.3 introduced the basic ideas for life cycle patterns. These include:

The life cycle usually includes multiple patterns for different parts of the work.
Work is organized as a set of phases or steps, with milestones or checkpoints included in some steps. (I use the terms phase and step interchangeably; I generally think of a “phase” being longer than a “step”.)
Phases or steps can be dependent on each other, with A depending on B meaning that work done in A will build on work done in B.
A phase can have conditions that should be met before it starts, and that should be true when it is complete.
Many phases and steps can be worked on in parallel.
A path from one step to another following dependencies does not necessarily mean that the steps must be performed serially.

Each project will use its own life cycle patterns. The patterns may incorporate a framework that is standard for the industry or the parent organization. Selecting and documenting the patterns is an essential part of starting up a project, and people in the project should review how well the patterns are working for them from time to time and may want to improve the patterns.

18.2 Key ideas

Almost all project life cycle patterns, for both whole systems and for components, follow a similar overall flow. Abstracting from the story in the introduction, there are phases:

Identifying purpose
Developing a concept
Refining concept into specification and design
Implementation
Verifying the result
Operating the system or component
Evolving it over time
Disposing of the system or component at end of life

For a whole system, this looks like:

Note that this flow starts with the system or component’s purpose. Good engineering always begins with having a clear understanding of what a thing is for. I have watched many engineers rush into designing and building a component without putting time into understanding what the component is going to be used for. By random chance their design has occasionally worked out to match what the component actually needed to do, but only rarely.

Understanding a system’s purpose or a component’s purpose also provides a way to bound the work. If one doesn’t know what a component is for, it is easy to keep working on a design without stopping because there isn’t a clear way to know when the design is good enough to be called done.

There are many points in this flow where one might add checks. At these times one can check on the correctness of the work. These checks improve system quality by building in the opportunity to discover and correct flaws before other work builds on the flawed work. Finding minor problems quickly usually means the cost of correction remains low.

This general pattern applies recursively. One can start by creating a specification and design for the system. The system design will decompose the system into high-level components (Section 5.3). The act of defining a set of component implies identifying a purpose for each one, then specifying and designing each high-level component. The design of a high-level component might in turn decompose into a set of lower-level components, which in turn need a purpose, then specification and design.

The overall flow shows a move from high uncertainty at the beginning to lower uncertainty as the work proceeds. I will address managing using uncertainty in ! Unknown link ref.

Finally, a project’s life cycle patterns will reflect the development methodology that the team has selected. Waterfall, spiral, and agile development all affect the contents of the patterns. I discuss this more in ! Unknown link ref.

The life cycle is provides a general set of patterns for how work should proceed, but it should not define exactly how each work step should be done. That is left to procedures (Section 17.4), which should provide step-by-step instructions for how to do key parts of the life cycle. For example, if a life cycle phase indicates that a design review and approval should occur before the end of a design phase, then there should be a corresponding procedures for design reviews. That procedure should indicate who should be involved in a review, what they should look for, how those people will communicate about the results, who is responsible for approving the design, and how they indicate approval.

The life cycle patterns are the basis for the project’s plan (Section 17.5). The patterns are a set of building blocks that people in the project can use to develop the plan. The plan, in turn, guides tasking: the selection of which tasks (as defined in the plan) people should be working on next.

18.3 Purpose

Life cycle patterns address problems that projects have. They can help the team have a predictable and reproducible flow to how work should be done, so that everyone shares the same understanding of how the team works.

There are six ways that life cycle patterns should help a project.

Quality of work. The team must build a system that addresses the customer’s purpose, and in doing so must meet quality, safety, security, and reliability objectives.
Efficiency. The project will be expected to deliver the final system as quickly as possible, at the lowest reasonable cost, while meeting the quality objectives. This means that the team needs to be kept busy doing useful work.
Team effectiveness. People on the team need to know how to work together. Building trust depends, in part, on having shared expectations of how each person will do their work.
Management support. Project management will need to plan and track the work in order to ensure the team meets deadlines and that they have sufficient resources to do the work.
Customer and regulatory support. The customer may have specific milestones they expect the project to meet in support of the customer’s acquisition processes. Regulators often have similar expectations if a system must be certified or licensed for operation.
Auditing support. The project’s work may be audited to check that the processes followed meet regulatory requirements, certification requirements, or as part of a legal review.

Gaining these benefits is not a result of using life cycle patterns per se; rather, it comes from using patterns that are designed to provide the benefits. For example, if the customer has an acquisition process that specifies certain milestones, then the top-level life cycle pattern for the project should incorporate those milestones. If the project is likely to have auditing requirements, then the patterns should include tasks to generate and maintain auditing records.

Quality of work. The purpose of a project’s approach to operations is, in the end, to produce a system for the customer that meets their objectives. This means it should do what they need, meet safety and security needs, and be sustainable as the system evolves in the future. In other words, the team’s work needs to produce a system with good quality.

Neither the life cycle patterns by themselves nor the plan that derives from them directly result in good product quality. System quality comes from all of the detailed work steps that everyone on the team performs. If they do their work well, and if mistakes they make are caught and corrected, then the system can turn out well. If some work is not done well, nothing in the life cycle patterns can prevent that.

However, the life cycle patterns can create an environment that will more likely lead to good quality. They can proactively make flaws less likely by ensuring that steps happen in order: identifying purpose and concept before design and implementation, for example. They can insert points in the work that encourage people to think through what they should design or implement. They can also avoid problems by providing a checklist for what should be complete at the end of a work step. They can ensure that when a system is delivered, that all the work needed to put it into operation is complete. They can build in checkpoints for reviews and verification to catch problems early. They also help project management organize the work so that it is complete, that is, so that no parts of the system or some work steps are overlooked.

Sometimes the value of a life cycle pattern will come from slowing down work. Most of the work done on a project is done by people who are focused on a particular part of the system; it is not their job to manage how the project goes as a whole. Their job is to get that one part designed and built, according to the specifications they have been given. If the specialists start building before the context for their work has been established, they are likely to design or implement something that does not meet system needs. I have been part of more than one project where the resulting rework caused the project to be canceled or required a company to get additional funding rounds to make up for the resources spent on the mistakes.

Efficiency. Most systems projects will be resource-bound, with more work than there are people on the team to do the work. In this kind of project, it is important to keep each person busy with useful work. This means that nobody on the team is blocked with no tasks they can usefully perform. It also means that almost all the tasks that people perform contribute to the final system—that there is little work that has to be thrown out and redone because it had flaws that made it unusable.[1]

As project management builds the project’s plan, using the life cycle patterns as building blocks, they must detect where there are dependencies between work steps and plan the work steps so that later steps are unlikely to get blocked. For example, if some part will require an unusually long time to specify and acquire from an outside vendor, then the management will need to ensure that work on that part starts early. The life cycle patterns provide part of the structure on which the plan is based, and provides a template for some of the dependencies.

Life cycle patterns can also help avoid unnecessary rework. This comes partly from the ways that the patterns help improve the quality of work. In particular, a good life cycle pattern can lead people to take the time to think through the purpose and specification of something before they jump into design and implementation unprepared, and then build something that does not meet the system’s needs.

Finally, the patterns can help bound the work to be done. When a project does not define the scope of work to be done, it is likely that someone will start working on something in excess or not related to the customer needs. Good patterns help avoid this by defining an orderly and thoughtful process for identifying what work needs to be done.

Team effectiveness. Members of an effective team respect and trust each other. Having shared norms and understandings for how work is done and how people communicate is important as part of the environment that allows the team to develop respect and trust.

A defined life cycle for a project addresses part of this by defining a common understanding of how work should be done. Good patterns define expectations of what will be done in different work steps. Everyone on the team can agree when a work step has been completed. Good patterns also create times when people know they are expected to communicate about some work step. This makes it easier for someone to trust that they will be consulted at appropriate points about work that might affect what they are doing, so that they do not need to create separate, ad hoc communication channels or try to micromanage something that is not their direct responsibility.

As I have noted elsewhere (Section 17.2.1), the life cycle patterns can only have this benefit if the team actually follows them.

Management support. The team, or designated parts of it, will be responsible for making a plan (Section 17.5) for the project’s work, then coordinating and tracking the resulting tasks. The life cycle patterns provide templates for the tasks that will go into the plan, and the key milestones that anchor the work. The life cycle sets the pattern for phases that the project will go through, such as initial conception, initial customer acceptance, concept exploration, implementation, and verification. The cycle also sets the pattern for milestones that gate the progression from one phase to another, such as a concept review, a design review (and approval), or an operational readiness review.

The plan will change from time to time, both in response to external change requests and as the project progresses and the team learns more about the work ahead. Sometimes the need for change occurs gradually, with an issue slowly manifesting itself but causing no acute problem that causes people to recognize there is a need for change. A good life cycle will build in times for people to step back to get perspective and detect when there is a slow-building problem to address. Review milestones are often a good time to plan for this.

Having life cycle patterns and corresponding procedures that apply when these changes occur will help the team adjust their work in an orderly way. It will help them ensure that steps don’t get missed as they work out how to change the plan (and the system being built).

Good life cycle patterns can help a project steadily decrease its uncertainty and risk as work proceeds. Most of the time, a project will start with high uncertainty about what the system will look like, and early project phases result in increasing understanding of what the system will need to be. This process will repeat at smaller scales: once the general breakdown of the system into major components is decided on, each of those components will start with high uncertainty about how it will be structured. The uncertainty about the major components will then gradually resolve, and so on. However, this occurs when the project is guided in a way that uncertainty is addressed systematically, not haphazardly.

Customer and regulatory support. Many customers will have a process they go through to decide whether to build a system and to track its development process. For US governmental customers, much of the process is encoded in law or regulation, such as the Federal Acquisition Regulation (FAR) [FAR] or Defense Federal Acquisition Regulation Supplement (DFARS) [DFARS]. The process governs matters like which design proposal is selected for contract, providing evidence of good progress, providing information that determines periodic contract payments, accepting the finished system, and determining whether the project should continue or be terminated.

These customers will expect deliverables from the project from time to time. The life cycle process must ensure that there are milestones when these are assembled and delivered. (It is then the job of project management to ensure that these milestones, and the tasks for preparing deliverables, can be completed by the time line that the customer requires.)

Whether the customer requires explicit intermediate deliverables or not, formally involving the customer may be important for keeping the project on track.

Similarly, regulatory bodies have processes by which a system that must be certified or licensed before operation can apply for that approval. Those processes will define activities that the team must perform, along with milestones and deadlines by which applications must be submitted or approvals received.

Auditing support. A project’s development practices may be audited for many reasons. Auditors may perform a review as part of an appraisal or certification against standards, such as CMMI ! Unknown link ref. They may review processes to ensure compliance with regulatory standards, especially for security-sensitive projects. The processes may also be audited as part of a legal review. These reviewers need to see both the entire definition of processes, including the life cycle patterns, as well as evidence of how well the team has followed these practices.

18.4 A model for patterns

Each project will have several life cycle patterns, each covering a different part of the work.

Each pattern is defined by its purpose, the circumstances in which it applies, the phases or steps involved, and the dependencies among the steps. It should also include rationale that explains why the pattern is structured the way it is. In the previous chapter I used the example of a simple pattern for building one component:

This pattern applies to building one low-level component where the purpose of the component is already known, and the component is straightforward to design and build in house. Similar but slightly different patterns might apply when the component has to be prototyped before deciding on a design, or when the component is being acquired from a supplier outside the project. This pattern would be used as one part of a larger pattern for building a higher-level component that includes this one.

Each phase of a pattern defines a way to move part of the work forward. It should have a defined purpose that defines what work should be achieved in that phase.

The details of the phase are defined by:

Preconditions: what conditions should be true when a phase is ready to start. A list of conditions for starting the phase, beyond the artifacts that should be ready.
Input artifacts that should be available at the beginning of the phase, and their maturity or completeness level.
Actions that are to be taken to begin the phase.
What work should be done in the phase, and the artifacts developed in the phase.
Milestones to be met during the phase.
Actions taken to end the phase.
Output artifacts that should be available at the end of the phase, and how complete each one should be.
Postconditions that define what should be true when the phase has completed.

Each action should also indicate who is responsible for performing that work. The responsibility will usually be defined as a role, not a specific individual. For example, a component design phase might involve three actions: design the component, review the design, and approve the design. The design action would be the responsibility of the component developer; the review action would be the responsibility of the developers responsible for components that interact with the one being designed, and the approval would be the responsibility of a systems engineer overseeing some higher-level component of which this one is part.

The rationale for this example design phase might say:

The purpose is to work out a design for the component that meets its specifications, including its relationships with other components in the system structure and its safety, security, reliability, and performance objectives.
A separate design step between specification and implementation gives time to think specifically about this component, and to document its design for future developers who may need to revise the design.
The review actions provide an independent check that should improve the likelihood of this component working correctly as part of the system by looking at the design from a point of view that has not been intimately involved in working out the design.

The actions defined for the phase should reference the procedures for doing those actions, when those procedures are defined. For the example design review action, the procedure might be:

The component developer notifies the review group that their review is needed;
Each reviewer in the review group acknowledges that they will provide the review;
Each reviewer checks the design against the relationship that component has with their own components, and against the specification for the component being reviewed;
Reviewers give feedback to the developer, indicating whether they are satisfied or not;
When all reviewers are satisfied, they inform the developer.
If one of the reviewers detects a more serious problem with the design that cannot be resolved by feedback to the developer, the reviewer should use another procedure to raise the issue up to a higher level.

The procedure might also name the tools to be used (an artifact repository for the design, a review workflow tool for the reviews).

18.5 Documenting life cycle patterns

A team needs clear documentation of the phases if they are to execute them properly. A team can’t be expected to guess at what they need to be doing, or how their work will be reviewed; it needs to be spelled out.

This documentation is assembled during the project preparation phase. The details are usually not completely worked out before any other work is begun; rather, “project preparation” more often proceeds in small increments, working out the rules shortly before the associated work begins.

Each life cycle pattern should have a purpose, and the steps or phases in the pattern should be checked that they can achieve that purpose (and that there is no extraneous work in the pattern).

A pattern should also have an explanation of when it applies and when it does not. For example, there may be multiple patterns for designing a component: one for a simple component that is built in house; one for a component that is outsourced to a supplier; one for a high-level component that is made up of several lower-level components; one for a component that requires investigation or prototyping before deciding on a conceptual approach to its design. All these patterns likely have a lot in common, but procuring an outsourced component will have contracting steps that an in-house component will not.

Someone using the documentation should be able to tell accurately whether they are using the correct version of the patterns. The life cycle patterns should be revised from time to time—as the team grows and as people find ways to improve how they work together. This means that the material that a user sees should indicate not just a revision number but have a clear indication of whether the version they are looking at is not longer current.

The form of the documentation is not as important as the content. It can be a written document. It can be made available electronically, with structured access and search capabilities (such as in a Wiki). Some companies offer tools that help define and document development processes or life cycle patterns, including definitions of phases. What matters is that each person who needs to use the documentation can do so conveniently and accurately.

18.6 Work steps and artifacts

Each phase or step has a number of artifacts that the team must develop. At the end of a phase, some of those artifacts need to be complete (allowing for future evolution), and others need to have reached some defined level of maturity. The work in a phase consists of the tasks that develop those artifacts.

I discussed artifacts in Chapter 14. The artifacts are the products of building the system, including the system being delivered as well as documentation of its design and rationale, records of actions taken during development, and information about how the project operates.

These artifacts are the inputs and outputs of the work specified in life cycle patterns (and the associated procedures). Using the component design step example, the work uses:

The purpose and rough concept developed for the component;
The specifications developed for the component;
Documentation of the relations between this component and others, both functional and non-functional; and
The purpose of the system and of higher-level components of which this one is part.

The design step produces:

A design document for the component;
Analyses of the design showing whether it will meet its specifications or not;
Records of the rationale for why the design was chosen; and
Records of review and approval of the design.

In general, every artifact involved in building the system should be a product of some work phase or step, and every input or output of work steps should be included in the set of artifacts the team will develop. Ideally, the life cycle patterns will be checked for consistency with the list of artifacts the project uses.

Different artifacts are developed at different times during the course of a project. A few artifacts should be worked out as the project is started—especially those recording the initial understanding of the system’s purpose and initial documentation of how the project will operate. These will be refined over time. Other artifacts are developed during the course of development, and the life cycle patterns indicate which ones are to be worked out before others. The artifacts will be in flux during development: the team learns about the system as it designs and develops it; the customer or mission needs often change over time; flaws get discovered in designs or implementations.

Many of the project’s artifacts support how people work together, and the life cycle patterns should reflect these communication needs. For example, one person may work out the protocol that two components need to use to communicate with each other. Two other people may design and implement the two components. The interface specification that the first person develops serves to communicate the details of the interface among all three people. The patterns should record that the design and implementation work steps depend on the work to develop the interface specification. Later, if one of the component developers identifies a flaw in the interface, the people involved can work through how to revise the interface—and the revised specification artifact records exactly how each person needs to update their work to match the change. The pattern helps to show how information about a change to the interface specification triggers rework on dependent artifacts.

A good life cycle pattern must have procedures to manage the change in artifacts, and how those changes affect other artifacts downstream from them. There are two separate problems these procedures must address:

Managing how changes are coordinated across multiple artifacts and through the team while a part of the system is in development
Ensuring that when a part of the system is complete, all the artifacts are consistent with each other

Different life cycle patterns approach this in different ways, which we will discuss in later chapters on different patterns. The most common approach is to maintain different versions of an artifact, with at most one version being designated as a baseline or approved version, and other versions designated as works in progress. Many configuration management tools have a way to designate a baseline version, and many software repository tools provide branching and approval mechanisms to track a stable version.

18.7 Life cycle and teams

What is the team size and background? How is it expected to change over time? A small team can often be a little less formal than a large team, because the small team (meaning no more than 5-10 people) can keep everyone informed through less formal communication. A large team is not able to rely on informal communication, so more explicit processes and communication mechanisms are important. Many teams start small when the project is first conceived, but grow large over time. A team that will grow will need to communicate more formally from the beginning than they otherwise might so that as they add people to the team, the larger team works smoothly.

Conversely, if the life cycle patterns indicate that some action will be performed by some person, does the team actually have the staff to do that work? When a project says that some work is to be done and then does not staff that function sufficiently, it sends a message to the team that they should not take the process as written seriously. This undermines the team’s trust. If the function is actually needed, either the team will find an ad hoc workaround or the function will not get done adequately. Either way, there will be a disconnect between what is written down and what actually happens.

18.8 Life cycle and planning

The life cycle patterns are just patterns that provide a guide to work that goes in the project’s plan. The plan is the actual definition of the tasks to be done. When the plan needs to be updated, the patterns provide a template for the work that goes into the plan.

Assembling the plan, however, takes into account many inputs, of which the pattern is only one. Planning involves deciding on the priority and deadlines for work, which is based on project deadlines, risk or uncertainty, and the project’s development methodology.

! unknown reference XXX discusses in detail how the plan is developed and maintained, including how the life cycle patterns get incorporated.

Consider the following example of how a pattern gets incorporated into the plan. This example shows how the pattern is only a template, and there are many decisions that will depend on other information.

This pattern defines what should happen when a customer requests a change. The basic pattern is that first someone on the team should evaluate the request; this may involve working with the customer to clarify the request, and with other engineers to estimate the scope and cost of the work. The project can then make a decision whether to accept the change or not. If the decision is to make the change, work to build, release, and deploy the update will follow. If not, there is another pattern for how to communicate with the customer that the change will not be made.

The activity starts when the project receives a change request. Based on this, the plan can be updated to include three tasks right away: the evaluation, review, and decision tasks.

At the same time, the planner must make decisions: who should each task be assigned to? What priority should the flow of tasks have? The pattern can indicate the roles involved in the tasks, such as there being a small team responsible for evaluating change requests and a customer representative from the marketing team, but it doesn’t determine which specific people. That’s for the planning and tasking efforts to determine. Similarly, the pattern does not specify how the work should be prioritized relative to other work the same people are doing. The planner incorporates information about how urgent the customer’s request might be and the importance of the customer into the decision. The project might have decided, for example, that there should be a queue of outstanding change requests and they should be evaluated in their order in the queue.

Determining who should be involved in a review of the evaluation might depend on the results of the evaluation. The pattern might indicate that the evaluation should be reviewed by engineers responsible for each high-level component that will be affected by the change. This means that the decision about who specifically will be tasked with the review can’t be made until the evaluation has worked out the scope of the change.

The decision to proceed with making an update will depend in part on whether the team has the time and resources to make the update. The team will need to determine whether adding the work to the plan will cause a problem with meeting deadlines that have been established already, or if it will overload a team that is already busy. This determination will involve analysis of the current plan—something that the life cycle pattern can help with only to the extent that the patterns can help with generating estimates of the work that would be involved.

When the project takes the decision to go ahead with developing an update for the request, the pattern shows that work steps follow to develop a change and then release and deploy the update. When the decision gets made, this will trigger the planning activity to add development and release work into the plan. These are high-level work steps with little detail. The planner will find patterns for these steps and populate those patterns into the plan.

Decisions about the work involved in development will depend on the development methodology that the team has selected to follow. If the update will involved extensive changes and the team is following a spiral-style methodology ! Unknown link ref, the development plan might consist of two or three development rounds. Each round would design and implement part of the changes, with a milestone at the end of each round showing how the partial changes have been integrated into the system.

Decisions about the release and deployment work will also incorporate policy decisions about how the team works. Will each change request result in a separate update release? Or will updates be bundled together into releases that combine several updates, perhaps on a schedule defined in advance?

18.9 Principles for a life cycle pattern

In this section I list some principles to consider when designing a workflow pattern.

The act of designing—or refining—a life cycle pattern is an opportunity to think deliberatively about how the team should get its work done. Life cycle patterns are the templates for the project’s plan, and so they should be designed to achieve the work that is needed to move the project forward well.

Designing the patterns ahead of time means having time to define good work patterns. The pattern does not have to be worked out under pressure, as a reaction to something unexpected happening in the project. It can be discussed among multiple team members to get different perspectives and to ensure everyone’s needs are met. Working in advance gives time to check that the steps in the pattern are consistent with each other. It means that there is time to think about what exceptional situations might happen and define what to do in those cases.

Note that if an organization already has an approach to life cycle patterns, whether documented or not, one should aim for continuity with that approach. Anyone already in the organization will know that approach to organizing work; making a major change would mean loss of the advantage of established team habits. On the other hand, if the current approach is not working well, then a new project is an opportunity to improve.

The life cycle patterns encode principles and methodology that encourages good work. Principles to consider include:

Know the purpose for something before developing it.
Build in time for and incentivize deliberative thinking.
Assign decision-making authority to an appropriate level based on the nature of the decision.
Build in ways to check work, and design them so they are a team norm and not prone to triggering defensive reactions.
Build in the longer term.
Think about exceptions that might happen, how to handle them, and when to change course.
Define the work so that everyone on the team can agree when a step has been completed.
Give a clear definition for each step of the quality considerations by which the work can be judged.
Make the pattern as light-weight as possible without compromising quality.

Purpose. I have mentioned this principle several times already, and I believe it is a basic principle of effective system-building. The life cycle patterns encode this principle for specific parts of the team’s work.

As with anything else that is designed, a pattern itself starts with a purpose. That purpose might be “build a simple component” or “build the whole system” or “handle a customer’s change request”. A good pattern addresses its purpose thoroughly, without trying to achieve other purposes.

The pattern that results should then ensure that team members follow this approach when building parts of the system. If the pattern is for handling a customer’s change request, for example, the pattern should address understanding and documenting what the customer wants changed (and why), before starting to work out whether to agree to the change or to begin implementing the change.

Time to think. Key parts of a complex system are best served by taking some time to properly understand the purpose or need of that part, and to look at options for how it can be designed or built. A project running at too fast a pace skips this thinking and uses the first thing that someone thinks of—though there may be subtle ramifications of that decisions that are not appreciated until the decision causes a problem later. Asking someone to how alternatives they considered and rewarding them to do so works to improve the quality of the system.

At the same time, people can take too long to make a decision or fixate on making it perfectly. The time spent on deliberation should be bounded to avoid this.

Decision-making authority. Bezos introduced the idea of reversible and irreversible decisions [Bezos16]. He wrote:

Some decisions are consequential and irreversible or nearly irreversible—one-way doors—and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation. If you walk through and don’t like what you see on the other side, you can’t get back to where you were before. We can call these Type 1 decisions. But most decisions aren’t like that—they are changeable, reversible—they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long. You can reopen the door and go back through. Type 2 decisions can and should be made quickly by high judgment individuals or small groups.

As organizations get larger, there seems to be a tendency to use the heavy-weight Type 1 decision-making process on most decisions, including many Type 2 decisions. The end result of this is slowness, unthoughtful risk aversion, failure to experiment sufficiently, and consequently diminished invention.

For engineering projects, many decisions fall in the middle ground between reversible and irreversible. Consider building an aircraft. As long as the designs are just drawings, the designs can be changed with low to moderate cost. Early in the design process changes can be quite low cost; as the design progresses and more and more interdependent components are designed, the cost of rework increases. Once the airframe has been machined and assembled, the cost of changing its basic design becomes high, possibly high enough in time or in money that it is in effect irreversible.

Good life cycle patterns will account for different costs of reversing decisions. They should both build in time for deliberation and consultation before making hard-to-reverse decisions and use lighter-weight decision-making for less risky decisions. Similarly, the patterns should ensure that the authority for hard-to-reverse decisions is assigned to someone with high-level responsibility in the project, while the authority for low-risk decisions should be placed as close to the work as possible.

Checking work. Checking that work has been done well is commonly understood to improve the quality of results ! Unknown link ref. It is essential for parts of a system that require high assurance—safety- or security-critical parts.

The key to checking is that they not be subject to implicit biases that the developer might have. This can be handled either by the developer doing analyses that force a stepping back from decisions (perhaps by encoding them mathematically) and that can be checked for accuracy by someone else, or by having an independent person review the work.

Either way, the developer’s pride in their work can feel threatened. Setting out life cycle patterns in which every part of the work is checked enables the project to make checks a norm. Designating in advance that checks will happen, and who will do them, helps depersonalize the effort and in the long term contribute both to quality work and team morale.

Building for the longer term. It is easy to solve an immediate problem at hand quickly and move on, leaving a problem for the future. Taking time to think about the problem (the principle of taking time for deliberative thinking, above) will help but is not sufficient.

It is likely that someone will revisit the work sometime in the future. They may need to understand the work in order to fix a flaw or make an upgrade. They may be auditing the work as part of a critical safety review. They will need to know the rationale for decisions that were made, and they will need to understand subtle aspects of the work. If this information has been documented, these people in the future will be able to do their work accurately and relatively quickly. If they have to deduce this information by looking at artifacts built in the work, they will have to spend time reverse-engineering the work and their accuracy is generally low.

Building into the pattern checks for documentation of rationale and explanations will accelerate future work.

Exceptions. Things often go not to plan. What then? Who needs to know? What needs to be done to respond?

Sometimes this is as simple as setting an expectation for the team. If a component’s specification is inconsistent or cannot be met, who gets informed, and how does the problem get corrected?

Sometimes the situation is time-critical. If a major piece of equipment catches fire, what is the response? What if an insecure component has been incorporated and deployed? What if a large part of the system has been built, and someone finds a fundamental flaw? The responses to situations like these are complex, and there often isn’t time in the moment to work out the details.

Good life cycle patterns include pre-planned responses to these exceptional situations. This might consist of references to procedures that should be followed, or it might reference a pattern used to respond to the situation.

Completeness. Can everyone on the team agree when a part of the work has been completed? The person assigned a task should understand their assignment, so that they can do their work independently. Others will check the work, or mentor the person doing the work—and they should have the same understanding of the assignment.

The definition of actions, as well as the list of outputs and post-conditions for a pattern, should be clear to everyone.

Quality considerations. As with completeness, the people assigned to work on tasks need to have a clear definition of what makes the results of their work acceptable, or what makes one way better than another. Sometimes this is simple: when objectives or specifications, which would be inputs to a work step, are met. Other times considerations of quality arise not from specifications but from things like coding standards. In those cases the quality considerations should be spelled out explicitly so the people doing the work know to use them.

Light-weight patterns. Good patterns are lightweight enough to get their job done, and not more. Working out the pattern in advance is an opportunity to work out what parts of the work are truly needed and which can be omitted or simplified. For example, a pattern should be adapted to the possible cost of making a wrong decision (see decision-making authority above). Patterns that involve easily-reversible decisions should include streamlined decision-making steps, pushing the decision authority to as low a level in the team as possible and involving as little work as possible. On the other hand, more difficult decisions should involve a pattern that calls for greater deliberation, more checking and consultation, and places decision-making authority higher in the team’s hierarchy.

Similarly, the patterns should be achievable by the team. If the team is small, it makes no sense to mandate complex work flows for which there isn’t the staff. Each decision about what to include in a pattern should be measured against what is possible for the team to perform.

18.10 In upcoming chapters

In the chapters that follow, I discuss life cycle patterns in more detail. This includes:

Example life cycle patterns currently in use (Chapter 19)
An idealized set of patterns (Chapter 20)

XXX add to this list as the part is developed

[1] Prototyping can be a grey area, on the boundary between useful and not useful work. I will argue in ! Unknown link ref that prototyping is useful, and indeed necessary, for reducing uncertainty about how a part of a system can be designed or implemented. However, a prototyping effort can produce less value than is justified by the effort involved if the prototyping goes on too long, or if it is not focused on learning rather than building. My guidance is that prototypes must not have a path to transition directly into a real component, and the prototype artifacts must be segregated from other system artifacts.

Chapter 19: Example life cycle patterns

29 February 2024

19.1 Introduction

In this chapter I survey some of the many different life cycle patterns in use.

The patterns have different scopes. Some cover the whole life of a system, from conception through retirement. Some are concerned only with developing a system. Others focus on more narrow parts of the work.

I group the patterns in this chapter into four sets, based on scope. The first group covers the whole life of a project, without much detail in the individual steps. The second dives into the development process. The third addresses post-development processes—for releasing and deploying a system; these patterns overlap with development processes. The fourth and final group is for patterns with a narrow focus on some specific detail of building a system.

Patterns with different scopes can potentially be combined. Most patterns that cover a system’s whole life, for example, define a “development phase” but do not detail what that is. One of the patterns for developing a system can be used for the details.

Each of the examples will include a comparison against the following baseline pattern for the whole life of a project.

The baseline phases are the same as in Section 18.2:

Project preparation. Work out and document the processes, rules, standards, and life cycle patterns that the team will use.
Concept development. Determine what the system needs to be or do in order to be useful to its users and to those who will operate it.
System development. Design, build, and verify the system.
Operational acceptance. Review and acceptance by the customer, indicating that the system is ready for operation.
System production. Build and deploy the system, using the artifacts from system development.
System operation. Support the system in operation, including fixing flaws and supporting problem resolution.
System evolution. Add capabilities to the system.
System disposal. Take the system out of operation and ensure that all artifacts are archived, destroyed, or recycled. Shut down the project.

19.2 Whole project life cycle

These patterns organize the overall flow of a project, from its inception through system retirement and project end. I have selected two examples: the NASA project life cycle, which is used in all NASA projects big and small, and the Rational Unified Process, which arose from a more theoretical understanding of how projects should work.

19.2.1 NASA project life cycle

The NASA life cycle has been refined through usage over several decades. It is defined in a set of NASA Procedural Requirement (NPR) documents. The NASA Space Flight Program and Project Management Requirements document [NPR7120] defines the phases of a NASA project.

The NASA life cycle model is designed to support missions—prototypically, a space flight mission that starts from a concept, builds a spacecraft, and flies the mission.

NASA space flight missions involve several irreversible decisions, and this is reflected in how the phases and decisions are organized. Obtaining Congressional funding for a major mission can take months or years. During development, constructing the physical spacecraft, signing contracts to acquire parts, and allocating time on a launch provider’s schedule are all expensive and time-consuming to reverse. Launching a spacecraft, placing it in a disposal orbit, and deactivating it are all irreversible. These decision points are reflected in where there are divisions between phases, and when there are designated decision points in the life cycle.

There are several life cycle patterns in that document, depending on the specific kind of program or project. I focus on the most general project life cycle [NPR7120, Fig. 2-5, p. 20], which is reproduced below:

The pattern includes seven phases. There is a Key Decision Point (KDP) between phases. Each decision point builds on reviews conducted during the preceding phase, and the project must get approval at each decision point to continue on to the next phase.

The key products for each phase are defined in Chapter 2 of the NPR and in Appendix I [NPR7120, Table I-4, p. 129].

Pre-Phase A (Concept studies). This phase occurs before the agency commits to a project. It develops a proposal for a mission, and builds evidence that the concept being proposed is both useful and feasible. A preliminary schedule and budget must be defined as well. If the project passes KDP A, it can begin to do design work.

Phase A (Concept and technology development). This phase takes the concept developed in the previous phase and develops requirements and a high-level system or mission architecture, including definitions of the major subsystems in the system. It can also involve developing technology that needs to be matured to make the mission feasible. This phase includes defining all the management plans and process definitions for the project.

Phase B (Preliminary design and technology completion). This phase develops the specifications and high-level designs for the entire mission, along with schedule and budget to build and complete the mission. Phase B is complete when the preliminary design is complete and consistent and feasible.

Phase C (Final design and fabrication). This phase involves completing detailed designs for the entire system, and building the components that will make up the system. Phase C is complete when all the pieces are ready to be integrated and tested as a complete system.

Phase D (Assembly, integration, test, launch, checkout). This phase begins with assembling the system components together, verifying that the integrated system works, and developing the final operational procedures for the mission. Once the system has been verified, operational and flight readiness reviews establish that the system is ready to be launched or flown. The phase ends with launching the spacecraft and verifying that it is functioning correctly in flight.

Phase E (Operations and sustainment). This phase covers performing the mission.

Phase F (Closeout). In this phase, any flight hardware is disposed of (for example, placed in a graveyard orbit or commanded to enter the atmosphere in order to destroy the spacecraft). Data deliverables are recorded and archived; final reviews of the project provide retrospectives and lessons learned.

This pattern of phases grew out of complex space flight missions, where expensive and intricate hardware systems had to be built. These missions often required extensive new technology development. The projects involved building intricate hardware systems that required extensive testing. The NASA procedures for such missions are therefore risk-averse, as is appropriate.

I have observed that many smaller, simpler space flight projects have not followed this sequence of phases as strictly as higher-complexity missions do. Many cubesat missions, where the hardware is relatively simple and more of the system complexity resides either in operations or in software, have blurred the distinctions between phases A through C. In these projects, software development has often begun before the Preliminary Design Review (PDR) in Phase B, and the teams have used continuous integration tools to begin verifying that the software components work together as they are developed rather than waiting for a formal integration activity in Phase D.

At the same time, I have observed some of these smaller space flight projects failing to develop the initial system concept and requirements adequately before committing to hardware and software designs. This has led to projects that failed to meet the mission needs—in one case, leading to project cancellation.

The phases in the NASA life cycle compares with the baseline model presented earlier as follows.

The NASA life cycle splits the system development activities across four phases. The NASA approach does this because it needs careful control of the design process, in particular so that agency management can make decisions whether to continue with a project or not at reasonable intervals. The NASA approach also places reviews throughout the design and fabrication in order to manage the risk that the system’s components will not integrate properly. Many NASA missions involve spacecraft or aircraft that can only be built once because of the size, complexity, and expense of the final product; this makes it hard to perform early integration testing on parts of the system and places more emphasis on design reviews to catch potential integration problems.

The NASA pattern is notable for some initial work on a mission concept starting before the project is officially signed off and started. There are two reasons for this. First, because all NASA missions have common processes, there is less unique work to do for each individual project. Second, NASA is continuously developing concepts for potential missions, and this exploratory work is generally done by teams that have an ongoing charter to develop mission concepts. For example, the concept for one mission I worked on was developed by the center’s Mission Design Center, which performed the initial studies until the concept was ready for an application for funding.

19.2.2 Unified Process

The Unified Process (UP) was a family of software development processes developed originally by Rational Software, and continued by IBM after they acquired Rational. Several variants followed in later years, each adapting the basic framework for more specific projects.

The UP was an attempt to create a framework for formally defining processes. It defined building blocks used to create a process definition: roles, work products, tasks, disciplines (categories of related tasks), and phases.

The framework led to the creation of tools to help people develop the processes. IBM Rational released Rational Method Composer, which was later renamed IBM Engineering Lifecycle Optimization – Method Composer [ELOMC]. A similar tool was included in the Eclipse Foundation’s process framework, which appears to have been discontinued [EPF]. These tools aimed to help people develop processes and then publish the process documentation in a way that would let people on a team explore the processes.

While the UP and its tools gained a lot of attention, their actual use appears to have been limited. I explored the composer tool in 2014, and found that it remarkably hard to use. It came with a complex set of templates, which were too detailed for project that I was working on. Another author wrote that “RUP became unwieldy and hard to understand and apply successfully due to the large amount of disparate content”, and that it “was often inappropriately instantiated as a waterfall” [Ambler23]. Certainly I found that the presentation and tools encouraged weighty, complex process definitions and that they led the process designer into waterfall development methodology.

The UP defined four phases: inception, elaboration, construction, and transition.

Inception. The inception phase concerns defining “what to build”, including identifying key system functionality. It produces the system objectives and a general technical approach for the system.
Elaboration. This phase is for defining the general system structure or architecture and the requirements for the system. The results of this phase should allow the customer to validate that the system is likely to meet their objectives. This phase may be short, if the system is well defined and or is an evolution of an existing system. If the system is complex or requires new technology, the elaboration phase may take a longer time.
Construction. This involves developing detailed component specifications, then building and testing (verifying) the components. This includes integrating the components together into the whole system and verifying the result. The result is a completed system that is ready to transition to operation. RUP focuses on constructing the system in short iterations.
Transition. This phase involves beta testing the system for final validation that the customer(s) agree that the system does what is needed, and deploying or releasing the final software product.

The UP does not directly address supporting production, system operation, or evolution; however, the expectation is that, for software products, there will be a series of regular releases (1.0, 1.1, 1.2, 2.0, …) that provide bug fixes and new features. Each release can follow the same sequence of phases while building on the artifacts developed for the previous release.

The four phases in UP compare with the simple model presented earlier as follows:

The Unified Process provides lessons for defining life cycle patterns: keep the patterns simple, make them accessible to the people who will use them, and put the emphasis on what they are for, not on tools and forms. The basic ideas in UP are good—carefully defining a life cycle, and building tools to help with the definition. I believe that these good ideas got lost because the effort became too focused on elaborate tools and model, losing focus on the purpose of life cycle patterns: to guide the team that actually does the work.

19.3 System development patterns

Some patterns focus only on the core work of developing a system. These patterns generally begin after the project has been started and the system’s purpose and initial concept are worked out. The patterns go up to the point when a system is evaluated for release and deployment. In between, the team has to work out the system’s design, build it, and verify that the implementation does what it is supposed to.

These examples all share the common basic sequence of specifying, designing, implementing, and verifying the system or its parts. Some of the examples include similar sequences of activity to evolve the system after release.

19.3.1 Systems V model

This pattern is used all over in systems engineering work. It is organized around a diagram in the shape of a large V. It is used in many texts on systems engineering ! Unknown link ref. It has also been used to organize standards, such as the ISO 26262 functional safety standard [ISO26262, Part 1, Figure 1].

In general, the left arm of the V is about defining what should be built. The right arm is about integrating and verifying the pieces of the system. Implementation happens in between the two. One follows a path from the upper left, down the left arm, and back up the right side to a completed system.

There is no one V model. There are many variants of the diagram, depending on the message that the author is trying to convey. Here are two variants that one often encounters.

The first variant focuses on the sequence of work for the system as a whole:

The second variant focuses on the hierarchical decomposition of the system into finer and finer components:

The key idea is that specifications, of the system or of a component, are matched by verification steps after that thing has been implemented.

In general this model conflates three ideas that should be kept separate.

Development follows a general flow of specification, then design, then implementation, then verification.
System development proceeds from the top down: start with the whole system, and recursively break that into components until one reaches something that can be implemented on its own.
Development follows a linear sequence from specification and design, through implementation of components, followed by bottom up integration of the components into a system (with verification along the way).

The first two ideas are reasonable. Having a purpose for something before designing and building it is a good idea. There are exceptions, such as when prototyping is needed in order to understand how to tackle design, but even that exception is merely an extension to the general flow. The second idea, of working top down, is necessary because at the beginning of a project one only knows what the system as a whole is supposed to do; working out the details comes next. Again there are exceptions, such as when it becomes clear early on that some components that are available off the shelf are likely useful—but again, that can be treated as an extension of the top down approach.

The third idea works poorly in practice. It is, in fact, an encoding of the waterfall development methodology into the life cycle pattern, and so the V model inherits all the problems that the waterfall methodology has.

In particular, the linear sequence orders work so that the most expensive development risk is pushed as late as possible, when it is the most expensive to find and fix problems. By integrating components bottom up, minor integration problems are discovered first, shortly after the low-level components have been implemented when it is cheapest to fix problems in those low level components. Higher-level integration problems are left until later, when complex assemblies of low-level components have been integrated together. These integration problems tend to be harder to find because the assemblies of components have complex behavior, and more expensive to fix because small changes in some of the components trigger other changes within those assemblies already integrated.

Development methodologies other than waterfall address these issues better, as I discuss in ! Unknown link ref.

19.3.2 Systems or software development life cycle (SDLC)

A number of life cycle definitions for system development, primarily of software systems (coverage is weak on irreversible steps that are not well defined going in)
Covers mainly system development, not so much the early and late phases related to project start up and close out
Defines a number of phases; anywhere from six to ten depending on the source

Plan
Design
Implement
Test
Deploy
Maintain

System investigation
Analysis
Design
(Implementation omitted)
Testing
Training and transition
Operations and maintenance
Evaluation

Initiation
System concept development
Planning
Requirements analysis
Design
Development
Integration and test
Implementation
Operations and maintenance
Disposition

19.4 Post-development patterns

19.4.1 DVT/EVT/PVT

Many electronics development organizations use a set of development and testing phases:

Engineering Validation and Testing (EVT)
Design Validation and Testing (DVT)
Production Validation and Testing (PVT)

EVT. The EVT phase is preceded by developing requirements for the hardware product. It is often also preceded by development of a proof of concept for the board.

During EVT, the team designs and builds working prototypes, often iterating from a first prototype through a few revisions as testing reveals problems with the prototype. The EVT phase ends when the team has a prototype whose design passes basic verification.

DVT. The DVT phase involves more rigorous testing of a small batch of the designed board. The design should be final enough that sample boards can be submitted for certification testing. The DVT phase ends when the sample boards pass verification and certification tests.

PVT. The PVT phase involves developing the mass manufacturing process for the board. This includes testing a production line, assembly techniques, and acceptance testing.

These three phases are all part of the system development phase of the prototype phase pattern I have presented.

This pattern addresses mass production of hardware in ways that our prototype pattern and the NASA pattern do not.

XXX references for this process

19.5 Comparisons and lessons learned

Focus needs to remain on the purpose of patterns: to guide people, meaning they have to be accessible and understandable. They can’t be legal documents with many layers of qualification.

The differences between the four phase patterns I have presented illustrate how a project must adapt its phases for the specific system, vendors, and customers involved.

There is one thing in common in all the patterns: they all put effort into defining the concept and objectives for the system first, before investing too much in developing the system.

The differences between the NASA approach and the other approaches illustrate how phases are structured differently when the system involves the development of expensive components. The cost, in time and money, of building the wrong software component is relatively low. The cost of building an airframe or rocket motor is far higher, and so it is worthwhile to spend more effort ensuring the airframe or motor design is right before beginning to build and test it.

The NASA and DVT approaches show how the need to interact with customers, funders, or suppliers can change the phases. The NASA approach is influenced by the US Government fiscal appropriation and acquisition mechanisms, which require programs to have multiple points where the government can assess progress and choose to continue or cancel a program. The DVT approach is influenced by the way a team needs to work with board and chip vendors to prototype a board, and get it ready for mass production.

The DVT and RUP XXX

Chapter 20: Model life cycle patterns

22 February 2024

20.1 Introduction

Projects generally proceed in a series of phases. Each phase has a different emphasis on what kind of work is done; the focus shifts as the project moves from one phase to another. Different life cycle patterns specify different sequences of phases, as we will see in later chapters on different patterns.

All project life cycles share a common general pattern, as shown below. The patterns differ in the details of how system development proceeds.

Some projects are only concerned with building a system; once the system has been implemented and tested, it goes into production or operation and is no longer the concern of the development team. Those projects only go through the first four phases. Most projects, on the other hand, have some level of involvement after the system is deployed and in operation, such as fixing bugs or enhancing the system. These projects involve all the phases.

The phases are:

Project preparation. Work out and document the processes, rules, standards, and life cycle that the team will use.
Concept development. Determine what the system needs to be or do in order to be useful to its users and to those who will operate it.
System development. Design, build, and verify the system.
Operational acceptance. Review and acceptance by the customer, indicating that the system is ready for operation.
System production. Build and deploy the system, using the artifacts from system development.
System operation. Support the system in operation, including fixing flaws and supporting problem resolution.
System evolution. Add capabilities to the system.
System disposal. Take the system out of operation and ensure that all artifacts are archived, destroyed, or recycled.

This is a minimal set of phases. Many projects will break up some of these phases into smaller ones.

20.2 The example phases

The phases defined earlier in this chapter provide a simple model that can be used to compare and contrast other phase structures, or that can serve as a basis for defining a custom life cycle pattern.

The phases are project preparation, concept development, system development, operational acceptance, system production, system operation, system evolution, and system disposal.

Project preparation. Starts when the idea for a project first comes up, which may be the same time that someone has an idea that starts concept development. There are several decisions taken and artifacts developed in this phase:

The life cycle pattern to be used for the project
Processes for each phase, especially for review and approval of work
Team management processes
Team organization, scopes of authority, and communication patterns
The tools and infrastructure to be used to support the work

The project preparation phase typically overlaps concept development and early parts of the system development phases. It is complete as long as each member of the team knows the rules for how to do their work.

Concept development. Starts with an idea for some customer need to address, and develops the customer objectives that the system should address. There are a few artifacts developed in this phase:

A document of the customer’s objectives
A document characterizing the customer. For visionary projects, which do not have a specific customer in mind, the customer characterization documents who the target customer set is
A document listing the regulatory requirements that apply to the project
A concept of operations, which provides a short statement of what the system will do. This often includes a set of key use cases
Optionally, a preliminary schedule for later phases The concept development phase ends when the customer and the project team are in agreement about what the system should do, as documented in approvals from the customer and project management.

System development. Starts with the customer objectives, and designs, implements, and verifies the system. This phase develops most of the system artifacts, including:

A component breakdown structure
Specification, design, and implementation of each component
Analyses of how the system, and key components, meet customer or regulatory objectives, including safety, security, and business objectives
Verification or test specifications for each component
Verification or test results for each component and for the system
Procedures and processes for producing components and assembling them into a working system The system development phase ends when the system has been implemented and verified.

Operational acceptance. The process for a customer reviewing the implemented system and the evidence collected during verification to ensure that the built system meets their needs, as well as regulatory requirements. The phase results in customer review outputs and the customer’s approval.

System production. This is the phase where the artifacts built up during system development are turned into one or more working, deployed systems, ready for operational use. If the system is to be mass produced, this is where many systems are built and made available for operation.

System operation. In this phase, the system is placed into operation. The team supports the operation by supporting problem identification, analysis, and fixes. The artifacts involved include:

Problem reports
Problem analyses (e.g. root cause analyses)
Updated versions of specification, design, implementation, and verification artifacts
Review and approval of the updates
Problem resolution instructions (such as instructions for installing a new component) Handling problems potentially continues as long as the system is in operation, ending either when the team is relieved of responsibility for fixing problems or when the system is taken out of operation.

System evolution. This phase occurs in parallel with system operation, and involves changing or adding to the system to make it better. The evolution phase is often planned in advance of the first version of a system going into operation—for example, when an organization releases a minimum viable product (MVP) initially and plans on quarterly improvements to the system after that. The artifacts include:

Change requests
Updated versions of specification, design, implementation, and verification artifacts
Review and approval of the updates
Instructions for deploying the updated system System evolution often continues as long as the system is in operation, or until the team’s management or funder determines that no further improvements are to be done.

System disposal. There are two parts to this phase: disposing of the operational system and disposing of all the system artifacts. The development team is sometimes responsible for defining how parts of the system should be taken out of operation and retired, such as taking hardware systems out of service and preparing them for recycling, or destroying any data that must not be preserved. The development team is also responsible for archiving all the system artifacts that were developed so that they can be re-used or audited in future.

XXX pulled from introduction

20.3 Life cycle phases

All project life cycles share a common general pattern, as shown below. The patterns differ in the details of how system development proceeds.

All projects begin with some kind of preparation. The team—or at least its few initial members—work out what rules, life cycle pattern, tools, processes, and standards should be used to design and build the system. This work often runs concurrently with the first technical phases.

An initial concept development phase begins the technical work. During concept development, the team works with the customer or mission to determine what the purpose and objectives of the system are. One must work out the objectives first so that the team spends development effort on building something actually relevant to what will be needed in the end.

The majority of work falls in the system development period. This typically includes activities or phases for:

Architecture development
Specification and requirements development
Specification and design validation
Prototyping and risk reduction
Detailed design
Implementation
Component verification
Integration and system verification

Development ends with accepting the system for operation. The system is not deployed and put into general operation until it is shown to be ready, and the customer or mission has given their approval showing that they have accepted the system from the development team.

Some projects end when the system has been handed over to the customer for operation. Other projects continue to support the system while it is in use: fixing defects and evolving the system as needs change. The project may continue through the end of system operation, when system components are disposed of and information about the system and its operation is archived.

Sidebar: Canceling a project

Projects get canceled all the time. Anecdotally, it seems that more projects are canceled than go to completion—this is a consequence of using competitive approaches to programs, and the net effects of competition are generally regarded as valuable.

Consider two examples, based on projects we have worked on.

In the first project, the team was writing a proposal for a US DoD spacecraft system. In the proposal-writing phase, the team has to establish the basic architectural and management approaches for the project, show they meet the department’s needs, and establish the price at which the team proposes to build the system. The team progressed through establishing the initial concept and architecture for the system, and we began evaluating the solution to see how good a job it would do for the customer and how much it would cost to build it.

We had a checkpoint milestone where we reviewed what we had found. At that review, it became clear that while our team had a decent solution for the needs, we did not have a great solution, and that other companies we expected to propose designs would likely have better solutions (because they had more experience in a couple of key technical areas). We made the decision not to pursue the proposal.

This was a good decision. Assembling a proposal is not a small task; we had a team of about 15 people working long hours. For US government projects, the proposer generally pays for the proposal development. Choosing to spend our team’s time and money on this project meant that the team couldn’t work on some other project. We judged that the opportunity cost was not matched by the probability of successfully winning the contract, so we freed up the team to work on a different system that did prove successful. If we had continued to work on the original proposal, we would have spent the budget available to develop proposals and could not have spent it on the proposal that succeeded.

In the second example, a different US DoD spacecraft program, the team was about two years into a multi-year contract. The team had performed excellently in a competitive first prototyping phase, and was the only team to be selected to move on to a second phase for building an initial working version. A key subcontractor on the team had staffing and management problems, and were not delivering results. Within the team we were struggling to fix the execution problems or find another way to build the necessary components, all the time keeping a large staff on payroll and running through budget. While the technological solutions for many system capabilities were probably sound, the team could not deliver. The customer observed the problem, and after working with the team to try to resolve the problems, went through the process to cancel the project.

This was also a good decision. In hindsight, the team lacked necessary capability in the subcontractor and in the project management team. If the project had been allowed to continue, it is unlikely that the team would have solved the problem and more money would have been spent without benefit in the end.

The take away from these examples is that there are many sound reasons for canceling a project. Sometimes the cancellation is designed in (as with competitive acquisition); other times it is because continuing to invest money, time, and the care of the team building the system has become unlikely to pay off.

For a more general discussion of US DoD project failures, see the report by Bogan et al. [Bogan17].

Part VI: Specifications

Chapter 21: Specifications

8 February 2023

21.1 Purpose

Specification is about recording how a component (or system) should behave or the structure that the component should present. It only documents how the component appears from the outside, as a black box; it does not specify how the component achieves these ends. A specification derives from the less-formal concept for the system or component.

XXX address specification vs requirement

XXX make sure this ties into the broader flow of phases

A specification provides a simplified and abstract view of a component. This abstract view allows one to reason about how the component will work with other components. Without the abstract view, one would have to analyze the details of a component’s implementation to determine whether it will interact properly with another. While that is possible, the work of figuring out how the component will behave only serves to reconstruct design information that was originally worked out when designing the component. The reconstructed information will not necessarily match the information used during design, and the effort is wasteful.

A good specification records the intent and assumptions that went into working out what the component is supposed to do. This information helps the component’s implementer and designer to check that they understand what they need to build, and to check that the specification matches the intent. These assumptions also help people understand how a component might need to change when part of the system is redesigned—to add a new feature, for example. A record of the intentions helps people who come along later to understand the system, and the particular component’s role in it.

Finally, a specification serves as a sort of contract between a component and the rest of the system in which it functions. The people building the component in question can proceed to work on their component with confidence that the result will likely integrate correctly into the system as long as they build to that contract. The people building other parts of the system can likewise proceed with reasonable confidence that when they go to use the component, it will do what they expect.

21.1.1 Good specification properties

A specification is used for several different tasks by different people over the course of a project. A good specification needs to be structured and contain the information needed to support these people.

Specifications should be clear and unambiguous. Each of the people who will read and use each specification need to come to the intended meaning of the specification.

They should be testable. Someone using the specification should be able to look at a design or implementation and determine whether it is compliant with the specification. That does not mean that determining compliance is easy; it only means possible. Sometimes the most that is possible is to build a body of evidence that a design is highly probably compliant. For a specification to be testable, however, the specification can’t contain statements like “approximately” or “fast” or “heavy”; it needs specific values that define what “approximately” (“+/- 10%”), fast (“at least 20 m/s”), or heavy (“greater than 5 kilograms”) mean so that compliance is not a matter of subjective judgment that can differ between two different people.

The specifications need to be organized. A specification is no good if the people who need to use it don’t know it exists or can’t find it. A specification is also not useful if the people who need it can’t tell whether it is currently applicable, outdated, or a speculative proposal. Specification should be kept in one place where everyone on the project can find all of them, and they should be maintained under configuration management.

A good specification is minimal. It addresses the needs for the system or component that have been identified in the concept work leading up to the specification, but it does not add other elements that are not relevant to the identified needs. (Note, however, that the process of developing a specification can often reveal needs that were missed in building up the concept and CONOPS. When those gaps are found, the concept and CONOPS need to be updated as well as addressing the gap in the specification.)

21.1.2 Specification versus documentation

Specification and documentation play different roles. Specification is a record of what something should be, while documentation is a record of what it has been designed and implemented to actually be. Specification deals with the black-box, external behavior, while documentation deals with the internals of the component. The documentation should connect decisions about the component’s internal structure to the external behavior or structure documented in the specification.

21.1.3 Specification needed to scale a project

A small project, implemented by a very small group of people over a short time and thereafter left alone, and that does not provide safety- or security-critical functions, does not necessarily need specification.

Unless all of those conditions hold, some level of specification is necessary in order to communicate between people and across time.

The communication includes:

Between those who design the system at large and those who design and implement a specific component. The specification allows different people to work on different parts of the system, which in turn allows a larger team to do the work, and allows some of the work to proceed in parallel.
Between those who design and implement a component and those who verify its correct behavior. The specification defines the “correct behavior” that verification checks.
Between the original designer and someone who must adapt part of the system later (for a new feature or to fix a problem)—bearing in mind that those may be the same people, but later in time when the details are likely to have been forgotten. The specification conveys a simplified, abstract version of what a component is supposed to be, so that the designer making changes can work out how the behavior should change and so that they can work out whether the changes will likely integrate correctly to address the new feature.
Between the system designers and implementers and those who later inspect the system as part of safety or security incident response. Accident response relies on being able to reconstruct both what happened in an incident and what was supposed to happen. This reconstruction is part of determining what changes need to be made to ensure that similar accidents do not recur, and it is also part of working out how to repair things after an incident (when that is possible).

Sidebar: The role of experience substituting for specification

Every specification is written in terms of some level of common knowledge: language, jargon, subject matter Have to strike a balance between what is assumed and what is explicitly recorded In small and fast-moving teams, temptation to rely on experience rather than writing down the needs, especially when the same person specifies and implements Works in the short term but not in the long term as people change or work becomes shared Is error-prone (example of hysteresis) Disadvantages early career engineers Can be okay if this is a transient condition, and specifications and assumptions are recorded before being forgotten

21.2 Specifications and systems

A specification defines the metaphorical shape that the component should have in order to fit into and support the system.

A specification treats the component as a black box: it considers only how the component should be seen from the outside, without determining how the component’s internals should be designed or implemented. One way to look at the specification is that it defines a contract between the system and the component: if the component behaves according to the specification, the system should work correctly as a whole.

A specification may define behaviors or attributes that in effect narrow the range of possible designs, possibly to only a single design. That situation in itself does not make a specification invalid. However, the specification should not include definitions that are not strictly needed to record needed external behaviors solely in order to constrain the design.

After a component has been specified, design of the internals of that component begins. The internal design often uses sub-components. The designers will develop specifications for the sub-components.

This process repeats recursively to lower and lower components, until one reaches components that have no further sub-components. The result is a tree (or possibly a DAG) consisting of alternating layers of specifications and designs. (This has been called the “layer cake model”.) The design of one component (or the system) responds to its specification. The specification for subcomponents depends on the design that has been selected for the component—the design determines both what subcomponents there are, and how they are to work together.

images. images/spec-layering.svg

21.3 Example

Some years ago, I worked on a rack-mounted computing system that had high reliability and uptime goals. A decision was taken to include a battery pack in each server assembly, so that if the mains power went out the servers would have enough time to record their state on storage before shutting down.

Consider the specification for the battery pack. It may seem simple—provide enough power to run the server assembly for some period of time—but the actual specification contains several subtle elements because its function is entwined with other system-wide reliability and safety behaviors.

Here are some of the system behaviors that affect the specifications for the battery pack:

Keep the server assembly running for long enough to save state to storage and perform an orderly shut down
The server assembly must be able to change its behavior when the battery pack does not have enough energy stored to safely shut down
A server assembly should function correctly for at least X years before needing replacement
When a server assembly is developing a condition that suggests a component will fail soon, the server should detect the likely failure and initiate a repair call
The server assembly must be safe and convenient for a customer to assemble (if needed) and install
The server assembly must not cause a fire, release toxic gasses, or cause other similar harms (which references a safety standard defining the harms to be avoided)
The server assembly must not be vulnerable to supply chain attacks, such as substituting unapproved components, in the period between manufacture and installation at a customer site
The server assembly must fit into a standard 19” computing rack (which references a mechanical standard specification)
The server assembly is intended to operate in a data center environment (which references an environmental specification)

These are rough objectives for the server assembly as a whole. These translate into specifications on the battery pack itself.

Addressing keeping the server assembly running:

The battery pack must provide power at X volts and Y current for at least Z minutes when mains power stops providing power to the server assembly
The voltage and current must be maintained within X of these reference values
The battery pack must provide connectors to deliver and receive electrical energy to and from other components in the server assembly

Addressing the server assembly changing its behavior:

A management system attached to the battery pack must be able to determine when the battery pack state of charge is sufficient to meet the voltage, current, and time need, and when the state of charge is not sufficient ⇒ The battery pack must provide information that allows accurate estimation of its state of charge (e.g. having an accurate relation between voltage and state of charge)
A management system must be able to account for temperature and battery age in making this determination ⇒ The battery must have predictable changes in behavior as temperature and age change

Addressing the server assembly lifetime:

The battery pack must not degrade its storage capacity more than X% over Y years
The battery pack must support a management system detecting when the storage capacity degradation is approaching a level when battery pack replacement will be needed

Addressing likely failure:

The battery pack must exhibit behaviors that allow a management system to detect degradation or incipient faults, such as failure of one battery cell or overheating
The battery pack must have identification that allows batteries from the same manufacturer and manufacturing lot to be tracked for common failure patterns

Addressing safe and convenient customer installation:

The battery pack should be shipped discharged
The battery pack should be safe to handle without protective equipment as long as its case or assembly is not damaged or modified
The battery pack should not make the server assembly too heavy for one person to lift safely
The battery pack, if separately installable, must not have error-prone cabling that a customer might hook up the wrong way

Addressing fire, toxic gasses, and similar safety issues:

The battery pack should be sealed against emissions up to at least X internal pressure
The battery pack should be able to handle up to X impact without taking damage that would lead to thermal or mechanical failures
The battery pack should provide sufficient information to a management system that the management system can disconnect the battery when a failure is beginning, and so that the management system can trigger external safety systems

Addressing supply chain attacks:

The battery pack must be handled only by trusted employees and agents of the company from manufacture to installation
The battery pack must include identifying information that is hard to falsify or replicate

Addressing fitting into a standard rack:

The battery pack must weigh no more than X kilograms
The battery pack must occupy a volume no greater than X by Y by Z millimeters
The battery pack must connect to the rest of the server assembly using X brackets

Addressing the environmental conditions:

The battery pack must function nominally in a given temperature, air pressure, humidity, air gas composition, and dust environment

These example objectives are not all of what would be needed for a server battery pack, but they illustrate several of the kinds of concerns that the battery pack’s designers will need to consider. These rough objectives must be turned into more precise specifications in order to guide the designers accurately. For example, some of the statements above use subjective words like “nominally” that need to be made precise. Other statements are too general and need to be decomposed into a set of more specific statements.

21.4 Kinds of specifications

“Specification” is a deliberately broad term, encompassing many different ways of recording what something should be or do (and why).

Many people assume that “specification” means “requirements”. While requirements are one kind of specification, they are not the only one—and requirements are not generally sufficient by themselves to record all the information needed about behavior or structure.

Kinds of specification include:

Requirements. These are short prose statements about what is expected of the component. These statements have the advantage that they have all the flexibility of written language. A reader only needs to know normal language and the associated technical jargon to understand the meaning of the statement. Requirements also have the advantage that they are easily organized and identified in an outline or table form, so that one can trace through each individual statement. Requirements inherit the disadvantages of written language that lead people to devise other notations or diagrams.
Interfaces. While a specification is in general a definition of the overall interface between a component and the system around it, in many domains part of the specification is gathered into an interface (often recorded in an Interface Control Document). A software application programming interface, which defines data structures, function calls, and communication steps used to communicate with a component, is one example. An environmental specification is another example that specifies the heat, vibration, atmosphere, electromagnetic radiation, dust, and other environmental conditions that a component must handle correctly.
Models. These are (semi)formal expressions about structure or behavior, written in a specialized notation or diagram. These expressions have the advantage of being able to express needs that are not easily expressed in prose, such as physical or temporal relations, or mathematical statements. These have the disadvantage that readers often have to learn new notations in order to understand what the expressions mean. Many notations do not lend themselves to easy enumeration of each individual condition that the component should comply with.

There are many kinds of models.

Mechanical models show the physical interface that the component should support. This can be as simple as a volume in which the component must fit. It might include connectors for how the component is attached to other components in the system. More advanced models might include the forces that may act on the component, and how the component must pass those forces from one place to another, or how a component may deform in response to forces.
Mathematical models document how a component behavior should respond to outside inputs, expressed as mathematical formulas. This can range from a simple linear response to an input to a complex and stateful impulse-response model. Many of these are expressed on a continuous time domain.

u(t)=K_{p}e(t)+K_{i}\int_{0}^{t}e(\tau)d\tau+K_{d}\frac{de(t)}{dt}

Example: control function for a PID controller XXX

Behavioral models document how the component will perform sequences of output actions in response to inputs. Many of these models are expressed as diagrams: state machines, sequence or activity diagrams, or timelines. Some models, such as state machines, express all the possible sequences of behavior that could happen; other models, such as sequence diagrams, express one or a few specific sequences without being exhaustive. Many of these models operate on a discrete time domain.

Example: TCP session opening state machine XXX

Domain-specific models. These express needs that are specific to some problem domain, and that are not easy to express in one of the other kinds of models. For example, Earth-orbiting spacecraft typically have an orbit specification, which can be used to determine where the spacecraft will be at various times in the future and what the conditions will be at those times. A material specification might include parts of the phase diagram for the properties of the material at different temperatures and pressures, indicating some conditions in which the material must be solid or gas. Most domain-specific models have an underlying mathematical definition, but the needs can be expressed more clearly using the domain-specific notation rather than only using the underlying formulas.

21.4.1 Combining multiple kinds of specification

In practice I have found that no one kind of specification meets all needs, and have used multiple kinds of specification together.

Generally, each kind of specification we use meets the good specification objectives of being clear and testable, as defined earlier.

Mixing multiple kinds of specification, however, requires care in organizing the specifications. Different kinds are often written and stored in different tools (a tabular tool for requirements; a CAD tool for mechanical drawings). This easily leads to a situation where a practitioner cannot find all of the specifications to which they need to be paying attention.

One way we have addressed this is to use a table of textual requirement statements as a primary specification, and include requirements like “the component shall comply with state machine X”, including a reference to the drawing of the state machine. Using a tool that makes all these forms accessible through one common user interface helps make this convenient for users. Using tools that can perform configuration management across all the different forms of specification also helps.

21.5 Using specifications in a system

We first look at how specifications are developed and used from the outside: from the perspective of those who are concerned with how a component fits into the system, and not with what the specification means for the design internal to a component.

A specification for a system derives from the objectives and CONOPS developed during the system concept development phase.

The system-level specification leads, in turn, to a system design and then recursively to the concepts and specifications for components in the system.

21.5.1 Building the specification

This is the first step in using specifications. The specification developer looks through all of the conceptual material assembled for the system or for a component, and organizes and formalizes it to make a specification.

In practice this does not happen all at once. People develop the various kinds of objectives that lead to the specification iteratively, and parts of the specification will be developed as the objectives and concept becomes clear. As people develop the specification, they will identify gaps in the concept, which will lead to improvements in the objectives and CONOPS and in turn lead to updates to the specification.

21.5.2 Evolving the specification

The needs that a system solves change over time. New capabilities get requested. Regulations evolve. Problems with the system are found and need to be fixed. All of these can lead to changes in the concept and thus to changes in the system specification.

The concept and design of components also changes, and for similar reasons. As well, a component may have a perfectly adequate design, but it may become outdated because subcomponents become unavailable. This leads to a redesign of a component, inducing new specifications for subcomponents.

It is important to follow an organized process when a specification changes. Many process standards recommend specific approaches; for example, ISO 26262 [ISO26262] specifies that any change to a system must begin with an impact analysis, which determines how a change to objectives or specification propagates through the design of the system, and downward through the hierarchy of components. Standards like that also specify that the specifications and designs be maintained under configuration control so that everyone can know whether a change is a work-in-progress proposal or has been committed to.

21.5.3 Validating the specification

The specification must reflect all of the needs identified in the concept from which it derives, and the specification must not add needs that do not appear in the concept and objectives. Before a specification can be declared complete, someone must go through all the material in the concept to check that the specification accurately reflects each of the identified needs or objectives.

A specification validation exercise can also help identify gaps in the objectives. Checking the specification often involves someone who was not part of developing the objectives and CONOPS; a fresh perspective can lead to asking questions about the objectives or the specifications that in turn lead to discoveries of topics that are missing.

21.5.4 System consistency

As the system design grows and more and more components are defined and specified, someone needs to check that the designs and specifications are all consistent. This is especially important for “long distance” dependencies: where the correct function of one component depends on the correct function of another component in a different part of the system. (More formally, when two components A and B depend on each other for correct function, and the lowest common parent of A and B in the component hierarchy is near the top of the hierarchy.)

21.5.5 Safety and security design

As we will discuss in future chapters ! Unknown link ref, the safety and security properties of a system must be designed top down, and they need to be defined early in system development, before too many low-level components are designed.

We advocate using the systems safety methodology ! Unknown link ref, which emphasizes starting with the accidents or losses that are to be avoided, and then the conditions that must be maintained in a system to achieve safe operation. (This is different from many safety methodologies, such as functional safety, which focus on safety in the face of failure conditions and do not address safety problems arising from design or component interactions.) The categories of losses come from the safety and security objectives defined in the concept development phase.

Some example conditions:

A spacecraft must not perform any propulsive maneuvers, emit radio frequency radiation, or perform any physical deployments within 1 km of the upper stage vehicle from which it was deployed. (This example is based on real launch vehicle safety requirements.)
A planetary lander must maintain engine thrust until it has touched down on the surface. (This example is based on the accident analysis of the Mars Polar Lander spacecraft; see ! Unknown link ref for details.)
An autonomous aircraft must not take operational commands from an unauthorized source.

Once these conditions are identified, systems engineers must determine how to address them in the design of the top-level system. They must then create derived specifications for each of the top-level components in the system, and show that if each of the components meets its specifications the overall system will exhibit safe or secure behavior by complying with the safety and security conditions. This process is repeated through at increasingly lower levels of the system.

21.5.6 Review and approval

A specification guides the design and implementation of parts of the system. Given the importance of this role, a specification—or an update to a specification—should be reviewed before being committed to. Each specification should be checked by the people whose work it affects: system designers, the designer of the component or system that contains the thing being specified, potential implementers, and those people who are working on components that will interface with or use the component being specified.

As with other system artifacts, a specification or specification update should be under configuration management so that each user can determine whether they are using the correct version or not, and whether the version they are using is a proposed or work in progress version, is the current approved (baselined) version, or a version that has become obsolete.

21.6 Using specifications for a component

We now turn our attention to those people and activities who use a component to design and implement a component; that is, who are concerned with how the internals of a component reflect its specification.

There are two tracks of activity that use a component’s specification:

One track follows the design and implementation of the component itself, which should result in a component that complies with the specification. The other track follows the design and implementation of verification methods, such as tests or static analyses. The tracks come together when the implementation gets checked by the various verification methods, resulting in a determination of whether the implementation is in fact compliant, or whether the design and implementation need to be fixed to bring it to compliance.

21.6.1 Learning about a component

A specification is an abstracted view of what a component should be. That makes it useful as a guide for someone who needs to learn about a component, before diving into the design or implementation of that component.

Someone who is learning about a component—or about the structure of the system across many components—needs to be able to find the relevant specifications. The specifications should be organized to support them:

Components should have names or identifiers
Specifications should be stored in a repository that allows people to find a specification by its component id, or be able to navigate through the structure of the system to discover information about components
Users should be able to start their search in a single place, well-known to all the team, without having to discover multiple tools that might or might not be readily apparent to new team members

21.6.2 Designing and implementing to specification

The general task of a designer or implementer is to create a component that complies with its specification. In practice, of course, this is a complex activity.

The designer needs to be able to clearly identify all of the behaviors or capabilities that the component must implement. This implies that the specification must be organized in a way that helps the designer find all of these, and in a way that can serve as a checklist for tracking which features have been satisfied and which have not yet.

As we will discuss further in upcoming chapters, the designer or implementer should be able to identify which aspects of the component have the highest design risk or are the most technically complex. The designer and implementer will often choose to focus on these hard aspects first, before dealing with aspects that are easy to solve. The hard aspects are often candidates for prototyping, in order to determine if a design approach is feasible and can meet the specification. (See XXX for more on prototyping and risk reduction.)

Complex systems and components can benefit from the combination of incremental development and continuous integration. Incremental development involves selecting a few parts of the component’s specification and implementing those, followed by testing. Once those aspects of the component appear sound, the developers perform a second iteration by selecting a few more aspects of the specification and adding them to the design and implementation. Continuous integration, in this context, involves performing integration testing of these partial designs and implementations in a skeleton of the rest of the system. The partial implementation of this component may use mockups of subcomponents, or interact with mockups of peer components in the system. We discuss incremental development and continuous integration more in XXX.

As people work through design and implementation, they are likely to find problems or gaps with the specification. The specification may be ambiguous in some part, or the specification may not define the behavior for some condition. The developers must be able to work with those who defined the specification to sort out these issues. The developers should check the specifications in depth, asking the specifiers questions to check their understanding or to confirm that there are issues. The developers then should work with the specifier to resolve the issues.

The developers should not make an assumption about a gap or ambiguity and move forward without confirming their assumption. The people who wrote the specification are responsible for ensuring that the specifications for different components are consistent and address large-scale safety or security concerns. The behaviors needed to support correct interaction are encoded in the specification. The developers are responsible for implementing components that correctly support these behaviors so that the resulting system works correctly. The developers do not necessarily have the big-picture perspective to make changes to these critical behaviors, and do not necessarily know who else needs to know about an assumption in how a component is defined. The developers need to work collaboratively with those responsible for the specifications so that all the pieces of the system remain consistent and correct, and so that everyone involved shares a common understanding of how the components and system are to function.

A component’s implementation will need to be verified against the component’s specification. People using continuous testing or test-driven development methods have had good results producing correct component implementations efficiently by testing an implementation in small increments as functionality gets added to it. This reduces the risk that the design or implementation has made some fundamental, early mistake that becomes increasingly expensive to correct as more functionality is implemented on top of the erroneous implementation. Performing continuous testing (or verification) requires having verification cases defined and implemented concurrent with the implementation of the corresponding functionality.

Finally, each component design and implementation will need to be reviewed and approved before being accepted as finished. Verifying that the design and implementation comply with the specification is a major part of the review process. The review activities will be much easier if the specification is well organized.

21.6.3 Evolving specifications

As mentioned earlier, a component’s specification will likely change when a system remains in use for a long time. Systems engineers will need to investigate the impact of making a change to a specification before committing to the change.

The component designers and implementers are part of the investigation process. While a systems engineer can look at what will change in how a component interacts with other parts of the system, the component designers and implementers are better positioned to evaluate the effect that a change in specification will have on implementation or verification.

To change a design and implementation in response to a change in specification, the developers need to correctly determine what has changed in the specification. Having a clear mechanism for showing what requirements have been removed, added, or changed, and for showing specifically how other parts of the specification have changed, makes this task possible. In particular, being able to accurately enumerate every change is important; the developer should not have to hunt for subtle changes that may be hidden.

The decisions that are encoded in a component’s design include how different parts of the component interact with and depend on each other. When a component’s design is to be changed in response to a change in specification, some parts of the design will be directly affected. For example, a decision to add a new input message to a component directly implies that new message reception and handling functions must be implemented. However, one change can affect other parts of the existing design, and the designer and implementer must find and address all of these effects. The example new input message, for example, might require changes to a database schema for storing additional information, or might affect response time behaviors that require changes to foundational concurrency control capabilities in the design. Having a clear record of how parts within one component are designed to depend on or affect each other reduces the effort involved in making this kind of change, and reduces the chances of an error stemming from some dependency being overlooked.

21.6.4 Verifying a component

The specification defines what a component should be or do; the design and implementation define how it is or does these things. Verification is the process of ensuring that the implementation produces behaviors that match the specification.

Every element of the specification should have a corresponding method for verifying compliance of the implementation. Different aspects of the specification will require different methods: some aspects can be verified by testing, such as showing that given some input A, the component responds with behavior B. Other aspects will require demonstration, such as showing that a physically representative user can see and reach control devices. Some aspects—especially safety and security—can only be verified by analysis or formal methods, such as showing that a component never enters performs some action identified as unsafe.

Verification methods involve design and implementation, similar to the design and implementation of the component itself.

Designing a verification method involves, first, determining how a specification property can be verified. (Sometimes a property is best verified using more than one approach in parallel.) once the approach—testing, review, demonstration, or analysis—has been determined, the next step is to design how that specific specification property will be checked. That can involve designing a set of test cases that cover the expected behaviors, or defining a test procedure to evaluate a mechanical component, or defining who will perform a review and what they will look for.

Implementing a verification method turns the design into a specific set of tools and actions that, when used, give a yes-or-no answer to whether the component is compliant.

The verification methods can have errors. Indeed, in some cases the verification of a property can be more complex than the component implementation it is checking. This means that the verification designs and implementation need careful scrutiny to ensure that they are, in fact, checking the specified properties and not something else.

The verification methods also must be complete: if some property is worth specifying, it is worth verifying. The verification designs and implementations need to be checked to ensure that they cover all of the specification. Explicitly recording which parts of the specification any particular verification method checks helps the task of checking completeness.

Finally, it is common for project management to track what portion of a component’s specification has been completed and verified. This can be organized by identifying each property in the specification, and tracking which verification methods check each one. As verifications are done, the project managers can determine which parts of the specification correspond to verification activities that passed.

21.7 Specification artifacts

Specification activities take as input the objectives ! Unknown link ref and CONOPS ! Unknown link ref artifacts that were generated during concept development.

The specifications themselves involve:

The component breakdown structure ! Unknown link ref, which organizes the components that make up the system and define the entities that will have specifications written for them
For each component, some kind of index or table of contents that organize all the elements of the specification for that component
Tables of requirements ! Unknown link ref, interface definitions ! Unknown link ref, and models ! Unknown link ref that make up the specification

The elements in the specification should include traces that show how each individual part of the specification derives from some part of the objectives or CONOPS, and conversely how each part of the objectives is reflected in the specification.

The specification artifacts should be maintained under configuration management. That means that there should be a common repository that everyone working on the system can use to retrieve (and potentially update) the artifacts. The repository should maintain separate versions of each artifact, and clearly identify which version is the current, baselined version that people should use, which versions are outdated, and which are works in progress.

The configuration management system should support people reviewing a specification, and must support recording when a particular version has been approved to be baselined.

Chapter 22: Requirements

1 April 2024

22.1 What are requirements?

Requirements are one kind of specification: they say something about a property that a component or system should have, or a behavior they should exhibit.

A requirement is a specification in the form of a single, declarative textual statement. In the simplest case, a requirement is a statements of the form:

For example,

The encabulator shall be colored green.

There are many nuances and variations on this basic form, but they are all extensions of this basic idea.

Requirements are written this way in order to maximize the simplicity and clarity of the specification.

Requirements are only one part of the specification for a component or system. They document specific facts about a system’s design, but they do not document the explanation of how that particular design came to be. They do not document the general purpose and scope of a particular component. They do not document complex interaction patterns. These other parts of a specification are documented in other design artifacts that complement requirements.

22.1.1 Why write requirements?

One of the jobs of systems engineering is to ensure that a user or consumer of some artifact (system or component) will be satisfied with the artifact once it is built and deployed.

The specifications for a system or component serve as a way to organize the information about what the user wants, and to organize the process of checking that the final result meets the user’s desires. The specification thus acts as a kind of implicit contract between the end user and the implementers: if the user agrees that the specification properly records their objectives, and the resulting system can be verified to meet the specification, then then the implementers have built something that satisfies what the user agreed to. (Whether the user is actually satisfied is a separate matter.)

XXX would a couple diagrams help here? A first one might show user → conceptual artifact, conceptual artifact → developer → concrete artifact; a second one might show systems and verification in the picture?

This means that there are three main uses for requirements (and the rest of specifications):

Encoding the user’s objectives in a written form, and allowing the user to validate that the specification matches what they want.
Guiding implementers as they work out the design for the artifact that will meet the user’s objectives.
Providing a checklist for verifying that the resulting artifact meets the specification, and thereby the user’s objectives.

A systems engineer is typically the keeper of the specifications, responsible for overseeing the writing, changing, and verification of requirements and other specifications.

Requirements—and all specifications—are therefore acts of communication between multiple groups of people with different roles in building the system.

Systems engineers are facilitators and interpreters in this communication between users and implementers. They are responsible for translating information received from users into specifications (including requirements), for explaining the specifications back to the users for validation. The information from the user is often unstructured and incomplete. It is up to the systems engineer to work with the user to clarify their objectives and ensure that the result accurately reflects the user’s intent. The systems engineer also works to ensure that the specifications are complete. This often involves identifying use cases that the user has not thought of themselves and working with the user to define what behavior the system should have in those other cases.

The systems engineer also facilitates the implementer’s work. The systems engineer develops specifications so that the implementer has a clear guide to what they need to design and build; this requires that the systems engineer provide translation or explanation when the specification does not use the same terms or concepts that the implementers do. The systems engineer is also responsible for ensuring that the final artifact meets the customer’s objectives by overseeing the verification of the implementation against requirements (and other specifications). This involves working with verifiers to ensure that verification methods match the requirements, and checking that all requirements have been verified before the system is declared done.

A systems engineer performs other tasks using requirements, such as checking consistency or completeness. We will discuss these tasks in a later section.

A good requirement must meet several objectives in order to provide accurate communication between all these parties:

A requirement must be unambiguous. Each of the parties must be able to read the text of a requirement and have the same understanding of what it means.
A requirement must have an unambiguous meaning about the system.Two people should be able to look at the system and read the requirement, and come to the same conclusion about how the system is supposed to behave if the system meets the requirement. Many people express this as “a requirement must be verifiable”.
A requirement must be accurate. A requirement that misleads a reader has no point.

These needs lead to conventions about how requirements are written and organized, as we will discuss later.

22.1.2 What are requirements about?

Requirements are a general-purpose way of writing down facts about what something is supposed to be (or not be).

Requirements can apply to just about anything. In a typical system project, they will be used to:

Record the objectives that end users have for the system as a whole
Record the regulations and policies that people governing the system
(or its market) impose on systems like this
Record the behaviors and properties that individual components need
to have on order to fit together into the system as a whole
Record the processes that the system’s implementers must follow
while building the system
Record elements of the contractual relationships between the
system’s implementer and vendors that supply parts or services
Record policies for the behaviors required of people working with the system

22.1.3 The context for requirements

Requirements don’t stand on their own.

Most requirements in a system will apply to particular components in the system. The component breakdown structure provides the list of components that requirements can be about.

Requirements are part of more general specifications for the system and its components. The specifications include

A definition of the scope and purpose of each component
The structure of how components connect together to make up the system
Analyses of the system design, such as fault trees or performance models
Specifications of behaviors that cross many different components

The requirements must be consistent with these other parts of a component’s specification.

In the end, requirements are satisfied by the implementation of the components in the system. Being able to trace the connection from a component’s requirements to the pieces of the implementation matters in order to be able to show that the requirements are satisfied.

22.2 A single requirement

A requirement itself is a single statement about something that should be true about something.

More formally, a requirement has three parts:

A subject, naming what thing the requirement applies to. The subject can be the whole system, a part of the system, or sets of parts.
A mode, naming the relationship between the subject and the properties. A word like “shall” or “must” is the most common mode, indicating that the property must be true or shall be true about the subject.
A property, encoding a predicate that is to be true about the subject. This can be just about anything, as long as everyone involved can agree on what it means for the property to be true. It can describe a behavior that the subject should provide, or some characteristic that the subject should maintain. The property is usually written as a verb followed by the object of the property, such as

The main winding of the encabulator shall be placed in panendermic semi-boloid slots of the stator

where “be placed” is the verb.

Some examples:

The spacecraft shall have a mass no more than 1 kilogram
- Subject: the spacecraft as a whole. The spacecraft is made up of many subcomponents; this requirement applies to all of them put together into the complete spacecraft system.
- Mode: shall—this is something that we require to be true.
- Property: have a mass no more than 1 kg. This is a measurable property of the spacecraft; once assembled, it can be put on a scale and checked.
The electrical power subsystem (EPS) must provide a minimum of 25 W in standby mode
- Subject: the electrical power subsystem. This is a named subsystem within a spacecraft.
- Mode: must—this is something that we require to be true.
- Property: provide a minimum wattage when in a particular mode. This is a conditional property: it only applies when the EPS is in standby mode. For this property to be meaningful, the design of the EPS must include a set of modes, and “standby” must be one of them.

22.2.1 Example

Consider an example of a statement of what the mission manager for a small spacecraft mission wants:

A spacecraft mission wants a small spacecraft that is expected to operate in low Earth orbit (LEO) for at least three years.

This sentence has a number of problems. It mixes statements together: the mission and the spacecraft, the operating environment and the lifetime. The sentence is not very precise: what is “low Earth orbit”? What does the spacecraft have to do to “operate”? It is unachievable: nobody can guarantee that a spacecraft will function for a particular duration as an absolute guarantee; what if there is an unusual solar flare that fries its electronics?

We can improve the example sentence a bit by splitting it into three requirements statements:

The mission shall operate a small spacecraft
The spacecraft shall operate nominally for at least three years with 95% probability
The spacecraft shall operate nominally in low Earth orbit between 200 and 400 km altitude

These requirements improve the original statement. First, it splits the original so that each requirement is about a single topic (and is written in the subject-mode-property form). Second, it improves the description of two of the requirements by making them more achievable (“95% probability”) and precise (altitude range given).

These three requirements in themselves are not sufficient. Before the requirements are done being written, for example, there will need to be a definition of what “operate nominally” means. Similarly, the “at least three years” requirement is not enough by itself: three years would be difficult or impossible to meet if the intended environment were the surface of Venus; it would be almost trivially easy in the intended environment were an air conditioned clean room. Adding more information about the environment is necessary to interpret the three-year condition—for example, what is the expected radiation environment at those altitudes?

The three example requirements are not sufficient in another way: they are high-level and provide the designer of, say, a battery subsystem no guidance about how the battery must be designed so that the spacecraft meets these requirements. The derivation or flowdown is the topic of an upcoming section.

22.2.2 Rationale

A well-written requirement is concise. As such, it makes a statement about what a component should do—but the text of the requirement does not capture why the component should do that.

Good requirements should include a rationale statement that documents the thinking that went into choosing to make the requirement. The rationale does not change the requirement; it only adds explanation. The rationale helps those who must come along later, after the requirements are written, to understand or evaluate the requirements. It helps educate other engineers about considerations that may not be obvious. It helps those who later need to revise requirements understand what constraints there may be on the requirement they are changing.

22.3 Multiple requirements

Requirements actually come in groups; they are practically never singular.

The meaning of a group of requirements is the logical and of all of them. If there are ten requirements, an implementation complies with the requirements if it complies with all ten of them individually.

There are two issues to watch out for when there are multiple requirements: contradictions and exclusivity.

Contradiction: Two requirements contradict if complying with one of them means that it is impossible to comply with the other, and vice versa. Every collection of requirements must be checked to ensure there are no contradictions. The section on consistency below discusses this further.

Exclusivity: If a collection includes a requirement

A must do X,

it is perfectly reasonable to also have another requirement

A must do Y.

Having both of them means that there are two things that A must do.

The question then arises: if component A also does Z, is that compliant or not? In some cases it is okay if A does Z (it has a feature that isn’t used) and sometimes it is not (if it is important that A only does X and Y and nothing else ever).

The answer is that having requirements about doing X and Y means that the requirements are silent on Z. If the requirements are silent on a topic, that topic is not considered important and it doesn’t matter for compliance. (If the topic is important, it needs to be included in the requirements.)

If it is important that A only does X and Y and nothing else, that needs to be stated explicitly. This can sometimes be written directly into one requirement:

The component must be colored one of red, green, or blue

This can also be written in a general negative form:

The component must not do any activity not listed in these requirements

Explicitly listing the allowed activities is preferable to a “must not” requirement—the negative form is convoluted and easy to misread.

22.4 Organizing requirements

Even a moderately-sized system will typically have thousands of requirements. Users need some kind of organization of all those requirements in order to find the requirements they will be working with.

There are three concepts to discuss: organizing by subject, organizing by sections, and hierarchical writing.

22.4.1 Levels of requirements

People use requirements for different purposes. This leads to fundamentally different kinds or requirements.

At the most abstract level, the general product or mission objectives capture what stakeholders want the system to do—its purpose. These almost always start as general, vague statements. The stakeholders, system engineers, and product managers refine these over time into a clearer definition of the system’s purpose. The exercise may or may not result in proper requirements statements, but it is worth treating the results as if they are requirements and showing how the top-level system requirements derive from these objectives.

Projects also have guiding objectives that do not specify the system directly, but instead define policy or standards that the system must adhere to. There are many kinds of policies, including:

Business policy: common rules for how the organization operates; expected profitability margins; expected customer base
Security policy: what kinds of hazards, threats, and attacks should be addressed in the system design; what should be protected; rules for prioritizing different attacks
Safety policy: what entities should be considered for safety; what hazards to consider; acceptable incident rates for different hazards; can include things like launch assurance standards
Regulatory policy: regulations and laws that the system must comply with
Development standards: standards that govern how the team designing and building the system should go about their work, including quality standards such as ISO 9001 or process standards such as NASA NPR 7150.2
Other standards: industry or government standards that will inform the design of the system, such as standards for hardware components or for human-machine interface design

It is helpful to organize the product/mission objectives and all the various policies and standards into separate collections, identified by the kind of policy or source of objectives. For example, one can maintain one collection for business policy and a separate one for the quality assurance standard being used to build a system.

The top-level requirements on the system as a whole are part of the formal or semi-formal definition of what the system is to do. These requirements say what the system is and does when looked at from the outside, as a black box. These requirements are best kept separate from the more vague product/mission objectives—the objectives represent desires, while the top-level requirements represent the commitments made for what the system will do. The derivation mapping from objectives to top-level requirements provides a place to record the rationale for why different decisions were made about the commitments in the system, and why the decision was made not to commit to supporting some desires, represented in objectives.

Requirements on lower-level components provide definitions of what the pieces that make up the system must do. These obviously have a different scope than the top-level requirements for the whole system.

22.4.2 Organizing by subject

The first concept is that requirements should be organized by their subject, following the component breakdown structure.

The system objectives are those requirements that apply to the system as a whole. These typically encode the CONOPS for the system, along with requirements derived from the process or design standards.

The rest of the requirements apply to specific components within the system. The component breakdown structure defines what the components are, and gives them names.

Organizing by component is important for proper verification, so that each requirement can be connected to the implementation artifacts that are expected to comply with the requirement, and so that the implementer of some component can properly determine all the requirements they need to adhere to.

22.4.3 Organizing by section

One single component or process/design standard can often have several hundred requirements. Users can find and work with all these requirements more easily if they are organized by topic as well as by subject.

This can be done by creating a set of topic sections within each component. Often these sections are the same for all components—sometimes empty when they are not relevant, but having the same organization across all components help people find what they are looking for.

There is no one recommended set of sections that will apply to every system. The choice of sections is affected by the kind of system or components being developed, as well as by process and design standards. For example, if an automotive project is following the ISO 26262 Functional Safety standards [ISO26262], the Safety Goals and/or Safety Requirements should be collected into one section.

As a starting point, we have used variations on the following set of sections in several projects:

Operational requirements. These relate to how the component turns on and off, and the major operational modes of the component.
Functional requirements. These record the behaviors or functions that the component is expected to provide (when it is turned on, if that applies)
Service or lifecycle requirements. These address how the component is to be manufactured, installed, maintained, and retired
Budget limitations. This includes data rates, mass, power consumption, RF emitted or received, heat emitted, or gas/vapor emissions
Environment. What the component must be or do in its environment. For example, vibration resistance, gas or dust resistance.
Discipline-specific sections. This includes sections for safety, security, and regulatory compliance.
A miscellaneous section for requirements that don’t fit into other categories.

It’s a good idea to work out one or a few section structures that work for your project, then use those sections consistently across all components.

Keep in mind that some requirements will always fit into multiple sections. For example, a requirement may both be about regulatory compliance and define a function the component is supposed to provide. Try to make consistent choices about which section a requirement goes in, but don’t try to make some perfect hierarchical section scheme that would let people avoid making such choices.

22.4.4 Hierarchical versus flat requirements

There are two general structures for organizing requirements on a particular topic:

As a single-level or flat collection of requirements, all at the same level
In a hierarchical or outline structure There are pros and cons of each approach. The choice of which you follow will depend on the style of working in your project, the capabilities of the tools you use, and the process requirements that apply to your organization.

The flat organization has all requirements within a section be at the same level. Each requirement is independent of the others and can be understood only by reading the text of the requirement.

The hierarchical organization places requirements into an outline, with general requirements and more specific sub-requirements. The sub-requirements must be read and understood in the context of their parent. The sub-requirements provide details, clarification, or limitations on the general parent.

Consider a set of requirements for security on a TCP/IP communication channel. The general requirement is that the communication channel should be authenticated and encrypted. In outline form, this looks like:

Communication channel X must implement security mechanisms
1. Communication channel X must require authentication before application data is exchanged
  1. The authentication protocol must mutually authenticate both parties to each other
  2. The identity being authenticated must be granted by the organization’s security management system
  3. The authentication protocol must be resistant to man-in-the-middle attacks
  4. The authentication protocol must support revocation of either party’s credentials within X minutes
2. Communication channel X must maintain integrity and confidentiality of the application data being exchanged
  1. The confidentiality protection must be resistant to traffic analysis

Consider requirement 1.1.1, requiring mutual authentication for the communication channel in question. The requirement for mutual authentication must be understood only to apply to communication channel X. There could well be another communication channel, called Y, that does not have the same authentication requirements.

Written in a flat style, the requirements might be expressed as:

Communication channel X must require authentication before application data is exchanged
The authentication protocol used for communication channel X must mutually authenticate both parties to each other
The authentication protocol used for communication channel X must use identities granted by the organization’s security management system
The authentication protocol used for communication channel X must be resistant to man-in-the-middle attacks
The authentication protocol used for communication channel X must support revocation of either party’s credentials within X minutes
The communication protocol used for communication channel X must maintain confidentiality and integrity of the application data being exchanged
The communication protocol used for communication channel X must provide confidentiality protection that is resistant to traffic analysis

Each of these statements can be read on their own; each statement includes all the necessary qualifications (“the protocol for communication channel X must…”) to identify the scope of its subject without having to refer to other statements.

There are pros and cons of each approach.

(Pro) Flat organization allows each requirement statement to be read and understood on its own, without reference to other requirements
(Con) Flat organization requires some requirement statements to include a lot of redundant text to identify the scope of the subject. The redundancy makes it more tedious to write the statements, and leads to errors where one of the qualifying clauses in the subject doesn’t get edited and becomes inconsistent with the others
(Pro) Hierarchical organization makes it easier for the reader to scan for a particular requirement, and makes the individual statements easier to understand
(Con) Hierarchical organization means that sub-requirements do not stand alone, but must be read in the context of their parent.

22.4.5 Requirement identifiers

Every requirement needs a unique identifier.

People use this identifier to refer to the requirement, including using it as a bookmark or link to reference the requirement in other documents. Software check ins to a repository often use the requirement identifier to indicate what functionality is being added to the repository. Task management systems use requirement identifiers to track the progress on implementing and verifying particular requirements. In general, the requirement identifier enables the integration of requirements management with other tools and tasks

The identifier must be stable. That is, once a requirement has been given an identifier, that identifier should not change. The text of the requirement can (and will) change, but the identifier remains a stable way to refer to the requirement in documents, email, and other messages without having to track down all the uses of the identifier and change them.

It is good practice for the identifier to convey some information about the requirement. At minimum, the identifier should make it clear what component or body of external requirements the identifier applies to. If one writes requirements hierarchically, then using the number of the requirement in the outline is a good identifier.

Having the identifier carry some information helps the user check that they are referencing the requirement they intended to reference. It also helps the reader to know generally what the writer is talking about, without going into a requirements management system to check.

For many projects, I have used the format <component id>:<hierarchical requirement number> as the identifier. For example, space.eps.panels:3.4.2 for a requirement applying to a spacecraft’s solar panels.

There are requirements management systems that use a universal, flat namespace for identifiers, such as REQ-82763. This is not a good identifier, because it makes it hard to check when one has accidentally mistyped or miscopied the identifier into another document. If one accidentally types REQ-82764 into another document, that other requirement could apply to a completely different component—and the mistake is obscured.

22.5 Writing good requirements

Requirements are a way of communicating between people on a project: between the customer and systems engineers, between those who look at how multiple systems work together and those who implement the pieces, between those who design and those who test. A good requirement is one understood equally well by all the people who use that requirement.

Writing good requirements takes practice, but the following guidelines will help in writing and reading requirements.

22.5.1 General form

Individual requirements have a general form:

The subject is often a component named in the component breakdown structure. It should be named explicitly:

The solar panels shall generate at minimum…

The rudder shall move between 10º left and 10º right

The majority of requirements use either the word “shall” or “must”, depending on the organization and industry. “Shall” indicates an assertion that the statement about the subject is to be true in the implemented system. “Must” expresses the obligation that the statement will be true in the system. In practice the two words mean the same thing when writing requirements.

The solar panels shall generate at minimum…

The flight computer must consume no more than X watts in any mode

The property is a predicate that should be verifiably true about the subject.

Writing the predicate is usually the complex part of writing a requirement. In some cases the predicate is simple:

The subject shall be painted green

The subject shall generate at most X watts of heat

In other cases, the predicate must have conditions added, saying when or under what conditions the predicate applies:

The subject shall generate at most X watts of heat while powered on.

Sometimes the requirement statement is easier to read if the condition clauses are presented in a different, natural order. However, the semantics remain the same: the clause is part of the property statement:

While powered on, the subject shall generate at most X watts of heat

22.5.2 Single topic

A requirement should specify a single property of the subject. The examples above all deal with a single property.

There are requirements that may have multiple things in their property statement that still deal only with a single property. For example:

The widget must be painted green, gray, or white

Formally, this requirement deals with a single property: what color the widget may be painted. The color is restricted to a set of three colors—but the property in question is the color.

Note that this requirement is slightly ambiguous: it is not clear whether the widget can be painted only one of those colors, or some mixture of them. This requirement could be improved by either rewriting it as:

The widget must be painted one of green, gray, or white

Or adding a second requirement:

The widget must be painted a single color

22.5.3 Clarity about subject

A good requirement must be clear about what thing it applies to. In general it is best to write down a proper name of the subject—the name of the relevant component in the breakdown structure, for example.

This rule makes for a lot of repetition in requirements. “The control system must X”, “The control system must Y”, “The control system must Z”, and so on. While it means a little more typing, using the component’s name in each requirement means that each requirement can be understood on its own.

22.5.4 Consistent language

Use consistent terms throughout requirements. Always call component X by one name; don’t change it from requirement to requirement. Always call some one function by the same name, so that it’s clear that all the relevant requirements really are talking about the same thing.

Having lists of names or terms helps those who write requirements to use consistent terms, and provides those who read requirements with definitions when they need to confirm what a term refers to. This means:

Define the component breakdown structure, and use the component names in that structure when writing requirements
Provide a list that defines the functions that a component must provide. Use the function names consistently when writing requirements, and provide a definition of the function that will help a requirement reader understand the requirement statement
Provide a glossary of terms with technical meanings. Use terms consistently with the glossary, and include definitions of terms that are clear enough to explain the terms to readers

22.5.5 Plain language

Requirements (and the rest of specifications) may be written by one or a few people, but they will be read by many people. The readers need to understand correctly what the requirements mean. Many of those readers will be learning about the system by reading requirements or other documents, so they won’t enter into reading the requirements with the same context that system engineers writing the requirements will have.

This means: don’t get fancy with requirements language. There are some ways that requirements will sound stilted, like the subject-mode-property form. There is some technical jargon that is needed to make the requirement precise. But don’t make the language more complex than it needs to be.

For any words or phrases that do not have a meaning that will be obvious to all your readers, help them out by defining how those words are being used in the specifications. Start with “must” versus “shall” and any other mode words (see Advanced Requirements below). Provide a glossary of the definitions of the rest of the words.

22.5.6 Negative requirements and “only”

Many organizations prohibit requirements that say “shall not”. Negative requirements have their place, but they are tricky to get right. The problems arise with exactly how broad or narrow the requirement actually is.

Consider a component implementation that could do one of three behaviors, A, B, or C.

If the component has a requirement “the component shall do A”, the implementation satisfies the requirement (it does A). That is because the requirement, as written, allows for the implementation to do other behaviors as well.

If the component has a requirement “the component shall only do A”, then the implementation does not satisfy the requirement because the implementation might do other things.

Now consider a requirement such as “the component shall not do D”. The implementation does satisfy the requirement, but not necessarily in a helpful way. Just because the component doesn’t do D, what should it do? Are behaviors A, B, and C all acceptable? What about behavior E?

In most cases it is clearer to name exactly the behaviors that are required, because that is unambiguous. One can write verification conditions to test exactly what is allowed.

Sometimes, however, one should write a negative requirement. If there is some behavior that really, truly must never happen, then writing a “shall not” requirement calls out that important condition, and a verification test can be designed to show that the system will not do the thing it isn’t supposed to. The negative requirement should usually be paired with a positive requirement that says what the system should do instead.

Safety and security properties often require stating a negative requirement, because these properties are fundamentally definitions of things that the system is to be designed not to do. I have not been able to imagine a way to write “a robot may not ingure a human being” [Asimov50] as a positive requirement.

Verifying negative requirements is more complex than verifying positive requirements. See Section 12.3.

22.5.7 Avoid “it”

Avoid the word “it” and other non-specific pronouns or modifiers (“they”, “those”, “them”, “its”). Repeat the name of a thing involved in the property, even if that seems repetitive and wordy. An example:

The control system must enter mode X when it is allowed

This is better written:

The control system must enter mode X when mode X is allowed

Because the “it” in the first example is ambiguous: the word could refer to the mode or to the control system.

22.5.8 Avoid impossibly high bars

There are things that we want a system to do. When writing a requirement, it is tempting to write something like

The spacecraft shall function nominally for at least three years on orbit

Unfortunately, this three-year required property of the spacecraft is virtually impossible to meet (unless, maybe, the “spacecraft” is a large, inert chunk of rock). A spacecraft has many parts, operates in a difficult environment, and is built by fallible humans.

The problem with this requirement is that it sets a bar that is so high that no real spacecraft can meet it. The requirement does not allow for any off-nominal operation. It doesn’t allow for a spacecraft to have a temporary fault and then recover. It doesn’t allow for debris to impact the spacecraft. In fact, this requirement is met only when the spacecraft is perfect for those three years. Any real spacecraft will fail verification if it has a requirement like this.

This kind of requirement needs to be modified to something more realistic. There are many ways to do that. The NASA Systems Engineering Handbook has the rule that a requirement should specify “tolerances for qualitative/performance values (e.g., less than, greater than or equal to, plus or minus, 3 sigma root sum squares)” [NASA16, Appendix C].

Three common ways are:

To place a probability on the requirement: function nominally with probability at least 95% for three years while on orbit
Provide an availability or reliability measure on the requirement: function nominally at least 99.3% of the time for three years on orbit
Provide a definition of “nominally”, either in other design documents or in additional requirements, that allows for the spacecraft occasionally to go into safe mode or otherwise be temporarily out of service

Of course, these are often combined.

22.5.9 Measurable conditions

The point of a requirement is that someone can determine whether an implementation complies with the statement in the requirement. Operationally, this means that a requirement can be verified (see the section on verification below).

One way to make a requirement measurable is to specify the condition quantitatively. For example, a spacecraft’s battery must be able to store at minimum X milliamp-hours. It’s not hard for a test engineer to see how to create a test to verify that the battery complies.

Other requirements, especially those that specify an action that should be taken under some condition, aren’t quantitative, but instead are measured by observing whether the required action is taken. The verification tests will involve either creating the condition under which the action is to occur or observing that the condition has occurred, and then observing that the required action has been taken. For this kind of requirement to be useful, a test engineer must be able to understand accurately the enabling condition and be able to create or detect that condition. The test engineer must also be able to understand the action that is supposed to occur, and detect that it has occurred. If the enabling condition or action can’t be detected, then the requirement is not readily measurable.

Requirements on low-level components are often easier to make measurable than requirements on high-level components. This is why high-level requirements are often verified by looking at requirements derived from the high-level requirement rather than by trying to construct a verification test directly on the high-level requirement.

22.5.10 Unverifiable conditions

When writing requirements for human-machine interaction or user interfaces, the underlying need is that a user can understand what the system is doing, and give it the right commands so that the system does what the user wants.

How would someone verify that the system as designed or implemented actually meets this objective? The statement is too vague actually to test.

There are multiple ways to address this issue.

First, one needs to break the objective up into a number of more-specific objectives. This often involves putting together a list of what it means to “understand what the system is doing”. This might involve:

Perceiving what operating mode the system is in
Observing fault conditions that will change operating behavior
Observing how much fuel or electrical charge the system has remaining
Interpreting the indications to understand the state of the system

And so on.

This breakdown is an improvement over the original desired objective, but the conditions are still not verifiable. As we will see in the later section on requirement derivation, these can be turned into high-level requirements that are broken down further, and the verification condition on these high-level requirements consists of, first, verifying all of the derived requirements, and then showing an argument that satisfying all the derived requirements shows that the high-level requirement is satisfied.

The derived requirements about “perceiving” or “observing” are themselves not verifiable: how does one verify that a person has observed, or can observe, some state of the system? This needs to be broken down into yet further, more specific requirements. For example,

Observing how much fuel the system has remaining

Is a process, consisting of a chain:

System has fuel → system can measure how much fuel → system transmits this information → an indicator shows the amount measured → a person can see the indicator → a person can accurately observe the indication XXX

If all these steps are satisfied and work correctly, then the person should be able to see the amount of fuel remaining.

Focus on the last two functions in the chain: that a person can see the indicator and that they can observe the indication. Seeing the indicator can be in turn broken down into further requirements, primarily on the physical structure around the person. For example, some of these might be:

The person observing must have an unobstructed line of sight to the indicator
The indicator must be visible in all the lighting conditions that the person will experience
The indicator must cover a large enough portion of the person’s field of view that they can see the details of what the indicator is showing

There is some prerequisite information needed to verify these examples. For example, what range of sizes will the users be? In order to check for unobstructed line of sight, one must know where the user’s head will be. What visual acuity or color perception abilities are required of the users? A colorblind user will not be able to perceive some color differences that might be used to convey necessary information. What expectations will a user bring to the task? If a user is socially conditioned that green means good and red means bad or stop, using different colors to indicate good or stop will be hard for a user to interpret.

How would one go about verifying these requirements? There are multiple techniques that will help—and usually the techniques must be used together to really check whether a requirement is satisfied. These techniques are a combination of analysis using models and real-world measurement.

Analysis of CAD models or physical models of the environment. If we know where the indicator will be placed, what else is around it, and where the user’s head can be, one can determine if something will obstruct the line of sight. Similarly, building a model of all of the anticipated lighting conditions can allow one to find the lighting situations that are likely to be difficult for perception, such as having the sun in a user’s eyes when they try to look at the indicator, or full darkness.
Standards. There are many standards for color, contrast, visual size of an indicator, and many other aspects of user interface components. Meeting a standard may be sufficient for determining whether the requirement is met.
Experimentation. Build a mockup of the environment where the indicator and user will be, put users in that mockup, and measure how well they do at perception tasks.

The experimental approaches are often the most expensive in time and money, but they are the gold standard for verifying a human interface requirement. Conforming to standards can help address expectations that users will bring to tasks.

In summary, there are several tools for addressing requirements that are too vague or complex to verify:

Writing a high-level requirement, breaking it down into multiple derived requirements, and showing that satisfying those derived requirements means that the high-level requirement is satisfied
Consider the chain of functions that leads from the basic states of the system or component to the desired outcomes, and use those functions to guide a breakdown into lower-level requirements
Determine how to perform verification on the lowest-level requirements, using a combination of analysis, standards, and experimentation

XXX revisit this section to bring it into line with the Leveson viewpoint on user interaction as control

22.5.11 Detail appropriate to the level

Requirements should be written as a description of what one sees in a component when looking at it from the outside—a black box view. A good requirement does not go into how the feature or behavior is implemented inside the black box.

Put another way, the requirements for a component are documentation of how the component fits into the system around it. If component A is part of a larger component B, the requirements on A document what the implementation of B needs for A to do its part correctly. If components C and D are peers, the requirements document what they will need from each other for both to do their job.

This matter connects directly to requirements derivation from component to subcomponent, which is discussed in the next section.

There are four reasons to follow this rule.

Requirements aren’t the only specification of the system. There are design documents whose job is to document how a component will be implemented internally.
Many requirements are written before a component’s internal implementation is understood. The requirements serve as a record that the component designer can come back to to make sure they have designed or built a component that meets the needs of the components or users that will interact with the component.
Things change. Components get redesigned. If a component’s features don’t change but its implementation does, the requirements defining the component shouldn’t change.
Saying what a component is supposed to do leaves room to document the thinking or the rationale that led from the external what to the internal design of how the component provides the whats. This helps others who come along much later to understand the system—in particular, it helps when a requirement needs to change and a new person has to work out what effects that change in requirement will have on an implementation.

It is tempting to skip right to the details of how a component is built. Don’t do it; provide other people the benefit of your understanding of the problem, not just the final design answer.

22.6 Requirement derivation

XXX revisit this to bring it in line with system model terms

No requirement stands entirely on its own. Almost all requirements have some reason that they have been included in a system, starting with: this requirement is necessary so that the system meets some objective. In lower-level components, the reason often is: this requirement is necessary so that this component provides some feature that other components depend on.

These are examples of requirement derivation. Derivation encodes the relationship between requirements.

Almost all requirements are derived from other requirements, and the requirements in a system must keep track of how one requirement leads to another, or how one is dependent upon another.

There are several kinds of relationships that people record. Some of these are:

A lower-level component has a function that a higher-level component will use
One requirement within a component implies other requirements on that same component
A component passes on a general requirement to its subcomponents
Two peer components will be interacting, and depend on each other’s functions

Let’s look at each of these kinds of derivation.

22.6.1 Subcomponents providing features for parent

A parent component has a requirement that the component provide some feature. The requirement in the parent specifies what the parent must do, but does not specify how to implement that feature. The design of the parent component, and later, the implementation, document how the parent component will satisfy that requirement.

When the designer decides on the implementation, they will decide (among other things) how the parent component will use subcomponents to implement the feature. These decisions create requirements on the subcomponents so that they provide the features that the parent component will use.

The reason for these requirements on subcomponents is that they are necessary to satisfy the requirement on the parent component. A derivation relationship between the parent requirement and the subcomponent requirements documents why the subcomponents have the requirements they do.

Consider a spacecraft example. The spacecraft as a whole has a requirement that it be able to point at a ground location, with some number of degrees of accuracy. To implement that feature, the spacecraft designer chooses to use the spacecraft’s attitude control system to point the spacecraft toward a ground location, and then slowly rotate the spacecraft as it passes over the ground location. The parent component—the spacecraft—has the high-level requirements for what it needs to do. The subcomponent—the attitude control system—must be able to slew accurately to an initial pointing vector, and then be able to slew slowly and accurately until the spacecraft is done with an observation. The slewing accuracy and speed are the derived requirements on the attitude control system.

The process continues recursively. The attitude control system designer decides to use reaction wheels as the primary attitude control mechanism. The requirements for slewing accuracy and speed create requirements on the reaction wheels for how quickly or slowly they can turn the spacecraft.

22.6.2 Internal derivation

Some components will have a requirement that specifies a very high-level capability the component must provide. For example, in a section on disposing of a component that is being discarded:

The component shall have a procedure for disposal that ensures that no confidential information is leaked to unauthorized parties

There are several ways this requirement could be met: destroying the retired component in house, crashing the component into the atmosphere or ground in a way that will assure the component is destroyed, or erasing the data on the component before giving the component to an outside entity for recycling.

Whatever the implementation decision is, it creates more requirements on the component, and those requirements derive from the decision on how to satisfy the requirement on protecting confidential information. If, for example, the implementation decision is to recycle a retired part, then this might lead to requirements like:

The component shall provide an interface by which an authorized user can command the erasure of all data stored in the component

The component shall provide a function that erases all data stored in the component

In some organizations, the practice is only to record derivation from one component to another. Sometimes that works out; in the example, the requirement for an erasure command could be on a command handling subcomponent, and the erasure requirement could be on a memory component. However, some components do not break down into subcomponents easily—for example, when the component is being implemented by an outside vendor. In other cases, it is simply clearer to document the implementation requirements for the component directly and then passing the requirements through to subcomponents, so that a user can see the totality of the functional interface to the component in one place rather than having to search through subcomponents for something they don’t know exists.

22.6.3 Pass through

External objectives and standards often impose general requirements on “all components of type X”, or the like. For example, an automobile might have a requirement that all electronic components function nominally across a temperature range of -40º C to +125º C. (See the section on Sets as subjects below for more on this.)

This requirement can be placed on the automobile as a whole; the requirement might read

All electronic components in the automobile shall function nominally across the temperature range of -40º C to +125º C

If the automobile includes engine, braking system, and entertainment systems as parts, the temperature range requirement can be passed down to those subcomponents:

All electronic components in the engine system shall function nominally across the temperature range of -40º C to +125º C

All electronic components in the braking system shall function nominally across the temperature range of -40º C to +125º C

The braking system controller unit shall function nominally across the temperature range of -40º C to +125º C

But the entertainment system, which is not safety critical and operates in the more benign environment of the passenger cabin, might have the requirement:

All electronic components in the entertainment system shall function nominally across the temperature range of -10º C to +50 C

In these examples, the general requirement is copied down into lower-level subcomponents until it reaches some component (such as the braking controller in the example) that does not have further subcomponents. Sometimes the requirement is copied verbatim, just changing the scope of the subject; other times, some component will have a variant on the general requirement.

This kind of derivation is sometimes referred to as allocating requirements to subcomponents.

22.6.4 Mutual dependency

Sometimes two components are peers of each other, and need to interact. A fuel tank provides fuel to an engine; a spacecraft communicates with a ground station to send telemetry and receive commands; client and server applications send messages to each other.

These interactions involve requirements on each of the components involved, showing how the components support each other. The fuel tank must send fuel; the engine must consume fuel. The spacecraft must be able to communicate with the ground station; the ground station must be able to communicate with the spacecraft.

This leads to pairs of requirements that record this mutual dependency. At a high level,

The spacecraft must be able to communicate with ground stations using protocol standard X

and

Ground stations must be able to communicate with the spacecraft using protocol standard X

These two requirements should show a two-way relationship with each other. (Formally, this introduces a cycle in the derivation graph.)

22.6.5 Using derivation

Derivation shows how requirements are related to each other.

Systems engineers use the record of these relationships for several tasks.

Determining whether a design of the system can satisfy its top-level objectives, and encoding how a system’s design satisfies those top-level objectives.
Determining whether a particular requirement is actually needed (and why).
Tracking down the effects of making a change in one piece of a system on the rest of the system.

A derivation relationship between requirements on two different components helps to document the implementation approach for meeting a higher-level requirement. When a designer looks at the high-level requirement, they can see what features are used to implement the high-level requirement. The lower level requirements and their rationale allow the designer to see the argument that the implementation will be sufficient to meet the high-level requirement. This makes the design rationale available to people who didn’t create the design in the first place, but need to understand it to evaluate it or to make changes.

The section on analyzing requirements, below, goes into more detail on how one can look at the requirement derivation relationships to evaluate completeness or sufficiency, to argue whether low-level features are actually necessary, and to trace out the effects of making a change in requirements.

22.6.6 Viewing derivation

There are two ways that a user should be able to see derivation relationships. First, when looking at any one requirement, the user should be able to see what requirements this one is derived from directly, and what requirements derive directly from this one.

Good requirement management tools will also provide a view of the graph that shows derivation graphically. Derivation relationships can be viewed as a graph, as a way to see multiple levels of derivation. The graph is typically mostly a tree or DAG, but there are legitimate reasons that the graph will sometimes have cycles (between peer components, for example).

Here is an example showing how a top-level requirement is the source for a number of other requirements.

22.7 Advanced requirements

All the requirements discussed so far are simple requirements. Simple requirements have a single, clearly specified subject component. Each simple requirement expresses one property about that subject that must be true.

Simple requirements are not sufficient to express every need that real systems encounter. There are two that we have seen many times: requirements on sets of components, and requirements for standards.

22.7.1 Sets as subjects

Consider a system where all code is expected to adhere to a published coding standard. The implied requirement does not apply to any single component; it applies to all of them that include software.

This expectation can be written as a top-level requirement on the system as a whole:

All subcomponents of <the system> that include software shall adhere to the XYZ Coding Standard.

The subject of this requirement is the set of all software components in the system. The property is that their implementation adheres to the named coding standard.

This kind of requirement is placed on the top-level system, and then each first-level subcomponent includes a derived requirement that propagates the requirement downward:

All subcomponents of component X that include software shall adhere to the XYZ Coding Standard.

On a component Y that has software as part of its implementation then has:

The software in component Y shall adhere to the XYZ Coding Standard.

If component Y has subcomponents, Y should also have a second requirement that continues to pass the requirement down to Y’s subcomponents.

This is an example of a general technique:

At some component, levy a requirement that is expected to apply to some subset of the subcomponents of that component
Pass the requirement down from component to subcomponent by giving the subcomponent a derived requirement that is nearly a copy of the original, except for limiting the set of components
When the derivation reaches a component to which the need applies, write a simplified requirement that names only that one component as its subject

22.7.2 Writing for standards

Many texts on requirements approach the subject from an assumption that there is one system being built: these are the requirements for System X. System X will be built in its entirety as specified; any and all requirements must be satisfied.

Writing standards is a different problem. A standard is specifying requirements on multiple hypothetical systems that may exist at some point. Those systems will not be identical, but the systems that adhere to the standards must adhere to the requirements in the standard.

Standards often provide options. The standard has a set of optional features. If the system chooses to implement those features, the features must conform to the standard. However, the system does not have to implement those features. This means that the system does not have to satisfy every requirement in the standard.

Some standards also present best practices. For some feature, it is recommended that the feature conforms to a part of the standard, but it is not absolutely required to do so.

The vocabulary of “shall” or “must” does not accommodate these situations well. The Internet Engineering Task Force (IETF) has defined a richer set of requirement modes. For example:

MAY. This word, or the adjective “OPTIONAL”, means that an item is truly optional. One vendor may choose to include the item because a particular marketplace requires it or because the vendor feels that it enhances the product while another vendor may omit the same item. An implementation which does not include a particular option MUST be prepared to interoperate with another implementation which does include the option, though perhaps with reduced functionality. In the same vein an implementation which does include a particular option MUST be prepared to interoperate with another implementation which does not include the option (except, of course, for the feature the option provides.) [BCP14]

The words used to indicate these more complex conditions must be defined just as carefully as “must” or “shall”, and must be used consistently.

22.8 Analyzing requirements

Many people think of requirements only as a contract for guiding implementation and a checklist for performing verification tests later. However, requirements—along with other specifications—are useful in themselves for helping build a design and making sure the design is good.

There are three kinds of analysis that systems engineers do on the requirements themselves:

Ensuring that the requirements (and specification) are complete
Ensuring that the design is minimal, meaning that the design only contains features that it actually needs and nothing extraneous
Ensuring that the requirements are consistent
Understanding the effects of making a change to one part of a system design

These are all analyses that should be done on the specifications of a system, including the requirements, and not delayed until implementation. Some of these tasks are easier to perform on the abstracted and simplified view of the system that specifications give. Performing these tasks before implementation will reduce the amount of re-implementation needed when one finds that the requirements aren’t sufficient or minimal.

22.8.1 Complete design

The expectation is that if a system is built to conform to its specification, including requirements, that the system will do the job that its users need and do it correctly. (Of course, this assumes that the top-level specifications are themselves a correct and complete record of the users’ objectives; we discuss this more in the section on validating requirements below.)

To meet this expectation, the system’s requirements need to be complete and correct. This means that when one looks at any given top-level requirement, one can trace out the features on other components that will be used to implement the requirement and argue that those features will combine correctly to produce the desired result.

There are two parts to this analysis:

At each level of derivation, are the derived requirements complete for their parent? Is there a rationale that shows how these derived requirements are necessary and sufficient to implement the parent?
Are the leaf requirements concrete, implementable, and verifiable?

Having tools that allow one to view parts of the derivation graph in visual, graphical form is invaluable to performing this analysis.

Consider an example. A UAV (drone) is supposed to receive and process commands from an operator on the ground. This leads to requirements:

These requirements are not complete, because they leave out a critical step: when a command is sent from the ground operator to the UAV, the message first goes to the transceiver. The receiver extracts the message, and then sends the message to the command and data handling component. The example omits the part about the transceiver and command handler passing information to each other. This means that one could build an aircraft that had a radio and had a flight computer, but the two would never talk to each other. Obviously, the UAV would not be acting on commands with that design.

This leads to a more complete set of requirements:

In the example, the communication between the transceiver and command handling components should be documented in some other specification for the UAV, perhaps an activity diagram showing how commands flow through components. The requirements then need to be checked against these other parts of the specification to make sure that all of the functions in each of the steps are reflected in the functions each component is required to implement.

Sometimes determining whether a set of requirements is complete or not will require further analyses. As a simple example, the maximum mass for an aircraft might be X kg. Making sure that the aircraft’s overall mass comes in under that limit means enumerating all the components in the aircraft that have mass, adding up their mass, and determining that the result is below X kg. For that analysis to be complete, it cannot leave out, say, the mass of the motors; all components must be considered.

As a more complex example, a system might have a maximum acceptable failure rate target. Being able to argue that the system is reliable enough involves performing a fault tree analysis, enumerating all the ways that failures in components can lead to system failures. The analysis cannot leave out components and be complete; nor can it leave out some failure modes of some of those components.

Checking whether the design is complete is not a simple task that can be performed just by inspecting the graph of requirements. The analysis is helped by being able to see the requirements, but it requires imagination and effort to actually check the result.

XXX Sidebar: relationship to Goal Structuring Notation and safety cases

22.8.2 Minimal design

Every feature and every requirement on a component should have a reason for being there.

At the top level, for the system as a whole, only features that address customer needs or business objectives should be included. At lower levels, the only requirements that should be placed on components should be ones that are actually needed to make the system work properly—meaning the system meets those top-level objectives.

22.8.2.1 Tracking the purpose for every requirement

The derivation relationships between requirements encode the reasons for a requirement to exist. This leads to a condition that should hold across all requirements:

Every requirement for a system and all its components should derive from one or more customer or business objectives

This is straightforward to check using the derivation graph: every requirement should derive from at least one parent requirement, and it should be possible to trace upward through the derivations to reach a customer or business objective.

Often while requirements are being developed, a requirement will be placed on some component without setting up the derivation. This requirement will not have a parent, and so the checking method will flag it. But what to do then?

In most cases, there was a good reason that someone wrote that component requirement. When one finds a requirement that is not documented as supporting some higher-level reason, it is worth exploring why that requirement is valuable. In some cases, the parent requirement(s) are present, and the requirement just needs to be linked to them. In other cases, the requirement can be a clue that there is some higher-level principle that the writer had in mind, and that higher-level principle should be added into the requirements higher up in the system.

For example, consider a data storage component where an engineer placed a requirement that all data be stored in an encrypted form. As written, that requirement doesn’t derive from any other requirement. But why did the engineer believe that encryption was necessary?

One answer is that encryption isn’t necessary. In that case the encryption requirement can be removed. Another answer is that the engineer wrote that requirement because they believed that the component would be storing confidential data that should be protected against disclosure. In that case, it is worth checking: does the system have requirements—or business objectives—about protecting confidential data? If not, then this exercise will have found a topic that has not been adequately addressed, and new requirements need to be added to make a correct specification. Those requirements should be added throughout the system, and the requirement we started with should show that it derives from those new features.

Many such requirements result from external standards that are supposed to be met, such as regulatory, safety, or security standards. Those standards should be included in the external objectives for the system, their requirements should flow down through the system to the components where the standards apply. This produces a record of how the system’s design complies with those standards.

22.8.2.2 Finding unnecessary requirements

Some requirements that show how they are derived from some parent requirement are still not actually necessary.

There is no simple, mechanical way to find these unnecessary requirements. However, the analysis used to determine whether a collection of requirements is complete is also useful for finding these unneeded requirements.

Consider this example:

The requirement about encryption is not actually needed for the system in question. That is because the connection between the transceiver and command handling components is physically contained within the UAV, and the physical encapsulation provides enough security to protect the messages passing between the two. The encryption requirement can be removed with no loss of capability.

However, in this example, the engineer who wrote the encryption requirement had a good idea but expressed it wrongly. The engineer understood that the integrity of communication between the two components was important; a command that was properly received but garbled in being sent to the command handling component could be a problem. The presence of the encryption requirement should be replaced by a less costly requirement, that the channel must protect the messages it carries against corruption.

22.8.3 Consistency

Consistency in a body of requirements is when the requirements don’t contradict each other. If requirements do contradict each other, the system as specified isn’t implementable and the specification needs to be fixed.

Broadly speaking, there are three kinds of consistency that one should check:

Consistency among requirements for one component
Consistency between requirements on either side of an interface between two or more components
Consistency between requirements on a higher-level component and the requirements on subcomponents that should show how the higher-level requirements will be implemented

As long as requirements are written as text, and not in a formal notation, consistency checking will be manual. It involves reading through each requirement, finding other requirements that address related topics, and checking that they are consistent with each other.

Some inconsistencies are fairly easy to detect. If one requirement says component X shall be blue and another says component X shall be red, it’s obvious—one must just read through all the requirements on component X and see that two requirements both deal with the color property and they say opposing things.

Other inconsistencies are harder to spot because they do not use the same language in the properties they are specifying. As an example, one requirement might say component X shall use encryption algorithm Y while another requirement says component X shall use protocol standard Z. If protocol standard Z allows encryption algorithm Y, this is fine. But if the standard does not allow that particular encryption algorithm (perhaps because the algorithm is outdated and no longer considered secure enough) then there is an inconsistency.

Another class of inconsistency comes from the states a component can take on. Elsewhere in the specification of a component, there should be a definition of the state machine that the component is supposed to follow. The requirements translate that state machine into individual actions that the component is expected to take in response to particular inputs. It is easy—especially when editing or updating the component’s specification—to have two requirements: when condition A occurs, component X must transition to state Y and when condition A occurs, component X must transition to state Z. The inconsistency can be more subtle, such as leaving out some transition, or using inconsistent definitions of the condition that causes the transition. This class of problem can be addressed by having a single, clear definition of the state machine the component is expected to follow, and then checking the requirements against the state machine.

Finally, another class of inconsistency that can be hard to detect has to do with timing. Two requirements can impose timing constraints that cannot both be satisfied. For example:

When event A happens on component X, event B must happen within 10 milliseconds

When event C happens on component X, event B must happen within 15 milliseconds

Component X must perform the events A, C, and B in that order

There is no way for component X to meet the timing requirements given the order that events must occur. Building a timing model of the component in question, and performing a timing feasibility analysis using that model, can help find this kind of inconsistency.

This is by no means an exhaustive list of the kinds of inconsistency one must look for.

22.8.4 Effects of changes

Systems change. This can happen because customer needs change, or because technology changes, or because someone has found a better design for part of the system. A good development process supports constant evolution and change of the design and implementation of a system.

Not every change that is proposed will be performed. When someone proposes a change, someone else will analyze the proposal to determine the effects of the change. Based on this analysis, people may decide to go ahead, postpone the change, or not make the change.

The analysis must accurately determine:

What components are affected by the change? Is it just one component, or do effects ripple through much of the system?
What things will need to change as a result? Component designs, specifications, safety analyses, regulatory filings?
How extensive or expensive will those changes be?

This analysis makes use of all the specifications in the system, but requirements are a major contributor. In particular, the derivation relationships help show how component features depend on each other, and thus help guide an analysis of how far some change will spread.

22.8.4.1 Effects of top-level changes

Top-level changes include adding a new feature to the system, removing a desired feature, or changing a standard or other external source of requirements.

If the change changes a top-level requirement, look at the derived requirements from that changed requirement and see if the derived requirements are still necessary and sufficient to satisfy the newly-changed requirement. If they are, then no further action is needed. If they are not, then the derived requirements must be revised, possibly adding or removing some of them. The process then needs to repeat with these changed derived requirements. If the change affects a requirement that supports a different top-level requirement, then one must check that the other top-level requirement is still satisfied by the changed derived requirements.

If the change adds a new top-level requirement, work out what derived requirements are necessary and sufficient to satisfy the new requirement. Look for lower-level requirements that already exist that can also support the new requirement. This may involve a change in design, not just requirements; this will cause more changes to propagate out.

If the change removes a top-level requirement, see if any lower-level derived requirements are no longer needed or can be relaxed. If so, work downwards to propagate the effects of those changes.

22.8.4.2 Effects of lower-level changes

Many more changes will come to lower-level components in the system. There are many reasons this can happen: because people have found that a design in process is infeasible or too costly; because a vendor’s part specification or availability has changed; or because someone has found a better design for some lower-level component.

Evaluating a lower-level change involves all the checks for a top-level change above, along with the need to see how the change will affect higher-level requirements. Will the change leave the higher-level requirement unsatisfied? Will this change make some other sibling requirement redundant (that is, the parent is satisfied without the sibling)?

Tracking down these effects is much easier if the derivation relationships among requirements are accurate.

22.8.4.3 Tools

Good tools help the process of evaluating changes. There are three features in particular to look for:

The ability to create an independent working version of the requirements, in order to try out changes before committing them to a baseline. The ability to see what has changed between the baseline and the working version and selectively merge changes into the baseline allow reviewers to understand the whole effects of the change and to accurately accept the changes.
A feature to mark some requirements as potentially changing and others as needing evaluation. This feature helps ensure that the change evaluation does not miss some important change.
The ability to record a rationale for a derivation relationship between requirements helps the people evaluating changes determine why a set of derived requirements was considered necessary and sufficient.

22.9 Validating requirements

XXX rewrite this to bring into line with introductory language on deriving verification

Validation is the process of determining whether a set of requirements accurately reflects the needs of the system. This can mean that the system will meet customer needs, or mission needs, or other external objectives.

It is important to keep validation separate from verification, which is discussed below. Validation is about seeing if the requirements (and the rest of the specification) is an accurate reflection of external needs. Verification is about seeing if the implementation is an accurate reflection of requirements. (Some software engineering texts focus validation on consistency, completeness, and similar properties. Systems engineering has generally kept those kinds of checks separate from validating customer or mission satisfaction.)

The validation process starts with checking the system objectives, business objectives, security and safety objectives, and regulatory objectives to see if they are an accurate reflection of the customer or mission needs. Presumably appropriate care has been taken while these objectives are being gathered and written down, but mission understandings or desires change over time and an independent check on the objectives will help avoid having problems be discovered late, when it is expensive to make changes.

At the top level, one should check:

Is there anything in the top level that does not come from what the stakeholders want?
Is there anything the stakeholders want that is missing?
Are all the regulatory needs addressed?

At lower levels, one is checking whether the derived requirements from a parent are necessary and sufficient. The analyses for complete and minimal design, discussed above, cover those checks.

There are many different ways to validate a system’s specifications. They generally fall into two groups: analysis and simulation.

XXX improve language: analysis as formal method vs review as informal

Validation by analysis involves people reviewing the requirements and using their judgment to check the specifications. This can involve performing joint reviews with stakeholders so that they can check the requirements.

Validation by simulation involves stakeholders somehow seeing a model of the system in action. There are many ways to do this. Stakeholders can be invited to define some scenarios that represent how they will use the system, and then try out those scenarios using a model of the system. Some ways we have done this include:

A desk or whiteboard exercise, where one or more engineers act out the system’s behavior and some stakeholders talk through what they would want to happen.
Building a simple simulation model of parts of the system and letting the stakeholders try out their scenarios with the simulation.
Building prototypes of parts of the system and having the stakeholders try out the parts of their scenarios using the prototype. This can be combined with a desk exercise for parts of the system that aren’t prototyped.

These validation exercises should be completed and the stakeholders should concur that the specifications are correct before one baselines the specifications, including requirements.

22.9.1 Connecting requirements and implementation artifacts

People must be able to navigate from a requirement to its associated implementation artifacts and vice versa. The people implementing a part of a system according to requirements need to be able to quickly and accurately find the requirements that they need to comply with. In the other direction, the people verifying requirements must be able to find the artifact or artifacts that implement a particular requirement.

The approach to organizing systems artifacts that I advocate here, which organizes many systems work around a hierarchical component breakdown structure, is designed to meet this need conveniently. The set of requirements that apply to some component are implicitly connected to other specifications and the implementation of that component because they are all organized by the same component names and identifiers.

One can also explicitly label artifacts with component identifiers or requirement ids. For example, verification test specifications are associated with specific requirements, so the test specification needs to be labeled with the requirement ids that it applies to.

22.10 Verification

Verification is the process of showing that the implementation of the system, or parts of it, complies with the requirements.

Verification involves gathering evidence that every requirement is satisfied by the implementation.

There are four general methods used to verify the implementation’s compliance:

Inspection: review of the implementation (source code, board designs) to manually check that the implementation complies
Test: define and perform tests on the implementation that check compliance. Many requirements require a number of tests in order to show that the implementation complies given all reasonable inputs or environment conditions
Demonstration: show the implementation working as defined
Analysis: perform a static analysis on the implementation to gather formal evidence that the implementation complies

Inspection is verification by having people review parts of the implementation to check that it complies with a requirement. The inspection review should be performed by people who did not implement that part of the system, so that the reviewers are not misguided by preconceptions (“I’m sure I implemented this correctly”).

Some inspections are particularly simple. Consider a high-level requirement that is the source for a few lower-level requirements. In many cases, the high-level requirement is satisfied when the lower-level derived requirements are all satisfied. In these cases inspection becomes a simple matter of checking that the derived requirements are all satisfied. The rationale associated with the derivation or with the high-level requirement should indicate when this situation applies.

Test and demonstration are similar. Testing is generally more exhaustive, and necessary lower-level components. A single electronic component, for example, might be operated across all the specified thermal, vibration, and atmospheric environments it must handle. Demonstration is less exhaustive, and used to verify top-level system objectives. A prototype spacecraft radio transceiver might demonstrate that it can communicate with ground stations from a similar orbit to where the final spacecraft system will operate.

Some requirements cannot effectively be verified by test or demonstration, and must be verified using analysis. This occurs when one is verifying a negative condition: the verification must show that the system will not perform some action or be in some condition at any time. Providing evidence of the absence of some condition is a long-standing scientific and engineering problem because proving the presence of some condition is relatively easy—demonstrate it happens in one case and that’s sufficient—but showing absence often requires exhaustive search. These verification problems often arise in safety and security requirements, where unsafe failures must be rare (e.g. no more than once in 10⁹ operating hours) or a system must resist a class of attacks (showing that no attack of that class will succeed).

Each requirement should have an associated verification specification. The specification should lay out what steps must be taken to determine whether the implementation is correct or not. A verification specification is often complex—many pages of documentation for a three-line requirement.

Verification status is a measure of how well the implementation matches the specification, including requirements. In practice this means how well a version of the implementation complies with a version of the specification, as both implementation and specification evolve over time. This means that, during design or implementation, there is no one single “verification status” that can be tracked: with each new update to the implementation, the verification status changes. Some practitioners and tools make the mistake of tracking verification status only in terms of requirements: which requirements have been satisfied by the implementation? This leads to project management errors when a change is made to the implementation that improves the implementation in one area but causes other parts of the system to go out of compliance—a common occurrence while in the middle of implementation using iterative approaches.

22.11 Limitations of requirements

Requirements have limitations. Writing a good specification for a system means understanding these limitations and addressing them in one way or another.

One limitation is that requirements are written in natural language. Human language is notoriously difficult for pinning down precise meanings, even within a single group of people. Specifications, including requirements, are used to communicate between different groups of people with different outlooks, experiences, and jargon. This makes it hard to write requirements that will be interpreted the same way by all of the people involved.

The limitation of natural language can be partly mitigated using a couple of techniques. One is to maintain a glossary that defines words or phrases that have specific meanings in the specification beyond common understanding. The second is through social cohesion: having enough people from different groups interacting and discussing the system so that they evolve a common understanding of the meanings of things.

Precision is another limitation. Some specifications can be clear and simple in mathematical notation, while they are hard to follow in prose. (Consider expressing Newton’s law of gravitation as an equation versus in prose.)

A third limitation comes from requirements being single statements. Sometimes the specification needs to encode a complex, multistep activity. Each of the steps might be encoded as an individual requirement, but it is awkward and hard to understand. Sometimes the better answer is to write part of the specification in a different form—a flowchart, a state machine, or a set of equations.

As a result, requirements are only one part of the total specification. They cannot do the entire job of recording the full specification of the artifact in question—but they are often the most flexible way to organize most of the specifications. Be prepared to supplement textual requirements with other kinds of specification to get the whole job done.

22.12 Working with requirements

This chapter has mostly covered what requirements are. This section touches on what one does with them and how they evolve over time.

Requirements will change continuously over the life of a project. The rate of change will be high at the project’s beginning, when the team is trying to sort out what the system should be. The rate of change will increase after the high-level system purpose is sorted out and as the design work proceeds in parallel on different components in the system. The rate will taper off as the design and implementation become more mature, with occasional bumps as people find problems with the specifications, or as stakeholders request changes. Ideally the rate will reach zero when the system is ready to go operational, but even while in use people will find changes they would like to make.

Detailed requirements are expensive to develop and maintain. They encapsulate the complexity of how all the parts of a system are interconnected. They require effort to develop in the first place, involving checking for consistency and feasibility across large parts of the system. Changes later involve even more effort, especially if the changes involve reorganizing specifications that have already been developed.

This leads to a tension: changes will always happen, especially with modern, flexible systems, but the cost incentivizes developing all the requirements at once and then freezing them to minimize the cost of change.

This tension is unavoidable, but there are things one can do to reduce the difficulty.

Don’t start finalizing requirements too early. Work with a sketch version of the main points while exploring the concept.
Develop detailed requirements for a part of the system hand in hand with prototyping the components that make up that part of the system. Most teams learn a lot from prototyping exercises; what they learn translates into design changes and in turn into requirements.
Start the processes for communicating about specification changes early in a team’s development, so that people have developed the habit of talking to each other about changes that might affect others before major, disruptive changes come along.
Use the right tools that help the team coordinate developing and changing the requirements.

22.12.1 Supporting the life cycle

The requirements for a system—and indeed all the specifications for the system—grow and evolve over time. The times and ways when requirements change depends on the development process a project is using. However, all these processes share some tasks in common.

Collaborative development. In some phases of developing the specifications and requirements for a system, there will be many unknowns and the possible specifications will be in constant flux. In periods like this, many people will be involved in writing down possible requirements, often collaboratively. In phases like this, what matters to people is the ability to quickly sketch out some requirements, and the ability to share and collaborate on these sketches.

Incremental change. At other times, when the requirements and specifications are more stable, there will be incremental changes to the requirements. When someone makes a request for a change to the system, a systems person will need to evaluate the effects of that change. The ability to trace out the implications of a change using derivation relationships helps make the analysis process accurate. As the systems person works out the effects of the change, they need to be able to create an independent working version of the requirements where their updates will not affect an official, baselined version of all the specifications.

Baseline. While the requirements and specifications will be in some degree of flux all the time, the people who use those requirements need stability. The most common approach is to designate a version of the requirements as the current stable version, and then control updates to that stable version. The stable version goes by different names in different fields: baseline, release, plan of record, committed version. For the purposes of this document, we use the term baseline.

A project should use a configuration management or version management process to maintain the baseline requirements. There are many tools that automate such processes. The key features needed are that

A user must always see the current baseline
A user must be able to know when the baseline has been changed, and see how that change affects them
No change is made to the baseline without review and approval
Working versions of specifications that have not yet been approved do not appear in the baseline

Review and approval. People will propose updates to the system’s design as a project moves forward. This occurs often at the beginning of a project, as the design goes from vague ideas to concrete specifications; it continues during the life of the project as stakeholders ask for changes, as engineers find problems or improvements with the current design; and it can continue after a system is released to operation, as people find problems in actual use. These changes will result in specific proposed updates to the requirements. The proposed updates need to be checked before they are accepted and applied to the baseline. Once applied to the baseline, everyone developing the system implementation will need to work to revise their part of the implementation to match, and verification steps will be required, and so on—thus it is important to control changes to the baseline to be sure that they are sound and within the project’s scope before committing to them.

Projects generally use a review and approval process to decide whether to apply an update to the baseline or not. In the review part, systems engineers check the updates to ensure they meet guidelines, including consistency, completeness, and minimality. People who will be affected by the update are asked to review the update, to evaluate whether it is technically correct from their point of view and whether the change is feasible. Project managers are asked to evaluate the update to determine whether the change is in scope and whether there are resources to accommodate the change. If all those parties agree, then the update is approved and someone creates a new requirements baseline that incorporates the changes.

Verification. The implementation of the system needs to be verified from time to time to ensure that what is being constructed complies with specifications. Verification can happen at many different times and with different scopes. As someone implements a feature into a component, verification tests can provide immediate feedback to the implementer. In software development, this is related to test-driven development. Regular verification activities can detect whether a change in the implementation in one place has had an unexpected consequence that causes something else to go out of compliance. This is sometimes called continuous integration testing. When a vendor supplies a prototype component, the prototype needs to be verified for acceptance testing. And when the system is believed to be complete, final verification checks are required before the system enters into operation.

22.12.2 Who works with requirements

Many people generate or use requirements during the lifetime of a project. These include:

The product or mission managers who define the system’s objectives and review and approve the top-level system requirements
The engineers who develop, revise, and maintain the requirements
Change management people who oversee the process to baseline and release consistent, committed-to versions of specifications
Designers and implementers who build components following the specifications
Engineers who design how different requirements will be verified
People who verify implementations against the requirements
Managers who track the progress of implementing requirements and the progress of verifying components
Vendor managers who oversee the process of writing requests for proposals, evaluate vendor offerings, and oversee the acceptance verification of artifacts from vendors
Vendors who design and build components according to requirements
Contract specialists who incorporate requirements into vendor contracts
Technical writers who develop operational and training manuals based on the requirements
System operators who use the requirements to understand how the system works in operation

22.13 Tools

The right tools make working with requirements much easier and more accurate. However, different requirements management tools are designed to support different styles of requirement writing and use, so you need to choose tools that match how you will write, organize, and use requirements.

Here are some questions that can help you evaluate requirements management tools.

Do the tools support the writing style and requirements organization strategy you use?
- If your project is using hierarchically-organized requirements, does the tool support that?
- Do the tools provide a way to organize requirements for one component into sections?
- Do the tools support organizing requirements in the way your project organizes them? If you organize requirements by component, do the tools support that?
Do the tools provide ways to trace top-level requirements from external sources, like policies or standards documents, that are the source for those requirements?
Do the tools support tracing the derivation relationships among requirements?
- Do the tools allow derivation among requirements within one component?
- Do the tools allow mutual relationships between two components, when those two components have mutual dependency?
- Do the tools provide ways to visualize the graph of derivation relationships?
- Do the tools allow one to record a rationale for derivation relationships, in addition to rationales for individual requirements?
Do the tools support the relationship between requirements and implementation artifacts or other documents?
- Do the tools provide a stable identifier for requirements?
Do the tools support reviews and baselining, or other versioning control mechanisms?
- Do the tools support the ability to create a non-baselined, work-in-progress version of some requirements?
- Do the tools provide a way to see the differences between a baselined version of requirements and a work-in-progress version?
- Do the tools enforce a review and approval process?
- Do the tools help notify those who are affected by a change and who thus should provide a review?
Do the requirements tools integrate with other tools that support verification?
- If the requirements change, do the tools make it clear what requirements must be verified anew?
- If the implementation evolves, do the tools allow for verification results to be associated with a version of the implementation and not just with the requirements?

People will use the requirements management tools to perform a number of tasks. You should evaluate how well requirements tools support these activities.

Writing new requirements: being able to quickly sketch out a set of requirements for some new component or some new function, and then fill in the details to complete a first draft of the new requirements. Ease of writing and editing is a key complaint about many requirements management tools.
Reviewing and updating requirements: when someone requests a change to the system design or objectives, being able to find what things need to change and then being able to revise the requirements conveniently.
Managing review, approval, and baselines: at some milestones, reviewers will need to see what requirements they should review, support them in the process of discussing the requirements, and tracking issues that are found until they are resolved. With some milestones, a version of the requirements will need to be approved (or not) and baselined.
Performing requirements analysis: people will need to perform the analyses listed in the Analyzing requirements section, including checking that the design is complete, minimal, and consistent.
Validation: people will need to review the requirements against the original mission or project objectives to see if the system will meet stakeholder needs. The reviewers are generally going to be independent reviewers who have not been using the requirements tools along the way. The tools need to support the reviewers determining whether the requirements are satisfactory, and tracking issues that are found.
Verification: some people will need to develop tests or analyses to perform verification. Other people will then need to perform verification on system implementation artifacts, and track the status of how well the implementation is complying with requirements.

Part VII: Design

Chapter 23: Design introduction

14 March 2023

23.1 Purpose

Previous chapters introduced how to work out what a system or a component should do, by determining what the objectives are for it and then turning those objectives into a specification.

The next step is to design the system or component that will fill those needs.

A design for a component provides a simplified model of how the component will achieve the behaviors, qualities, and structure laid out in its specification. The design is not the full details of how it will achieve those things, or a detailed implementation. The design is a plan for how the component will be built, at a high level; it records the high-level decisions about how the component will be implemented without actually being the implementation.

“Design” is an activity that lacks sharp boundaries from other development activities. On the one hand, it responds to the objectives and specifications that have been developed for the thing being built; on the other hand, the act of designing usually reveals gaps in the specifications that lead to feedback that causes people to update the specification. Specification and design proceed recursively as a system is built, where the act of designing one component leads to writing specifications for its subcomponents.

“Design” also lacks a distinct boundary with “implementation”. Indeed, the boundary between the two varies by convention in different disciplines.

In electrical engineering, a “board design” is a detailed mechanical design, choice of components, and layout of components and circuits—what we would classify as implementation. A higher-level design for a board would include things like the most important components (e.g. chip, SOC, discrete electronic components) that will be used, the kinds of interfaces, connectors, or cables that will interface the board to others, along with the record of a rough physical layout that shows that an implementation is likely possible. This higher-level design guides the board designer to create a board design that is ready to be assembled.
In software, a design typically consists of high-level decisions such as where the software will be deployed, the structure of how subcomponents communicate with each other and with outside components, the state and operations that are to be implemented in the component, and behaviors it will perform. This design leads to a software engineer writing the specific code that implements the design.
In architecture or mechanical engineering, a design records the general decisions about the shape or layout of the structure being built, how it connects to other components, and key design choices like materials or colors. This leads to detailed mechanical drawings that incorporate all the supporting details needed to build the structure.

23.1.1 Defining “design”

Given the diversity of ways the word “design” is used, we define what we mean in general by the term.

A design is:

A high-level record of the decisions made about the structure, state, and behaviors of a component or system, and how it meets its specification
A record of how the system or component connects to things outside itself, including physical connections, data or control, or energy transfer
A record of how the system or component breaks down into subcomponents, and the roles or specification of each
A guide to the reviewers and implementers that helps them understand the component’s structure and function
Simpler than the detailed implementation of the component, so that the design explains the implementation
An abstraction of the implementation that allows one to analyze the component (for safety, security, performance, or meeting objectives) at a reasonable level of effort, when doing the analysis using a detailed implementation would be too complex

A design is not:

Detailed enough directly to support assembling and operating the component or system
The specification, which records properties of a component without saying how it should be designed to provide those properties
The implementation, which is has enough details that the actual component or system can be assembled, integrated, tested, and deployed

In some projects we have used the term “design model” for the design, to emphasize that the design is a simplification and explanation of the most important aspects of the component’s implementation.

23.1.2 Contents of design

There are several kinds of information that should be recorded in a design.

The components that make up the system, with names or identifiers for each. This can be thought of as the catalog or table of contents for the system. Every component in the system should be included, whether it is built from scratch for this system or brought in from some supplier. Having a list and names allows people to identify what they are talking about, and it enables checklists for making sure everything is done.
The structure of how each component breaks down into subcomponents. This identifies how the pieces of the system are grouped, which helps people navigate through the design to find the parts in which they are interested. (This also often reflects how the teams building the system are organized.)
How each component interfaces and interacts with other components. This information provides a map to the other components that will be directly affected by this component’s behavior or structure, and vice versa. It provides the information needed to map out more indirect ways that other components can be affected or affect this component.
The high-level state, behavior, and structure of each component. This is the core of the plan for how the component will implement its objectives. The state information might include modes of operation, information the component stores, or physical configurations. The behavior is how that state can change: how the component can change from one operation mode to another, or how the component can change its physical configuration. The structure might be the shape of a mechanical component, the modularization of a software component, or the major chip components of an electronic component.
How each hazard or objective that has been identified for a component is addressed in the design.

All of this information should be annotated with a rationale for the decisions that led to the particular design.

23.1.3 Purposes of a design

Why should one take the deliberate and separate step of putting together a design for a system or component, rather than just implementing a component directly based on its specification?

For an exceptionally simple component, one can skip design and just implement the component—but the component must be truly simple, completely understandable from its implementation, involving no significant design choices, and with no future need to change the component, for this to pay off in the long run.

The value of an explicit design comes partly from its abstraction and simplification, and partly from being done mostly before putting together the detailed implementation.

The effort of design provides time to reflect on decisions before committing to designs.
A design effort allows incremental design that allows one to balance all of the important design constraints before going into detail on any one aspect of the design.
The design provides a guide and explanation for the component.
The design provides a place for people to write down the rationales behind the decisions they make.
The design supports analysis, which usually feeds back into the design before spending resources on implementation.
Designs for alternative approaches for implementing a component support comparison and decisions.

Time to reflect. This is perhaps the most important reason to take the time to build a design before implementing a component. Modern systems are deeply interconnected. The design choices for one component have effects not limited to that component, and the design choices must usually reflect the needs that many other components place on the one being designed. It takes time to find and understand all these interdependencies.

Many components can be designed in multiple different ways. It is often useful to spend some time developing multiple design approaches before settling on one of them. In many cases it is useful to have two or three design approaches, one of which imposes requirements on some subcomponent that are difficult to achieve. That difficulty may not reveal itself until people have proceeded into the specification and design of that subcomponent. Only then may one realize that an alternative design for the original component is better.

Finally, the design needs to support all of the component’s or system’s specification. Rushing through the design increases the likelihood that some essential requirement will get missed, leading to problems later when the component is integrated with others, or when the system goes into operation, and a subtle failure occurs.

Balanced and incremental design. Modern, complex systems involve many different kinds of constraints on components. A component may need to meet all of structural, safety, functional, security, reliability, environmental, maintainability, user interface, and budget constraints to meet its specification and thus to function correctly in the system as a whole.

We have found that focusing too much on any one of these aspects leads to an unbalanced design that does not meet some other aspect. This can lead to repeated partial design followed by redesign after redesign, each time focusing on a different aspect.

The alternative is to consider a little of each aspect at the same time, working to find a rough design that looks like it will be going in a feasible direction for all of these aspects. After there is a rough design, one can go into greater depth on individual aspects with lower risk that the dive into one area will result in not meeting constraints on another aspect.

As one example, reliability and safety often work against each other. The safer choice is often to shut down a component rather than trying to keep it in operation after a failure. Conversely, the redundancy needed to increase reliability increases the complexity of the component, leading to more conditions that could lead to a safety violation.

Guide and explanation. Multiple people will use a design over the course of a project. While one person may develop the first design, others will analyze it for safety or security; still others will review the design for completeness or correctness; one or more people will use it to implement the component; other people will use it to develop and perform verifications. Later, other people will use the design to understand a component that may need a bug fix or feature change.

In other words, the design is for communicating among many different people and over potentially long periods of time, when the people who originally made the design are no longer available to answer questions from their memory.

For all those people who work on the component later, the design provides a guide to understand how the component is organized.

All too often, an engineer is asked to figure out why some existing software component is not working as expected. There is no design, just the source code. The engineer has to try to extract the design from the source code in order to figure out where the component is not behaving as it should. Extracting the design takes time and effort that could be avoided if the design could just be consulted. An extracted design is rarely accurate: the source code does not have a record of where there are subtle, unobvious aspects of the design; nor does it record why the design is what it is. The result is greater cost and time required to update the component, and a higher risk of a change introducing more problems than it fixes.

Decision rationales. A good design includes an explanation of why particular decisions were made. This information helps those who review and analyze the design to determine whether good choices were made. More important, the rationale informs the people who later need to update or redesign the component.

It is common that any electronic board component that is in production more than a handful of years will run into a situation where some chip is no longer available. The manufacturer has stopped making the original chip X, but another manufacturer is making a chip Y that is supposed to be pin-compatible with chip X. Is it okay to substitute chip Y for chip X? That depends on what it was about chip X that led to it being the choice. If the choice was just on the basic chip function, the substitution is probably okay. However, if the choice was based on something unobvious like the chip X’s radiation tolerance resulting from a particular lithography technique, chip Y may not be an acceptable replacement. The only way to know that the radiation tolerance was a key part of the decision is if someone writes down that rationale.

Supporting analysis. Many key component properties, especially those related to safety, security, or reliability, are emergent from the design. It is increasingly evident that these properties are difficult to retrofit into a completed design: they involve the fundamental organization of elements of the design.

This leads to approaches of security-guided or safety-guided design. In these approaches, the security or safety properties are considered from the start and included in the design. As the design progresses from a rough sketch to something more detailed, it can be analyzed with progressively greater accuracy to determine whether these properties are being met.

This approach is relatively inexpensive and easy when it is being done as part of the original design effort. A safety analysis can determine what high-level aspects of a control loop are essential for safe operation; a security analysis can determine what information flow properties must be met to maintain security. These analyses help early pruning of potential design approaches that would not meet safety or security needs.

The alternative is to proceed without including safety or security considerations, then having to go back and work out control or data flow on a more complex design, then repeat parts of the design process while undoing earlier decisions. Repeating work like this takes more time and effort, and is more likely to result in an implementation that has safety or security flaws.

Alternative designs. In the early stages of designing a complex component, there are likely to be multiple different approaches for the component. The choice among the approaches is often not immediately evident. Which one uses chips that will be available on the needed schedule at the needed quantity? Which one uses a subcomponent that will require significant research to make work? Which one will require a significant up-front investment in acquiring long lead time parts? Which one will be acceptable to regulatory agencies? It may take quite some time and effort to find answers to these issues: prototyping a subcomponent, making legal arrangements with suppliers to find out about availability, and so on.

When there are these kinds of risks in the designs, it is helpful to explicitly keep multiple designs open during the investigations, and to delay investing in detailed implementation effort on any one design that would not be useful if that design turns out not to be feasible.

23.2 How designs are used

As noted above, a design enables communication among multiple people, across different times, and for different purposes.

Developing the initial design. One or more people take the objectives, CONOPS, and specification for a component and eventually produce one or more potential designs for that component.

Developing the design is not a single, monolithic activity. It almost always proceeds incrementally, evolving the design from a rough sketch through multiple ideas that turn out not to be quite right until reaching a design that looks like it will meet the component’s specification. The designers will need to try out multiple ideas along the way, meaning that what they document will need to evolve as they try different approaches.

The process of assembling a design can be characterized as working through each of the elements of the specification, while at the same time matching the specification against the possible building blocks for the component. As a simple example, this might involve matching a specification for an electrical energy storage system to store X mAh of energy against a catalog of available battery products.

Actual component specifications involve multiple aspects, some of which will work against each other. A realistic electrical energy storage system must meet performance specifications such as the amount of storage, maximum safe current, reliability constraints, and a number of constraints related to safety. This leads to the recommendation that a designer consider many specification aspects at once, but only at a high level, before going into greater detail.

In the end, the designers must either show that the design they have created fulfills the corresponding specification, or show that the specification is flawed in some way and feed that information back to the people responsible for the specification to get it changed.

Tracking alternative designs. There are usually many ways to design some component, with pros and cons to each. Early in design, there may be multiple promising approaches that require more investigation before a decision can be made among them.

This means that each of the alternatives needs to be documented, along with the investigations needed for each of them, until a decision can be made. It must also be clear to everyone working with the alternatives which one is which. When one alternative is selected, that choice must be clear to everyone working with the designs.

Evolving a design. Every design will evolve, both during the initial system development and over time as the system is used or upgraded or fixed. Any change to the design needs to be evaluated for its scope, its effects, and its correctness.

Evaluating scope and effects means determining what effects the change will have in addition to the specific change being considered. A change in one part of a component might affect some safety property of the component as a whole, for example. A change might also affect some behavior or structure that some other component depends upon, possibly indirectly across multiple intervening components. Substituting one chip for another in a board design might change the timing of some signal, which leads to a subtle change in the sequence of operations performed by software on another board, which in turn invalidates a monitor watching for faults.

Evaluating correctness involves checking that any analyses done on the previous design to show that safety, security, or other properties hold either continue to hold or that the analyses can be adjusted to show that the updated design still meets those criteria.

Analyses. Complex systems will have a number of properties they must exhibit to be correct. These include safety, reliability, and security properties; they also include meeting business objectives and other more mundane properties.

Safety- and security-guided design methods ! Unknown link ref involve incrementally building up these analyses as design progresses, so that a simple, preliminary analysis can provide input to an evolving design.

When a design is believed to be complete enough to select and baseline, it will need review to ensure that it meets all of its specification. Part of this review involves checking the analyses that show the design is compliant. The reviewers need to have the analysis in order to check it.

When a change is being made to a component’s design, the analyses provide a starting point for analyzing the effects of the changes to check that the safety, security, or other properties will continue to hold if the change is made.

Generate specifications for lower-level components. The choice of what subcomponents will be part of a component is a major part of the design effort. The choice of subcomponents means that the role each subcomponent will play has to be worked out; this amounts to developing a specification for each subcomponent.

A subcomponent’s specification is a reflection of the component design. The subcomponent will only work properly as a part of the component if it meets that specification. This leads to the layering principle discussed earlier ! Unknown link ref.

Navigating through the system. Many people will need to find things in the system over time—developers, reviewers, auditors, and many others. Virtually none of them will come in with a complete understanding of the system and its structure, so they will need a guide that helps them learn the structure of the system and to find where some behavior or feature is implemented.

The system design can support such users in three ways. First, the design can provide the breakdown structure, showing how the system is divided into components, those into subcomponents, and so on. The breakdown structure also groups related components together, so that a user can narrow down where they are looking. Second, the design can show how components are related to each other. If one component in one part of the system is providing feedback signals to a component in a different part of the system, making these relationships explicit provides a way for a user to trace out these interactions. And third, including explanations or rationales for why the design is the way it is helps educate the user about subtleties that are not going to be apparent from just reading about the structure, interactions, or behaviors.

Guiding project management. As the design progresses, there will be more components to design than there are people to work on them, and some components will be ready to implement or verify. Project management must make decisions about where to put effort.

Project managers will need information like how risky some potential component designs are, as opposed to those component designs that are fairly certain and thus reasonable to implement. They will need to know which component designs have significant uncertainty, and will benefit from investing resources to prototype a potential design.

These decisions benefit from information that can be gathered and maintained in the overall system design, such as:

How much design or implementation risk is there for this component so far?
Is it possible to estimate the effort, time, or cost involved in designing or implementing this component? What would make an estimate possible or more accurate?

Progress tracking. Project management needs to be able to track the development progress of different parts of the system, in order to determine whether a project is on track for completion or is having problems that need to be addressed.

Being able to name each of the components that need to be developed, and being able to determine the development progress on each of them, enables project tracking.

23.2.1 Design leading into implementation

As well as all the uses listed above, the developer uses the design as a guide for the implementation. The resulting implementation must be consistent with the design: having the same structure and behavior, including all the functions in the design, and including no functions not in the design.

The developer or implementer must be able to understand the design to build a component that matches the design. The developer must also be able to check that they understand the design properly, so that there is a way to catch misunderstandings. A good design uses consistent structure, terminology, and diagrams to aid understanding. It provides a glossary of terms that may have multiple meanings to define how they are used in the design.

Developers will find problems with the design as they proceed through implementation. They may find ambiguities, where the design is unclear or where the design does not address some important condition. The developer may find errors, where the design is inconsistent internally or with its specification. The developer may find that parts of the design aren’t feasible to implement. All of these problems need to be fed back to designers for clarification or correction.

When the design changes, the developer needs to be able to identify what parts of the design have changed so they can change the corresponding implementation. The change might come in response to feedback from the developer, or evolution of the design to address changing needs or broader system fixes. This can be supported by using tools that track design versions and highlight design changes between versions.

Finally, the developer must be incentivized to follow the design (or provide correction feedback) as they implement the component. This includes having the designer and independent people review the implementation to compare it to the design. If they find that the design and implementation are not consistent, they must decide on how to change the design, the implementation, or both in order to achieve consistency. The component implementation should not be accepted as complete until they match.

23.3 Design artifacts

The artifacts that record the design enable all the usage cases listed above. The key functions they need to fill include:

Organizing the design of the system and each component in a way that people can find the parts of they system they are looking for, and understand the context of those parts.
Recording the structure, state, behavior, and shape of each component so that people can build a corresponding implementation or analyze the design to check whether it meets needed properties, including safety and security.
Recording the rationale or explanations for why the design is what it is.

23.3.1 Supporting infrastructure

The designs for a system need to be available to everyone associated with the project, so that they can use the design to learn about the system and navigate through it.

An ideal solution provides a “single source of truth”: a user can go to one place and see all of the information about the system. The ideal solution also ensures that the user always sees a single consistent version of all the information. To the best of our knowledge, at present there are no systems that completely meet this ideal. However, there are ways to come close by integrating multiple tools and applying conventions to how they are used.

The infrastructure for maintaining designs needs to, at minimum:

Store all of the design information, organized according to the names or identifiers for each part of the system
Support multiple versions of the design, where one version is the current approved baseline, some versions are outdated and no longer useful except for historical reference, and some versions represent works in progress that are being developed or are alternatives, and that have not yet been completed and made part of the approved baseline. Note that software repository systems have evolved a number of useful conventions about how to maintain and name these versions; the version management capabilities for a design repository should be at least as capable as those.
Support finding the differences between versions. When a design changes, a reviewer or developer will need to understand what they need to look at to understand the change. Tools that expose all differences between versions reduce the chances that some change will be overlooked.
Support integration and tracing between specification and design artifacts. The design of a component is in response to the specification for that component, and the infrastructure should make it convenient for a user to see how the design maps to the specification and vice versa. Similarly, the specification of a subcomponent responds to the design of the component it is part of, and the infrastructure should help users understand that connection.
Support reviews and approvals. Each design will begin as a work in progress, and then progress until it becomes the approved baseline (or until the work in progress is abandoned). Being able to track what designs need to be reviewed by whom, track the feedback from those reviews and corresponding updates, and tracking the approval of a design is error-prone without tools that support the process.

23.3.2 The artifacts

The following sections list the key artifacts that should be part of a design. Later chapters will detail these artifacts.

23.3.2.1 Breakdown structure

The breakdown structure consists of the hierarchical relationship of system, components, and their subcomponents recursively. It gives a name or identifier to each component, and provides the index or table of contents to the parts that make up the system. See ! Unknown link ref.

23.3.2.2 Control structure and other large-scale behaviors

A complex system will have behaviors or structures that cross multiple parts of the system, and don’t neatly fit within a single hierarchy of components. There are two important examples of these behaviors to document.

The first example is behavior or activity sequences that show how different parts interact with each other. These are sometimes documented as UML or SysML activity diagrams, which show how control or data pass among components, and how different components take actions in response to those. The point of these patterns is to show how components work together, which informs the interfaces, actions, and states that the components involved in the activity must support.

The second example is the hierarchies of control that operate in the system. These document how one part of the system controls the functions in other parts, including how some components provide sense data to drive the control logic, and how the control logic in turn sends commands to other components to effect control actions. Documenting and analyzing these control systems is an essential part of some safety and security process methodologies, such as STPA [Leveson11].

23.3.2.3 Details of each component

Each component in the system should have its own design. This is the primary content about individual components, as opposed to how components work together.

A component’s design can be represented in many different ways. However, it is easiest for users if all designs follow the same general format so that they know how to find particular kinds of information within every design.

All designs should include:

An identifier and name or title for the component
The version of the design and its status (in progress, obsolete, baselined)
The position of that component in the breakdown structure (that is, what it is a subcomponent of)
A short description of the purpose of the component, to aid people coming later to understand the design
Information about the states, behaviors, interfaces, and environment for the component
The decomposition of the component into subcomponents, if any
A list of the artifacts that implement the component

Each component’s design should include rationale: the reasons why different design choices were made. This information helps those who must come along later to review or update the design.

In some cases, not all of this information can be represented in one way or in one tool. For example, for electronics designs the best way to represent some information will be in a CAD drawing that is maintained in a separate tool from the rest of the design information. In these cases, there should be unambiguous references from the main design to the CAD drawing and vice versa, and the versioning in the main design should be reflected in versioning in the CAD tool.

23.3.2.4 Safety, security, and other analyses

Part of the reason for developing a design—as a simplified model of what will be implemented—is to enable analysis of the design’s essentials. These analyses address whether the design will meet aspects of the component’s specification. These can include safety and security, as well as meeting business objectives, regulatory requirements, performance specifications, or resource budgets.

As we will discuss in the next section, it is recommended practice to develop these analyses incrementally in parallel with the design itself. In this way, a rough analysis of a rough design can provide quick, early feedback that will guide the design toward meeting its specified properties as it is developed in more detail.

These analyses become an important part of the record of a design once complete. They provide an extended rationale for why the design is the way it is. They may be needed to answer to external stakeholders, including regulators or courts of law, when it becomes necessary to provide evidence why the design is acceptable. The analyses also help people who must later evolve the designs to understand both the constraints on what they can change, and where they have freedom to make changes without invalidating the safety or other properties of the design.

23.4 Developing designs

As a matter of principle, the design for a system or component should be done after its objectives and specification are done, and before its implementation. Similarly, the design for the components in a system should proceed top down, starting with the system as a whole and proceeding to lower and lower level components. When the design of one component depends on the design of another, the two should be designed together.

These principles often lead people to conclude that systems should be built using a waterfall-like process, where everything is specified before design, everything designed before implementation, and so on.

Real projects are not so simple. We have never observed a project that actually used such a process, even when they tried to. This is because every complex system we have encountered is not fully and accurately knowable in advance. One can write a set of specifications that turn out to require some impossible component design. One might miss some important system objective when developing the initial system concept because the customer was not able to conceive of system operation until they could see part of the system in operation, or because the customer’s needs change. An initial design may be invalidated because a supplier discontinues an essential part. Some part of the system may require significant investigation or research before one can find a feasible way to approach its design.

All of these situations lead to cases where the specification, design, and implementation of the system does not proceed in a tidy one-way sequence through the waterfall stages. Instead, part of a component’s specification gets worked out, and some tentative design goes ahead using that part of the specification gets worked out. Or multiple possible design approaches are defined, and then someone proceeds to build simple prototype implementations of two or more of them to compare their feasibility. Or the design for a component must change, leading to a change in implementation. All of these may be happening in multiple parts of the system at once.

At the worst, all this change happening all over a system can lead to chaos where people working on different components are working to incompatible specifications or designs and building parts that will not integrate into a system. Project management may not be able to determine how much progress has actually been made on any part of the system, and thus be unable to detect when there are schedule or resource problems.

Therefore while the simple waterfall model, which organizes the work on a system, is not feasible, there is still a need to organize development work.

23.4.1 Applying general principles, flexibly

The principles we started with are good ideas in general, when used flexibly.

Develop specifications, then design. When one designs a component without first working out what the rest of the system needs that component to do, one usually ends up with a design that doesn’t actually meet needs (once those are worked out). When a specification gets developed, the people involved will tend to look at the effort that has already been spent on designing (and possibly implementing) the component and will try to adjust the specification to fit that sunk cost—after all, that work has already been done, why should it be discarded? Unfortunately this tends over time to produce safety and security problems, and to dramatically increase the cost of the system as people try to integrate the wrong component into the rest of the system.

It is better to explicitly defer some design decisions until the specification is firm—but not avoid doing any design. (Doing no design until specification is done is not possible when the design activity can reveal problems with a specification.) Do a minimal amount of design, bearing in mind the risk that design may need to change as the specification changes, as well as the risk that the specification may need to change as design reveals problems.

Instead:

Keep the design officially tentative as it is being developed, so that people continue to treat the design as a work in progress that they are willing to modify.
Limit the amount of design effort while a specification is incomplete, so that one can check whether the specification can lead to a feasible design but while keeping the design open as the specification is revised.
Focus design efforts on fulfilling parts of the specification that either appear to be fairly certain, or on those parts of the specification where there appears to be risk of specifying something that can’t be built.
Track how the design matches the specification as the two evolve, so that when the specification changes one can accurately determine what parts of the design are affected.

Develop design, then implement. Similar to the way design reflects specification, the implementation reflects design. Proceeding with implementing a component before it has been designed is not really possible: doing so means that design is done implicitly and is left unrecorded. This leads to components that fail to meet functional, safety, or security constraints because those constraints have not been properly considered and analyzed before committing effort to implementation.

At the same time, deferring all implementation until all design is complete is a recipe for an infeasible system. It is all too easy to create a design that involves impossible feats of implementation, from requiring metals that do not currently exist (“unobtainium”) to algorithms that have not been invented.

We have found that a middle ground often works well. As we will discuss in future chapters on implementation, we have used a software implementation approach that emphasizes continuous integration (by which we do not mean continuous testing) and skeleton building for implementation, where the implementation proceeds in many small iterations. Using this approach we can build a simplified implementation of the general structure of a component, focusing on those aspects where the design appears either to be relatively certain or where there is higher risk in the design that needs to be checked with a rough implementation.

We have also made a point of prototyping implementations of parts of a design in order to validate whether the design is feasible. We will also discuss prototyping in a future chapter.

There is a high risk with any implementation done before specification and design are solid, even when the implementation is done for good reasons (like prototyping to validate a design approach). The effort spent on implementing something is a sunk cost: it cannot be recovered. As the design evolves, there is a strong incentive to try to continue to reuse the implementation that has already been completed, as the incremental cost or time of modification is almost always perceived to be less than starting a new implementation from scratch. This leads to a sequence of incremental changes, each of which by themselves can be perceived as the lower-cost way of handling a sequence of design changes. However, it is often the case that after a few of these incremental changes, it will have become more cost-effective to have thrown away the initial prototype or implementation and started over with better information. This sequence of incremental changes also tends to result in an implementation that has many vestiges of implementations that are no longer applicable, but which continue to present a source of bugs, security flaws, or safety problems.

The cost of incrementalism is often apparent only in retrospect. It is also driven by basic business imperatives to minimize cost at each step, or to get features implemented as rapidly as possible. This is an example of an online optimization problem, which is often hard to solve well theoretically and even harder when human incentives are involved. The techniques used to solve similar online optimization problems (notably the ski rental problem ! Unknown link ref) apply. Limiting the amount of implementation effort that may be at risk for incrementalism by deferring as much implementation as possible until the design is solid helps avoid this situation.

Thus we:

Limit the implementation effort to building simple, low-effort skeletons or prototypes of a component implementation, rather than moving to building the implementation as a whole.
Deliberately make the initial skeletons or prototypes undeployable so that the temptation to use half-baked experimental implementation is reduced.
Focus the implementation effort on aspects of the design that either have low uncertainty, implying that the risk of having to re-implement is low, or where the design carries high technical uncertainty, implying that the value of a prototype feeding back information to the design is high.
Track the relationship between design and implementation, so that the implementation can accurately change as the design evolves.

Design top down and coordinate the design of interdependent components. Many aspects of a system’s design can only be developed effectively when they are developed from the top down, notably safety and security properties. That is because these properties apply to the system as a whole and are emergent from the designs of the components that make up the system. (See [Leveson11] for an in-depth discussion of this effect.)

However, designing from the top down creates risk, similar to the previous principles, that a high-level design may create unachievable specifications for lower-level components. There is also a risk that during high-level design the cost or time involved in developing some lower-level parts of the system is unknown. This can lead to effort being spent, unknowingly, on subcomponents that are simple to design and build while subcomponents that will take far longer to develop are left for later, leading to a drawn-out schedule.

Our recommendation for managing this risk is to sketch the design for multiple layers, creating a rough outline of a design for a component and some layers of its subcomponents, then proceeding to add detail to the high-level component and fleshing out the specification for its subcomponents. Proceeding incrementally in this way allows one to obtain some information about the feasibility and complexity of a particular design approach before committing all of one’s effort to the detail of the top-level component. This approach is similar to our recommended implementation approach of building skeletons or prototypes of components rather than immediately progressing to detailed implementation.

The same issues about the cost of incrementalism apply to top-down design as they do to implementation. It can be useful to make sketch designs that are not in the final form needed, to reduce the temptation to turn sketches that have been changed over and over directly into the design for a component.

23.4.2 Additional principles

Balance design work. We have found that focusing on one aspect of a component’s design to the exclusion of others often leads to dead-end designs, where a work in progress becomes too biased toward one aspect and is not readily evolved as other aspects begin to be considered. Focusing on primary features first, and leaving security or safety for later, is a common example of this pattern.

We have found it more useful to consider many different aspects of a component’s design at a high level, sketching out different rough possible designs and making simple comparisons as we learn about the design problem. This approach has the advantage of investing relatively less effort on detail design and analysis while the design has a higher degree of uncertainty, and focusing effort on those approaches that pass the first simple evaluations.

This approach to design has its pitfalls. Some components’ designs are constrained by particular aspects—such as a need for high performance or the ability to operate in an extreme environment. These aspects are sometimes called design drivers: they have a disproportionate effect on the final design. Recognizing when some aspect drives the design in this way, and putting more effort earlier into understanding these drivers, is part of the art of designing well.

Plan for updates. Nearly every design in a successful system will be updated as time goes by. Over time, the effort spent on these updates will dwarf the effort spent on the initial design. This means that if one is developing a system for the long run, the processes, tools, and artifacts used in the design effort should be organized in a way that supports those who will come along to learn about, evaluate, and redesign parts of the system—long after those who initially designed it have moved on.

This necessitates documenting more than just the structure of the implementation. For these people to understand a design, they need to know the thinking behind what the choices were and the subtle aspects that are not necessarily apparent from looking at the implementation. These people will need guidance for how components relate to each other. They will need to understand the analyses that determined whether the component’s design was sufficiently safe or secure. This documentation takes more effort than proceeding through a one-time design, building an implementation, and then moving on, but it provides a project with a future.

Making updates effective also involves creating a team structure and human processes that can handle updates. This involves giving the team a clear way to understand how design changes happen, and how to distinguish proposals or work in progress from a design they should work from, or how to determine what design applies to a specific deployed system. It also involves developing a team culture that incentivizes good design and good documentation, giving them enough time to document enough design that their successors can build on their work and avoiding creating unnecessary time pressures that disincentivize people from doing good design.

Use appropriate infrastructure. Finally, effective design relies on having the tools, processes, and standards that give people the tools to do design work. The key principles we recommend include:

All designs should be maintained in a single repository, so that everyone on a project knows where to access the information and everyone works from the same designs. This is often referred to as having a “single source of truth”. As noted earlier, this is not always possible when some tools will not integrate with others; in that case, each particular kind of design information should be maintained in one place and there should be clear linkages between a document in one repository and a related one in a different repository.
All design information should be versioned, so that everyone on the project can determine whether they are using the same version of a particular design artifact. Versions should be for more than just single artifacts; they should define a version of the entire system, so that the artifacts in one version are consistent with each other.
Design artifact versions should make it clear when a version is a proposal or work in progress, a version being reviewed, a version that has been baselined (meaning approved for others in the team to use), a version that applies to a deployed or released system, or a version that has been superseded. There should be a common standard for how these different kinds of versions are related to each other and labeled.
Design artifacts should use common formats or design standards. This includes a common format for the basic information about a component’s design, common formats for drawings (such as using SysML/UML, or CAD standards), or common formats for reporting analyses.

Chapter 24: Breakdown structure

9 June 2022

The component breakdown, or breakdown structure, is the way to name and organize all the components that make up a system.

24.1 What is the component breakdown for?

The component breakdown organizes and names all the pieces in the system. It serves three main purposes:

It provides a master list of all the components that need to be designed, built, and tested to make up the system
It provides a unique identifier for each component, so that every piece of the system can be referred to
It provides an organization for all the documents and artifacts that make up the system. In this role, it provides an index or table of contents that helps people find information about the system and its components

These purposes lead to a few objectives that a breakdown should meet.

The breakdown should be complete. It should include all the major components that people will be working on.
The breakdown should reflect how people naturally think about how components relate to each other. People should be able to navigate through the breakdown structure accurately to find information.
The breakdown structure should result in identifiers that are easy to use. The identifiers will be written down in many different places; they will be embedded in different artifacts to link them together. Identifiers that are too long or too easy to mis-copy will cause difficulty for users.

24.1.1 Component breakdown versus work breakdown

Some institutions, notably NASA [NPR7120][NASA18] and other parts of the US Federal government [DOD22], specify the use of a work breakdown structure (WBS) in project management and systems engineering. A WBS as used in those projects is different from a component breakdown structure as defined here.

A WBS is oriented toward project management, not systems engineering. It is focused on defining the work to be done (hence the name) rather than the items or components being built by the work. From the NASA WBS Handbook [NASA18, p. 35]:

The WBS is a project management tool. It provides a framework for specifying the technical aspects of the project by defining the project in terms of hierarchically-related, product-oriented elements for the total project scope of work. The WBS also provides the framework for schedule and budget development. As a common framework for cost, schedule, and technical management, the WBS elements serve as logical summary points for insight and assessment of measuring cost and schedule performance.

This difference in intent leads to two major differences in the contents of a WBS compared to a component breakdown. The first is that a WBS includes work items that are not product artifacts. The standard NASA WBS, for example, includes project management, systems engineering, and education and public outreach branches of the work breakdown tree [NASA18, p. 47]. Given that part of the goal of the WBS is to organize resources and budget for a project, that’s an appropriate choice. The other difference is that some people break a task for building a component down into multiple revisions or releases. For example, a “motor control software” component might have subitems “prototype”, “release 1”, and “release 2”, recording the phases of work done to develop that software package.

The component breakdown structure presented in this chapter is narrower in focus than a WBS. The component breakdown lists only the things that are being built. It must be complemented by other engineering and management artifacts to provide everything needed to run a project.

24.1.2 Component breakdown versus other views

The component breakdown is one of several views into the system’s design and specification. The component breakdown has only two purposes: listing all the components and giving them unique names, and providing a structure that people can use to navigate through the components to find one they are looking for.

The component breakdown is not for expressing other facts about components and relationships between them. There are other views and other breakdowns for representing that information—and for doing so in ways that are better suited to the specific information that needs to be explained. For example, a network or wiring diagram does a better job of illustrating how multiple hardware components are connected together. Mechanical drawings are a better way to show how components relate to each other physically. Data and control flow diagrams, perhaps realized as SysML activity and sequence diagrams, are better suited to expressing relationships between software components.

24.2 Basic concepts

When developing a component breakdown, the first question to be settled is: what is a component?

First, a component is something that people think of as a unit. Terms like “system”, “subsystem”, or “module” are all clues that people think of a thing as a unit. More generally, a component is something

That can have a defined boundary of what functions or parts are part of the component, and what is outside, and
That has a physical or virtual reality; it is something that needs to be built or acquired.

Components do not have to be atomic units. Systems have subsystems; components have subcomponents. For example, the electrical power system (EPS) in a spacecraft is a medium-level component in a typical breakdown structure. It is part of the spacecraft as a whole. It is made up of several subcomponents: power generation, power storage, power distribution, and power system control. Each of those subcomponents in turn have constituent components themselves: for example, power generation has solar cells, perhaps arrays that hold the cells, perhaps some other power generation mechanism.

This illustrates the general pattern for the breakdown structure. The structure is a tree, with the highest-level component being the system as a whole. The system as a whole is typically not just a vehicle or box; it is the entire mission or business on which a vehicle is part. Underneath the whole system come the major component systems. For a spacecraft mission, this might be the spacecraft, ground systems, launch systems, and related assembly and test systems. The next level of components are the major subsystems. The structure continues recursively until reaching components that are the smallest that are sensible to model using systems tools.

The recursive process of defining smaller and smaller components ends when there is a judgment that further subdivision won’t help the systems engineering process. In practice, for example, continuing the breakdown structure all the way to individual resistors and capacitors on a printed circuit board is too detailed to be useful for systems engineering tasks.

Some criteria I have used for deciding when to continue subdividing a component into subcomponents include:

Does the state of the component separate into multiple independently-acting parts, or not? If the state of the component can easily be modeled as a single state machine, then it probably does not need to be refined; if the natural modeling is as a collection of independent, interacting state machines, then it should be refined.
Is the component physically a single physical unit? Or is it fundamentally a collection of smaller physical components?
For software, is the component treated as a single unit? Or does this project assemble the software component out of other parts?
How large or complex are the tasks to design, build, or test the component? If the component can be designed and built by one person or one vendor, then there is less incentive to split it into subcomponents than if multiple people or organizations must work together to design and build the component.

Some examples:

An aircraft. This is a high-level component in a breakdown structure for a project that is about designing and building a new aircraft model. In that project, the aircraft will have components like body, wings, propulsion, fuel, communication, and so on. If, on the other hand, the project is to design and build an air traffic control system, each aircraft is likely best treated as a low-level component that is not further broken down because all the aircraft in the airspace have roughly similar functions and behaviors, and the air traffic control system is concerned primarily with sensing those aircraft and guiding their flight.
An aircraft’s body (fuselage) structure. This is usually a middle-level component in an aircraft system. It is part of the aircraft over all. It is made up of formers, stringers, trusses, and skin. Each one of those subcomponents is typically built separately and then joined together to make the aircraft’s structure. Each will have different mechanical constraints; there are likely to be different choices for the materials out of which each one is built. These are candidates for refinement of the fuselage as a whole into a collection of subcomponents.
Infrastructure software, like an operating system. An operating system is a complex body of software; some operating systems contain several million lines of code. If a project is building an operating system, then the operating system as a whole will clearly need to be broken down into many subsystems. However, for a project that uses the operating system but doesn’t build it, the operating system doesn’t necessarily need to be broken down further: the team will obtain the operating system as a whole from a vendor and use it in its entirety without the team needing to make modifications to the operating system itself. For an aircraft project, for example, the team may buy a real-time operating system from a vendor. The team may need to build or separately obtain some of the related software components, such as a board support package and device drivers. In this case, the OS would be one component, and the BSP and drivers would be separate subcomponents–but there is no need to model the operating system’s virtual memory management system.
A printed circuit board implementing a processing unit. This is an example that may or may not be worth breaking down into subcomponents. If the board (and the parts mounted on it) are treated as a single entity, acquired as a whole, and treated as a single unit for application software purposes, then the board need not be broken down further. On the other hand, many circuit boards implementing a processing unit today are complex systems: in one recent example, the board had a primary processor, a secondary control processor, a storage system, a cryptographic unit, and two different network communication units. While the board was built and delivered by a single vendor, separate software had to be written for each of those parts—and in some cases there was an option whether to populate some of the storage and communication parts or not. In that example, breaking the overall board into subcomponents was necessary.

24.2.1 Satisfying the objectives

Completeness. Completeness depends on the exercise of identifying all the components to run to completion. The hierarchical approach does not directly inhibit or support this objective. However, the hierarchical approach makes it easier to approach completeness iteratively: one can start with a high-level breakdown, and incrementally expand parts of the breakdown tree when one finds that some components need to be refined.
Supporting navigation. People generally talk about the structure of systems in a hierarchical way: system and subsystems and components and subcomponents and so on. This means that a hierarchical breakdown structure matches common usage (as long as the refinement into smaller and smaller components follows the common usage).
Usable identifiers. The hierarchical structure does support unique names for each component, as will be discussed later. The identifiers are usually not the most compact possible, because the identifiers reflect a path of names through the breakdown structure tree, similar to the way file systems and URLs organize hierarchical names. However, the hierarchical identifiers in practice have worked well as a readable and writable form of identifiers in other domains, including URLs.

24.2.2 Alternatives

The approach laid out here is fundamentally hierarchical, and reflects the way people usually approach breaking down a complex system—by a reductive approach that organizes parts into a hierarchy.

That is not the only approach to organizing the components. Mechanical and electrical engineering systems often use a more-or-less flat space of part numbers to identify components. The specifications for each part can have attributes, and the attributes allow one to search for a desired part.

A flat part number approach works well for low-level, physical components. A 100 ohm resistor can be used in many different components; there is little value in giving a different name for its use in one place on one board and a different name for a second place on that board, or on a different board. Similarly, when manufacturing many instances of a vehicle, using a part number to identify the part in an assembly works well.

I have generally not used a part number approach for higher-level systems activities, however, because the uses are not the same. During design, each component that systems engineering deals with is generally unique.

24.3 Component identifiers

A component’s identifier provides a unique way to refer to that component. It is like the address for a building: it allows one to find the component (or its specifications), but does not by itself convey much more information. The keys are that the identifier be unique, and that people can use the identifier to find what they are looking for.

The pathname is the long-standing practice for creating identifiers for elements in a hierarchy. This is familiar from file systems and URLs: the path /a/b/c/d refers to a file or object named “d”, which is contained in “c”, which is in turn contained in “b”, which is part of “a”, which is one of the top-level objects or folders in the system. While the object name “d” is not necessarily unique (there can be another object /a/f/d, for example), the path as a whole does give a unique identifier for the object or file.

This approach applies to the identifiers for components in a breakdown structure as well. The names in the path are typically separated by a slash (/) or period (.).

The names of each component in the tree can be abbreviations or short words describing the component. Both work well; the choice is primarily a matter of style. When there are commonly used abbreviations for some components, it is reasonable mix and match abbreviations and longer names. For example, a spacecraft’s computing system is often called the CDH (command and data handling); attitude control is the ACS (attitude control system); and the electrical system is the EPS (electrical power system).

Some examples from a fictitious spacecraft system:

Abbreviations	Short names
sc	spacecraft
sc.eps	spacecraft.power
sc.eps.batt	spacecraft.power.battery
sc.cdh.fp	spacecraft.cdh.flightprocessor

Long component identifiers can become a problem. Long identifiers are harder to type than shorter ones. Sometimes there are limits on how long an identifier can be; for example, if one is recording information about components in a spreadsheet and putting each different component on a different sheet, most spreadsheet packages have limit on how long a sheet name can be.

The length of an identifier is driven by how deeply the breakdown structure tree goes. The path name for a component six layers down in the hierarchy will be much longer than the path name for a component in the third layer. This suggests that one should try not to make the component hierarchy any deeper than it needs to be.

24.4 Viewing the breakdown structure

Many people find a visual representation of the breakdown structure helpful for understanding it. Here is a drawing of an incomplete breakdown structure for a simple spacecraft:

It is worth finding tools that can show this kind of visual representation of the breakdown structure.

24.5 Context and relationships

The breakdown structure provides the fundamental organization for most systems engineering artifacts. This means that the structure chosen for the breakdown will affect how most other parts of a specification are organized.

Each component named in the breakdown has a specification. The specification includes information like

The purpose and scope of the component
Its requirements
Its state and behaviors
Its interactions and interfaces with other components

When two components interact, the interface between them must name which components are involved. The specifications for each component must indicate what data or control they will be sending and receiving in the interaction.

The identifier for a component provides a way to express a reference between implementation and test artifacts, like source code or drawings, and the specifications to which they should comply.

The breakdown structure affects almost everyone working on the project. This includes:

Systems engineers develop the breakdown structure
Every engineer and developer users the breakdown structure to find information about the components they will work on or that they will interact with
Project managers use the component breakdown as a basis for defining development tasks and tracking their progress
Contract managers use the breakdown structure to identify deliverables that vendors might provide

24.6 Advice

24.6.1 Evolution

The understanding of the system evolves gradually from the initial concept to the time that a final product is delivered (if indeed there is a final product). At each step of this evolution, the understanding of what should be in the breakdown structure and how it should be organized will change.

Because the breakdown structure is central to many other processes and artifacts, a change to the breakdown structure will result in changes to potentially many other artifacts. The cost of the change grows as the size of the breakdown structure tree grows.

Don’t try to build an elaborate and complete breakdown structure too early. At the beginning, while still working out the basic concepts of the system and its structure, just sketch out the first level of the structure—and try out several potential structures until one appears to match the system’s objectives. Often the main structure will be suggested by common practice for similar projects: the automobile industry has a common, vernacular breakdown of cars and trucks into common subsystems, for example.

In general, it is best to keep a branch of the breakdown structure shallow as long as there is significant uncertainty about how that part of the system will be designed. In an aircraft, for example, the propulsion system should be left unrefined in the breakdown structure until the team has settled on the general approach to propulsion—will it use turbofans, turboprops, propfans, electric rotors, or some combination? The broad choice can typically be settled early in concept development by working out the concept of operations and determining what capabilities, performance, and physical layout will meet the aircraft’s operational needs. Once the general architecture has been decided, then one can refine the propulsion system by adding a layer of components for each engine or other major unit involved in propulsion.

24.6.2 Depth

The point of the breakdown structure is to help people find and refer to components. The breakdown structure should reflect common ideas of how a system breaks down into components, and should result in short, easy-to-use identifiers. The breakdown structure should focus on these capabilities and not be drafted into serving other purposes.

Consider the breakdown structure for all the sensors that provide information to an autonomous vehicle. One way to organize the sensors is to create a general “sensors” component, and then include all the sensors as children of the general sensors component. Another way is to break the sensors down first by general type (camera, lidar, radar, sonar, microphone), then by general location of the sensor on the vehicle (front, left, right, top, back), and then by the specific sensor unit. In this example, the first approach leads to a shallow and broad breakdown structure; the latter example leads to a narrow and deep structure.

In general, a shallow, broad breakdown structure will meet these objectives better than a narrow and deep structure. There are a few reasons for this.

The identifiers for sensors will be shorter. The shallow structure might name a particular camera “sensors.leftcam” while the deeper structure might name it “sensors.camera.left.camera1”.
The broader structure serves the purpose of helping people locate a component better. In the broad structure, a developer knows that all the sensors are under the “sensor” component; they can navigate to the sensor component and then find the particular sensor they are looking for in the list of sensors. In the deep structure, the developer must navigate to sensors, then to cameras, then to the location, and then find the sensor. If the developer is looking for a particular type of camera and doesn’t know where cameras of that type are located, the extra layers pose extra work.
The shallower structure handles evolutionary change more easily. If a project uses the deeper approach, then determines that the layering structure isn’t working as well as expected and the structure needs to change, that will lead to changing the identifiers for many components—which in turn leads to finding and updating identifiers in many other artifacts where the identifiers have been written down. In the shallow structure, there is no structure organizing the sensors and nothing to rearrange.

This leads to a general principle. The breakdown structure should be used only for providing a unique name, and not for embedding a taxonomy or search attributes. The tools that people use to navigate through the breakdown structure and its related artifacts, like specifications, should provide search mechanisms that let someone find a component by attributes. Embedding extraneous information, like a location attribute or model number or power requirement in the name will just make the names longer, harder to use, and less resilient to change.

24.6.3 Multiple fit

The hierarchical, tree-structured approach recommended here makes each component part of exactly one parent component. It does not accommodate components that have more than one natural affinity to parent groupings.

Consider a radio transceiver that is used to communicate between aircraft, such as the ADS-B systems used for collision avoidance. This transceiver could be categorized multiple ways. It is part of the aircraft, but it is also part of an air traffic management safety system. The transceiver within the aircraft is part of a communication system, but it is also a part of the flight control system and intimately connected with human interface components on the flight deck. The transceiver, in other words, is part of several different groupings of components, depending on who is looking and for what purpose.

There is a fundamental tension between simple organizing structures, like a tree, and the richer relationships that elements of a system have with each other. For an excellent discussion of this, see Alexander’s essay on trees as a structuring approach for cities [Alexander15]. In that essay, Alexander proposes that a lattice structure is a more appropriate model for organizing urban structures. In his account, a tree-oriented description of a city fails to account for the ways that a house can be both a place for a family to live as well as a node in a social network and a place of work; in each of these roles, the house is related to different buildings or locations in the city.

The systems engineering approach presented here addresses this problem by separating naming or identity from the complex relationships that each component actually has. The breakdown structure only tries to give a name to each thing, like the address for a building. The relationships, functions, requirements, and everything else that goes into defining a component are all left to other artifacts, such as the component’s specification and models of the components.

This means: don’t try to make the breakdown structure do too much. When a component fits into multiple categories, pick the one that seems most natural for most users and leave it at that. Other artifacts and tools will address greater complexity.

24.6.4 Not by function

The breakdown structure is for organizing components: things that are built and that can be seen or touched (possibly virtually).

There is sometimes a temptation to try to organize system functions into the breakdown hierarchy. Don’t do that. The breakdown of function—and of the allocation of function to component—is a separate task that needs to be addressed by a structure that focuses on how functions are organized.

A better approach is to maintain the component breakdown and a functional breakdown separately, and maintain an allocation mapping that shows how different subfunctions are achieved by different components. The functional breakdown is often better reflected in the structure of how specifications or requirements derive from each other. See the chapter on requirements for more on this.

24.6.5 Keep related things together

Some projects have proposed organizing components primarily by some fundamental, nonfunctional attribute. One project was considering separating hardware from electronics from software from operational procedures at the top level, and then organizing components within each of those categories by subsystem. Another project organized components first by the vendor organization that was to implement the component.

These approaches make it harder for people to use the breakdown structure to find things. Consider an electrical power controller on a spacecraft. This has an electronic component (the board and processor that runs the power control function) and a software component (that makes the decisions about what to power on and off, and to report information to a telemetry function). Someone working on the power controller will generally want to know about both aspects. Requiring them to look in two widely-separated parts of the breakdown structure is inconvenient, and (more seriously) it increases the chances that someone will miss a component that they need to know about to do their work.

As a general principle, it is better to group components by how people naturally think of them as being grouped. Keep functionally-related components close together in the breakdown structure so that people find everything they need about something by looking in one place.

As noted above, this doesn’t always work. The breakdown structure will not be perfect because not everything in a system naturally falls into a hierarchical organization. But the more that like things can be grouped, the easier it will be for people.

24.6.6 Generic and reusable components

There is one special case of a component fitting into multiple places in a breakdown structure that deserves special treatment: generic and reusable components.

Consider an operating system. There may be multiple processors within a system that may all run instances of the same operating system. It is useful to have one specification for that operating system: there’s one product that is acquired from a vendor, there is one master copy kept somewhere, and so on. At the same time, that operating system will be loaded onto many different processor components in different subsystems.

One way to address this is to have a part of the breakdown structure for generic components, and then put an instance of that component in the places where it is used. The specification of each instance component can refer to the specification for the generic, with those functions or requirements that are specific to the instance added. This is an example of using the class-instance model from object-oriented programming to solve the problem.

24.7 Examples

24.7.1 NASA Work Breakdown Structure

The NASA project management process and systems engineering standards use a common WBS structure across all NASA projects. The use of the WBS is codified in a Procedural Requirement document [NPR7120], with details in an accompanying handbook [NASA18].

The NASA WBS is used as a project management artifact to organize work tasks, resources and budget, and report progress. The hierarchy must “support cost and schedule allocation down to a work package level” [NPR7120, p. 113]. A “work package” means one task or work assignment that is tracked, budgeted, and assigned as a single unit.

A NASA project’s WBS tree is rooted in the official NASA project project authorization, with its associated project code.

The first level of elements is defined by NASA standards, and each element has a standard numbering. The standard elements for a space flight project are: [NPR7120, Fig. H-2, p. 113]:

Note how this organization mixes technical artifacts (payloads, spacecraft, ground systems) and management activities (project management, safety and mission assurance, public outreach).

The NASA WBS is intended to be one part of an overall project plan document. The project plan also contains information like:

The governance and authority for the project
Resourcing
Organization (team) structure
Schedule and cost plans
Safety and risk plans

24.7.2 MIL-STD-881 Work Breakdown Structure

This breakdown structure standard aims to provide a “consistent and visible framework” [DOD22] for communicating and contracting between a government program manager and contractors that perform the work. It addresses needs such as “performance, cost, schedule, risk, budget, and contractual” issues [DOD22, p. 1]. This kind of WBS is thus focused on supporting contractual relationships with suppliers.

The standard defines a number of different templates for different kinds of projects. It includes templates for aircraft systems, space systems, unmanned maritime systems, missiles, and several others.

The template for an aircraft system includes the following Level 2 items:

System integration, assembly, test, and checkout
Air vehicle
Payload/mission system
Ground/host segment
Aircraft system software release 1..n
Systems engineering
Program management
System test and evaluation
Training
Data
Peculiar support equipment (support equipment specific to this air vehicle)
Common support equipment
Operational/site activity by site
Contractor logistics support
Industrial facilities
Initial spares and repair parts

As should be clear from this example, this WBS template aims to address not just the design and building of a system but rather the operation of the entire program, including testing, deployment, and initial operation.

24.7.3 A simple spacecraft system

This is an example component breakdown for a simplified imaging spacecraft. The spacecraft uses solar panels to collect energy; it has a single imaging camera to collect mission data; it has a flight computer to run the system; an attitude control system to point the imager where needed; and a radio to communicate to ground. (The graphical version of this breakdown structure is included earlier in this chapter.)

Id	Title
space	Space segment
space.acs	Attitude control system
space.acs.control	Control logic
space.acs.sun	Sun sensor
space.acs.wheels	Reaction wheels
space.cdh	Command and data handling avionics
space.cdh.gps	GPS receiver
space.cdh.gps.ant	Antenna
space.cdh.main	Main processor
space.cdh.storage	Data storage
space.comm	Communications system
space.comm.ant	Antenna
space.comm.ant-tran	Cable
space.comm.trans	Transceiver
space.eps	Electrical power system
space.eps.battery	Battery
space.eps.controller	Power controller
space.eps.panels	Solar panels
space.eps.sep	Separation switch
space.harness	Harnesses
space.harness.canbus	Data CAN bus
space.harness.pl	Payload harness
space.harness.power	Power cabling
space.harness.radio	Radio harness
space.pl	Payloads
space.pl.imager	Imager payload
space.prop	Propulsion system
space.prop.lines	Fuel lines
space.prop.tank	Fuel tank
space.prop.tank.pressure	Pressurization system
space.prop.tank.sensor	Fuel pressure sensor
space.prop.thruster	Thruster
space.structure	Structure
space.thermal	Thermal management system
space.thermal.propheat	Prop tank heater
space.thermal.radiator	Thermal radiator

This example only goes four levels deep. The actual breakdown structure would likely include at least two more levels, to represent, for example, different parts of the flight control software or subcomponents of the radio transceiver.

The example includes an example of a component that could fit in multiple places in the structure: the propellant tank heater. This is part of the thermal management system—its function is to keep the fuel in the propellant tank within a certain temperature range—but it is also part of the propulsion system. In this example the choice was to categorize it as part of the thermal management system.

Part VIII: Team organization

Chapter 25: Team introduction

9 March 2024

Basic collective action problem is related to the equilibrium between contribution and resulting common good
- Depends on fungibility and homogeneity of contribution and good

Collective good is homogeneous: a successful system
- More specifically, good decisions, designs, and implementations that lead to the system as a whole working in a way that meets its purposes, which will in turn provide individual goods like pay, satisfaction, status

The cost is heterogeneous: each person should be doing different things to contribute to the good; this is about sharing out work, with different people doing different tasks
- Modeling contribution as effort (time spent) is not helpful, because if everyone does the same thing with their effort, the result will have zero value

Perception of what one person should do (is incentivized to do, is assigned to do) does not necessarily result in an optimal common good
- Value is related to satisfaction of the system structure constraints
- Value of one contribution can be low if someone else does something not consistent with it; the same contribution can have high value if the other does soemthing consistent

Free rider problem (not contributing full share for the good received) increases with size of group
- Detection of free rider becomes harder
- Perception of value of contribution decreases because it is a smaller share of contribution, and share of common good is lower
- Work (cost) in a team is transferrable, so if one member tries to be a free rider, someone else will have to pick up the work. This creates an incentive for enforcement

If contributions are not homogeneous and shared equitably, is there increased motivation to free ride? Is there greater motivation for the “greater” contributor to dictate?

Common interest of a team: pay, satisfaction in work, product
- Pay and satisfaction are private and excludable; product is neither

Organization interest is the product, which may or may not be an interest of the members
Within the team, common interest can be about ability to work on some part or make certain decisions
Antagonistic interest in that people want more scope or to do less work
A successful system product is a common good (public good); non-exclusivity of the result

Consider, for example, meetings that involve too many people, and accordingly cannot make decisions promptly or carefully. Everyone would like to have the meeting end quickly, but few if any will be willing to let their pet concern be dropped to make this possible. And though all of those participating presumably have an interest in reaching sound decisions, this all too often fails to happen. When the number of participants is large, the typical participant will know that his own efforts will probably not make much difference to the outcome, and that he will be affected by the meeting’s decision in much the same way no matter how much or how little effort he puts into studying the issues. […] The decisions of the meeting are thus public goods to the participants (and perhaps others), and the contribution that each participant will make toward achieving or improving these public goods will become smaller as the meeting becomes larger. It is for these reasons, among others, that organizations so often turn to the small group; committees, subcommittees, and small leadership groups are created, and once created they tend to play a crucial role. [Olson65, p. 53]

that needs careful design just as must as the system product

This is the “machine” that, when run, builds the system product

People form organization whether given or not
Required in order to divide up work and to share things

Know who is working on what, to avoid duplicated work and so direct people to the right person to talk to
Know how to escalate issues when needed
Know when a situation is exceptional and should be escalated, so that delegation can work
Know that escalated issues and side-channel reporting will be dealt with
Know how checks and approvals happen
Ensure that no work or system parts fall through the cracks

Part IX: Project plan

uncertainty
multi horizon planning
resource constrained project

purpose of plan

Part X: Appendixes

Appendix A: From stakeholder need to model purposes

8 January 2024

A.1 Introduction

In Chapter 13, I presented an approach for determining what features and capabilities should be supported in the project in order to do a good job of building a system, and meeting stakeholder needs. In this appendix, I present the detail of that derivation.

Bear in mind that this derivation results in a set of objectives for a project. It does not say how any particular project should meet these objectives; each project must decide those things in ways that meet the specific needs of that project and that system. The objectives can be seen as a set of considerations that each project should examine as they decide how to run the project.

The derivation only addresses matters that are related to the project’s approach to building a system. There are many other factors outside this scope: matters of project management, or of policy in the organization that hosts the project. Where appropriate I have made notes of these matters external to the system-building scope.

A.1.1 Stakeholders

The set of stakeholders is:

The customer for which the system is being built;
The team that builds the system;
The organization(s) of which the team members are part;
Funders who provide the investment to build the system; and
Regulators who oversee the system and its building.

I introduced each of these in Section 13.2.

A.1.2 Model elements

I introduced the model for making systems in Section 6.3. This model is organized around the tasks that need to be performed to build the system, and has the following elements:

Artifacts that are created by performing tasks, and represent the system and records about it;
The team that builds the system by performing tasks and making artifacts;
The tools that people on the team use in doing tasks; and
The plan that organizes what tasks need to be done, in what order, and using what resources.

In addition to these elements, I have include an element for matters external to the system-building project for matters that stakeholders need but that aren’t about building the system itself.

A.1.3 Derivation

The derivation maps stakeholder needs onto objectives for parts of the model.

The result is a set of objectives or capabilities that people should consider when working how how the project should operate.

I discuss each stakeholder in the sections that follow, along with tables of the needs or objectives of each. The objectives that support these stakeholder objectives are annotated with a right-pointing arrow: →.

A.2 Stakeholders

A.2.1 Customer

The customer (see Section 13.2.1) is a stakeholder who wants the system built because they are going to use the system. They may or may not be funding system development directly—if they are, then they are also a funder below.

model:2 Customer

2.1 Fill purpose
The project must deliver a system that meets the customer’s purpose

2.1.1 Know purpose
The project must know what the customer’s purpose for the system is

→ model.artifacts:2.1

→ model.plan:3.2.1

→ model.team:2.1.1.1

2.1.2 Build to purpose
The project must produce a system that meets the customer’s purpose

→ model.artifacts:1.1, 2.1, 4.2, 4.4, 4.5, 5.1, 5.2

→ model.plan:1.2, 2.1, 2.2, 2.3, 3.3, 3.3.2

→ model.team:2.2.1, 2.2.2, 2.5.1

→ model.tools:3.1, 3.2, 4.1

2.1.3 Know requirements
The project must know the customer’s reliability, safety, and security requirements

→ model.artifacts:2.1.2

→ model.plan:3.2.1

→ model.team:2.1.1.1

2.1.4 Meet requirements
The project must produce a system that meets the customer’s reliability, safety, and security requirements

→ model.artifacts:2.1.2, 4.5, 5.1, 5.2

→ model.plan:3.3, 3.3.5

→ model.team:2.2.1, 2.2.2

2.1.5 Free of errors
The project must produce a system that is free of errors

→ model.artifacts:4.5

→ model.plan:3.3.5

→ model.team:2.2.2

2.2 On time and budget
The project must deliver a system by the required deadline and within the needed budget

→ model.plan:1.2.5, 4.1, 4.2

2.2.1 Know budgets
The project must know the budgets and deadlines for delivering the system

→ model.plan:3.2.2, 4.1

2.2.2 Know consumption to date
The project must know the resources and time that has been used to date that count against budgets or deadlines

→ model.plan:4.1.1

2.2.3 Project forward usage
The project must be able to project the resources and time required to complete the system or meet other deadlines

→ model.plan:1.2

2.2.3.1 Uncertainty
The project must be able to estimate the uncertainty in any forward projections of resources or time

→ model.plan:1.2.1

2.2.4 Control execution
The project must be able to control execution to adjust resource and time consumption

→ model.plan:1.2.4

2.3 Certifications
The project must deliver a system that has appropriate certifications or approvals

2.3.1 Know regulations
The project must know the regulations or standards that apply to certification/approval

→ model.artifacts:8.1

→ model.plan:3.2.5

2.3.2 Follow process
The project must follow any processes required to get certification/approval

→ model:2.5.2

→ model.artifacts:8.2, 8.3

→ model.plan:3.3.1.1, 3.3.2.1, 3.3.3.1, 3.3.7

2.4 Release and deployment
The project must be capable of releasing a version of the system and deploying it to a customer

→ model.artifacts:1.1, 6.1

→ model.plan:3.4

→ model.team:2.5.1

→ model.tools:3.5, 4.3

2.5 Evolve system
The project must evolve the system in response to changes in customer or other needs

2.5.1 Receive requests for change
The project must be able to receive and process requests for change from the customer

→ model.plan:5.1, 5.3

→ model.team:2.1.1.2

2.5.2 Receive regulatory changes
The project must be able to receive and process changes in regulatory requirements

→ model.plan:5.2

→ model.team:2.3.1.2

2.5.3 Know purpose of change
The project must know the purpose of the change (and the change in system purpose that results)

→ model.artifacts:2.2

2.5.4 Build to meet change
The project must be able to produce a system that meets the changed purpose while maintaining the system’s other purposes and requirements

→ model.artifacts:1.1, 2.1, 2.2, 4.2.1

→ model.plan:1.2, 2.1, 2.2, 2.3, 3.3.6

→ model.team:2.1.1.2, 2.2.1, 2.2.2, 2.5.1

A.2.1.1 Filling purpose

A customer has some purpose for the system, meaning something they want to achieve by deploying and using the system. This is the problem that the customer wants solved, which is a higher-level concern that the specific features that the system will provide.

A customer may have additional requirements on the system. They likely have a need for a minimum level of reliability. They likely have needs related to safety and security of the system.

The project needs to build a system that can meet this purpose and the requirements.

The project can meet these needs by:

Learning and documenting the customer’s purpose and requirements, resulting in documents that the team can refer to while building the system.
Having processes defined that ensure that the team will follow a deliberate process as they design, build, and verify the system.
Having processes for checking the design and implementation to ensure that it meets the customer’s purpose, and for finding and fixing errors.
Having a team that has enough people with the right skills to do the work, and gives them clear assignments for what steps they are responsible for.
Having tools that help the team analyze and verify the design or implementation.
Keeping track of the work that needs to be done, including having a general plan to guide the work and tracking the work currently in progress.

A.2.1.2 On time and budget

The customer likely has a deadline by which they would like the system delivered. They likely also have a budget for how much they want to invest in acquiring the system. At minimum, customers generally want the result as soon as possible and for as low a price as possible.

To meet these needs, the project should:

Know what the budget and deadlines are for the work.
Have a plan that can be used to estimate resources and time needed to get to deadlines or finish the project. This plan should be visible to everyone on the project, so that they can understand how what they are doing fits into the overall work flow.
Account for degree of uncertainty in these estimates—which will be inaccurate early in the project, and become more accurate as time goes by.
Track how much resource has been used and the progress on tasks.
Regularly update the plan based on progress made and information learned.
Have processes in place to detect when there may be a budget or deadline problem, and to figure out how to resolve the problem.

A.2.1.3 Certifications

In many industries, some kind of certification or approval is necessary to operate the system. An aircraft, for example, needs a type certification from the local aviation authority as well as approval for a specific instance of the aircraft. Even if there is no overt certification required, there are often regulatory standards to be met.

The project must build the system in compliance with regulations. When certification is needed, the project must follow the process to get that certification.

To achieve this, the project should:

Know what regulations and certification process applies to their work, and make that information available to everyone on the project.
Have people who interface with regulators, to learn when regulations change, to ask for clarification or guidance when needed, and to work with the regulators during certification.
Account for regulatory requirements in the specifications for the system, equal with customer objectives and requirements.
Maintain design, analysis, and process records that can be checked to verify that the project is working to meet regulations.
Have a process for regularly checking that the project building a system that meets regulations, including verifying the design or implementation.
Have a process for working with the regulators to get certification, including having a process for resolving issues that the regulators identify that stand in the way of certification.

A.2.1.4 Release and deployment

The customer needs the system actually to be delivered and put into operation. The project must deliver the system, and provide or support its deployment.

To do this:

Maintain the tools used to maintain products for release, including a protected repository for software releases and stockrooms for hardware inventory.
Have a process defined for releasing and deploying a system.
Have people on the team who are trained and responsible for release and deployment.

A.2.1.5 Evolve system

The the system is successful, the customer often finds that it can be made even better with some changes. Or the customer’s needs may change, and they will want the system to adapt to meet their changed needs. The project should be able to maintain and evolve the system to support the customer’s changing needs.

A system may also need to change when regulations change.

The project can support an evolving system by:

Having people, tools, and a process for receiving requests for changes.
Having a process to decide (and prioritize) which requests to act on, and which to defer or reject.
Tracking which changes are being reviewed, which are being worked on, and which have been completed.
Maintaining rationales of decisions made during design and implementation, so that people working on a change can accurately understand how the system can be modified without causing errors.
Having processes and tools to maintain all the design, implementation, and verification artifacts as they are changed—and keeping changed artifacts consistent with each other and separate from versions without a particular change.

A.2.2 Team

The team (see Section 13.2.2) is the collection of stakeholders who build the system. These people need the things that skilled, technical workers generally need: satisfaction, security, confidence, compensation.

Meeting these needs is mostly outside the scope of systems-building itself. These needs are largely met by the project and organization management who create the environment in which the team works. Still, there are aspects of systems-building that can help (or hinder) meeting the team’s needs.

The analysis of a team’s needs presented here is somewhat idealistic. It focuses on skilled workers who are not readily interchangeable, whose value to a project derives in part from the knowledge they carry about the system being built. It assumes workers motivated largely by work satisfaction and have essential material needs met by their compensation. These assumptions lead to a particular balance of power between the team and the organizations that employ them. This would not apply to a team of interchangeable workers or workers whose material needs are not well met by their employment.

model:3 Team

3.1 Satisfaction in the work
The team must have work that challenges them and results in satisfaction in what they produce

3.1.1 Positive outcome of work
The team must have confidence that their work will have a positive outcome

→ model.external:1.1

→ model.plan:1.1, 1.2

3.1.2 Challenging work
The team must find that the project’s work challenges them and makes use of their skills while remaining achievable

→ model.external:1.1, 1.2

3.1.3 Avoid irrelevant work
The team must believe that they are not being asked to do irrelevant work as part of the project

→ model.artifacts:1.3, 1.3.1

→ model.external:1.2

→ model.plan:1.2.5

→ model.tools:1.1

3.2 Appropriate staffing
The team must be staffed with the right people to do the work

3.2.1 Sufficient staffing
The project must have sufficient staff, with the right skills, to build the system

→ model.external:1.3

→ model.plan:1.2.3, 6.1, 6.3

3.2.2 Not overstaffed
The project must not be overstaffed in a way that leaves some unable to make meaningful contributions

→ model.external:1.3

→ model.plan:1.2.3, 6.1, 6.4

3.3 Sufficient supporting resources
The project must provide the team with sufficient resources to do the work

→ model.tools:3.2, 3.3, 4.1, 4.2, 5.1

3.4 Secure position
The people in the team must feel secure in their position in the team

3.4.1 Understanding of fit
The team members must understand how they fit into the organization

→ model.external:1.5

→ model.team:1.1

3.4.2 Clear expectation
The team members must have a clear and correct understanding of their responsibilities in the project

→ model.plan:1.2.7, 2.4, 3.2.3, 6.2, 6.3

→ model.team:1.2

3.4.3 Fair evaluation
The team members must have an expectation that their work will be fairly evaluated

→ model.external:1.7

→ model.team:1.2.1

3.4.4 Clear lines of authority
The team members must have a clear understanding of the authority of others in the project

→ model.artifacts:3.1

→ model.plan:3.2.3, 6.2, 6.3

→ model.team:1.1.1, 1.1.2

3.4.5 Ability to raise issues
The team members must have the ability to raise issues about the team and about the system, without retribution

→ model.external:1.4

→ model.plan:2.4, 3.3.5

→ model.team:4.1

3.5 Fair compensation
The team must be fairly compensated for their time and effort

→ model.external:1.6

3.6 Belief in project
The team must be able to believe in the project, its purpose, and its leadership

3.6.1 Belief in objective
The team must have confidence that the organization is accurately working with the customer

→ model.plan:1.1

→ model.team:2.1.2

3.6.2 Ethics
The team members must believe that the system will be used in ways that accord with their ethical beliefs

→ model.artifacts:2.1.3

→ model.external:1.8

A.2.2.1 Satisfaction in the work

Team members are expected to need satisfaction arising from the work they are doing on the project.

The satisfaction comes in part from believing that the work they are doing will have some positive outcome. That outcome might be that they see the system deployed and having a positive effect on the world. It might be that they see their work acknowledged, publicly or privately, even if the system ultimately is not deployed. It could come from social standing among their peers improving because of their association with the work.

Skilled workers also want work that makes use of their skills—which leads to a sense that they, as a specific individual, are making a contribution to the work. Work that challenges them or from which they learn things contributes to that satisfaction.

Doing work that is seen as not relevant or not likely to have value decreases their satisfaction. If asked to do something that is not achievable, they will lose enthusiasm. If they are asked to do work that they perceive as irrelevant, they will feel a lack of their individual value.

To support team satisfaction, the project can:

Have a clear explanation for why the organization decided to go ahead and run the project to build the system.
Have a plan that gives team members context for how their work fits into the overall flow of work.
Use tools that automate as much repetitive or low-skill work as possible.

Other aspects are outside of the project’s scope.

The organization and project management need to keep the team informed on how the project is proceeding. This helps the team have confidence that the project will successfully finish the system, and it gives the team a way to get involved if there are problems in execution.
The project management needs to assign work to the people with the right skills to do the work.

A.2.2.2 Appropriate staffing

A team needs to have enough of the right people to do the work—but not too many people. Having enough people on the team who can do the work contributes to a team member’s sense that the project has a good chance of having a positive outcome.

Having too few people, or too many people who lack necessary skills, leads to an overworked and burnt out team.

Having too many people leads to team members who don’t have useful work to do. It can lead to people making up new work just to feel like they are contributing.

The “right” staffing level is dynamic. It changes over time as the project moves forward: a particular skill in designing electronics boards may be important for one period in the project, but once the necessary hardware has been designed and built, the need decreases. It changes over time as people change. As a team members learns new things, they may find that they should move on to a different project. Life events occur that change a person’s motivations and needs. The key is not to always have the perfect cohort working on the project, but to have a pretty good group and work to address changes as they happen. If the team has trust in their management that the management is able to address team composition, people will generally stay satisfied.

Ensuring appropriate staffing involves:

Having a reasonably accurate idea of the staffing needs, which comes from having an overall system design and a plan for building the system.
Ability to hire staff, or transfer them in from other projects. This includes having the ability to bring new staff members up to speed on the project.
Having an accurate record of who is on the team, and what role each person plays, in order to identify where there are gaps in staffing.

Much of this is outside the scope of the project itself. The organization holds the funds used to pay staff. It also provides the ability to hire and fire people.

A.2.2.3 Sufficient supporting resources

As with staffing, the team needs resources to do their work: a place to work and the tools to do the work, for example. They may need consumable resources as well. For example, a team might need a ready supply of liquid nitrogen in order to test a hardware component that is supposed to operate at low temperature.

If the team lacks these resources, they can’t do their work. This affects their satisfaction.

The project needs to have:

Facilities where the team can work. This includes both office space for desk work and lab space for building and testing.
Ability to make hardware components, including the machining used, the stock used for components, and ability to store and track the inventory of made parts.
Software build and test tools.

A.2.2.4 Secure position

Team members need to have a sense of security in their position. This means that they need to believe that they understand their position in the project and organization, believe that they will be treated fairly, and believe that issues they raise will be addressed. The opposite of this is when they have a sense of insecurity—because they do not understand what is expected or how they are evaluated, or because they believe that problems will not be resolved, even if they raise an issue.

The sense of security allows people to put their effort into their work, rather than spending their time and energy on worry. The sense also helps keep people on the team so that their knowledge of the system continues to benefit the project.

This comes from the project:

Having clear records of what role each person is expected to fill, and the ability for each person to find the people who have some role or responsibility.
Having clear reporting and decision authority, documented for all the team to know.
Having a process for raising issues about both technical and team problems, along with a process for tracking the issues as they are resolved.

The organization also needs to:

Document how the project fits into the organization’s overall structure.
Document (and follow) a policy for how each person will be evaluated.
Respond to raised issues promptly and without retribution.

A.2.2.5 Fair compensation

A technical worker needs to believe they are being fairly compensated for their time and effort. They need to be compensated well enough that they are not distracted by want. That compensation may be monetary, but it may take other forms as well.

Setting compensation policy is usually a responsibility of the organization, not the project.

A.2.2.6 Belief in project

Skilled workers often have choices about what project they work on. Many of them are motivated by a belief in the work they do: that it will help its users, or that it will result in some good for the world. If they come to believe that either or both is not true, they will be demotivated.

The project should:

Document and maintain the reasons why the organization and the team are building the system.
Maintain documentation of what the customer will use the system for.
Communicate regularly with all the team about how interactions with the customer are going.

The organization should also maintain an ethics policy that details:

What the organization considers ethical behavior by team members;
What the organization commits to as ethical behavior in the world; and
Procedures for raising and resolving ethical questions—including external reporting when appropriate.

A.2.3 Organization

The people in the team work for the organization, which provides a home for the project (see Section 13.2.3.) The organization is responsible for finding funding for the project and providing a legal entity for doing the work. The organization provides the business operations that make the project possible.

There is no one kind of “organization” that fits all situations. The organization might be anything from a single person, to a company, to a consortium of organizations, depending on the project. The organization might exist to return profit in exchange for the work, or it might be a non-profit or a governmental organization that looks for non-monetary benefits from the project. Some organizations exist only to build and deliver one system; others expect to deliver the system to many customers and to build more systems in the future.

Many of an organization’s needs are not to be met by the system-building project itself; they are met by how well the organization’s management and business operations. Nonetheless, how the system is built can help or hinder business management or operations.

The diversity of kinds of organizations means that the list of needs below has to be tailored for each project and each organization.

model:4 Organization

4.1 Ability to deliver
The organization must have the ability to deliver the working system to the customer

4.1.1 Ability to communicate with customer
The organization must be able to communicate with the customer

→ model.team:2.1.1

4.1.1.1 Conflict resolution
The organization must be able to negotiate and resolve conflicts between the team and the customer

→ model.team:2.1.1.3

4.1.2 Support for the team
The organization must have the infrastructure to support the team

4.1.2.1 Leadership
The organization must have leadership that can run the organization in a way that enables the team

→ model.external:1.4, 1.5, 1.7, 1.8

4.1.2.2 Infrastructure
The organization must have the ability to staff and finance the team

→ model.external:1.3, 3.1

4.1.2.3 Resources
The organization must have resources to hire the team and for them to operate

→ model.external:1.3

4.1.2.4 Workplace regulation
The organization must provide a workplace that meets regulation

→ model.external:1.9

4.2 Ability to sell
The organization must have the ability sell the system produced (when appropriate)

→ model.external:2.1

4.2.1 Articulate value
The organization must be able to articulate the value of the system product being sold

→ model.artifacts:2.1

4.2.2 Market
There must be a market for the system being sold

→ model.artifacts:2.1

→ model.external:2.2

→ model.plan:3.2.1

4.2.3 Sales and marketing team
The organization must have a sales or marketing capability, with the resources to do its job

→ model.external:2.3

4.3 Profit
The organization must get enough profit from the project to fund overhead and to support future projects

→ model.external:3.2

→ model.plan:1.2.5, 1.2.6, 3.2.6

4.4 Positioning for future work
The organization must be positioned for future projects and/or maintenance of this system

4.4.1 Reputation
The organization must have a reputation for being able to build systems well

→ model:2.1, 2.2, 2.5

4.4.2 Reusable capability
The organization must have capabilities in processes, teams, and tools that will apply to future projects

→ model.external:4.1

4.4.3 Ongoing improvement
The organization must be able to learn and improve its capabilities over time

→ model.external:4.2

A.2.3.1 Ability to deliver

The purpose for the organization pursuing a system-building project is to deliver a system to the customer. If the project does not deliver something, the organization will see little return on its investment and effort.

Of course, an organization might get a contract from a customer and get started, only for the customer to cancel the contract. (Hopefully the organization has taken this into account in its planning.) The organization still needs to have been able to deliver the system, even if the work was stopped.

The ability to deliver has two aspects: communication with the customer and support for the team, in addition to the general ability to build a system for the customer.

When the system being built has a specific customer, the organization needs to be able to talk to them, keep them updated on progress, and hear concerns or issues from the customer. When there is disagreement, the organization needs people who can negotiate and resolve issues.

The project can help this by maintaining the interface with the customer, including having people assigned to work with the customer, documenting what they learn from the customer, and negotiating with the customer as issues arise.

The project team can do little without the organization supporting them. The team needs leadership; it needs workspace and other infrastructure; it needs human resources and payroll and accounting support. The organization needs to:

Provide the resources (funding, space) that the team needs to do its work.
Provide supporting capabilities, such as hiring, problem resolution, and marketing.
Maintain a safe and regulation-compliant workplace.

A.2.3.2 Ability to sell

If the system is expected to be delivered to multiple customers over time, the organization needs to be able to find those customers, make the case to them that the system will benefit them, and work out a deal to deliver the system.

I have written this need in terms of selling, but the needs apply when something is being delivered not for monetary return. An open-source project that is freely available to users does not sell the system for money, but the way that project has value is for users to pick up, deploy, and use the system. The project may want to attract developers to build up an ecosystem of related products or services. Meeting these needs involves making potential users aware of the system and making the case that they will benefit from the system.

To be able to sell the system, the organization needs to:

Be able to say why the system is worth acquiring and using. The system-building project can help this by keeping records of the intended value proposition for the system.
Ensure that there is a market for the system. The project can assist by having that documentation of what the system is for, and by working with the organization’s marketing team while designing the system.
Have a team that can do sales and marketing.

A.2.3.3 Profit

The organization will be expecting to get some kind of return on its effort. That may be a monetary return, but a non-profit or government agency may look for a non-monetary return, such as a community benefit.

The project can support this in two ways. First, the organization can set business objectives for the project, such as expected profit. The project can keep records of these objectives, and take them into account in the system’s design. Second, the project can organize its work as efficiently as possible so that investment goes as far as possible (consistent with deadlines). The project’s management can monitor the time and money being spent and work out how to adjust the project if it looks likely that the project will not meet the organization’s expectations on return or profit.

A.2.3.4 Positioning for future work

Many organizations build multiple systems over their existence—whether this is building multiple bespoke systems for customers, or building multiple products that are delivered to many customers. The ability to continue to deliver profitable systems is a major part of a company’s stock performance: the stock price is determined by the market expectation of future returns to the investors.

An organization’s reputation affects its ability to attract customers and investment, as well as its ability to hire talented staff. The reputation depends in part on its ability to deliver good systems.

An organization can become more productive over time—and thus improve its reputation, its ability to deliver, and its profitability. This comes from learning and improving. If the organization builds up a staff that knows how to run a system-building project well, future projects can be executed more efficiently. Better tools will help the next projects. However, learning and improvement does not often happen by chance; it happens when an organization sets out to learn from its performance.

The project can:

Add to the organization’s reputation for building a system well, on time, and on budget.
Maintain a transparent and good working relationship with the customer, so that a happy customer will refer others for future projects.
Meeting its team’s needs, so that people on the team spread the word that this is a good organization to work for.

The organization should:

Take steps to build up reusable capability from one project to another.
Have the the commitment and processes to learn and improve over time.

A.2.4 Funders

The funders provide the investment that funds the team building the system (see Section 13.2.4.) The funder provides these resources in the expectation of some kind of return, be that monetary or not. A venture capital funder is most likely to look for a monetary return from future profits from the organization it is funding. A company providing internal funding more likely is looking the project to add to the company’s capabilities, which will in turn enable the company to increase its future profits. A government agency is likely looking for something that will benefit the public in some way.

As noted earlier, there are many different kinds of funders, from venture capital to company internal funding to customers paying for development.

model:5 Funders

5.1 Return on investment
The funder must get at least the expected return on its investment

→ model:4.2, 4.3

5.1.1 Visibility
The funder must have sufficient visibility into the organization’s behavior and progress to determine when the project is at risk of not providing a return on investment

→ model.external:4.3

→ model.plan:1.2, 4.1, 4.2

5.1.2 Influence
The funder must have influence on the organization or project in order to address performance that will jeopardize return on investment

→ model.external:4.3.1

5.2 Ability to attract future investment
The project must help the funder attract investment for future projects

→ model:5.1

A.2.4.1 Return on investment

Funders provide capital to run the project on the expectation that they will get some return on that investment.

In some cases, the return will come from profit realized in building the system (Section A.2.3.3) or from an increase in the value of the organization (Section A.2.3.4). In other cases the return will come from the value of the system after it is delivered and deployed (Section A.2.1.1, Section A.2.3.1).

The funders will also expect to be able to track the organization’s and project’s progress, and to raise issues when they find that there may be a problem that could jeopardize the funder’s return. The organization needs to have people whose responsibility includes interfacing with the funders.

The project can support the interface with funders by maintaining a realistic plan for the work, managing its budget, and keeping the organization informed of progress. The project should also have the processes in place to respond when the funders raise an issue that leads to a potential change to the system.

The project may also need to maintain accurate records and artifacts that allow the funder to audit the project—verifying that the information the funder has received is accurate and complete.

A.2.4.2 Ability to attract future investment

The funders get the capital they invest from somewhere. In many cases, the investment capital comes from their customers: individual and institutional investors for venture capital, legislatures and the public for government investors. The funders will keep their investor customers satisfied if they can show that their investments produce the expected returns, leading to a reputation for using capital wisely. At the same time, funders want to avoid bad press from projects that have problems, which can reflect on the funders’ ability to select organizations or projects.

The ways that the project can address this funder need are all included in the previous section, on the funder’s need for return on investment.

A.2.5 Regulators

Regulators (Section 13.2.5) provide an independent check on work to ensure that it meets regulations or standards, thus ensuring that some public good is maintained that the organization or project might not otherwise be incentivized to meet.

The interaction between the project and regulators depends on the countries involved and the nature of the project. Some industries require licensing or certification of some kinds of system: most aircraft, for example, must obtain type certification from the local civil aviation authority before that aircraft is allowed to fly or be sold. Spacecraft require a set of licenses for launch and communication. Other industries, such as consumer electronics or automobiles in the United States, depend on compliance with regulation but compliance is only checked after the fact.

I include voluntary standards as part of regulation. Non-governmental organizations set interoperability standards; the standards for USB (set by the USB Implementer’s Forum) and Wifi (set by the Institute of Electrical and Electronics Engineers 802.11 working group) are examples. Other organizations set safety standards that help to ensure consumer products are checked to be safe.

The regulators perform multiple tasks:

Setting regulations or standards.
Communicating those regulations to those who might be affected.
Performing compliance checks for certification, when appropriate.
Performing compliance checks when violation is suspected.
Working with the organizations that are being checked to identify and resolve issues.
Overseeing remedies, when appropriate.

model:6 Regulators

6.1 Compliance and certification
The regulator must be able to work with the project to ensure regulatory compliance and (when appropriate) certify the system

6.1.1 Available regulation
The regulator must make information about regulations available to the organization and possibly user

→ model.artifacts:8.1

→ model.plan:3.2.5

→ model.team:2.3.1.1, 2.3.1.2

6.1.2 Application
The project must apply to the regulator for certification and then follow the certification process

→ model.artifacts:8.2

→ model.plan:3.3.7

→ model.team:2.3.1.5

6.1.3 System auditability
The regulator must be able to audit that the system complies with regulations

→ model.artifacts:4.2, 4.4, 4.5, 8.4

→ model.plan:3.3.3.1

→ model.team:2.3.1.3

6.1.4 Process auditability
The regulator must be able to audit that the organization has followed required processes in building the system

→ model.artifacts:8.3

→ model.team:2.3.1.3

6.2 Monitoring
The regulator must be able to monitor the organization, project, and/or system for compliance with regulation

→ model.team:2.3.1.3

6.2.1 Accurate information available
The organization and/or user must make available to the regulator accurate and complete information about the system and organization behavior

→ model.artifacts:4.2, 4.4, 4.5, 8.3, 8.4

6.2.2 Notify regulator
The organization or user must proactively provide information to the regulator when a potential regulatory problem is detected, as required by regulation

→ model.plan:3.3.7.1

→ model.team:2.3.1.4

6.3 Problem resolution
The regulator must be able to work with the project and/or user to identify and resolve potential regulatory problems

6.3.1 Communicate with organization or user
The regulator must be able to communicate with the organization or user about potential regulatory problems

→ model.plan:7.1

→ model.team:2.3.1.3

6.3.2 Accurate information
The regulator must obtain cooperation and accurate information from the organization or user to investigate a potential regulatory problem

→ model.artifacts:4.2, 4.4, 4.5, 8.3, 8.4

6.3.3 Respond to remedy
The organization or user must be able to respond to a regulator’s remedy

→ model.plan:7.1

→ model.team:2.3.1.5

A.2.5.1 Compliance and certification

The regulator makes regulations (or standards), and makes them available to teams building affected systems.

The project responds to the regulations by designing and building the system so that it meets the regulations, maintaining records needed to show that the regulations have been met, and beginning a process for getting certifications or licenses when appropriate.

The project is responsible for:

Finding and documenting all the regulations that apply to the system, and communicating those to the team. The project should maintain artifacts documenting these regulations, and needs people assigned to tracking down this information. Frequently the team will need to ask the regulator for explanation or clarification as well.
Tracking regulations for change. When regulations change, the team needs people who will find the changes and oversee any changes to the system needed to comply with the new regulations.
Designing and building the system to meet the regulations, including documentation of how compliance has been verified. The team must include regulation as it formulates requirements and designs. The team should have a process that ensures that designs and implementation are checked against the regulatory requirements, and maintains records of the checks.
Maintaining records of the process that has been followed, so that the regulator can audit the process that has been followed (separate from the content of the system).
Applying for certification. This involves knowing how to apply, developing a working relationship with the regulator to perform the application, delivering information to the regulator as needed, and responding to issues raised by the regulator.

A.2.5.2 Monitoring

In some cases, the regulator must monitor the project’s work—for example, during aircraft certification, which is generally a joint effort between the aviation authority and the company building the aircraft. A regulator might also need to monitor the project after a violation has been found and the team is working on remedial action.

Accurate and timely information is paramount when this occurs. The project must maintain good records to be able to provide that information to the regulators. The information potentially covers everything about the project: the design and analyses of the system, its implementation, records of design rationales, and logs of the processes followed.

The team must also be prepared to notify the regulator proactively as situations arise. The team should have people who will watch for situations and communicate with the regulator.

A.2.5.3 Problem resolution

I have never observed a licensing or certification process to go with no problems. The processes and regulations are often complex, and unless a team has done the process several times before there will almost certainly be things the team needs to learn to get through the process.

This means that there will be problems to resolve. Sometimes the team will discover the problem and need guidance from the regulator. Other times the regulator will raise the issue.

The team can make smooth the problem resolution process by:

Documenting the process it is following, including notes on how the team understands the process and what it learns along the way (to help the next time).
Maintaining good communications with the regulators, in order to be able to ask them questions or receive notices from them. This implies having people on the team who have the ongoing responsibility to maintain the interface with the regulators.
Maintaining accurate information, as noted above.
Having and following a process to handle regulatory issues when they are found or reported, including working with the regulator to find a resolution and then implementing and verifying the resolution.

A.3 Model elements

All of the objectives in the previous section map to objectives related to artifacts, team, tools, and plan. Some of them also map to things other than the system-building that goes on in the project.

This section lays out tables of the objectives for each element of the model. Each objective is annotated with its parents; that is, the objectives that are the reason that this objective is included. These are annotated in the tables with an arrow pointing down and right: ↘. If one of the objectives has children, those are annotated with a right-pointing arrow: →.

A.3.1 Artifacts

model.artifacts:1 Artifact management

1.1 Store artifacts
The project must have a place to store artifacts

↘ model:2.1.2, 2.4, 2.5.4

↘ model.artifacts:2.1, 2.2, 3.1, 4.2, 4.4, 4.5, 5.1, 5.2, 6.1, 6.2, 7.1, 7.2, 7.3, 7.4, 8.1, 8.2, 8.3, 8.4, 9.1

→ model.tools:2.1, 3.2.1, 3.4

1.1.1 Consistent versioning
The artifact storage must be able to maintain versions of all artifacts that are consistent with each other

1.2 Finding artifacts
Team members must be able to find artifacts that they are looking for

↘ model.artifacts:2.1.1, 3.1.3, 4.2, 4.4, 5.2, 6.1, 8.1

1.2.1 Discovery
Team members must be able to learn about artifacts they need to know of that they didn’t previously know existed

1.3 Status
Anyone looking at an artifact must be able to determine the status of that artifact (work in progress, proposed, approved, complete, and its version)

↘ model:3.1.3

↘ model.artifacts:2.1, 2.2, 3.1, 4.2, 4.4, 4.5, 5.1, 5.2, 6.1, 6.2, 7.1, 7.2, 7.3, 7.4, 8.1, 8.2, 8.3, 8.4, 9.1

1.3.1 Support workflow

model.artifacts:2 Purpose-related

2.1 System purpose
The artifacts must include documentation of the customer’s purpose for the system

↘ model:2.1.1, 2.1.2, 2.5.4, 4.2.1, 4.2.2

↘ model.plan:3.3.1

↘ model.team:2.1.1.1

→ model.artifacts:1.1, 1.3

2.1.1 Discoverable purpose
The artifacts that document the system’s purpose must be readily and accurately discoverable by members of the project team

→ model.artifacts:1.2

2.1.2 System requirements
The artifacts must include documentation of the customer’s reliability, safety, and security requirements for the system

↘ model:2.1.3, 2.1.4

2.1.3 System usage
The documentation of the customer’s purpose must include accurate information about what the system will be used for

↘ model:3.6.2

2.2 Changes in purpose
The artifacts must include records of requests made for changes to the system’s purpose, including the status of that request and any artifacts resulting from an approved change

↘ model:2.5.3, 2.5.4

↘ model.plan:5.1, 5.2, 5.3

↘ model.team:2.1.1.2, 2.3.1.2

→ model.artifacts:1.1, 1.3

2.3 Reasons for building system
The artifacts must include documentation of why the team has chosen to build the system

↘ model.plan:1.1

model.artifacts:3 Team-related

3.1 Team structure
The artifacts must include documentation of the structure of the team

↘ model:3.4.4

↘ model.plan:3.2.3, 6.3, 6.4

↘ model.team:1.2

→ model.artifacts:1.1, 1.3

3.1.1 Team membership
The documentation of team structure must include accurate records of who is on the team

3.1.2 Roles and authority
The documentation of team structure must include accurate records of the roles and authority that each team member has

3.1.3 Navigability
Members of the team must be able to conveniently and accurately navigate the records of team structure

→ model.artifacts:1.2

model.artifacts:4 System-related

4.1 Technical uncertainty
The artifacts must include records about the uncertainties or risks identified for the system being built

↘ model.plan:8.2

4.2 Specification and design artifacts
The artifacts must include accurate records of the specification and design of the system components and structure

↘ model:2.1.2, 6.1.3, 6.2.1, 6.3.2

→ model.artifacts:1.1, 1.2, 1.3

4.2.1 Rationales
The design-related artifacts must include rationales for the design choices made

↘ model:2.5.4

4.4 Implementation artifacts
The artifacts must include the implementation of the system

↘ model:2.1.2, 6.1.3, 6.2.1, 6.3.2

→ model.artifacts:1.1, 1.2, 1.3

4.5 Analysis artifacts
The artifacts must include accurate analyses of the system component and structure design or implementation

↘ model:2.1.2, 2.1.4, 2.1.5, 6.1.3, 6.2.1, 6.3.2

→ model.artifacts:1.1, 1.3

model.artifacts:5 Verification-related

5.1 Verification tests
The artifacts must include implementations of tests used for verifying the system implementation

↘ model:2.1.2, 2.1.4

↘ model.artifacts:5.2

→ model.artifacts:1.1, 1.3

5.2 Verification results
The artifacts must include accurate results of performing verification tests, reviews, and analyses

↘ model:2.1.2, 2.1.4

→ model.artifacts:1.1, 1.2, 1.3, 5.1

model.artifacts:6 Release/deployment-related

6.1 Release/deployment procedures
The artifacts must include the procedures to be used to release or deploy the system

↘ model:2.4

→ model.artifacts:1.1, 1.2, 1.3

6.2 Release/deployment records
The artifacts must include records of each release and deployment made of the system

↘ model.plan:3.4

→ model.artifacts:1.1, 1.3

model.artifacts:7 Management-related

7.1 Budget records
The artifacts must include records tracking resource budgets

↘ model.plan:4.1

→ model.artifacts:1.1, 1.3

7.2 Roadmap and plan
The artifacts must include the plan for completing the system

↘ model.plan:1.2

→ model.artifacts:1.1, 1.3

7.3 Tasking
The artifacts must include records about the tasks currently being performed, or that are scheduled to be performed in the near future

↘ model.plan:2.1, 2.2

→ model.artifacts:1.1, 1.3

7.4 Project uncertainty
The artifacts must include records about the uncertainties or risks identified for project execution

↘ model.plan:8.1

→ model.artifacts:1.1, 1.3

model.artifacts:8 Regulatory-related

8.1 Regulations
The artifacts must include all the relevant regulations that the system must meet (or references to them)

↘ model:2.3.1, 6.1.1

↘ model.plan:5.2

↘ model.team:2.3.1.2

→ model.artifacts:1.1, 1.2, 1.3

8.2 Certification process
The artifacts must include information on the certification process

↘ model:2.3.2, 6.1.2

→ model.artifacts:1.1, 1.3

8.2.1 Application
The artifacts must include records of applications made for certification

8.2.2 Certifications
The artifacts must include records of certifications that have been granted or denied for the system

↘ model.plan:3.3.7

8.3 Regulatory process records
The artifacts must include records of the process being followed to meet regulation or obtain certification

↘ model:2.3.2, 6.1.4, 6.2.1, 6.3.2

→ model.artifacts:1.1, 1.3

8.4 Regulatory verification records
The artifacts must include records showing that the system has been verified against regulatory requirements

↘ model:6.1.3, 6.2.1, 6.3.2

↘ model.plan:3.3.3.1

→ model.artifacts:1.1, 1.3

model.artifacts:9 Audit

9.1 Approvals
The artifacts must include records of reviews and approvals of designs and implementations

→ model.artifacts:1.1, 1.3, 1.3.1

A.3.2 Team

model.team:1 Organization

1.1 General structure
Each team member must be able to find and understand the structure of the team

↘ model:3.4.1

1.1.1 Finding team members
Each team member must be able to find essential information about other team members

↘ model:3.4.4

1.1.2 Reporting
Each team member must be able to accurately determine the reporting structure of the team

↘ model:3.4.4

1.2 Roles and responsibilities
Each team member must be able to accurately find and understand their roles and responsibilities

↘ model:3.4.2

→ model.artifacts:3.1

1.2.1 Clear responsibilities
Each team member must be able to accurately find and determine the responsibilities on which their performance will be evaluated

↘ model:3.4.3

model.team:2 Capabilities

2.1 Customer-related

2.1.1 Customer interface
The team must have people whose responsibility is to work with the customer

↘ model:4.1.1

2.1.1.1 Learn and communicate the customer’s purpose
The team must have people whose responsibility is to work with the customer to identify the customer’s purpose and requirements for the system and to communicate that to the rest of the team

↘ model:2.1.1, 2.1.3

↘ model.plan:3.2.1

→ model.artifacts:2.1

2.1.1.2 Learn and process change requests
The team must have people whose responsibility is to receive requests for changes, document those requests, and drive the process to decide on and resolve the requests

↘ model:2.5.1, 2.5.4

→ model.artifacts:2.2

2.1.1.3 Ability to negotiate
The people on the team who interface with the customer must be able to raise issues to the customer and negotiate resolutions of issues or conflicts

↘ model:4.1.1.1

2.1.2 Internal communication
The team must have people whose responsibility includes regularly informing the rest of the team about the status of working with the customer

↘ model:3.6.1

2.2 Technical capabilities

2.2.1 Ability to build system
The team must have the skills needed to build a system that meets the customer’s purpose

↘ model:2.1.2, 2.1.4, 2.5.4

2.2.2 Ability to verify system
The team must have the skills needed to verify that the designed or implemented system meets the customer’s purpose, regulatory requirements, and other constraints

↘ model:2.1.2, 2.1.4, 2.1.5, 2.5.4

2.2.3 Track technical uncertainty
The team must have people whose responsibility includes identifying risk or uncertainty related to the system being built, documenting those uncertainties, and ensuring that the uncertainties are resolved

↘ model.plan:8.2

2.2.4 Ability to release/deploy system
The team must have the skills needed to release or deploy the system

↘ model.plan:3.4

2.3 Regulator interface

2.3.1 Regulator interface
The team must have people whose responsibility is to work with regulator(s)

2.3.1.1 Identify regulation
The team must have people whose responsibility includes identifying relevant regulation or certification requirements and documenting them

↘ model:6.1.1

↘ model.plan:3.2.5

2.3.1.2 Detect and handle regulatory changes
The team must have people whose responsibility includes detecting that relevant regulations have changed, documenting those changes, and driving changes to plans to address the changes

↘ model:2.5.2, 6.1.1

→ model.artifacts:2.2, 8.1

2.3.1.3 Handle regulator requests
The team must have people whose responsibility includes receiving and responding to requests for information from the regulator

↘ model:6.1.3, 6.1.4, 6.2, 6.3.1

2.3.1.4 Handle regulator notification
The team must have people whose responsibility includes notifying the regulator at identified events that affect certification or regulatory compliance

↘ model:6.2.2

2.3.1.5 Perform certification/approval process
The team must have people whose responsibility includes working with the regulator(s) to obtain certification or approval

↘ model:6.1.2, 6.3.3

↘ model.plan:3.3.7

2.3.2 Process oversight
The team must have people whose responsibility is to ensure that the project is following processes that will result in a system that meets regulations and/or obtain certification

↘ model.plan:3.3.1.1, 3.3.2.1, 3.3.3.1

2.4 Management capabilities

2.4.1 Track plan
The team must have people whose responsibility includes building and maintaining the project plan

↘ model.plan:1.2, 5.4.1

2.4.2 Detect and respond to problems
The team must have people whose responsibility includes detecting when the project may not meet deadlines and oversee the response

↘ model.plan:1.2.4

2.4.3 Prioritize and assign tasks
The team must have people whose responsibility includes determining which tasks should be performed next according to some prioritization, and assigning those tasks to team members

↘ model.plan:2.1, 2.2

2.4.4 Maintain team information
The team must have people whose responsibility includes maintaining information about the team

↘ model.plan:6.2, 6.3, 6.4

2.4.5 Track staffing levels
The team must have people whose responsibility is to detect when staffing levels need to change, and then ensure that needed changes are made

↘ model.plan:6.1

2.4.6 Track project uncertainty
The team must have people whose responsibility includes identifying risk or uncertainty related to project execution, documenting those uncertainties, and ensuring that the uncertainties are resolved

↘ model.plan:8.1

2.5 Support capabilities

2.5.1 Maintain tools
The team must have people whose responsibility includes maintaining the tools used for building the system

↘ model:2.1.2, 2.4, 2.5.4

model.team:4 Exceptions

4.1 Raise technical issues
Each team member must know how to raise technical issues when they find them

↘ model:3.4.5

model.team:5 Other

5.1 Tracking
The team must be able to track time spent building the system

↘ model.plan:4.1.1

A.3.3 Tools

model.tools:1 General

1.1 Automate simple tasks
Where possible, the project should use tools to automate simple or repetitive tasks

↘ model:3.1.3

model.tools:2 Artifact management

2.1 Digital artifact storage
The tools must provide for storing and managing digital artifacts

↘ model.artifacts:1.1

model.tools:3 Hardware support

3.1 Design tools
The project must have the design tools needed to design hardware components

↘ model:2.1.2

3.2 Manufacturing
The project must have the tools needed to manufacture hardware parts

↘ model:2.1.2, 3.3

→ model.tools:3.4

3.2.1 Stock storage
The tools must provide for maintaining stock materials used to build hardware components

↘ model.artifacts:1.1

3.3 Hardware test
The project must have tools to perform verification tests on hardware components

↘ model:3.3

↘ model.plan:3.3.3, 3.3.5

3.4 Inventory management
The project must have the facilities and tools to maintain hardware parts inventory and track its content

↘ model.artifacts:1.1

↘ model.tools:3.2, 3.5

3.5 Hardware deployment
The project must have tools for delivering and distributing hardware components

↘ model:2.4

→ model.tools:3.4

model.tools:4 Software support

4.1 Software build
The project must have tools to build software components

↘ model:2.1.2, 3.3

4.2 Software test
The project must have tools to perform verification tests on software components

↘ model:3.3

↘ model.plan:3.3.3, 3.3.5

4.3 Software release
The project must have tools for making, maintaining, and distributing software releases

↘ model:2.4

model.tools:5 Facilities

5.1 Team facilities
The project must have facilities in which the team can work to develop the system

↘ model:3.3

A.3.4 Plan

model.plan:1 General roadmap

1.1 Overall objective
The roadmap must include a clear statement of the objective(s) of the system-building effort

↘ model:3.1.1, 3.6.1

→ model.artifacts:2.3

→ model.plan:3.2.4

1.2 Plan to completion
The project must maintain a plan that shows an estimation of the time and effort required to complete the system

↘ model:2.1.2, 2.2.3, 2.5.4, 3.1.1, 5.1.1

→ model.artifacts:7.2

→ model.team:2.4.1

1.2.1 Reflect uncertainty
The plan must accurately reflect the degree of uncertainty (or risk) in what is known about the steps needed to complete the system

↘ model:2.2.3.1

→ model.plan:8.1, 8.2

1.2.2 Update plan
The plan must be updated as work is completed or uncertainty changes

↘ model.plan:3.3.6, 5.4.1

1.2.3 Include resource estimate
The plan must include estimates of the time and resources required to complete steps in the plan

↘ model:3.2.1, 3.2.2

1.2.4 Detect and handle deadline problems
The project must include processes and milestones that will detect when project deadlines will not be met, determine how to respond, and ensure the response is executed

↘ model:2.2.4

→ model.team:2.4.2

1.2.5 Only project-relevant work in plan
The project must include only work that is relevant to building the system, or managing and supporting that development, in the plan

↘ model:2.2, 3.1.3, 4.3

1.2.6 Efficient execution
The project must organize and plan the work in an efficient way, minimizing time to completion and/or cost without sacrificing quality, customer purpose, or requirements

↘ model:4.3

→ model.plan:2.2, 8.1, 8.2

1.2.7 Navigability
The project plan information must be accessible to team members in a way that allows them to understand how the work will proceed and how it will affect their assignments

↘ model:3.4.2

model.plan:2 Sequencing and prioritization

2.1 Track current effort
The project must include processes to track what tasks are currently being worked on or have been assigned to people to be worked on in the near future

↘ model:2.1.2, 2.5.4

→ model.artifacts:7.3

→ model.team:2.4.3

2.2 Assign next tasks
The project must include processes to determine what tasks should be worked on in the near future, and by whom

↘ model:2.1.2, 2.5.4

↘ model.plan:1.2.6

→ model.artifacts:7.3

→ model.team:2.4.3

2.3 Detect and handle tasking problems
The project must include processes and milestones to detect when there are problems performing one or more tasks, determine how to respond, and ensure the response is executed

↘ model:2.1.2, 2.5.4

2.4 Navigability
The scheduling information must be accessible to team members in a way that allows them to determine what work they should be doing, and who is doing work related to theirs

↘ model:3.4.2, 3.4.5

↘ model.plan:6.1

model.plan:3 Process and lifecycle

3.1 Defined lifecycle
The project must have a defined lifecycle that defines the processes people must follow to build and deploy the system

↘ model.plan:3.2, 3.3

3.2 Project startup
The project lifecycle must include the processes involved in starting the project

→ model.plan:3.1

3.2.1 Learn and verify customer purpose
The project lifecycle must include an initial step to learn the customer’s purpose for the system and ensure that the team properly understands that purpose

↘ model:2.1.1, 2.1.3, 4.2.2

→ model.team:2.1.1.1

3.2.2 Learn and verify available resources
The project lifecycle must include an initial step to determine the resources initially available for building the system

↘ model:2.2.1

↘ model.plan:4.1

3.2.3 Establish organization structure
The project lifecycle must include an initial step to decide on and document the structure of the team that will be working on the project

↘ model:3.4.2, 3.4.4

→ model.artifacts:3.1

3.2.4 Establish reasons for building the system
The project lifecycle must include an initial step to determine whether the team should build the system, and if so, why

↘ model.plan:1.1

3.2.5 Establish regulatory constraints
The project lifecycle must include an initial step to determine what regulations apply to the system, including the potential need for certification

↘ model:2.3.1, 6.1.1

→ model.team:2.3.1.1

3.2.6 Establish organization expectations
The project lifecycle must include an initial step to determine what expectations has of the project, including definition of constraints on the project and system

↘ model:4.3

→ model.external:3.2

3.3 System building
The project lifecycle must include the processes involved in building the system

↘ model:2.1.2, 2.1.4

↘ model.plan:5.4

→ model.plan:3.1

3.3.1 Design to purpose
The project lifecycle must provide processes and milestones that ensure that the system design accurately reflects the customer’s purpose for the system

→ model.artifacts:2.1

3.3.1.1 Design to meet regulation
The project lifecycle must provide processes and milestones that ensure the system design meets regulatory requirements

↘ model:2.3.2

→ model.team:2.3.2

3.3.1.2 Design for release/deployment
The project lifecycle must provide processes and milestones that ensure that the resulting system can be released or deployed

↘ model.plan:3.4

3.3.2 Build to purpose
The project lifecycle must provide processes and milestones that ensure that the built system accurately reflects the customer’s purpose for the system

↘ model:2.1.2

3.3.2.1 Build to meet regulation
The project lifecycle must provide processes and milestones that ensure the built system meets regulatory requirements

↘ model:2.3.2

→ model.team:2.3.2

3.3.2.2 Build for release/deployment
The project lifecycle must provide processes and milestones that ensure that the built system can be released or deployed

↘ model.plan:3.4

3.3.3 Verify against purpose
The project lifecycle must provide processes and milestones that verify that the system meets the customer’s purpose

→ model.tools:3.3, 4.2

3.3.3.1 Verify against regulation
The project lifecycle must provide processes and milestones that verify that the system meets regulation

↘ model:2.3.2, 6.1.3

→ model.artifacts:8.4

→ model.team:2.3.2

3.3.4 Verify no extraneous behavior
The project lifecycle must provide processes and milestones that verify that the system does not include functions or behavior that is outside the customer’s purpose

3.3.5 Identifying and fixing errors
The project lifecycle must provide processes and milestones that ensure error will be detected with high likelihood, and that detected errors will be fixed

↘ model:2.1.4, 2.1.5, 3.4.5

→ model.tools:3.3, 4.2

3.3.6 Adaptation
The project lifecycle must provide processes and milestones to adapt the plans and design as the team learns about the system or as uncertainties are resolved

↘ model:2.5.4

→ model.plan:1.2.2

3.3.7 Perform certification/approval
The project lifecycle must provide processes to result in certification or approval, if required for the system

↘ model:2.3.2, 6.1.2

→ model.artifacts:8.2.2

→ model.team:2.3.1.5

3.3.7.1 Regulatory notification
The project lifecycle must define events at which the project must notify regulators, and processes by which the notification information is gathered and delivered

↘ model:6.2.2

3.4 System release and deployment
The project lifecycle must include the processes involved in releasing and deploying the system

↘ model:2.4

→ model.artifacts:6.2

→ model.plan:3.3.1.2, 3.3.2.2

→ model.team:2.2.4

model.plan:4 Budgets

4.1 Resources
The budgets must include the amount of various resources allocated to the project

↘ model:2.2, 2.2.1, 5.1.1

→ model.artifacts:7.1

→ model.plan:3.2.2

4.1.1 Amount used
The budget must accurately record the amount of resources already used in the project

↘ model:2.2.2

→ model.team:5.1

4.1.2 Amount remaining
The budget must provide accurate measures of how much resource remains available

4.2 Deadlines
The budgets must include milestones or other deadlines that the project must meet

↘ model:2.2, 5.1.1

model.plan:5 Change handling

5.1 Receive change request
The project must have a process for receiving and documenting a request for change in purpose or features

↘ model:2.5.1

↘ model.external:4.3.1

↘ model.plan:7.1

→ model.artifacts:2.2

5.2 Receive regulatory change
The project must have a process for detecting and receiving changes in regulatory requirements

↘ model:2.5.2

→ model.artifacts:2.2, 8.1

5.3 Decision process
The project must have a process for determining whether to proceed with a change request or not

↘ model:2.5.1

↘ model.external:4.3.1

→ model.artifacts:2.2

5.4 System change
The project lifecycle must provide processes and milestones for building changes to the system

↘ model.external:4.3.1

↘ model.plan:7.1

→ model.plan:3.3

5.4.1 Plan change
The project lifecycle must provide processes for updating plans when change requests are accepted for building

→ model.plan:1.2.2

→ model.team:2.4.1

model.plan:6 Team-related

6.1 Determine need for staffing changes
The project lifecycle must provide processes and milestones for detecting when a change in staffing is appropriate, and the following through on the needed changes

↘ model:3.2.1, 3.2.2

→ model.external:1.3

→ model.plan:2.4

→ model.team:2.4.5

6.2 Maintain team information
The project lifecycle must provide processes and milestones for maintaining information about the structure, roles, responsibilities, and authority in the team

↘ model:3.4.2, 3.4.4

→ model.team:2.4.4

6.3 Adding team members
The project lifecycle must provide a process for adding new team members to the processes and tools, and educating them about the project

↘ model:3.2.1, 3.4.2, 3.4.4

↘ model.external:1.3.1

→ model.artifacts:3.1

→ model.team:2.4.4

6.4 Removing team members
The project lifecycle must provide a process and definitions of triggering events for removing a member from the team

↘ model:3.2.2

↘ model.external:1.3.2, 1.4

→ model.artifacts:3.1

→ model.team:2.4.4

model.plan:7 Regulatory-related

7.1 Receive and process regulatory issues
The project lifecycle must provide a process by which the project can receive notification of issues from the regulator, identify remedies, implement the remedies, and respond to the regulator

↘ model:6.3.1, 6.3.3

→ model.plan:5.1, 5.4

model.plan:8 Risk and uncertainty

8.1 Project uncertainty
The project lifecycle must track and manage uncertainties or risks related to project execution

↘ model.plan:1.2.1, 1.2.6

→ model.artifacts:7.4

→ model.team:2.4.6

8.2 Technical uncertainty
The project lifecycle must track and manage uncertainties or risks related to the system being built

↘ model.plan:1.2.1, 1.2.6

→ model.artifacts:4.1

→ model.team:2.2.3

8.2.1 Efficient burn-down
The project lifecycle must provide a process and milestones that lead to efficiently reducing technical uncertainty as the project progresses

A.3.5 External responsibilities

model.external:1 Team-related

1.1 Satisfaction
Project and organization management must take steps to provide team members with the information they need to give them confidence that the project is on track and will challenge the team members

↘ model:3.1.1, 3.1.2

1.2 Appropriate assignments
Project management must take steps to match team members with tasks that are needed and that challenge the team member

↘ model:3.1.2, 3.1.3

1.3 Appropriate staffing level
Project management must manage the makeup of the team to ensure that the project has the right people to do the work, including having processes to hire, contract, or let go of staff

↘ model:3.2.1, 3.2.2, 4.1.2.2, 4.1.2.3

↘ model.plan:6.1

1.3.1 Hiring
Project management and the hosting organization must be able to hire or move in staff when needed for the project

→ model.plan:6.3

1.3.2 Firing or transfer out
Project management and the hosting organization must be able to move out or let go staff who are no longer needed for the project

→ model.plan:6.4

1.4 Respond to team issues
Project management must respond appropriately to issues raised by team members, both technical issues and non-technical issues

↘ model:3.4.5, 4.1.2.1

→ model.plan:6.4

1.5 Fit to organization
The organization must provide the team with an understanding of how the project and the team members fit into the organization

↘ model:3.4.1, 4.1.2.1

1.6 Compensation
Project and organization management are responsible for setting compensation levels for team members at a level that is acceptable to both the staff and the project/organization

↘ model:3.5

1.7 Evaluation
Project and organization management are responsible for setting and documenting the evaluation process that will be used to judge each team member’s performance

↘ model:3.4.3, 4.1.2.1

1.8 Ethics policy
Project and organization management are responsible for setting and documenting an ethics policy that applies to all team and organization activities, along with mechanisms for reporting and resolving potential ethics violations

↘ model:3.6.2, 4.1.2.1

1.9 Workplace regulation
The organization management must provide a workplace that meets regulation

↘ model:4.1.2.4

model.external:2 Customer-related

2.1 Ability to sell
The organization must have the ability to sell the system produced (when appropriate)

↘ model:4.2

2.2 Ability to define market(s)
The organization must have the ability to define plausible market(s) for the system

↘ model:4.2.2

2.3 Sales and marketing capability
The organization must have a capability to perform sales and marketing of the system

↘ model:4.2.3

model.external:3 Resource-related

3.1 Sufficient resources
The organization and project management are responsible for obtaining funding and other resources sufficient to perform the project

↘ model:4.1.2.2

3.2 Define expected return
The organization must define the expected return on investment for the project to build the system

↘ model:4.3

↘ model.plan:3.2.6

model.external:4 Organization-related

4.1 Reusable capability
The organization must take steps to build up reusable capabilities that can apply to multiple projects

↘ model:4.4.2

4.2 Ongoing improvement
The organization must have the capability to learn and improve its capabilities over time

↘ model:4.4.3

4.3 Communication with funder
The organization must communicate with the funder in a way that gives the funder visibility into the organization’s behavior and progress on the project

↘ model:5.1.1

4.3.1 Receive and process funder concerns
The organization must have the capability to receive concerns from the funder, discuss the concerns, and take steps to address the concerns

↘ model:5.1.2

→ model.plan:5.1, 5.3, 5.4

Acknowledgments

The salt crew (Vanessa Kuroda, Peter Sachs, and Max Nova)
Zack Lovering
Dave Brown
Lawrence Turner
Olin Sibert
Alicia Noyola

Bibliography

[14CFR23]	“Part 23—Airworthiness standards: normal category airplanes”, in Title 14, Code of Federal Regulations, United States Government, December 2023, https://www.ecfr.gov/current/title-14/chapter-I/subchapter-C/part-23, accessed 14 February 2024.
[Agile]	Wikipedia contributors, “Agile software development”, in Wikipedia, the Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Agile_software_development&oldid=1198512141, accessed 14 February 2024.
[Alexander02]	Christopher Alexander, The Nature of Order, Berkeley, California: The Center for Environmental Structure, 2002.
[Alexander15]	Christopher Alexander, A City is not a Tree, Portland, Oregon: Sustasis Press, 2015.
[Alexander77]	Christopher Alexander, A Pattern Language, Oxford University Press, 1977.
[Ambler23]	Scott Ambler, “What happened to the Rational Unified Process (RUP)?”, https://scottambler.com/what-happened-to-rup/, accessed 29 February 2024.
[Asimov50]	Isaac Asimov, I, Robot, New York: Gnome Press, 1950.
[BCP14]	Scott Bradner, “Key words for use in RFCs to Indicate Requirement Levels”, Internet Engineering Task Force (IETF), Best Community Practice BCP 14, March 1997, https://www.ietf.org/rfc/bcp/bcp14.html.
[Bain99]	David Haward Bain, Empire Express: Building the First Transcontinental Railroad, New York: Viking, 1999.
[Bezos16]	Jeffrey P. Bezos, “2015 Letter to Shareholders”, Amazon.com, Inc., 2016, https://s2.q4cdn.com/299287126/files/doc_financials/annual/2015-Letter-to-Shareholders.PDF, accessed 22 February 2024.
[Bogan17]	Matthew R. Bogan, Thomas W. Kellermann, and Anthony S. Percy, “Failure is not an option: a root cause analysis of failed acquisition programs”, Naval Postgraduate School, Technical report NPS-AM-18-011, December 2017, https://nps.edu/documents/105938399/110483737/NPS-AM-18-011.pdf.
[Brand94]	Stewart Brand, How Buildings Learn: What Happens After They’re Built, New York: Penguin Books, 1994.
[CMMI]	ISACA, “What is CMMI?”, https://cmmiinstitute.com/cmmi/intro, accessed 24 March 2024.
[Collins74]	Michael Collins, Carrying the Fire: An Astronaut’s Journeys, New York: Farrar, Straus and Giroux, 1974.
[Conway68]	Melvin E. Conway, “How do committees invent?”, Datamation, vol. 14, no. 4, April 1968, pp. 28–31, http://www.melconway.com/Home/pdf/committees.pdf.
[DFARS]	“Defense Federal Acquisition Regulation Supplement”, General Services Administration, United States Government, January 2024, https://www.acquisition.gov/dfars, accessed 16 February 2024.
[DOD22]	“Work Breakdown Structures for Defense Materiel Items”, Department of Defense, United States Government, Standard Practice MIL-STD-881F, May 2022, https://cade.osd.mil/Content/cade/files/coplan/MIL-STD-881F_Final.pdf.
[ELOMC]	Engineering Lifecycle Optimization—Method Composer, IBM, version 7.6.2, 2023, https://www.ibm.com/docs/en/engineering-lifecycle-management-suite/lifecycle-optimization-method-composer/7.6.2, accessed 29 February 2024.
[EPF]	Eclipse Process Framework Project (archived), Eclipse Foundation, 2018?, https://projects.eclipse.org/projects/technology.epf, accessed 29 February 2024.
[FAR]	“Federal Acquisition Regulation”, General Services Administration, United States Government, January 2024, https://www.acquisition.gov/browse/index/far, accessed 16 February 2024.
[Fowler05]	Martin Fowler, “Code as Documentation”, 22 March 2005, https://martinfowler.com/bliki/CodeAsDocumentation.html, accessed 15 March 2024.
[Garmin13]	G3X Installation Manual, Garmin, 190-01115-01, Revision K, July 2013.
[Hardin20]	Russel Hardin, and Garrett Cullity, “The Free Rider Problem”, in The Stanford Encyclopedia of Philosophy, Edward N. Zalta, editor, Metaphysics Research Lab, Stanford University, 2020, https://plato.stanford.edu/archives/win2020/entries/free-rider/, accessed 28 March 2024.
[ISO26262]	“Road vehicles — Functional safety”, Geneva, Switzerland: International Organization for Standardization, Standard ISO 26262:2018, 2018.
[JPL00]	JPL Special Review Board, “Report on the Loss of the Mars Polar Lander and Deep Space 2 Missions”, Jet Propulsion Laboratory, Report JPL D-18709, March 2000, https://smd-cms.nasa.gov/wp-content/uploads/2023/07/3338_mpl_report_1.pdf.
[Jacobson88]	Van Jacobson, “Congestion Avoidance and Control”, Proc. SIGCOMM, vol. 18, no. 4, August 1988.
[Johnson22]	Clair Hughes Johnson, Scaling People: Tactics for Management and Company Building, South San Francisco, California: Stripe Press, 2022.
[Kalra16]	Nidhi Kalra, and Susan M. Paddock, “Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?”, Santa Monica, CA: RAND Corporation, Report RR-1478-RC, 2016, https://www.rand.org/pubs/research_reports/RR1478.html.
[Klein14]	Gerwin Klein, June Andronick, Kevin Elphinstone, Toby Murray, Thomas Sewell, Rafal Kolanski, and Gernot Heiser, “Comprehensive formal verification of an OS microkernel”, ACM Transactions on Computer Systems, vol. 32, no. 1, February 2014, https://trustworthy.systems/publications/nicta_full_text/7371.pdf.
[Kruger00]	Justin Kruger, and David Dunning, “Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments”, Journal of Personality and Social Psychology, vol. 77, no. 6, January 2000, pp. 1121–1134.
[Leveson00]	Nancy G. Leveson, “Intent specifications: an approach to building human-centered specifications”, IEEE Transactions on Software Engineering, vol. 26, no. 1, January 2000, http://sunnyday.mit.edu/papers/intent-tse.pdf.
[Leveson11]	Nancy G. Leveson, Engineering a safer world: systems thinking applied to safety, Engineering Systems, Cambridge, Massachusetts: MIT Press, 2011.
[Lynch89]	Nancy A. Lynch, and Mark R. Tuttle, “An introduction to input/output automata”, Cambridge, Massachusetts: Massachusetts Institute of Technology, Technical memo MIT/LCS/TM-373, 1989, https://www.markrtuttle.com/data/papers/lt89-cwi.pdf.
[NASA16]	“NASA Systems Engineering Handbook”, National Aeronautics and Astronautics Administration (NASA), Report NASA SP-2016-6105 Rev2, 2016.
[NASA18]	“NASA Work Breakdown Structure (WBS) Handbook”, National Aeronautics and Astronautics Administration (NASA), Handbook SP-2016-3404/REV1, 2018, https://essp.larc.nasa.gov/EVM-3/pdf_files/NASA_WBS_Handbook_20180000844.pdf.
[NASA19]	“Debris Assessment Software User’s Guide, Version 3.0”, National Aeronautics and Astronautics Administration (NASA), Report NASA TP-2019-220300, 2019.
[NPR7120]	“NASA Space Flight Program and Project Management Requirements”, National Aeronautics and Astronautics Administration (NASA), NASA Procedural Requirement NPR 7120.5F, 2021.
[NPR7123]	“NASA Systems Engineering Processes and Requirements”, National Aeronautics and Astronautics Administration (NASA), NASA Procedural Requirement NPR 7123.1D, 2023.
[OConnor21]	Timothy O’Connor, “Emergent Properties”, in The Stanford Encyclopedia of Philosophy, Edward N. Zalta, editor, Metaphysics Research Lab, Stanford University, 2021, https://plato.stanford.edu/archives/win2021/entries/properties-emergent/, accessed 13 February 2024.
[Olson65]	Mancur Olson, The Logic of Collective Action: Public Goods and the Theory of Groups, Harvard Economic Studies, Cambridge, Massachusetts: Harvard University Press, 1965.
[Parnas72]	D. L. Parnas, “On the criteria to be used in decomposing systems into modules”, Communications of the ACM, vol. 15, no. 12, December 1972, pp. 1053–1058.
[Spiral]	Wikipedia contributors, “Spiral model”, in Wikipedia, the Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Spiral_model&oldid=1068244887, accessed 14 February 2024.
[TTSB21]	“Major Transportation Occurrence—Final Report, China Airlines Flight CI202”, Taiwan Transportation Safety Board, Report TTSB-AOR-21-09-001, September 2021, https://www.ttsb.gov.tw/media/4936/ci-202-final-report_english.pdf.
[Thompson99]	Adrian Thompson, and Paul Layzell, “Analysis of unconventional evolved electronics”, Communications of the ACM, vol. 42, no. 4, April 1999, pp. 71–79.
[Tolstoy23]	Leo Tolstoy, Anna Karenina, Constance Garnett, translator, Project Gutenberg, 2023, https://www.gutenberg.org/ebooks/1399.
[Tuckman65]	Bruce W. Tuckman, “Developmental sequence in small groups”, Psychological Bulletin, vol. 63, no. 6, 1965, pp. 384–399.
[Tuckman77]	Bruce W. Tuckman, and Mary Ann C. Jensen, “Stages of small-group development revisited”, Group and Organization Studies, vol. 2, no. 4, 1977, pp. 419–427.
[Wilson05]	Simon P. Wilson, and Mark John Costello, “Predicting future discoveries of European marine species by using a non-homogeneous renewal process”, Journal of the Royal Statistical Society Series C: Applied Statistics, vol. 54, no. 5, November 2005, pp. 897–918, https://academic.oup.com/jrsssc/article/54/5/897/7113002.
[Zhang90]	Lixia Zhang, and David D. Clark, “Oscillating behavior of network traffic: a case study simulation”, Internetworking: Research and Experience, vol. 1, 1990, pp. 101–112.

Making systems

Table of contents

Part I: Introduction and purpose

Chapter 1: Introduction

Chapter 2: What is “making a system”?

2.1 An example

2.2 Generalizing

2.2.1 Fitting parts together

2.2.2 Meeting objectives

2.2.3 Coordinating

2.2.4 Managing artifacts

2.2.5 Communicating

2.3 The rest of the book

Chapter 3: Stories about building systems

3.1 Developing a spacecraft mission without engineering the system

3.2 Marketing and engineering collaboration

3.3 Missing implicit requirements

3.4 Building at a mismatch to purpose

3.5 Heavyweight, understaffed processes

3.6 Planning the transcontinental railroad

Part II: Systems background

Chapter 4: A model of systems work

Chapter 5: Elements of systems

5.1 Introduction

5.2 System purpose

5.3 System parts and views

5.4 Structure and emergence

5.5 Evidence

5.6 Using this model

Chapter 6: Elements of making a system

6.1 Introduction

6.2 Objective

6.3 Model

6.3.1 Artifacts

6.3.2 Tasks

6.3.3 Team

6.3.4 Tools

6.3.5 Operations

6.4 Using this model

Chapter 7: A well-functioning project

7.1 Project leadership

7.1.1 Principle: Communication and translation

7.1.2 Principle: Provide staff to run the engineering team’s operations

7.1.3 Principle: Systems view of the system

7.1.4 Principle: The team is a system

7.1.5 Principle: Team habits

7.1.6 Principle: Keep it lightweight and actionable

7.2 System-building tasks

7.2.1 Principle: Start with a purpose before doing work

7.2.2 Principle: Evaluate tools before adopting them

7.2.3 Principle: Follow the spirit, not just the letter

7.2.4 Principle: Document things so there is a future

7.2.5 Principle: Build in checks

7.2.6 Principle: Work against cognitive biases

7.3 Plan for building the system

7.3.1 Principle: Prioritize work by risk or uncertainty

7.3.2 Principle: Prioritize integration

7.3.3 Principle: Have a long-term plan

7.3.4 Principle: Set up intermediate internal milestones

7.3.5 Principle: Use prototyping safely

7.3.6 Principle: Analyze for feasibility

7.4 The team

7.4.1 Principle: Document team structure

7.4.2 Principle: Plan on reorganizing the team as it grows

7.4.3 Principle: Have shared procedures

7.4.4 Principle: Define regular communication paths

7.4.5 Principle: Define exceptional communication paths

7.4.6 Principle: Provide independent resources for checks

Part III: Systems

Chapter 8: Purpose

8.1 Introduction

8.2 Why purpose matters

8.2.1 Not monolithic or fixed

8.2.2 Inconsistent or conflicting purposes

8.3 Explicit purpose

8.4 Implicit purpose

8.5 Using purpose

Chapter 9: Component parts

9.1 Introduction

9.2 Definition of component