Making systems

Volume 1: Fundamentals

Richard Golding

Release: 0.4-review

Chapter 1: Introduction
Chapter 2: A note of caution
Part I: System stories
Chapter 3: Making a simple system
3.1 The request
3.2 Building the cottage
3.3 Retrospective on building
3.4 Adding to the cottage
3.5 Retrospective on addition
3.6 Principles
Chapter 4: Stories about building systems
4.1 Developing a spacecraft mission without engineering the system
4.2 Marketing and engineering collaboration
4.3 Missing implicit requirements
4.4 Building at a mismatch to purpose
4.5 The persistence of team habits
4.6 Heavyweight, understaffed processes
4.7 Planning the transcontinental railroad
Part II: Systems background
Chapter 5: What making systems is
Chapter 6: Elements of systems
6.1 Introduction
6.2 System purpose
6.3 System boundary
6.4 System parts and views
6.5 Structure and emergence
6.6 Evidence
6.7 Using this model
Chapter 7: Elements of making a system
7.1 Introduction
7.2 Objective
7.3 Model
7.4 Using this model
Chapter 8: Principles for a well-functioning project
8.1 Project leadership
8.2 System-building tasks
8.3 Plan for building the system
8.4 The team
Part III: Systems
Chapter 9: Purpose
9.1 Introduction
9.2 Why purpose matters
9.3 Explicit purpose
9.4 Implicit purpose
9.5 Objectives and constraints
9.6 Using purpose
Chapter 10: System scope
10.1 Introduction
10.2 Why scope and boundary matter
10.3 Content
10.4 Using scope and boundary
Chapter 11: Component parts
11.1 Introduction
11.2 Definition of component
11.3 Divide and conquer: the component breakdown structure
11.4 Component characteristics
11.5 Downsides
11.6 Why components matter
Chapter 12: Structure and emergence
12.1 Introduction
12.2 Definition
12.3 Abstraction
12.4 Emergent system properties
12.5 Working with structure
Chapter 13: System views
13.1 Introduction
13.2 Technical views
13.3 Non-technical uses
Chapter 14: Evidence of meeting purpose
14.1 Introduction
14.2 Verification versus validation
14.3 When to evaluate a system
14.4 Kinds of evidence
14.5 Methods of gathering evidence
14.6 Completeness and minimality
Chapter 15: Synthesis
15.1 Introduction
15.2 Structure of artifacts
15.3 Exploration and development
15.4 Views
15.5 Relation to making a system
Part IV: Making a system
Chapter 16: Approach
16.1 Purpose
16.2 Stakeholders and needs
16.3 Mapping needs to model
Chapter 17: Artifacts
17.1 Purpose
17.2 General principles
17.3 Kinds of artifacts
17.4 Managing artifacts
Chapter 18: Tools
18.1 Purpose
18.2 General considerations
18.3 Kinds of tools
18.4 Managing tools
Chapter 19: Teams
19.1 Purpose
19.2 Model of teams
19.3 Using model of teams
19.4 Directory
Chapter 20: Operations
20.1 Purpose
20.2 Operation model
20.3 Development methodology
20.4 Life cycle
20.5 Procedures
20.6 Plan
20.7 Tasking
20.8 Support
20.9 Using the operations model
Appendixes
Appendix A: From stakeholder need to model purposes
A.1 Introduction
A.2 Stakeholders
A.3 Model elements
Appendix B: The Heilmeier questions
Further reading and inspiration
Acknowledgments
Bibliography

Chapter 1: Introduction

25 October 2024

This book began as many presentations and short documents that I put together for different projects over the years. Those presentations covered topics from basic requirements management to good distributed system design to how to plan and operate a project that was regularly in flux. A few of the documents were retrospectives about why a project had run into trouble or failed. Others were written in an attempt to head off a problem that I could see coming.

I have worked on many projects. Most of these have been about building a complex system, or one that required high assurance—ones where safety or security are critical to their correct operation. Some have gone well, but all have had problems. Sometimes those problems led to the project failing. More often they have cost the project time and money, or resulted in a system that was not as good as it should have been. In every case the problems have required unnecessary effort and pain from the team working on the project.

This raised the question: what could be learned from these projects? How can future projects go better?

I began to sense that there were some common threads among all the education and advice I was putting together for these teams, and the problems they were having. With the help of some colleagues who were working their own challenging projects, I began to sort through these impressions in order to articulate them clearly and gather them in one place.

I have found that many of the problems I have observed have come at the intersection of systems engineering, project management, and project leadership. Building a complex system effectively requires all three of these disciplines working together. Most of the problems I have seen have arisen from a breakdown in one or more of them: where there is capable project management, for example, but poor systems engineering, or vice versa.

The intersection is about how each of these disciplines contributes to the work of building a system. The intersection is where people maintain a holistic view of the project. It is where technical decisions about system structure interact with work planning; where project leadership sets the norms for how engineers communicate and check each other’s work. It is where competing concerns like cost versus rigor get negotiated. And it is where people take a long view of the work, addressing how to prepare for the work a year or more in the future.

I’ve worked with many people who were good at one of these disciplines, but didn’t understand how their part fit together with others to create a team that could build something complex while staying happy and efficient. I have worked with well-trained systems engineers who knew the tools of their craft, but did not know how or, more importantly, why to use them and how they fit together. I have worked with project managers who had experience with scheduling and risk management and other tools of their craft, but lacked the basic understanding of what was involved in the systems part of the work they were managing. I have also worked with engineers and managers who were tasked with assembling a team, but did not understand what it means to lead a team so that it becomes well-structured and effective.

In other words, they were all good at their individual disciplines but they lacked the basic understanding of how their discipline affects work in other disciplines, and how to work with people in the other disciplines to achieve what they set out to do.

And that brings me to the basic theme of this book: that making systems is a systems problem, an integration problem. The system that is being built will be made of many pieces that, in the end, will have to work well together. This requires at least some people having a holistic, systemic view of the thing being built. The team that builds the system is itself a system, and its parts—its people, roles, and disciplines—need to work together. The team is something to be engineered and managed, and it needs people who maintain a holistic view of how its parts work together.

This book is not a book on systems engineering or project management per se. Rather, it provides an overarching structure that organizes how the systems engineering, project management, and leadership disciplines contribute to systems-building. While I reference material from these disciplines as needed, do not expect (for example) to learn the details of safety analyses here. I do discuss how those analyses fit with the other work needed for building a system, and provide some references to works by people who have specialized in those topics.

This book is for people who are building complex systems, or are learning how to do so. I provide a structure to help think about the problems of building systems, along with ways to evaluate different ways one can choose to solve problems for a specific project. I relate experience and advice where I have some.

Using this book. This book covers two topics: the system being built, and how to go about building that system. These topics are intertwined, because the point of going to the effort to build a system is to build a well-functioning system.

The first two parts of this book are meant for everyone, and to be read first. They provide a general foundation for talking about making systems. That is, they present a simplified but holistic view of making systems. They present a short set of case studies to motivate what I’m talking about (Part I). Part II presents models for thinking about systems and the making of systems at a high level, along with recommended principles for both.

The two parts that follow provide more detailed discussions of what systems are (Part III) and what systems-building is (Part IV). These parts expand on the material in Part II, providing more structure for talking about each of their subjects. These parts are meant to be read after the foundational parts, but need not be read in order or all at once.

Defining what a system is, in the abstract, intersects with the work of making a system in what I call the system artifact graph. This is the realization in one physical form or another of the abstract model of a system—how it is written down. Those artifacts are also what the work of building the system produces.

The remaining parts go into depth on concepts and tools that help with building complex systems. These include topics like project life cycles, system design, team organization, and planning. These parts use the language built up in the first parts. The later chapters are meant to be read as needed: when you find you need to know about a topic, dip into relevant chapters.

Chapter 2: A note of caution

8 December 2024

This work aims to help people understand how to do a better job of building complex systems. The strategy I use is to gather together in one place many things that people have already learned, but not necessarily understood as connected.

This strategy has good company. Many people over the years have worked to improve engineering and management practices. Many of those works have led to improved project performance and better systems—and every one of them I know about has had a down side. This work will be no exception.

I imagine the reader as I am writing this work, as if we are having a conversation. But in truth this is not conversation; whoever reads this cannot ask clarifying questions, and I cannot respond with better explanations where the writing is unclear. This leaves me to wonder: will the reader read and understand what I meant to say?

Everything I am writing is based on my own experience, whether it comes directly from projects I have worked on or from what I have learned from others. This raises, as for any work, questions about biases in viewpoint and correctness. I have tried to question my viewpoints by checking them with others and comparing my conclusions to their experience, but that will always be an imperfect approach to the truth. So there are two other questions: is what I present here correct? And will it apply to new situations that I do not now imagine?

There is yet a third worry in writing a work like this. Will it come to be treated as the truth, unquestioned? Will someone treat it as dogma?

[Proverbs] are pithy—short, slick, and speciously easy to understand. They are, as they feel, traditional and well known. […] Proverbs serve as reminders to stop and consider. […] The danger is that a proverb’s glib and persuasive use as a substitute for deliberation becomes harmful in itself, especially if reinforced by a false attribution to authority.

—Richard A. Suss [Suss24, p. 4]

A work like this one, which provides practical guides for doing complex projects, can prod people who already have some experience into new thinking about how they do their work, adding new perspectives or providing overviews that help them think about what they are already doing. It adds to what they already know, but does not serve as the only guide for how they work.

In the longer term, though, every work like this that I know about has come to be taken as a One True Approach, where decisions are justified by saying “that’s how it is written”—without the people involved actually understanding why that approach says what it says, and not thinking through how the guidance applies to the actual work they have in front of them.

There are two examples that illustrate this behavior.

NASA has an extensive set of processes and procedures, with extensive documentation. The NASA Systems Engineering Handbook [NASA16] is an accessible way to start exploring that process. The processes have evolved over several decades of experience in what what leads space flight missions to fail or succeed, and the procedural requirements are full of small details. People will do well to understand and generally follow those procedures for similar projects.

However, this has led to too many people within NASA and the associated space industry to follow these requirements blindly. A NASA project must go through a sequence of reviews and obtain corresponding approvals to continue (and get funding). I have watched projects treat those reviews pro forma: they need a requirements review, so they arrange a requirements review, but nobody actually builds a useful (or even consistent) set of requirements to be reviewed. A preliminary design review has a checklist of criteria to be met, so presentation slides are prepared for each point, but the reasons behind those criteria aren’t really addressed. And a little while later the project is canceled because it isn’t making good engineering progress.

Similarly, the Unified Modeling Language (UML) [OMG17] brought together the experience of many software and system engineering practitioners to create a common notation for diagramming to describe complex systems. The notations provided a common language teams could use to express the structure and behavior of systems, so that everyone on a project could understand what a diagram meant. A common notation allowed organizations to build tools to generate and analyze these drawings. While not everyone follows the notation standard exactly, the standard has improved the ability of many engineers to document and understand systems. I certainly use many elements of UML diagramming regularly.

The UML has had a corresponding downside. Engineers who first learned how to think about systems using UML have had trouble thinking in other ways. In particular, there are some aspects of system specification and design that are not fundamentally graphical—but some people who have grown up on UML find themselves uncomfortable working with tabular or record-structured information in databases (such as for requirements). I worked with one engineer who was using the SysML dialect of UML, which does not include all of the diagram types in the main UML language. He needed a kind of diagram in UML but not in SysML, but was shocked at the suggestion that he should just use the UML diagram he needed anyway because it “wasn’t part of SysML”.

Generalizing from these, the problem is that when something provides a general, comprehensive guide about how to do complex work, this thing can come to be treated as the only answer. People who only learn from that one source can end up with a constricted understanding of how to do the work.

And so I hope that those who read this work will not take what is written here as the only word on the subject. A guide is not a substitute for learning and thought. I hope that readers will take what I have gathered here as inspiration to think about the work they will encounter in making complex systems, drawing on their own experience and the experience of other around them as well as what has been written. I hope that the points in this book help people keep in mind why something is being done, so that they can address the spirit of the need in addition to the rules of any particular procedure or methodology.

Finally, this last point has caused some reviewers to raise a concern about this kind of caution. By insisting that this book doesn’t have the complete answers to anything, and that people will have to think for themselves to do good systems work, it leaves the door open for anyone to justify to themselves any choice they care to make.

This is a valid concern. I have worked with lots of people who have made bad management and engineering decisions, and who were convinced they were right. In truth everyone makes poor decisions, and everyone has limited perspective. Every single project and every single person involved in building systems will face difficult decisions and will make some of them poorly.

In the end, I have decided that this is not something one can address with a book. I do not know who will read this work in future; I cannot know or address the specific problems they will have. I cannot have a conversation with each of you to try to sort through the actual problems and decisions you encounter. All I can do is provide you with one perspective, and hope that you will add it to your own in useful ways.

Part I: System stories

A set of case studies illustrating what can go right and wrong in a project to build a system.

Chapter 3: Making a simple system

30 April 2024

This book is about both what a well-built system is and how to make that happen. To begin, I’ll start with a simple story: building a small cottage model out of Lego™ bricks.

This story is made up, but it reflects some of the situations I have found in real projects I have worked on. It deliberately illustrates problems in a simplified and perhaps exaggerated way to make them clear. The simplifications include: a very small team, and one that doesn’t need to grow during the project; customer “needs” that are simple; and a project that does not need to consider real emergent properties like safety, security, or even mechanical strength.

3.1 The request

A customer wants a small cottage model, built out of Lego™ bricks. They would like the cottage to be white. They would like it to have a window. They have a base plate they would like it to fit on.

Someone on the team works with the customer to get this information and understand the needs. This results in a sketch, which the customer agrees reflects what they have asked for (Figure 3.1).

Figure 3.1: Sketch of customer needs

3.2 Building the cottage

The project gets its team together, and they begin discussing how to design and build the cottage. Based on the sketch concept, they decide to split the work: one person for each of four walls, and one person for the roof.

The team discusses some basic design parameters. The decide on the length and height of each of the walls, based on the size of the base plate and the rough ratio of the sides in the sketch. They also decide which wall will get the window.

Each person on the team then begins designing and building their part, based on the sizes they have agreed on. The result is a set of five assemblies (Figure 3.2).

Figure 3.2: Initial components

Right away there are some visible problems.

The wall with the window is missing bricks to the right of the window. The person building that wall segment couldn’t figure out how to connect those bricks.
Another wall segment has red stripes, because the person designing that segment liked the look of stripes better.

The team then try to integrate the assemblies together to make the cottage. The result is not good (Figure 3.3).

Figure 3.3: Initial integration, with problems

There are integration problems.

The roofs are unsupported. It’s unclear whether the triangle-shaped sections under the angled roof segments are part of the roof component or part of the wall below them.
Some of the wall segments are designed to interlock with the ones adjacent to them, but not all. Some people determined that interlocking at the corners adds to the strength of the cottage structure, but they did not coordinate this interface with the people building the adjoining wall segments. The segments that are designed to interlock don’t do so in compatible ways.

At this point, the team addresses some of these issues. They add roof supports to the front and back walls, and redesign all the walls to interlock at the corners.

The result is a structure that integrates all the components (Figure 3.4).

Figure 3.4: Initial integrated cottage

There are still problems with the integrated cottage.

The cottage is supposed to be white, but there is a red brick on the side.
There is a brick on the side that doesn’t fit in its place.
The cottage lacks a door.

The problems with the side wall come from one of the team members rushing to rebuild that wall after they were reminded that the cottage was to be all white and not have red stripes.

The missing door is a specification problem that came to light when the customer saw the completed cottage. The original sketch developed with the customer didn’t include a door—it only had an annotation about a window. People implicitly know that cottages need doors but builders may miss out on the door if it isn’t explicitly specified.

After some systems work, the team corrects the problems, fixing the side wall and adding a door. Correcting the problems involved taking the cottage more than halfway apart and rebuilding it. The result meets what the customer wanted (Figure 3.5).

Figure 3.5: Final integrated cottage

3.3 Retrospective on building

There were several problems that the team encountered building the cottage.

The team did not work with the customer to develop a thorough understanding of the customer’s needs. The team only had a minimal writeup of the needs, and that writeup left an important need implicit (the need for a door).

Next, the team did not develop a concept of the system (the cottage) and check that concept with the customer. For example, the team could have made a more realistic drawing of the cottage, and talked with the customer about how the cottage would be used. Checking a concept would have probably caught the missing implicit requirement for a door.

To their credit, the team decomposed the cottage into components (walls and roof), defined some dimensional requirements each would meet, and assigned someone to design each component. Unfortunately the team did not work out and document the interfaces between components. This meant that no one looked at how the walls would be joined (interlocking or not), and no one looked at how the roof would be supported on some walls.

One of the team members building a wall did not follow requirements about color—or perhaps the color requirement was missing or unclear.

Finally, the team members did not communicate with each other. Ideally, each one would have shared abstract designs for their component with the people building components connecting to that component. Sharing these designs would likely have caught that each team member had different understandings of how their components would be joined together.

The outcome was that it took longer than it should have because there was rework that could have been avoided.

Of course, this story is simpler than building a real building would be. A real building has multiple internal component, such as electrical, plumbing, or HVAC systems, that would create many more interfaces among components. A real building has to be designed to be mechanically sound; this requires systematic analysis to ensure that the building will stay up event in unusual events like storms or earthquakes. A real building also has safety concerns, like fire safety. Finally, building a real building is regulated in most places, requiring permits, inspections, and approvals from external authorities to ensure regulatory compliance.

3.4 Adding to the cottage

Some time passes, and the customer decides that they would like a larger model cottage, and they make a request to add on to the initial version. The team that built the original cottage has moved on to other projects.

A new team talks with the customer to learn what the customer wants. How much larger do they want the extended cottage to be? Should it be extended horizontally or vertically? The customer indicates that an extension adding between 50% and 100% of the original floor area would be sufficient, and the customer prefers a horizontal extension.

The team next has decisions to make about the overall design of the extension. They settle on an approach that matches the style of the original part and adds a little over half the floor area. They suggest to the customer that a window in the extension would be a good idea, and the customer agrees.

The new team does not have access to the team that made the original design decisions. They have to reverse engineer the design approach used by examining the cottage as built.

The original cottage was located toward the back of the base plate, and the team has decided that the extension should be at the back of the original. This implies that the team will have to move the original cottage forward. The team examine the original structure and determine that it can be moved on the base plate without problems.

The new team works together to design and build the extension. They have learned about the problems that the original team had, and so they manage the interfaces between walls and with the roof better. However, they don’t have access to the decisions that the original team made about interlocking the walls for strength, and so they build the extension as a separate unit.

Figure 3.6: Cottage with addition

3.5 Retrospective on addition

This illustrates a common scenario: that changes are made to a system long after it was originally built. The changes can be complex projects on their own. The original team may be long gone, or they may no longer remember details that were not written down. Knowing the design decisions and their rationales for the decisions affects how the changes are designed.

The changes not just add features (new space), but add interfaces between new parts and the original, and can change the interfaces within the original.

The team that build the original cottage did not document the design decisions they made. The team building the addition had to reverse engineer the design from the built cottage. The lack of information about the rationale for how walls were connected led to a different, less structurally sound approach for connecting the addition to the original structure.

The project to build the addition took longer than it could have if the team had not had to reverse engineer the design. The lack of design rational led to a structural solution that is sufficient for plastic bricks but would not work in a real structure.

In this story, the new team did learn from the original experience that they should do systems-level work. They worked through the interfaces between new parts, and this led the new project to go more smoothly than the original. The lesson is that learning over time matters.

Once again, this story is a simplification of a real building project. A real building would have far more interfaces: electrical circuits and plumbing might need to be extended. The structure of a real extension would have to be integrated into the original structure.

This story did not show the value of designing the original to be expanded. In the example, the original cottage could have been placed forward on the base plate so there was space for a later addition. In a real building, by analogy, designing the electrical main panel to have space for additional circuits and enough capacity to add more usage would make an addition easier.

3.6 Principles

As I present these stories, I will link them to the principles in Chapter 8 that can provide solutions.

Project leadership. Some of the problems in this story relate to how the cottage-building project was led. The most relevant principle is Section 8.1.3—Principle: Systems view of the system. The original team’s work would have gone more smoothly if they had had someone responsible for ensuring that the system made sense as a system.

System-building tasks. Some of the problems related to how the original team went about its work—which resulted in problems with the final system product.

Section 8.2.1—Principle: Start with a purpose before doing work. A team should have a thorough understanding of what they trying to build and why before setting out to build it.
Section 8.2.5—Principle: Document things so there is a future. Most systems-building projects produce a system that will outlast the people who initially build it. The team that documents essential information, especially design rationales, will help the teams that will come along later to evolve the system.
Section 8.2.6—Principle: Build in checks. Defining regular points when the project’s work will get checked helps catch errors early, minimizing the cost of undoing and redoing work.

The team. This story does not illustrate many problems with the team itself. However, the team building the original cottage built each of the components in isolation, and did not discover that their parts would not integrate until the parts had been built.

Section 8.4.4—Principle: Define regular communication paths. If the team had followed this principle and planned out regular communications to check integration they might have avoided some of the problems.

Chapter 4: Stories about building systems

This chapter presents some case studies of how people have built complex systems.

4.1 Developing a spacecraft mission without engineering the system

10 April 2024

The project. I worked on a NASA small spacecraft project. The project’s objective was to fly a technology demonstration mission to show how a large number of small, simple spacecraft could perform science missions. The mission objectives were to demonstrate performing coordinated science operations on multiple spacecraft, and to demonstrate that the collection of spacecraft could be operated by communicating between one spacecraft and ground systems, and the spacecraft then cross-linking commands and data to perform the science operations.

The problem. The mission had one set of explicit, written mission objectives to perform the technology demonstration. It also had a number of implicit, unwritten constraints placed on it, primarily to re-use particular spacecraft hardware and software designs.

Those two sets of objectives resulted in conflicts that made the mission infeasible. There were three key technical problems: power consumption was far in excess of what the spacecraft’s solar panels could generate; the legacy that could not communicate effectively over the distances involved; and the design had insufficient computing capability to accurately compute how to point spacecraft for cross-link communication.

Conflicts like these are not uncommon when first formulating a system-building project, and NASA processes are structured to catch and resolve them. The NASA Procedural Requirements (NPRs), a set of several volumes of required processes, require projects to formalize mission objectives and analyze whether a potential mission design is feasible. This work is checked at multiple formal reviews, most importantly the Preliminary Design Review (PDR).

At the PDR, expected project maturity is:

Program is in place and stable, addresses critical NASA needs, has adequately completed Formulation activities, and has an acceptable plan for Implementation that leads to mission success [italics mine]. Proposed projects are feasible with acceptable risk within Agency cost and schedule baselines. [NPR7120, Table 2-4, p. 30]

This project, however, failed at three of the necessary steps. First, the project did not perform top-down systems engineering, such as a proper documentation of mission objectives, a concept of operations, and a refinement of those into system-level and subsystem-level specifications. In particular the implicit and undocumented constraints were never documented as requirements; they were tacitly understood by the team and rarely analyzed. Those requirements that were gathered were developed by subsystem leads, and they were inconsistent and did not derive from the mission objectives. Second, individual team members did analyses that showed problems with the the ability of the radios, their antennas, and the ability to point the spacecraft in such a way that cross-link communications would work. The people involved repeatedly tried to find a solution in their individual domain of expertise to fix the problem, and the problems were never raised up to be addressed as a systemic problem. Finally, the PDR was the final check where these problems should have been brought to light as the refinement of mission objectives and the concept of operations would fail to show communication working. Instead, the team focused on making the review look good rather than addressing the purpose of the review.

Outcome. The project proceeded to build the hardware for multiple spacecraft, began developing the ground systems and developing the flight software. After several months, the project neared the end of its budget, and the spacecraft design was canceled. Something like two years’ worth of investment was lost, and the capability of performing a multi-spacecraft science mission was never demonstrated.

The agency later found some funds to develop a much simplified version of the flight software and relaxed the mission objectives substantially to only performing some minimal cross-link communications. A version of that mission was eventually flown.

Solutions. The project made four mistakes. Each one of them could have been corrected if the project had followed good practice and NASA required procedures.

First, the conflicting mission objectives and constraints should have been resolved early in the project. NASA has a formal sequence of tasks for defining a mission and its objectives, leading to a mission definition that is approved and signed by the mission’s funder. If the project had followed procedure, the implicit constraints would have been recorded as a part of this document. Documentation would have encouraged evaluation of the effects of those constraints.

Second, the project did not do normal systems engineering work. The systems engineering team should have documented the mission objectives, developed a concept of operations for the mission, and performed a top-down decomposition and refinement of the mission systems. In doing so, problems with conflicting objectives would have been apparent. The systems leadership would have been involved in analyses of the concept, and thus been aware of where there were problems.

Third, the team lacked effective communication channels that would have helped someone working one individual problem raise the issues they were finding up to systems and project leadership, so that the problems could be addressed as systems issues. For example, one person found that the flight computer would not be able to perform good-enough orbit propagation of multiple spacecraft so that one spacecraft would know how to point its antenna to communicate with another. A different person found problems with the ability of the radios to communicate at the ranges (and relative speeds) involved.

Finally, the PDR should have been the safety net to find problems and lead to their resolution. The NASA procedural requirements have a long list of the products to be ready at the PDR. (See [NPR7123, Table G-6, p. 111] and [NPR7120, Appendix I].) The team took a checklist approach to these several products, putting together presentations for each topic in a way that highlighted progress in the individual topics but failing to address the underlying purpose: showing that there was a workable system design.

Had any of these mechanisms worked, the systems and project leadership would have detected that the conflicting mission objectives were infeasible and led the project to negotiate a solution.

Principles. This example is related to several principles for a well-functioning project.

Section 8.4.1—Principle: Document team structure; in particular, the authority of each team member
Section 8.1.3—Principle: Systems view of the system
Section 8.2.4—Principle: Follow the spirit, not just the letter
Section 8.3.6—Principle: Analyze for feasibility
Section 8.4.5—Principle: Define exceptional communication paths

4.2 Marketing and engineering collaboration

12 April 2024

The project. I worked at a startup company that was building a high-performance, scalable storage system. The ideas behind the system came from a university research project, which had developed a collection of technology components for secure, distributed storage systems.

The company had developed several proof-of-concept components and was transitioning into a phase where it was getting funding and establishing who its customers were. The company hired a small marketing team to work out what potential customers needed and to begin building awareness of the value that the new technology could bring.

The problem. The marketing team had experience with computer systems, but not with storage in particular. They could identify potential market segments, but they did not have the background needed to talk with potential customers about their specific needs.

The engineering team were similarly not trained at marketing. Some of the team members had, however, worked at companies that used large data storage systems and so had experience at being part of similar organizations.

Solutions. The marketing team set up a collaboration with some of the technical leads. This collaboration left each team in charge of their respective domains, with the technical leads helping the marketing team do their work and the marketing team providing guidance about customer needs to the engineering team.

One of the technical leads acted as a translator between the marketing and engineering teams, so that information flowed to each team in terms they understood. Technical leads joined the marketing team on customer visits, helping to translate between the customers‘ technical staff and the marketing team. The marketing team conducted focus group meetings, and some of the technical leads joined in the back room to help frame follow-up questions to the focus groups and to help interpret the results.

Outcome. The collaboration helped both teams. The marketing team got the technical support and education they needed. The engineering team got proper understanding of what customers needed, so that the system was aimed at actual customer needs.

Principles. This example is related to the following principles:

Section 8.1.1—Principle: Communication and translation
Section 8.2.1—Principle: Start with a purpose before doing work

4.3 Missing implicit requirements

13 April 2024

The project. This occurred at the startup I worked at that was building a scalable storage system.

The problem. The team had a focus on making the system highly available, to the point where we had an extensive infrastructure for monitoring input power to servers and providing backup power to each server. If the server room lost mains power, our servers would continue on for several minutes so that any data could be saved and the system would be ready for a clean restart when power came back on. We did a good job meeting that objective.

What we forgot is that people sometimes want to turn a system off. Sometimes there is an emergency, like a fire in a server room, and people want the system powered off right away. Sometimes preventing the destruction of the equipment is more important that losing a few minutes’ worth of data. We had no power switches in the system and no way to quickly power it down.

Outcome. In practice this wasn’t too serious a problem because emergencies don’t happen often, but it meant that the system couldn’t pass certain safety certifications.

Solutions. We made two mistakes that led to the problem.

The first mistake was that everyone on the team saw high availability as a key differentiator for the product, and so everyone put effort into it. This created a blind spot in how we thought about necessary features.

The second mistake was that we did not work through all of the use cases for the system and so implicit features, including power off. Building up a thorough list of use cases can serve as a way to catch blind spots like this, but the team did not build such a list.

Principles. This is related to one principle:

Section 8.2.7—Principle: Work against cognitive biases

4.4 Building at a mismatch to purpose

15 April 2024

The project. I consulted on a project to build a technology demonstration of a constellation of LEO spacecraft for the US DOD. This constellation was to perform persistent, world-wide observations using a number of different sensors. It was expected to operate autonomously for extended periods, with users world wide making requests for different kinds of operations. The constellation was expected to be extensible, with new kinds of software and spacecraft of new capabilities being added to the constellation over time.

One company organized the effort as the prime contractor. That company built a group of other companies of various sizes and capabilities as subcontractors. The team won a contract to develop the first parts of the system.

The problem. The constellation had to be able to autonomously schedule how its sensors would be used, and where major data processing activities would be done. For example, someone could send up a request for an image of a particular geographic region, to be taken as soon as possible. The spacecraft would then determine which available spacecraft would be passing over that region soon. Some of the applications required multiple spacecraft to cooperate: taking images from different angles at the same time, or persistently monitoring some region, handing off monitoring from one spacecraft to another over time, and performing real-time analysis on the images gathered on those spacecraft.

The prime contractor selected its team of other companies and wrote the contract proposal for the system before doing systems engineering work. This meant that neither a detailed concept for the system’s operation nor a high-level design had been done.

After the contract was awarded, the team had to rapidly produce a system design. This effort went poorly at first because the system’s concept had not been worked out, and different companies on the team had different understandings of how the system would be designed. The team had to deliver initial system concept of operations and requirements quickly after the contract was awarded. The requirements were developed by asking someone associated with each expected subsystem to write some requirements. Needless to say, the concept, high-level design, and requirements were all internally inconsistent.

After the team brought me on to help sort out part of the design problems, we began to do a top-down system design and establish real specifications for the components of the system. We were able to begin to work out general requirements for the autonomous scheduling components.

The project team had determined that they needed to use off-the-shelf software components as much as possible, because the project had a short deadline. One of the subcontractor companies was invited onto the team because they had been developing an autonomous spacecraft scheduling software product, and so the contract proposal was written to use that product.

However, as we began to work out the actual requirements for scheduling, it became apparent that the off-the-shelf scheduling product did not match the project’s requirements. The requirements indicated, for example, that the system needed to be able to schedule multiple spacecraft jointly; the product only handled scheduling each spacecraft independently. The system also had requirements for extensibility, adding new kinds of sensors, new kinds of observations, and new kinds of data processing over time. This suggested that strong modularity was needed to make extensibility safe, but the off-the-shelf product was not at all modular.

Outcome. The mismatch between the decision to use the off-the-shelf scheduling product and the system’s requirements led to both technical and contractual problems.

The technical problem was that the scheduling product could not be modified to work differently and thus meet the system requirements. The project did not have the budget, people, or time to do detailed design of a new scheduling package that would meet the need.

The contractual problem was that the subcontractor had joined the project specifically because they saw a market for their product and were expecting to use the mission to get flight heritage for it. When it became clear that their product did not do what the system needed, they discussed withdrawing from the project.

In the end, the customer decided not to continue the contract and the project was shut down.

Solutions. This project made three mistakes that, had they been avoided, could have changed the project’s outcome.

First, the team did not do the work of early stage systems engineering to work out a viable concept and high-level design before committing to partners and contracts. This would have made it clear what was needed of different system components. It would also have provided a sounder basis for the timelines and costs in their contract proposal.

Second, the team made design and implementation choices for some system components without understanding the purpose that those components needed to fill.

Finally, the team made commitments to using off-the-shelf designs without determining whether those designs would work for the system.

Principles. The solutions above are related to the following principles:

Section 8.1.3—Principle: Systems view of the system
Section 8.2.1—Principle: Start with a purpose before doing work
Section 8.2.2—Principle: Evaluate tools before adopting them

4.5 The persistence of team habits

6 May 2024

The project. I consulted for a company that was working to build an autonomous driving system that could be retrofitted into certain existing road vehicles.

The company had started with veterans from a few other autonomous driving companies. They began their work by prototyping key parts of a self-driving system, to prove that they had a viable approach to solving what they saw as the key problems. This resulted in a vehicle that could perform some basic driving operations, though it was always tested with a safety driver on board.

The team focused only on what they saw as the most important problems in an autonomous driving system. They believed that it was important to demonstrate a few basic self-driving functions as rapidly as possible—in part because they believed that this would help them get funding, and in part because they believe that this would help them forge partnerships with other companies. They focused on a simplified set of capabilities, including sensing, guidance, and actuation mechanisms for driving on a road.

The problem. This focus meant that the team developed a culture, along with a few somewhat documented processes, that was focused on building a prototype-style product, even as they began to fit their system into multiple vehicles and test them on the road (with safety drivers). When they found a usage situation in their testing that their driving system did not handle as they felt it should, they added features to handle that situation to the sensing and guidance components and to simulation tests they used on those components. In other words, the engineering work was driven largely reactively.

The team did not spend effort on analyzing whether the new features would interact correctly with existing features, relying on simulation testing to catch regressions. They did not develop a plan for features that they would need, and for how they would integrate other systems with the core functions they had already prototyped.

Some of the team members had some awareness that they needed to improve the safety of the driving system and the rigor with which the team designed and built the system. These team members, some of whom were individual engineers and some who were leaders, tried from time to time to define some basic individual processes—like defining requirements before design, or conducting design reviews. Their goal was always to move the team incrementally toward sound engineering practice.

None of these attempts worked: each time, a few people would try a new procedure, task, or tool, but a critical mass of the team would keep working the way they had been in order to keep adding new features in response to immediate needs.

Outcome. After nearly two years, the team had not changed its practices and continued to work as if they were building a prototype. The team in general did not define or work to requirements; they did not analyze the systems implications of potential new features before implementing them. The team was making little progress on developing a safety case for the system.

Solutions. The fundamental problem was a misalignment between the incentives that drove the team in the short term and long-term practices needed to build a safe and reliable system.

The team as a whole, from the leadership down, developed habits focused on developing a proof of concept that would let the company get additional funding, as well as attract good staff and help the company build partnerships. This was the right choice for the company in its early days, because a company that cannot get funding does not get to move on to the long term. This short-term focus drove the habits and culture of the early company.

Later, as the company got funding and built up a team to build the system, they needed to change their practices. Changing a team’s culture and habits is hard: the team’s practices have been working out initially. The team’s habit of focusing on short-term results, in particular, defined how they organized all their work.

In order to change practices to be a company building a product that is viable in the long term, teams like this make a deliberate change to their culture, habits, and practices. A disruptive change like this does not happen spontaneously: a team’s culture defines the stable environment in which people can do what they understand to be good work. This creates a disincentive to make a change that disrupts how everyone works together.

Deliberate and pervasive changes come from the team’s leadership. The leadership must first recognize that a change is needed and work out a plan for what to change, how quickly, and in what way. The leadership then have to explain the changes needed, create incentives that will overcome the disincentives to change, and hold people on the team accountable for making the changes.

Principles. This case reflects some more of the principles outlined in Chapter 8.

Section 8.1.2—Principle: Provide staff to run the engineering team’s operations. Having someone responsible for overseeing how the team operates defines who is responsible for detecting when the team needs to change practices and then lead the change.
Section 8.1.5—Principle: Team habits. A team’s habits and culture are hard to change because of the tendency to maintain stability.
Section 8.2.6—Principle: Build in checks. Almost every project will need to make changes to how the team works as the work matures. Building in points where team leadership is expected to reflect on the team’s habits creates opportunities to detect when the project has reached the time for a change.
Section 8.2.7—Principle: Work against cognitive biases. One can view a team’s habits as cognitive biases toward working in some established way. When it comes time to make a change, the techniques used to address cognitive biases can help the team make the change.
Section 8.3.3—Principle: Have a long-term plan. A project’s path from beginning to a delivered system can be envisioned in general terms, even if that general path changes at times. Most projects will follow a path that has the same general phases—starting up, developing proofs of concept, developing the system, integration, delivery (see Chapter 20). Plotting out these general phases early in the project can help the leadership recognize when the team’s practices will need to change.
Section 8.4.2—Principle: Plan on reorganizing the team as it grows. The team’s structure is likely to need to change just as the team’s practices change.

4.6 Heavyweight, understaffed processes

24 April 2024

The project. A colleague was an engineer working on an electronics-related subsystem at a large New Space company that was building a new launch vehicle.

The team in question was responsible for designing one of the avionics-related subsystems and acquiring or building the components. This required finding suppliers for some components and ordering the necessary parts.

The problem. The company had processes in place for both vendor qualification and parts ordering. They included centralized software tools to organize the workflow.

The vendor qualification process began with submitting a request into the tools. The request was then reviewed by a supplier management team; once they approved a supplier, the avionics team could start placing ordering requests to buy parts. The purchase request would similarly be routed to an acquisition team that would make the actual purchase from the supplier.

The intents of this process were, first, to take the work of qualifying potential vendors and managing purchases off the engineering team, and second, to ensure that the vendors were actually qualified and that parts orders were done correctly.

From the point of view of the engineers building the avionics, the processes were opaque and slow. They would put in a request, and not know if they had done so properly. Responses took a long time to come back. At one point, my colleague reverse engineered the vendor qualification process in order to figure out how to use it; the result was a revelation to other engineers.

It also appeared that the positions responsible for processing these requests were understaffed for the workload. In practice these people did not have the time to do proper reviews of the vendors most of the time.

Outcome. Having supply chain processes was a good thing: if it worked, it increased the likelihood that the acquired parts would meet performance and reliability requirements, that the vendors would deliver on schedule and cost, and that the cost of acquiring parts remained within budget.

However, getting vendors qualified to supply components and then getting the components took a long time, delaying the system’s implementation and then delaying testing and integration.

The suppliers and the parts did not get the intended scrutiny, which may have let problem suppliers or parts through.

The company acquired a reputation with its employees of being slow and difficult to work for.

Solutions. There are four things that could have been done to make these processes work as intended.

First, the processes should be documented in a way that everyone involved knows how the process works. In this situation, it seems that people playing different parts in the process knew something about their part, but they did not understand the whole process; if there was documentation, the people involved did not find it. The process documentation should inform all the people involved what all of the steps are, so they understand the work. It should make clear the intent of the process. It should also make clear what would make a request successful or not.

Second, the processes should be evaluated to ensure that every step adds value to the project, compared to not doing that step or doing the process another way.

Third, the supporting roles—in this case, those tasked with reviewing and approving requests—should be staffed at a level that allows them to meet demand.

Finally, the project should regularly check whether its processes are working well, and work out how to adjust when they are not working.

Principles. The following principles apply:

Section 8.1.4—Principle: The team is a system
Section 8.1.6—Principle: Keep it lightweight and actionable
Section 8.4.2—Principle: Plan on reorganizing the team as it grows

4.7 Planning the transcontinental railroad

24 April 2024

The project. The first transcontinental railroad to cross North America was built between 1862 and 1869 [Bain99]. It involved two companies building the first rail route across the Rocky Mountains and the Sierra Nevada, one starting in the west and the other in the east. It was built with US government assistance in the form of land grants and bonds; the government set technical and performance standards that had to be met in order to get tranches of the assistance. The technical requirements included worst-case allowable grades and curvature. The performance requirements included establishing regular freight and passenger service to certain locations by given dates.

The problem. The companies building the railroad had limited capital available to build the system. They had enough to get started, but continuing to build depended on receiving government assistance and selling stock. Government assistance came once a new section of continuous railroad was accepted and placed into service. In addition, the two companies were in competition to build as much of the line as possible, since the amount of later income depended on how much each built.

This situation meant that the companies had to begin building their line before they could survey (that is, design) the entire route. They operated at some risk that they would build along a route that would lead to someplace in the mountains where the route was uneconomical—perhaps because of slopes, or necessary tunneling, or expensive bridges.

Because the building began before the route was finalized, the companies could not estimate the time and resources needed for construction beyond some rough guesses. The companies worked out a general bound on cost per mile before the work started, and government compensation was based on that bound. In practice the estimate was extravagantly generous for some parts of the work.

Solutions. The initial design risk was limited because there were known wagon routes. People had been traveling across the Great Plains and the mountains in wagons for several years. While the final route did not exactly follow the wagon routes, the early explorations ensured that there was some feasible route possible.

The companies built their lines in four phases: scouting, surveying, grading, and track-laying. (In some cases they built the minimal acceptable line with the expectation that the tracks would be upgraded in the future once there was steady income.) Scouting defined the general route, looking for ways around bottlenecks like canyons, rivers, or places where bridges or tunnels would be needed. Surveying then defined the specific route, putting stakes in the ground. The surveyed route was checked to ensure it met quality metrics, such as grade and curvature limits. After that, grading crews leveled the ground, dug cuts through hills, and tunneled where necessary. Finally, track-laying crews built bridges and culverts where needed, then laid down ballast, ties, and rail. After these phases, a section of track was ready for initial use.

Scouting ran far ahead of the other phases, sometimes up to a year ahead. Survey crews kept weeks or months ahead of grading crews. The grading and track-laying crews proceeded as fast as they could. All this work was subject to the weather: in many areas, work could not proceed during winter snows.

Outcome. The transcontinental railroad was successfully built, which opened up the first direct rail links from one coast of North America to the other. The early risk reduction—through knowledge of wagon routes—accurately showed that the project was feasible.

The companies were able to open up new sections of the line quickly enough to keep the construction funded. The companies received bonds and land grants quickly enough, and revenue began to arrive.

The approach of scouting and surveying worked. The scouting crews investigated several possible routes and found an acceptable one. While there were instances of tentatively selecting one route then changing for another—sometimes for internal political reasons rather than technical or economic reasons—no section of the route was changed after grading started. In later decades other routes were built, generally using tunneling technology that was not available for the first line. Many parts of the original line are still in regular use.

Principles. The transcontinental railroad project was an example of planning a project at multiple horizons, when the work of implementing begins before the design is complete, and where the plan and design is continuously refined.

Section 8.3.1—Principle: Prioritize work by risk or uncertainty
Section 8.3.3—Principle: Have a long-term plan
Section 8.3.6—Principle: Analyze for feasibility

Part II: Systems background

Foundational definitions used throughout the rest of this book, including:

A definition of making a system;
Basic models of systems and of their making; and
Principles for making a system well.

Chapter 5: What making systems is

9 May 2024

This book is about the work involved in making a system—what a system is, and how to do a good job making one.

Part I presented a set of case studies that showed how system-building project can go well—or not. This leads to two questions: How does one build a system well? And how does one avoid the problems?

To start finding answers to these questions, consider three aspects of making a system: what a system is; the activities involved in making it; and the people who do the activities that make the system.

A system. A system is “a regularly interacting or interdependent group of items forming a unified whole”.[1] Other definitions speak to a set of items or components that work together to fulfill a purpose.

This definition includes some of the key aspects of a system.

It is a coherent or unified whole, but made up of smaller parts.
The parts that make up the system work together.
The whole fills some purpose.
There are things that are in the system, and there is everything else outside it.

For artificially built systems, the system is the outcome of all the work that people do to make the system.

Having a purpose distinguishes a system built by people from systems in nature. A natural system often just exists, and any meaning or purpose to it assigned after the fact by people. A human-built system, on the other hand, does something for someone. The purpose of human-built systems can be described in terms of what it does for someone, and why it is worth the effort to make a system do that.

Most systems are not static: they will evolve rapidly as they proceed from concept through design; once it is in operation, they will continue to evolve as their users’ needs evolve.

The next chapter, Chapter 6, discusses more about what a system is.

Making a system. The work of making a system can be seen as a string of activities, the life cycle of the system. It begins with an idea. That idea might be a user’s need, or it might be an idea for a new way to do something that might fill an as-yet-unidentified user’s need. The work proceeds to translate that idea into designs and then into a working system. This work goes through a number of steps, such as developing a concept, specifying its pieces, designing and implementing them, integrating the parts and verifying the assembly. Once a system has been built, it can be placed into operation. A system that has been in operation may change over time: users’ needs change, or technology changes. Eventually, every system is retired and disposed of.

All these activities are done by a team of people who are building the system, and the point of spending the effort is for the system, at when built and in operation, filling its purpose.

Chapter 7 discusses more about how to make a system.

Who does the work. A team of people working together does all the activities involved in making the system. For complex systems, the team can get large and may involve people at different companies and with different skills.

The team of people is itself a system: a set of people, whose purpose is to build the objective system, who interact with each other through discussion, documentation, and artifacts. A team that is functioning well is able to focus their efforts on the purpose of the system they are building. The team is organized so that its members have information they need each to do their part, and to communicate so that the pieces of the system that they create work together.

Key roles. A team that functions well, like any human-built system, does not happen by accident; it happens because someone takes the effort to design and implement it so that it works well.

In practice, there are three roles that do this work of organizing and running the team. These roles may be divided among team members in many different ways, but every team building a complex system needs the three roles filled somehow. The roles are:

Systems engineering. This role is responsible for the holistic view of the system being built. People in this role look at the component parts that make up the system and ensure they work together properly so that they meet the user’s purpose. These people are also responsible for articulating the user’s purpose in a way that can be used to design the system and to check whether the result matches the purpose.

Project management. This role is responsible for the structure and operation of the team building the system. People filling this role coordinate what parts of the system get done in what order, define how the team will work to get things done, and check that the work has actually been completed.

Project leadership. This role is responsible for establishing the reasons why the team is building the system and how it is doing so. People filling this role work with the user to find out what their needs are, working with people in the systems engineering role. People in this role also set the basic culture and style of the team—things that are then elaborated by people doing project management into definitions of how the team works together. Those in the project leadership role, in the end, represent the authority for why the project is being done.

The intersections. Having teased apart the ideas of system, system-building, and people, and the ideas of systems engineering, project management, and project leadership, the next step is to acknowledge that none of these things are in fact separate.

The objective of a project is to produce a system. The way to produce it is to do system-building work. The people in a team do that work. All three must fit together: the way that the work gets done determines whether the resulting system meets its purpose. How the team is organized, its culture and habits, govern how the people will do the work.

While systems engineering, project management, and project leadership are different roles and involve different skills, they work together. Leadership by itself gets nothing done; that comes from engineering and management. Leadership and management without systems engineering might produce a system but it probably won’t work. Leadership and engineering without management usually means a lot of engineers running around doing cool things but also wasting time and resources and not actually getting things done. Management and engineering without leadership isn’t able to make decisions or take responsibility.

The people filling each of these three roles also need to understand their counterparts’ roles. A systems engineer who designs something that would require more time or resources than the project has is not going to be effective. A project leader who does not understand the work the team does is not going to model good work practices. A project manager who does not understand the engineering is not going build a plan or schedule that makes sense.

Systems work, in the end, is about doing work that makes a coherent whole out of the parts it has to work with. The work of making a system is just as much systems effort as its product is. Only when the parts fit together does the work get done as it should.

Chapter 6: Elements of systems

21 May 2024

6.1 Introduction

Working with systems is about working with the whole of a thing. It is a bit ironic that to make the whole accessible to rational design, we need to talk about the parts that make up systems work.

That is one of the first points about systems. Most systems are too complex for a human mind to remember and understand as a whole at one time. To work on these systems, people must find ways to abstract and to subset the problem. This book discusses some of the techniques for slicing a system into understandable parts, along with ways to use those techniques and why to use them. In the end, however, everything in here deals with carefully-chosen subsets of a system.

This chapter covers some of the essential concepts and building blocks that are the foundation for the techniques discussed in the rest of this book.

The subjects for systems work can be divided into five groups:

The system’s purpose;
The system’s boundary or scope;
The parts and views of the system;
The structure and emergence in the system; and
Evidence that the system and its design meet its purpose.

The first four subjects are connected by a reductive approach to explaining complex systems, in which the high-level purpose is explained by reducing it to simpler constituent parts and structure, and conversely expressing the purpose as emergent from these simpler parts. The final subject is about ensuring that the system does what it is supposed to do (and only that).

6.2 System purpose

Every system that is designed and built has a purpose. That is, someone has an expectation of the benefits that will come from building the system, and they believe that those benefits will outweigh the costs (in resources, time, or opportunities) that will be incurred building the system.

Every system must be designed and built to address its purpose, and no other purposes, at the lowest cost practically achievable. This point may seem uncontroversial on its surface, but I have observed that the majority of projects fail to work to this standard, and incur unnecessary costs, schedule slips, or missed customer opportunities. Every design choice must be weighed according to how well each option helps satisfy the purpose or not; if an option does not, it should not be chosen.

Making design decisions guided by a system’s purpose means that the team must understand what that purpose is. The purpose must be recorded in a way that all the team members can learn about it. It also needs to be accurate: based on the best information available about what the system’s users need, and as complete as can be achieved at the time. The record of the purpose should avoid leaving important parts implicit, expecting that people will know that systems of a particular kind should (for example) meet certain safety or profitability objectives; people who specialize in one area will know some of these implicit needs but not others. The purpose documentation should also include secondary objectives, such as meeting regulatory requirements or leaving space in the design for anticipated market changes.

The understanding of a system’s purpose and costs will shift over time, both as the world changes and as people learn more accurately what the system’s value or cost will be. When the idea for the system Is first conceived, the purpose may be accurate for that time but the understanding of the cost is likely to be rough. As design and development progress, the understanding of cost improves, but the needs may change or a customer may realize they misunderstood some part of the value proposition.

A system’s purpose also changes over longer periods of time. People add new features to an existing product to expand the market segment to which it applies or to help it compete against similar products. The technology available for implementing a system changes, creating opportunities for a faster, cheaper, or otherwise better system.

Systems leadership have to balance the needs for a clear and complete statement of a system’s purpose with the fact that the understanding of purpose will change over time. The agile [Agile] and spiral [Spiral] management methodologies arose from this need for balance between opposing needs. Chapter 22 addresses how systems engineering methodologies can help address this need.

Working in a way that is driven by system purpose requires discipline in the team and its leadership. Many junior- and mid-level engineers are excited about their specialist discipline, and want to get to designing and building as quickly as possible—after all, those are the activities they find fulfilling. I have observed team after team proceed to start building parts of a system that they are sure will be the right thing, without spending the effort to determine whether those parts are actually the right ones. Those design decisions may end up being correct many times, which leads to a false confidence in decisions taken this way (“I’m experienced; I’m almost always right!”). The flaw is that the wrong decisions can have a high cost, high enough to outweigh any benefit from the rapid, unstudied decision.

I have heard many teams say—rightly—that they need to make some design decision quickly, see whether it works, and then adjust the design based on what they learn. This line of reasoning is both a good idea and dangerous. If a team actually does the later steps of evaluating, learning from, and changing the design then this approach can result in good system design. (This is discussed more in later Chapter 42 on prototyping and Chapter 63 on uncertainty.) However, most teams lack the leadership discipline to perform to this plan: once there is some design in place, pressures to keep moving forward drive teams to live with the bad initial design and accept complexity and errors. It requires discipline and commitment from the highest levels of an organization to take the time needed to learn from an early design and change what they are doing. The leadership must be prepared to push back against pressures to just live with a poor design and instead to require their team to take the time to learn and adjust, and to be clear with external parties, such as investors, that the plan is a necessary and positive way to realize a good product.

6.3 System boundary

A system has a boundary that defines what is within the system and what is not. What the system does (its functions) and what it uses to do them (its components) are within the system.

The rest of the world is outside the system. The outside world includes the system’s environment: the part of the world with which the system interacts.

The boundary defines the interface between the system and its environment.

What is inside the system and where the boundary lies are within the control of the project building the system. The project must adapt its work to everything else outside the system boundary.

6.4 System parts and views

Systems are designed and built by people. The methods used to build them must account for two human issues. First, most systems today are too complex for one person to keep in mind all the parts at one time, leading to a need to work with subsets of the system at any given time. Second, most systems also require multiple people to design or build, either because of specialties or the total amount of work involved. This leads to the need to break the work up into parts for different people to work on.

There are two techniques used to address this need. First, systems are divided into component parts, typically in a hierarchical relationship: the system is divided into subsystems, which are in turn subdivided, until they reach component parts that are simple enough not to require further subdivision. Second, people approach the system through narrow views, each of which covers one aspect of the system but across multiple component parts—such as an electrical power view, an aerodynamics view, or a data communications view.

Dividing the system into component parts creates pieces that are small enough to reason about or work on in themselves. The description of the part must include its interfaces to other parts, so that the design or implementation can account for how it must behave in relation to other parts. However, the interface definitions abstract away the details of other parts, so that the person can concentrate their attention on just the one part.

Dividing up the system also allows different people to work on different parts, as long as both parts honor the interfaces between them. The division into parts, and the definition of interfaces, create divisions of responsibility and scope for communication for the different people. This is addressed further in the Teams section (Section 7.3.3).

The hierarchical breakdown of the system into components and subcomponents provides a way to identify all of the parts that make up the system, ensuring that all can be enumerated. It also defines a boundary to the system: the system is made up of the named parts, and no others.

Reasoning about views of a system provides a similar and complementary way of managing the complexity of reasoning about a system by focusing on one aspect across multiple parts, and abstracting away the other aspects. This allows different people to address different aspects, as long as the aspects do not interact too much. For example, specialist knowledge, such as about electrical system design, can be brought to bear without the same person needing to understand the aerodynamics of the aircraft in which the electronics will operate.

Sidebar: Non-reductive systems

This approach of defining a system in reductive terms—using parts and structure—is not a formal necessity of systems in general. Rather, this approach is used as a way for ordinary people to define, build, and check systems.

There are numerous examples of non-human processes that have developed complex systems that are not easily explained reductively. Many of these were developed using evolutionary methods, both biological and electronic. Others arise from other optimization and machine learning techniques. These generative design tools have been demonstrated in mechanical and electronic design.

Consider the circuit discussed by Thompson and Layzell [Thompson99]. This circuit was developed by evolving a design on an FPGA, so that the result would distinguish between inputs at two different frequencies. The resulting circuit design achieves its objective, but is not readily understandable by decomposing the design into individual elements on the FPGA—indeed, the presence of some cells that did not appear to be used directly appeared to be essential to the circuit’s function. Further, the circuit only worked well on the specific FPGA chip on which it was evolved; when moved to another FPGA of the same model, it was reported to work poorly.

While these designs are not readily understood by decomposition, they still must be verified for conformance with their purpose. This starts with a clear definition of purpose, from which the fitness or objective function used in optimization can be derived. For critical systems or components, the objective function must not only specify what the desired behaviors are, but also the undesired behaviors and the behaviors when the system is outside its intended performance environment. In some methods, the objective “function” can be an adversarial neural network that must itself be trained based on the system’s purpose. The result of the generative or optimization method must also be verified against the purpose to check that the result is in fact correct—which can catch errors in building the objective function, or subtle dependencies on environment.

6.5 Structure and emergence

Decomposing a system into component parts is one part of the system’s design; the other part is how those components relate to each other. The relations between parts define the structure of the system. These relations include all the ways that components can interact with each other, at different levels of abstraction. At low levels, this might be interatomic forces at the molecular level; at medium levels, mechanical, RF, force, or energy transfers; at higher levels, information exchange, redundancy, or control.

The structure needs to lead to the system’s desired aggregate properties, such as performance, safety, reliability, or specific system functions like moving along the desired path or providing reliable electrical service.

The aggregate properties are emergent, and arise from the way the structure combines the properties of individual components.[1] The structure must be designed so that the system has the desired emergent properties and avoids undesired ones. For example, a simple reliable system has a reliability property that arises from the combination of two or more components that can perform the same function, along with the interaction patterns of each component receiving the same inputs, each component generating consistent outputs, how the two or more results are combined, and how each component responds to failure.

The structure must be designed to avoid unanticipated emergent properties, especially when those properties are undesirable. In a safe or secure system, for example, it is necessary to show that the system cannot be pushed into some state where it will perform an unsafe action or provide access to someone unauthorized. Avoiding unanticipated emergent properties is one of the hardest parts of correctly designing a complex system.

The structure must be well-designed for the system to meet its purpose, and for people to be able to understand, build, and modify it. In particular the structure needs to be:

Understandable by the people who design, implement, verify, and use the system;
Analyzable and verifiable so the system’s design and implementation can be confirmed to meet the system’s purpose; and
Modifiable as needs and technologies change.

There are good engineering practices that should be followed to achieve these aims, as I discuss in Chapters 41, 47, and 53.

Finally, the structure determines the interfaces that each component part must meet. Those interfaces in turn determine a component’s functions and capabilities, which guide the people working on the component, as discussed in the previous section.

6.6 Evidence

It is not enough to design and build the system; the team must also show that the system meets its purpose.

The team developing or maintaining the system must be able to show that the system complies with its purpose to customers, who need to know that the system will do what they expect; to investors, who need evidence that their investment is being used to create what they agreed to fund; and to regulators, especially for safety- or security-critical systems, who are charged with ensuring that systems function within the law.

The team also needs to ensure that pieces of the system meet the system’s purpose as they are developing or modifying those pieces. They must be able to judge alternative designs against how well they meet the purpose, and once built they must be able to check that the result conforms to purpose.

The process of showing that a system or a component part fulfills its purpose involves gathering evidence for and against that proposition, and combining the evidence in an argument to reach an overall conclusion about compliance. There are many kinds of evidence that can be gathered: results of testing, results of analyses, results of expert analysis, or results from performing a demonstration of the system. These individual elements of evidence are then combined to show the conclusion. The combination usually takes the form of an argument: a tree of logic propositions starting with the purpose and devolving hierarchically into many lower-level propositions that can be evaluated using evidence. The process must show that the structure of the argument is both correct and complete in order to justify the final conclusion.

Pragmatically, arguments about meeting purpose usually follow a common pattern, as shown below. The primary argument that the implementation meets the purpose consists of a chain of verification steps. The implementation complies with a design, which complies with a specification, which complies with an abstract specification, which complies with the original purpose. As long as each step is correct, then the end result should meet the original purpose—but at each step there is the possibility of misinterpretation or missing properties, or that the verification evidence at each step is not as complete as believed. In practice this approach leaves plenty of uncaught errors in the final implementation. To catch some of these errors in the chain of verification steps, common practice is to perform an independent validation, in which the final implementation is checked directly against the original purpose.

Some industries, particularly dealing with safety-critical automotive and aerospace systems, add an additional kind of evidence-based correctness argument. This is often called the safety case or security case, and consists of an explicit set of propositions, starting with the top level proposition “the system is adequately safe” (or secure) and showing why that conclusion is justified using a large hierarchy of propositions. The lowest-level propositions in the hierarchy consist of concrete evidence; intermediate propositions combine them to show that more abstract safety or security properties hold. (See, for example, one group’s guidance on writing assurance cases [ACWG21].)

Finally, evidence takes many forms, depending on what needs to be shown. Some correctness propositions can be supported by testing. These typically show positive properties: the system does X when Y condition holds. Some of these conditions are hard to test, and are better shown by analysis or human review of design or implementation. Negative conditions are harder to show: the system never does action X or never enters state Y, or does so at some very low rate. These require analytic evidence, and cannot in general be shown by testing.

I discuss matters of correctness, verification, validation, and the related arguments in Chapter 14.

6.7 Using this model

The model in this chapter provides a way to think and talk about systems work. As a team begins a systems-building project, it will be gathering information or making decisions that can be organized using this model. The model can help guide people as they work through some part of the system. For example, the system’s purpose is reflected in the emergent behavior of the system, which in turn depends on the structure of how components interact. When the system is believed to be complete, the team should be able to verify that all of the relations indicated by this model are defined and correct. Later, as the system needs to evolve and the team makes changes to the system, this model helps them reason about what is affected by some change.

This model of systems provides a foundation for organizing the work that needs to be done to build the system. The next chapter presents a model for this work of building a system or component. The information about one component is represented in a set of artifacts, and there are tasks that make those artifacts. The structure of the artifacts, and thus of the tasks, is based on the model of systems and components in this chapter.

Part III goes into greater detail about each part of this model.

Chapter 7: Elements of making a system

6 March 2025

7.1 Introduction

The previous chapter defined what a system is. In this chapter, I turn attention to how to make that system. “Making” includes the initial design and building of the system, as well as modifications after the initial version has been implemented.

Making the system is a human activity. Building a system correctly, so that it meets its purpose, requires a team of people to work together. Building systems of more than modest complexity will involve multiple people, usually including specialists who can work on one topic in depth and people who can manage the effort. It involves people with complementary skills, experiences, and perspectives. Such systems take time to build, and people will come and go on the team. Systems that have a long life that leads to upgrades or evolution will involve people making modifications who have no access to the people who started the work.

This chapter provides a model to organize and name the things involved in the making of a system—the activities, the actors, and what they work with. Later chapters provide details on each part of this model. This model includes both elements that are technical, such as the steps to design some component, and elements that are about managing the effort, such as organizing the team doing the work or planning the work. Note that this model does not attempt to cover all of managing a system project—there is much more to project management than what I cover here.

The model presented in this chapter only serves to name and organize. I do not recommend here different approaches one can take for each of the elements of the model; only attributes that good approaches should have. Later parts of this book address ways to achieve many of these things. For example, the team that is designing a system should have an organization (a desirable attribute), but I do not address which organizational structures one can choose from.

The assembly of all the parts involved in making a system is itself a system. In those terms, this chapter presents the purpose (Chapter 9) of the system-making system and a high-level concept for how to organize the high-level components (Chapter 11) in that system.

7.2 Objective

This model of making captures the activities and elements involved in executing the project to make or update a system.

The approach used for making the system should:

Build a good system. This means the system should have a clear design, be safe and secure, and be maintainable.
Be cost and time efficient. A system usually has a budget or a deadline; the work should proceed without wasting either time or money.
Keep the workers who build the system satisfied.
Satisfy the customers who will sponsor or use the system.
Position the organization building the system for future work, if appropriate.

7.3 Model

The making model has five main elements:

Artifacts: the things created that make up the system and its records
Tasks: the activities that are performed to make artifacts
Team: the people who perform tasks
Tools: things that the team uses in performing tasks
Operations: how the team manages the work to be done

7.3.1 Artifacts

The artifacts are the things that are created or maintained by the work to make the system.

The artifacts have three purposes. First, the artifacts include the system’s implementation: the things that will be released or manufactured and put in users’ hands. The artifacts should maintain the implementation accurately, and allow people to identify a consistent version of all the pieces for testing or release. Second, the artifacts are a communication channel among people in the team, both those in the team in the present and those who will work on the system later. These people need to understand both what the system is, in terms of its design and implementation, and why it is that way, in terms of purpose, concept, and rationales. Finally, the artifacts are a record that may be required for future customer acceptance, incident analysis, system certification, or legal proceedings. Those evaluating the system this way will need to understand the system’s design, the rationales for that design, and the results of verification.

The artifacts should be construed broadly. They include:

Records of the system’s purpose (Chapter 9).
Documents recording the system’s concept and design.
The system’s implementation.
Verification records.
Rationales for design choices.
Plans, defects, analyses, and activity logs.
Procedures and processes to guide work.
Information about the team and roles.

Artifacts other than the implementation are valuable for helping a team communicate. Accurate, written documentation of how parts of the system are expected to work together—their interfaces and the functions they expect of each other—are necessary for a team to divide work accurately.

Many engineers focus solely on the implementation artifacts, especially in startup organizations that are trying to move quickly, and do not produce documents recording purpose, design, or rationales. If the organization is successful and the system they are building enters service, at some point this other information will be required—as the team membership turns over, or as the complexity of the system grows, or as the team finds flaws that need to be corrected. The startups I have observed have all had to reconstruct such information after the fact; the reconstructed information is less accurate and costs more than it would have been if it had been recorded from the beginning.

The system artifact graph is the collection of all the artifacts that the team works with. It includes both every artifact and relations between them that show how each one one derives from others. Among other things, for each component, the artifacts graph includes its purpose specification, and design as well as implementation and verification artifacts. I discuss this more in Chapter 15.

Finally, the artifacts should be under some kind of configuration management. Artifacts will evolve as work progresses. One artifact may be a work in progress, meaning others may want to review or comment but that they should not count on the artifact’s contents being stable. An implementation artifact may reflect some design artifact; when the design artifact is revised, people must be able to see that the implementation reflects an older version of the design. When the implementation artifacts are packaged up and released, the resulting product needs to have consistent versions of all the implementation parts.

7.3.2 Tasks

These are the individual activities that team members perform. The tasks use and generate artifacts. I rely on the colloquial definition of “task” and do not try to formalize the term here.

Systems projects usually have vast numbers of tasks. These include tasks for designing, building, and verifying the system; they also include tasks for managing the project, reviewing and analyzing parts of the system, and approving designs and implementations.

There are usually far more tasks to be worked on than people to do them. Tasks also usually have dependencies: something needs to be designed before it is implemented, or one part of the system should be designed before another.

Tasks, in themselves, need to be known and tracked. People on the team need to know what they can be working on, and who is doing other tasks that might relate to their work. Managers need to be able to track what is being done, what tasks are having problems, and ensure that tasks are coordinated and completed.

Operations, discussed below, addresses questions of what tasks are needed and which ones should be performed in what order.

7.3.3 Team

These are the people who do the tasks. They are not an amorphous group of indistinguishable and interchangeable parts; each person will have their own abilities and specialties. Each person will also have their own authority, scope, and responsibilities.

The team should be organized. This means:

Everybody knows who is on the team.
Each person knows what they should work on—the scope of their responsibility—as well as what they should not work on.
Each person knows who has what responsibility and authority, so that they know who to talk with when they have a question or request.
Each person knows who they should inform of both technical and administrative matters, so that when one person is making a discovery or decision they know who they should tell.
Each person should know the process by which decisions are made, so that they know what decisions have been made or not.

In addition, the team needs to be staffed with enough of the right people to get work done. This means that people with management responsibility need to know who is on the team and their respective strengths, as well as the workload each one has and the overall plan for moving the project forward.

7.3.4 Tools

These are things that the team uses to get its tasks done. The tools are not part of the system being produced, though they are often systems in their own right. An end user of the system being produced will not use these tools, either directly or indirectly.

The tools include things like:

Tools for storing and tracking artifacts as they are created and updated.
Software build tools and hardware design tools (e.g. CAD systems).
Verification and analysis tools.
Testing infrastructure, including physical equipment, procedures, and analysis software.

7.3.5 Operations

Operations is about organizing the work that the team does. Its primary function is to ensure that the right tasks are done by the right people at the right time.

Operations sets up “a set of norms and actions that are shared with everyone” in the project [Johnson22, Chapter 2]. It gives people in the team a shared set of rules and procedures for doing their work, and it uses those procedures to manage a plan and tasks that coordinate that work. When people share a set of rules and procedures, they can each have confidence in how others are working and in the results that others produce.

There are two primary objectives for operations: making sure the work proceeds efficiently, and ensuring product quality. Operations has secondary objectives, including keeping the organization informed of progress and needs.

Ensuring the project runs efficiently implies several things.

Avoiding unnecessary work and rework.
Avoiding delays where some team members are blocked because some other part is not yet ready.
Ensuring resources are available when needed.
Organizing communication within the team so that needed information is shared.
Ensuring that the project can continue to operate into the future by providing for communication with future team members.

Ensuring quality means:

Making sure that design and implementation are firmly based in system purpose.
Ensuring that work is checked for meeting correctness and quality standards.
Managing work to account for uncertainties and unknowns.
Ensuring that work involving multiple people is coordinated so that they do not work divergently.

I look at operations through the lens of the tasks that people on the team will do. Operations is about tracking what tasks need to be done, who is working on them, and how those tasks are going. It is also about organizing tasks so work proceeds with as few interruptions as possible. Operations is, in a way, a feedback control system that keeps the flow of tasks running smoothly.

Operations is more than overseeing tasks, however. It is equally about guiding the team through its work, working out what tasks need to be done and looking forward to plot out how those can be best scheduled.

Determining the tasks to be done begins with the list of artifacts that the team will develop. Developing this list begins in turn with the structure of the system, particularly the components in it. For each component, there is some pattern of artifacts: purpose, specification, design, and implementation for example. Then there are questions of how the team should tackle those artifacts—one after another, or in multiple iterations, or something else. The project also establishes checkpoints or milestones, where the artifacts will be reviewed, some results are made available, or decisions are made.

How to schedule all the tasks comes next. The scheduling must respect how tasks depend on each other. They must be arranged so that long tasks are started early enough that people don’t have to sit idle waiting for the long task to be done. The project must also account for who can do different tasks and ensure that the right people are available when needed.

Obviously, the team does not know what all the components will be, and thus what all the tasks will be, early in the project. Operations is an exercise in the unknown, tracking how the work ahead changes as the team works out the structure of the system or when people find problems with artifacts already developed. Requests for changes add to the complexity. undisplayed image

The following model divides the work of operations into parts to make the job more tractable. These parts are interdependent. I discuss this model in more detail in Chapter 20.

The development methodology determines the guidelines for doing work—incrementally or not, and how to focus effort. The life cycle and procedures define the steps that the team will follow for steps of the work. The life cycle and procedures depend on the definitions of what artifacts the project will produce, since they define task for building them up. These three parts define policy for how the team will work. They are defined early in the project, and change slowly once the project is moving.

The methodology and life cycle combine with the components in the system to define the tasks that need to be done. As system structure is developed, more tasks are revealed. As problems are found or change requests received, other tasks get added.

Planning then looks forward at the tasks that have yet to be done, aiming to understand the medium- to long-term direction for the work. It includes both tasks that are defined at the moment as well as placeholders for work that can be expected. For example, part way into a project to build a small spacecraft, there could be defined tasks to specify and design the electrical power system and expected work to verify its integration with the rest of the spacecraft systems. The plan looks at how tasks depend on each other, the resources needed to complete each one, and whether some tasks will take a long time to complete. It produces a plan for how to order the tasks for efficient execution; this plan provides information to scheduling activities that predict when the project might complete different milestones.

Tasking is the final step, deciding which specific tasks to do next. For example, when someone becomes available to take on another task, this determines which tasks are the best options for them.

All of these depend on information that supports project operations: records of the methodology, life cycle patterns, and procedures, tracking tasks and schedule information, and tracking who is available for what kind of work when.

Development methodology. This defines how the project will organize the overall flow of work in general terms. It defines in what order people will do work, how and how often they will plan that work, how they will break up complex tasks, and so on. Many people are aware of basic development methodology approaches like waterfall, spiral, or agile; these provide basic patterns that can be used to inspire the development methodology that a project actually uses.

The development methodology defines things like: “The project defines periodic intermediate milestones, and then works toward those milestones,” or “the project builds components in multiple iterations, focused on reducing risk and most essential function first.” I discuss development methodologies more in Section 20.3 and Chapter 22.

Life cycle. This defines the overall patterns of actions that the team will perform as it does the project. It defines phases of work and how one phase should happen before another. A typical phase is made up of many tasks; it covers (for example) the the work designing some artifact that is part of one component. The life cycle also defines milestones, which provide planned times when checks on work are done in a phase.

A life cycle pattern says things like: “First work out purpose, then specifications, then design, then implementation. At the end of each of these phases, have a review with one person designated to approve moving forward.”

There are many different life cycle patterns, and usually an organization or a project will need to pick one—and then customize the life cycle to meet its specific needs. Sometimes the life cycle will be determined by external requirements; for example, NASA defines a common life cycle for all its projects [NPR7120].

Procedures. While the life cycle defines in general what to do, the procedures define how to do some tasks. They provide specific instructions for how to do particular actions or tasks. The instructions might take the form of a checklist, a flow chart, or a narrative.

People on the team need to know how to do things that require coordination. While team members should be able to do most of their work independently, at some point they will need to work together. The work will go more smoothly if everyone understands when they need to work together and how to do it.

There are also some tasks that are procedurally complex, even when only one person is involved. For these tasks it is helpful to have written down the steps to perform—which serve in effect as a checklist.

Procedures should be defined for tasks where getting the actions right is critical or where the task is complex. In the example below, checking a document artifact into a repository is simple, but needs to be done correctly. Performing a design review and approval has potentially many steps to go through: communicating the design to others for review, an approval decision by a designated team member, and changing the status of design documents to show that it has been released. When the life cycle defines a point in the project when something should be checked, such as during a review, procedures ensure that all the needed checks actually happen.

Documented procedures help the team perform tasks accurately, helping to make sure that steps aren’t missed. They also help the team do those tasks in compatible ways so that one person’s work can build on another’s.

I have seen teams that try to operate without some ground rules for working together. This can work quite well for teams up to three or four people, and when the artifacts they produce do not need high assurance (that is, when what they produce is not safety- or security-critical). On larger teams that have not written down their basic process rules, I have always seen failures to communicate or consult. These failures sometimes led to errors in the system that had to be corrected later once found. Sometimes they led to one person damaging another person’s work, requiring time and effort to recreate overwritten designs.

Documenting procedures also provide a way for the project to learn and improve. If some procedure is not working well, the team can identify which procedure is the problem and then change it. As long as team members then follow the revised procedure, the team’s ability to work should improve over time. Contrast this to not documenting a procedure: some people may have opinions on how to do it better, and they may start doing it the new way, but not everyone will know about the change, and people may forget it after a little while. This makes learning slower and less reliable.

Plan. The plan defines the overall intended path forward to a completed system, along with selected milestones along the way. It is a current best estimate of the general steps needed to move the project toward that goal.

A plan records the approach the team intends to take to build the system. It lays out the phases of work expected, in coarse to medium granularity. In doing so, it records decisions like the flow from specification to design to implementation to verification. It records when the team decides to investigate different ways to design some component, perhaps prototyping some of the ways. It documents expected dependencies and parallelism.

The plan is, therefore, a record of how parts of the life cycle pattern are applied to this specific project. Just as there are many patterns that a project can choose to use, there are many different ways to organize the project’s work. I discuss these choices in depth in Chapter 20.

A plan is not necessarily a schedule. A schedule is usually taken to mean a sequence of events with a high confidence of accuracy and completeness. A plan, on the other hand, reflects the uncertainties that come with developing a complex system. In the beginning, the plan can be specific about a few things in the near term but must be vague about the longer term until enough design has been completed to fill out later work. As a project progresses and more and more becomes known, the plan should converge to something like a schedule.

A plan is broader than a list of specific tasks. It consists of a number of work phases, and dependencies among them. This information then guides the specific tasks, as discussed in the section on tasking below.

Plans are used in prospect, in the moment, and in retrospect. They should provide guidance on what direction the work will likely go in the future, even when that direction has uncertainty. They are used in the present to track what is happening now. They provide history of what has been done, to understand how the team’s work compares to predictions and to provide accountability for everyone responsible for working on the project.

I have never encountered a project that had a single plan for the whole duration of the work. Plans have always been dynamic. Early in a project, we would know that we needed to develop a concept for the system but did not yet know enough to sketch out the work involved in building that concept. Later we had a general structure for the system, but there were technical questions to resolve; once resolved, we would know what we were building. Later in the project, we would find defects or we would get a change order, resulting in unanticipated work.

Tasking. This is the day-to-day definition of tasks to be done, their assignment to team members to perform, and tracking their progress.

Tasking involves continuous decision-making: the choice of which tasks should be performed next, or which tasks should be interrupted to deal with higher-priority tasks. These choices merge several streams of potential tasks: ones that derive from the nearest parts in the plan; ones made newly urgent by a change in what is known about the system; ones about fixing errors that have been discovered; and tasks related to new outside requests.

The team will need to keep track of both the potential tasks and the ones that have been assigned and are being worked on. This implies record-keeping artifacts.

The criteria for deciding about tasks should be encoded in procedures, as discussed above. The procedure for choosing tasks can be viewed as a control system that responds to project events to affect the set of tasks assigned for work, with the aim of making the project’s execution run efficient. “Efficiently” means meeting the goals set out above for operations: ensuring that the right work is done, that people aren’t blocked from getting work done, and that the work follows orders or dependencies needed for high-quality work.

How the tasking control system works depends on the development methodology used in the project. Agile development, for example, often focuses on making tasking decisions at regular intervals (for each “sprint”); other methodologies focus on making tasking decisions continuously.

Support. The decisions made during operations take into account several kinds of supporting information. These include:

The project’s budget. This accounts for the money and other resources made available to build the system and how much has been used to date. This can be combined with the estimates in the plan to determine whether the project has enough resources to complete, or to reach intermediate milestones.
Risk. This is a list of potential problems that could affect the project’s execution. These risks are things like “part X may not be delivered on time” or “regulator Y may object to a design decision”. Good management practice keeps track of these, checks in on them regularly, and works to either eliminate the risk or mitigate the outcome.
Uncertainty. This is a list of the technical uncertainties in the system—things like “are batteries of energy density X available” or “the control algorithm that maintains output within range Y is not yet determined”. These uncertainties can be addressed by including work in the plan to investigate the questions. Finding answers will allow the plan to become a more accurate estimate of the path forward, as well as leading to choices in the system design.

Sidebar: Resource-constrained projects

Traditional project planning approaches grew out of projects, such as building construction, that focus first on time and budget. This kind of project treats the completion date as the driving factor in organizing work, assumes that in general as many workers can be brought in as are needed to complete the work quickly, and that parallelism between tasks is limited primarily by dependencies between tasks. For example, in building a house, one contractor typically brings in a team to frame the structure, while another brings in a team to add the electrical wiring or plumbing into the structure. Each of these teams can bring in as many people as needed to get the work done, and then those people go on to another construction project elsewhere when their part is done.

This model of project planning leads to tools organized around a graph of dependencies between tasks. These tools usually provide analyses like critical path analysis, which shows the longest path through the graph of tasks and therefore the hardest constraint on how quickly the work can be completed. Planning the project well often hinges on understanding the dependencies between tasks and the critical path through them.

Most complex technical system projects, on the other hand, do not fit this model well. Each person working on the project needs to understand the context of their work, and there is usually a substantial cost to add someone to the project—largely in them learning about how the project works and how the system is organized. The collection of trained people on the team constitutes a valuable resource that the organization tries to keep around to maintain the system or to work on similar systems.

This approach leads to a different approach to planning work. While dependencies are certainly important, there are often many tasks that any one person can work on (and it is common to expect some degree of multitasking). In this case, getting the order of operations precisely right is not as important. It is more important to ensure that everyone can stay busy and that any major dependencies are accounted for.

7.4 Using this model

This chapter has presented a model for thinking about the work involved in making a system. This model, in itself, does not prescribe any particular way of managing building a system; it only names the topics that need to be addressed and provides some objectives by which an approach can be judged.

In Part IV, I go into more depth about each of the elements in this model.

Those who manage a project will need to decide how they will go about organizing their work. As I noted earlier, how a project is organized and run is itself a system, and the techniques discussed in this book apply as much to designing and operating the project’s operations as they do to designing, building, and operating the system product. Chapter 6 and Part III discuss the model for what a system is.

Part V discusses how the work of building a system can be organized around the life cycle of a project. Chapter 23 introduces the idea of a life cycle. It also introduces the idea that a life cycle model provides a basis for working out the tasks that need to be done to build the system. Subsequent chapters discuss what each of the phases of a life cycle, along with the artifacts and activities that go into each one.

Part XI discusses ways to organize the team that will do the work.

Part XIII presents approaches for planning and organizing the tasks that need to be done.

Chapter 8: Principles for a well-functioning project

3 May 2024

I have been a part of many projects. These projects built a wide range of systems, including specialized small business record keeping, local government IT applications, low-level graphical user interface tools, large storage systems, spacecraft systems, and ground transportation.

Some of these projects went well. They produced systems that were useful for their customers. The systems held up over many years of use, working correctly and supporting their users in ways they needed. The projects proceeded (fairly) smoothly: no major unexpected flaws, teams that worked together well, completion within close to the expected time and resources.

Paraphrasing Tolstoy, all well-functioning projects are alike; each project that has problems has problems in its own way [Tolstoy23]. Though there are several ways for a project to go well, there are far more ways they can go wrong—and it takes deliberate effort to keep a project on the path that goes well.

I have watched many of these projects struggle through problem after problem, most of them self-inflicted. The causes have included poor team organization, lack of a coherent system design, lack of taking the time to think through designs, lack of design, internal organization politics, and many others. The struggles led to canceled projects, startups that had to get extra funding rounds and missed their market opportunity, and unsafe systems being used in public spaces—often consequences not just for the people building the system but for their funders and for society at large.

This book was inspired by observing these problems and finding ways to do better the next time.

So what does a project need to do to function well? To develop a useful, safe system, on a reasonable schedule and budget? To keep its team functioning at a sustainable pace, without internal disruptions? The rest of this book seeks to provide some answers.

My general principles fall into four categories:

The project or organization leadership;
The tasks for building the system;
The plans for building it; and
The team that builds it.

For each of these, I will list some principles I have found important to making a project that runs well, or to keep it running well.

8.1 Project leadership

I have watched many projects, especially in startup companies, try to create a team of the best specialists: executives who are skilled at fundraising and external relations; an HR person who has a track record at recruiting; someone with marketing skills and connections, and a few engineers who can build the key technical parts of the system. Most of the projects that have staffed only with such specialists have either failed or had serious problems with execution.

These projects had a gap at the center of the work. Everyone is responsible for some piece, but there is no one whose responsibility is to link the pieces together: to build either the team or the product as a coherent system. People in the team generally don’t really understand each others’ work. They have trouble finding how to work with each other. The executives don’t understand the work or the team, and issue instructions that don’t make sense. The team makes poor technical decisions because no one understands how the artifacts they are building must work together.

This gap leaves three needs unmet. First, there is communication and translation between the executive team and the engineering team. Second, there is organizing and running the engineering team. And third, there is maintaining a systems view of the team’s technical work.

8.1.1 Principle: Communication and translation

Have at least one person in the organization who can communicate with people in the executive team, marketing, and engineering, and translate among them.

The executive team is, in most organizations today, a collection of specialists in running the company as a whole: corporate activities, finance, legal, public relations, marketing. I have found this to hold equally for independent companies, especially startups, and for projects that are part of larger organizations. The details may differ but the roles are largely similar.

The engineering team is also mostly a collection of specialists in one area or another, according to the needs of the system being built. They will understand parts of the system, but few of them are tasked with making all the parts cohere so they work together. Most of the engineering team will have been contributing by having specific, deep skills.

The communication need is to represent these parties to each other. The executive team is responsible for setting the overall direction for the project. The engineering team needs this direction translated into actionable directions. The executive team also must be responsible for high-level safety and security decisions (e.g. what kinds of safety hazards the company will address in its system products). The executive team has the responsibility for these decisions, and those then need to be translated into the safety and security engineering processes. In the other direction, the engineering team needs to provide feedback to the executive and marketing teams on the feasibility and cost of different possible feature or market decisions the executive team could make.[1] The project management part of the engineering team also has the information about how work is progressing and can provide information about the time and people needed to reach different milestones.

8.1.2 Principle: Provide staff to run the engineering team’s operations

Designate at least one person to oversee how the team building the system operates. This person (or people) organize the team, and adjust how it operates as the team grows and the work progresses.

An organization is a system, and a team of more than a handful of people will not self-organize in a useful way. I will argue below that this system needs careful design to work well.

I consulted with a small startup that did not have someone responsible for organizing the engineering team. The startup had begun as a very few people, who were figuring out the basics of what their company could build. The co-founders did not create an organization below the executive level; instead, they expected that they could all just work together and figure it out. And, predictably, they did not figure it out once they added a few more people to the team and had to specialize.

Johnson [Johnson22] discusses how to organize a growing company, and I recommend her work to the reader. She presents many ideas about what to do to organize a company’s operations. While that book focuses more on the human-oriented parts of operations, such as hiring and performance evaluation, the ideas it presents provide a solid foundation for parts specifically about engineering, such as how to organize design and implementation verification (which are as much a human activity as a technical one).

An organization that is going to successfully build a complex system will need to designate someone as having the primary responsibility for creating and maintaining the team’s structure and patterns of behavior. Either that, or they need to get improbably lucky.

8.1.3 Principle: Systems view of the system

A team building a complex system must have at least one person who is responsible for the system as a whole, not just its parts.

A coherent, working system does not occur by chance. It requires deliberate effort for a collection of parts to work together, and for the collection to fulfill the purpose for a system.

This deliberate effort can be achieved, theoretically, by a group of uncoordinated specialists. However, this amounts to the Infinite Monkey Theorem, where enough workers and enough time can produce any system. For realistic systems, many more times the projected lifetime of the universe might be enough.

In reality, the majority of the engineering team is responsible for parts of the system, not the whole thing. It is not the job of these people to be responsible for the systems view of the whole; nor is it usually their training or experience.

Building a system requires coordination so that the parts work together. This can be achieved by designating one or a few people to be responsible for the coordination, or by having the parts-builders work by consensus. Work by consensus requires skills and time that few people have, unless the team has no more than perhaps five or six members.

Building a coherent system also requires having a way to measure coherence and satisfaction of system purpose. If a team is to work by consensus, all members of the team must have a consistent understanding of these criteria. If a smaller group is responsible for the system as a whole, then fewer people are required to share this understanding.

The shared understanding starts with the purpose for the system. The definition of the system’s purpose is outside the engineering team’s scope; it comes from the customer or their proxies by way of marketing roles (Section 6.2 and Chapter 9). The translation of information about customer needs into an actionable system purpose is the responsibility of a system role. This includes documenting the system purpose, developing a concept of the system, and writing down top-level system specifications. In doing so, the role works with the executive and marketing teams to confirm that the purpose and concept as developed match what the customer and organization actually intend.

The systems role is responsible for ensuring that the component parts of the system fit together into a coherent system. To meet this responsibility, the systems role is responsible for the design of the high-level decomposition of the system into parts, and how those parts are related—the functional and non-functional relationships (Section 6.5 and Chapter 12). While the systems role delegates the work to design and build the components, the role does check that the results match the specification of how the components interact. The systems role also guides the order of work, especially for how to plan integration.

8.1.4 Principle: The team is a system

A well-performing team is deliberately designed to have a structure that gives each member incentives and support to work together. The team’s leadership establishes the design, and monitors the team’s function to adapt the team structure when needed.

An effective team does not happen by accident. When a team is not given a structure and rules about how to work together, they will find ways to work. They will build up habits in response to a few specific early needs—and those habits will not make for a team that communicates well, cooperates well, or makes good systems decisions.

When medium to large teams try to self-organize, they react to problems they face immediately, and each person determines their response based on their own values and self-interest. The team members are not trained or incentivized to plan the team’s organization for future needs; instead, they find ways to work through individual problems as they come up. The team members in general do not have a view of the entire effort that will be needed to build the system, and so they find solutions based on their specific needs.

Team work exhibits variations of collective action problems. [Olson65] These problems occur when a group must work together; each member of the group must contribute in some way, and in return everyone in the group receives some benefit. The optimal strategy for an individual is often at odds with the optimal strategy for the common good. Many commonly-known cooperation problems, such as the tragedy of the commons or the prisoner’s dilemma, are kinds of collective action problems. (In fact an engineering team represents a particularly complex kind of collective action problem, because the contributions of different group members can combine non-monotonically: the value of one person’s contribution to the common good can be negated by another’s contribution.)

In other words, the natural tendency for a group is to form an organization that is reactive to immediate needs and to individual objectives, rather than the long-term objectives of the project as a whole.

Creating an effective team is, therefore, a deliberate act. It involves working out what the team needs to do as a whole, and then designing a structure for the team. That structure should address:

Who is responsible for what actions (designing some component, testing something);
How communications should flow in the team (both communication within the team as a whole and within sub-groups to coordinate building specific parts of the system);
Who makes what decisions, and how those decisions are made; and
How to identify and respond to exceptional conditions (problems in the team, technical problems).

Maintaining a team’s effectiveness is also a deliberate act: good project leadership monitors how the team is doing and adapts organization or processes when needed. The team organization, or its processes, or its role assignments may work well for a while, but not fit the team’s needs as well later. The project’s leadership may set up a team organization or process and then find it doesn’t work as well as expected.

The organization of a team can be evaluated against the objectives in Section 7.3.3: how well people know how they fit into the organization and how that affects the actions they take.

I discuss matters of designing a team in Chapter 19 and in Part XI.

8.1.5 Principle: Team habits

A team with good habits and culture can get work done. A team with poor habits will not, except by unlikely random chance.

Whether a team follows procedures and processes depends on whether following them is the norm for the team.

Teams follow habits. Habits and norms provide stability to team members: when they know what to expect, they can get on with their work. This creates an incentive to keep following habits and not change them.

Establishing the habit at the beginning of a project is not difficult. Changing their habit later is quite hard and rarely successful. The leadership of a team has one opportunity to set up a team to follow a process without undue effort. When they squander that opportunity, a project has difficulty from then on. If people in a team do not have a de jure process to follow, they will work out ways to get things done, and those habits will be the default way they work. Those habits are likely to have been worked out in reaction to a few specific, immediate situations and won’t account for the indirect ways that one piece of work affects another, and thus will not meet the project’s needs well.

It is possible to change a team’s habits after the fact. However, it takes time (a lot of it) and effort. The transition from one way of working to another will take time, as people will follow their habits without thinking until new habits set in. People will need constant reminders and incentives to change their behavior. There will be a period when people are doing a mix of old and new, which can increase chaos for a while (and often creates extra work to clean up the differences). People will feel extra stress and often there will be a decrease in morale or civility in the team until they settle into the new norm.

Most of the projects I have worked on over the years have been about innovation. The people who start such projects do so because they are excited about what they can build, whether about the technical aspects or the market aspects. They are motivated to get moving as quickly as they can. They usually are trying to make a prototype or do a demonstration as soon as they can. They do not have excitement about the work of crafting a team; if they need that, they will get to it later when they have the prototype built, or when they have the next funding round…

This tendency is often exacerbated by the way some funders behave. They reward market opportunity and technical originality, which incentivizes a team to build the market case and technology demonstrations as quickly as possible. Funders rarely reward or even evaluate whether the project leadership has capability to form a well-functioning team. When a team’s ability to execute effectively and efficiently is not valued by the funder, they will not put the effort into crafting the team.

A project’s leadership must incentivize and model following processes in order to build a team’s habits. I am aware of a company that set out anti-corruption processes, including ethical standards and a hot line for reporting violations. The leadership did not, however, make it clear to the employees how these would be acted on, and there was no demonstration of the standards being enforced. The employees correctly realized that the leadership was not serious about enforcing the standards, and it led to significant internal theft.

8.1.6 Principle: Keep it lightweight and actionable

People will use processes that they can figure out how to follow and that clearly give them benefit. Don’t make processes more difficult than what the team can do.

People will generally follow prescribed practices and procedures as long as 1) they know about them; 2) understand them well enough to perform them; and 3) the practices have high value relative to the effort required.

The first aspect implies that processes and procedures are documented and organized in a way that team members can find them. This also implies that when people join the team, they are taught how to find and use them.

A practice or procedure must be both clearly written and actionable for people to understand it and use it. I have encountered “plans” or “procedures” on multiple projects that amounted to a list of aspirations, rather than a specific set of actions that someone could follow. In one example, a security incident response procedure said things like “we will contact the responsible parties”, without naming who the responsible parties are (or even better, listing them with contact information). Had there been an actual incident, vague statements like this one would have led to time spent figuring out who the responsible parties were, and likely coming up with a wrong answer when under the time pressure of trying to resolve a critical incident.

A process or procedure that requires too much time or effort will lead people to try to create workarounds, usually subverting the reason that a procedure was established. This is the problem of a procedure that people perceive as too “heavy”. Keeping procedures as simple as possible will help. At the same time, some work is simply complicated, perhaps needing several people involved because it affects all of them. When some work is necessarily complex, it is vital to clearly document the process so that everyone involved understands both their own role and what the others involved will be doing.

I will discuss these topics more in Chapter 20, and especially in Section 20.9.2.

8.2 System-building tasks

Most engineers understand the need to use good technical judgment as they build a part of a system, but it is just as important to follow good practices in how the team approaches the work.

8.2.1 Principle: Start with a purpose before doing work

Understand why something is being built—its purpose—before trying to design and build it.

This is one of the most important principles in this book, and it applies in a great many ways.

“Purpose” here means the objectives for some work, the need that is to be met by doing the work or the reasons that it is worthwhile to spend the time and resources involved.

If someone starts designing or building something without understanding the purpose of the work, it is unlikely that what they build will actually meet the need that caused them to start the effort. And even if they do meet the need, perhaps by focusing on the purpose part way through the work, they are likely to have spent time and resources in false directions.

When someone takes on a task, whether to build part of the system or to oversee team operations, it is that person’s responsibility to ensure that they accurately understand the purpose of the work. Ideally they will be told the purpose as part of the task, but the person is still responsible for confirming that they correctly understand the purpose. I have found that taking explicit steps to confirm understanding saves time and effort, even for small tasks.

At the same time, the person who defines a task is responsible for ensuring that there is a clear purpose to the work and communicating that purpose to whoever takes on the task. In other words, the purpose for work is involved in a communicative action.

This principle applies to building a whole system. As I discussed in Section 6.2, a system needs a purpose—a customer need, for example—that it will fulfill. This purpose originates with the customer, or whoever will use the system and the value that the system will provide them.

The principle also applies to building components of a system. Each component (Section 11.2) has some role in the system: functions, behaviors, or properties that it should have that contribute to the system as a whole meeting its high-level purpose.

Other work also should have purpose. Organizing the team, or maintaining the project plan, or reviewing a component design are all tasks that have purposes. Someone doing these tasks should understand why the organization or review is being done, and they should ensure that how they do the work addresses that purpose even if associated procedures don’t spell out every step involved.

I argue in an upcoming principle that successful projects perform checks to ensure that the work that is done correctly fulfills its purpose. Without a clearly-defined purpose, it isn’t possible to determine whether a design or implementation or plan is correct or accurate.

I discuss how purpose fits into a system-building project throughout the rest of this book. I address the purpose for a system in Chapter 9. Each chapter in Part IV, on how to make a system, discusses the purpose of steps in building a system. As I present more specific topics, such as specifications (Chapter 34) or designs (Chapter 38), I present the purpose for that aspect of system-building before talking about what it is or how it works.

8.2.2 Principle: Evaluate tools before adopting them

Investigate whether tools, procedures, methodologies, designs, or implementations fit the project’s purpose before adopting them.

Every complex system is different from others in some way. The differences may be technical, such as how some component must behave, or they may be operational, such as the kind of team, the organization hosting the project, or the customer’s needs.

Differences mean that things taken off the shelf may or may not address the project’s need. An off-the-shelf electronics board might be a good fit, or it might not be available within the time needed, or it may lack a key security feature, or it may have reliability features that the project’s design does not need (but that do not interfere with how the board will be used). Similarly, a development methodology might address the project’s need for moving quickly and being flexible, but it might not work for a project’s distributed team.

In many cases the off-the-shelf methodology or design can be used in many different ways. The team may need to make choices about which of those ways are helpful for this specific project. The team may need to adapt procedures or methodologies for the procedure to fit what this project needs.

A well-functioning project will evaluate something that can be adopted, whether it is a component design or a procedure or a tool, against what the project needs that thing to be. Something that might be adopted can be measured in terms of the benefits of using it, the costs of adapting it to meet the project’s needs, and the costs of using the thing without adapting it. If the benefit outweighs the costs, then the thing can be used. If the thing does not quite meet the project’s need but can be adapted, then an investigation will reveal how to adapt it.

Sometimes a project will be obliged to adopt a process or use a component that is not a good fit. In that case the thing should be evaluated so that the team has a clear-eyed understanding of what problems could arise, and they can work out mitigations to avoid the worst problems.

This principle has a serious risk: that it will become an excuse for the Not Invented Here syndrome. No projects have the time or resources to invent everything from scratch—especially when reinventions often lose sight of the experience that has gone into building existing procedures or components. A team has to balance using tools that are pretty good but not perfect against the cost of inventing from scratch.

The idea of satisficing applies. This is when one applies a solution that is good enough to satisfy a need, without attempting to find a perfect solution. Writing of adapting buildings:

The solutions are inelegant, incomplete, impermanent, inexpensive, and just barely good enough to work. The technical term for it, which arose from decision theory a few decades back, is “satisficing”. It is precisely how evolution and adaptation operate in nature.

Even after generations of satisficing, the result is never optimal or final. […] The advantage of ad hoc, make-do solutions is that they are such a modest investment, they make it easy to improve further or tweak back a bit. [Brand94, page 165]

8.2.3 Principle: Take care with build-versus-buy decisions

Carefully evaluate each choice of whether to design or build something within the project, or acquire it from outside. Be particularly careful about the team’s ability to accurately make this evaluation.

Projects often have choices about whether to design and build something themselves, or to acquire if from somewhere outside the project.

Too often, the choice is made without deliberation. When the wrong option is chosen, the cost can be significant: spending resources to acquire something that doesn’t work well, or to build something that is not very good.

There are reasons to choose to build something inside the project. These include:

When the component needed is innovative or not available elsewhere.
When a project-built component will fit into the system and meet the system’s needs in some significantly better way.
When control of the supply chain for the component is important (for example, in some foundational security component).
When control over the intellectual property involved is important to the success of the system.

There are also reasons to acquire a component.

When someone else has skills or knowledge that the project’s team does not have, and cannot acquire in reasonable time or cost.
When the component can be acquired at lower cost or time of development because someone else has already invested in the component.
When the component comes with long-term maintenance that the team cannot match in cost or needed resources.

Sometimes there are overriding concerns in making the decision. If the team does not have someone with the skill to develop the component, it will have to be acquired. If no outside organization offers a component that fits, it will have to be built. If the time to build is too long, then it will have to be acquired, or vice versa.

Other times the decision depends on the costs and benefits of each option.

Two cost considerations are often overlooked. First, a custom-built component can be made to be a perfect match for the system’s needs, while an acquired component may have to be adapted or may have unneeded features (which can become a liability). The cost of adaptation has to be considered in addition to the cost of acquisition. Second, a custom-built component presents opportunity cost as well as the direct cost of building it. If a custom-built component is not essential to the system purpose or the related business purpose, then the resources used to build the component might be better used on something more central to the purpose.

Teams, and individual team members, need to consider their ability to make an objective build-versus-buy decision. I have observed many people who choose to build something new not for sound technical or business reasons, but because they are excited about building that thing. I have seen other cases where someone decided to acquire a component because they were not interested in the effort required to design and build it well. Worse, too often the Dunning-Kruger effect [Kruger00] applies: that the person making the decision is not aware of whether they have the knowledge to make an accurate decision, or are not aware of how their biases are driving a decision.

8.2.4 Principle: Follow the spirit, not just the letter

When a project has adopted a procedure or tool, that procedure or tool has a purpose. When using it, keep the purpose in mind and make sure that purpose is met—not just following a procedure or using a tool blindly.

A well-functioning project does not adopt its procedures or methodologies on whims; it addresses them to purposes. In organizations like NASA, the procedure standards represent several decades of accumulated experience. While the procedure may not be written to make the purpose and experience clear, these reasons exist behind what has been written.

I worked on a NASA project that reached its Preliminary Design Review (PDR) milestone. The team followed the long NASA checklist for what should be presented at that review. Unfortunately, the team did not keep in mind what the PDR was actually for: ensuring that the early, conceptual design coheres as a system and showing that the system is ready to proceed to steps that will involve greater investment. Instead they developed material that checked each box on the agenda, without addressing the system as a whole. The reviewers could tell that the design did not make sense; moreover, the review failed to reveal the actual problems that the design had.

A team should document the reasons or purposes for which they adopt a procedure or a tool. Similarly, each person on a team should put effort into understanding why the team has adopted procedures and tools.

8.2.5 Principle: Document things so there is a future

Document both how things work and why they work so that people can understand the system when they work with it in the future.

It is easy to want to design or implement at full speed, keeping focused on the immediate goal: getting the thing built.

That goal misses the larger purpose of building something—that the built thing meets its purpose and specification, and that it continues to do so as the system evolves.

In practice, the initial design and implementation of a component involves much less effort than is spent on checking that implementation, integrating it with other components, fixing bugs, and making changes later. A project that is building a system to succeed in the long term optimizes for all these other tasks, not just the initial design or implementation.

All these later tasks involve understanding specification, design, or implementation of a component. Understanding means not just being able to see the design or implementation artifact, but also knowing why the component is what it is. This includes documenting the rationales that led to significant decisions about the component. It also includes providing people a guide to understanding the component’s design or implementation, especially if there are subtle aspects to the component that are easy to miss if one is looking just at a design document or an implementation.

When someone is the code for some component and asked to change some behavior, and that person isn’t the one who initially implemented that component (or they are the same person, but it was a while ago), they begin by building up a mental model of how the component works. Once they have that mental model, they can proceed to work out how to change it. They will think of different ways they could make changes, and evaluate them to see if the changes will have the effect they intend and that the changes will not have some other undesired effect.

Building up an accurate mental model involves working out constraints that led to the component’s design, major decisions about how the component is structured, and how different parts of the component work together to achieve its functions. This information is not encoded directly in software source code or mechanical drawings or circuit designs; all those things are the products of a process that works through all those other things on the way to producing the design or implementation artifact.

The person who is tasked with changing a component, and then building up a model of how that component works, can get information two ways: from documentation or by reverse-engineering it from the implementation artifacts. In practice it is usually best to do both. A circuit design is the truth about how an electrical component works, and so this is the most accurate way to learn about the implementation. However, a circuit design or software source code leaves out the rationale for why the design is the way it is. Having documentation about the design, about why the design is the way it is, and a guide to the implementation will help the person understand the component more accurately and more quickly.

Of course, having documentation only helps if that documentation is accurate. If the documentation doesn’t match how the component was actually implemented, then the documentation will lead someone astray when they try to learn how a component works.

There has been a saying in agile software development that “the code should be documentation”. This is usually interpreted as “the code should be the only documentation”, which is not what the people who developed agile methodologies intended.[2] The point in the agile methodology is that software code is necessarily documentation, and it should be written so that it is clear and readable so that others can read and understand the code.

I have experienced both the advantages of having good documentation and the disadvantages of having no or inaccurate documentation. Many years ago, I developed a multithreading package for a research system. That package included a peculiar thread-synchronization primitive tuned for that specific application; correct implementation depended on some unobvious code in one place. It took some time to analyze the design to identify that condition, and if I had not written it down I would not have remembered it correctly when I had to modify the package a year or two later. On the other side, on a personal project I was developing a responsive, single-page web application and developed a combination of JavaScript code running in the browser and Ruby code running in the server to achieve it. I did not document the design, and when I needed to improve it after a couple of years I had to reconstruct the design. I spent much more time than I would have liked on that reconstruction.

8.2.6 Principle: Build in checks

Make independent checks of all critical specifications, designs, and implementations a normal and expected part of project work. Define in advance who will do the checks and when they will do them.

Having one person check another’s work is a basic mechanism for maintaining quality, safety, and security in a system. It applies equally to technical work, such as verifying that a design matches specifications, and to project operations, such as checking that a procedure is working as intended or that team communication is flowing.

Note that this does not mean that developers can avoid writing unit tests or performing design analyses. They should be doing those, and independent checks should be done as well.

There are many advantages to performing reviews or checks:

Reviews catch errors, especially when someone with a different perspective can see something the person who did the work missed;
Checks in processes or organizations can identify team problems early, when they can be corrected more easily than if a problem were allowed to fester;
Reviews and checks provide a way to build shared operational and technical culture; and
Reviews provide an opportunity for people to learn from each other.

There are two significant disadvantages that can lead to a team skipping checks. First, checks take time and effort. When a team is pressed for time or short handed, it’s easy to let a check go by. Second, done poorly a review can feel like a lack of trust or like an attack on someone’s work.

Nonetheless, checks and reviews are important enough that a well-functioning project will find ways for checks to happen.

Having checks be a built-in norm for the team helps address the disadvantages. If everyone knows that checks are going to happen, the time and effort involved will be planned for. People will notice if checks are being skipped, and will ask why—helping to ensure that the checks actually do happen. Separately, when everyone’s work is checked, it becomes easier to convey the sense that no one is being singled out or is not trusted.

I discuss how checking can be built into a project’s life cycle patterns in Chapter 20.

8.2.7 Principle: Work against cognitive biases

Take deliberate, ongoing actions to avoid the negative effects of cognitive biases, such as confirmation bias or team echo chambers, and missing or incorrect information.

The work of building a system involves making many complex decisions. These decisions are based on the information that the person making the decision has, along with their skills, experience, and biases.

Incorrect decisions can be made when people work with beliefs or biases that are inaccurate. This leads to concepts or specifications that reflect the errors, and from there to designs and implementation that do not meet system needs. There are many terms for these various situations, including confirmation bias, echo chambers, or recency bias.

These errors arise from many different causes:

When someone has incorrect beliefs or biases, and follow those beliefs without question. For example, when a systems engineer has a belief about customer needs that is not a match to the actual customer needs.
When a team develops inaccurate shared beliefs. This can occur when team discussions progressively reinforce some information: one person raises an offhand question, someone else starts pondering that question, and as discussions continue the question becomes a fact the team accepts.
Incentives to pursue only some lines of investigation, leading to biased evidence for decision-making. I have observed this happen when someone had previous good experience with a particular design approach, and limited their investigation to similar approaches.
Missing information leading to biased decisions. For example, when a decision is to be made about how to re-design some component to add a new capability, and part of the rationale for how the original design was determined is missing. This can lead the re-designer only to consider the aspects of the rationale that were recorded and not consider the missing aspects. (See Leveson [Leveson00, §3.2, p. 17] for discussion.)
Continuing belief in information that has been found to be false, or decided against. This can occur, for example, when a project was considering multiple design options and has chosen one of them, but not everyone on the team understands or accepts that the decision has been made. This can also occur when a customer changes the system requirements and not everyone realizes that the requirements have changed.
The Dunning-Kruger effect [Kruger00], when someone is not aware of the limitations of what they know and proceed with false confidence to make design decisions.

These biases can lead to serious system flaws when incorrect decisions are made about high-level system design or safety and security functions.

There is no one method that will eliminate these problems. Indeed, many of these problems are a necessary flip side to cognitive behaviors that have positive outcomes, such as group agreement and pruning a search space when making decisions.

A well-functioning team takes deliberate and ongoing steps to reduce the problems that come from cognitive bias. These address the problems from two directions: prevention, by making complete information available, and reducing occurrence, by building into the project’s procedures methods to avoid or catch problems.

A project can reduce the chances of cognitive bias issues by maintaining complete written records of key information. Information about customer needs (and how those were determined) and rationales for design decisions are most important. Completeness in designs and verification records also helps. Sharing information that changes widely as well as documenting it in writing helps avoid team members working from outdated assumptions.

Reducing occurrences of erroneous bias involves finding ways to see around the bias into information that would have been ignored or dismissed. This almost always comes from finding a way to get perspective that sees a problem from a different perspective. Training team members to take deliberate steps that will try to falsify their hypotheses gives each team member their own improved perspective. Building in reviews where decision rationales are explained to people with different perspectives helps catch biased decisions before they cause errors. Designating someone to be a devil’s advocate in discussions about complex decisions makes it clear that the team is taking the possibility of bias seriously.

Continuous training for team members in their own disciplines and in related ones improves their skills, in addition to what they learn by experience. Greater knowledge and skills helps combat the kinds of cognitive bias related to the Dunning-Kruger effect. Training in related but different subjects improves their open-mindedness, giving the team members new perspectives to use in thinking through decisions.

Project leadership has an important role in avoiding problems that arise from bias. Good leadership models behaviors where the leader explicitly looks for falsifying evidence and alternative perspectives. The leadership has the ability to allocate effort to investigating decision alternatives and being the devil’s advocate in discussions. The leadership sets expectations for the rest of the team by inspecting decision rationales to ensure that steps have been taken to address possible biases.

8.3 Plan for building the system

Complex systems, with dense graphs of relationships between their parts, cannot be built without a plan. A project cannot get such a system built by following a random walk through the space of possible tasks. However, plans have often been over-done, trying to lay out a definite schedule where in fact there are unknowns and then having scheduling crises when something runs long or over budget. A middle ground that remains honest about what is known and what is not, that allows flexibility as the project moves forward, and that also guides the work in a consistent direction works better.

8.3.1 Principle: Prioritize work by risk or uncertainty

Put effort into work that carries risk or uncertainty as early as possible.

Common project management practices advocate paying attention to the critical path: the set of tasks that must be completed on time in order for the project as a whole to complete on time. If any one of these critical tasks runs late, the project as a whole will be late. Each task has some measure of slack, the amount that it can start early or run late without delaying the end of the project. If a task has no slack, meaning it must start and finish on time, it is part of the critical path. Most projects have at least one sequence of critical tasks from the start (or from the present) to the end of the project.

This definition of critical path is useful but overly simplistic. It is useful because it gives a way to identify work that can put the project at risk, and once identified that work can get extra attention to make sure it goes as planned. The definition is simplistic because, at least in the basic formulation, it assumes that the graph of tasks and the duration and dependencies of each task are all known.

The critical path method is a special case of the general principle of using risk and uncertainty to inform project planning. In general, what work could lead to the project being delayed, or to the project failing?

There are at least four kinds of risk or uncertainty to consider.

First, there is the risk that some external event will affect the project. A customer might change their needs. Regulation might change, affecting how the system must be designed. A supplier might go out of business and thus not deliver components. Weather might delay an essential testing operation. Some geopolitical event might happen that changes the ability to manufacture an essential part.

Second, there is uncertainty about how to build part of the system. At the beginning of a project, there is neither concept nor design for the system and so the time required to build it is uncertain. As the design begins to develop, there will be some parts of the system that have low technical risk because they involve well-understood problems, such as wheels for a road vehicle. There will be other parts that cannot be built using available designs, such as a spacecraft that needs low-mass, low-power radio subsystem that can communicate with another spacecraft. If the team can find or develop an appropriate radio, then the project can move forward—but if it can’t be, then the system design or the mission will require significant re-work. It may not even be possible to meet the customer needs within the time and budget they require.

Third, there is uncertainty about the time and effort required to build something. There may be a likely technical solution for some component, but the difficulty of constructing it may have hidden surprises. The time needed for a supplier to provide a purchased component might not be known until a contract is signed with them. The complexity of testing the integration of certain components and fixing bugs might not be understood.

Finally, there is schedule risk from a “long lead” task or sequence of tasks that will take a long time to complete.

A well-functioning project searches out risks and uncertainties like these and puts attention and effort on them. Deliberately spending effort addressing technical and schedule risks early in a project means that potential problems are addressed when it is cheapest to handle them. Consider finding out halfway through a project that there simply is no component available to fill some need. Addressing this might require a redesign of much of the system—but much effort has already been spent building parts of the system that now must be discarded. This is a waste of resources; more seriously, it presents a problem that all too often leads project management to decide to fudge the solution and build a system that does not work as needed.

This principle requires dedication to examining the state of the project thoroughly and without bias.

8.3.2 Principle: Prioritize integration

Integrate components as early as possible. When possible, integrate mockups or skeleton components before building out the component details.

There is common wisdom that the cost of fixing an error in a complex system generally increases over time, up to the release into production. While the hard evidence for this is lacking, I find general acceptance that this occurs, though with plenty of exceptions.[3] The idea of increasing cost over time has led to methods that successfully catch errors early, including concept, requirement, and design reviews, test-driven development, and automated checking tools.

Studies such as those reported by Leveson [Leveson11, Sections 2.1 and 2.5] suggest that the greatest cause of system failures now comes from design errors related to the interaction of separate components: the robustness of individual components is not the problem, but instead how components work together. This appears to be the case even with requirement and design reviews, which certainly catch many errors before they are implemented.

I have found two methods help reduce integration-related errors.

The first method is to use semi-formal, top-down design analysis methods in conjunction with design reviews. I recommend the STPA method that Leveson presents. [Leveson11] The Mars Polar Lander loss review called out the lack of such analyses as a significant contributor to the loss of the spacecraft [JPL00, Section 5.2.1.1, p. 16].

The other method is to organize development around integration, so that the component interactions can be tested (not just analyzed) as soon as possible. This principle means focusing on how components will work together before implementing fully detailed components. This leads to building the system in increments, starting with a collection of stub or skeleton components that implement a few parts of the component behaviors and integrating them together into a partial system with limited capabilities. This partial system is then tested, with an emphasis on seeing if the interactions work correctly. Once problems with the integration are sorted out, another tranche of functionality can be added and tested. Along the way, one always has a partial system that runs.

Integration first has two benefits. First, if the component interactions do not work well, multiple components will be affected by a redesign. Detecting the problem before investing in all the details of the components means less re-work. Second, it is usually easier to test interactions with mockup or skeleton components than with “real” components. One can instrument the mockups to observe detailed states that are harder to observe in a complete implementation. One can also add fault injection points to make it easier to create off-nominal test scenarios.

This principle is not one to apply blindly, however. The purpose of integration-first development is to address uncertainty or risk that comes from potential component interaction problems. Some components may have their own internal technical risks, and sometimes it is more important to sort out that risk before addressing component interaction risks. Of course, the ideal would be to address both in parallel.

8.3.3 Principle: Have a long-term plan

Maintain a plan for how to get from the present to a completed system. Detail out the near future; have a concrete but less detailed plan in the medium term; and have a general approach beyond that. Evolve the plan as understanding about the work changes.

Consider the task of planning a route for walking from one place to another. If one has a map of roads or trails connecting the locations, one can search out a path by using a standard shortest-path graph algorithm, which evaluates various parts of paths in an orderly way until it finds a “best” path.

This is analogous to building a system with few unknowns. One can start by designing the system on paper and checking it out. This approach is a low-risk way to build a system, as long as one can be sure that all of the components can be built as designed and that their integration into a system will work as planned. This situation applies when building a system that has strong similarity to other systems, so that there is an existing body of knowledge about what works. This is the basis for repeatable engineering methods, as evaluated by standards such as CMMI. [CMMI] It is also the situation that led to the waterfall system development methodology.

What if there is no map? What if the terrain in between is unknown, and the distance is far enough that one can’t do something like climb a hill and look?

Most projects that are working to build an innovative complex system have a situation like this. At the beginning, there is no obvious path to follow to get to the desired system; indeed, there may not be any path that gets there if the desired system is not feasible.

The team working on the project needs a plan that will guide their work, giving it a general direction for the long term, some concrete plan for the medium term, and details in the short term. As the work progresses, some of the medium-term work will turn into specific, detailed tasks. Some of the tasks will provide information that fleshes out parts of the general, long-term work into more concrete medium-term work. Sometimes bug reports or change requests create new short-term tasks that change the medium- and long-term parts of the plan.

A plan like this benefits the team. It helps ensure that people get all the tasks done, without some getting missed. It conveys decisions about how work is prioritized, which helps the team work independently. It gives a basis for measuring progress and predicting whether milestones will be reached on time.

The act of maintaining the plan provides the opportunity to think about priorities (such as those in the previous principles) and the dependencies between parts of the work.

A flexible, evolving plan strikes a middle ground between a fixed schedule and a purely reactive tasking approach. A fixed schedule, of the kind often associated with the waterfall development methodology, often either becomes a fiction after a few weeks when unknowns intrude onto the planned perfection, or the schedule becomes flexible and takes effort to maintain without a discipline to doing the maintenance. A purely reactive approach, which can be seen as in Agile methodologies taken to an extreme, has the risk of the team wandering around chasing whatever immediate priority comes along, and then having execution difficulties when some work requires more planning than one sprint’s duration.

Of course real projects rarely take either extreme approach; in practice real projects adjust schedules over time. Having a discipline for maintaining a plan from the beginning helps the evolution proceed smoothly.

8.3.4 Principle: Set up intermediate internal milestones

Define regular internal milestones for showing a part of the system working in an integrated way.

Internal milestones that demonstrate some system function give the team a focus for their work in the medium term.

Each milestone demonstrates a set of system capabilities working, especially if those capabilities involve integrating functionality in multiple components. The milestones include a demonstration of the new capability working, in order to prove that the system is working and to give the team a concrete success to celebrate. Internal milestones like these put the team’s focus on a part of the system, leading to capabilities that are integrated together early. (This approach supports the principle of prioritizing integration, above.)

The functionality for each milestone should represent some significant amount of work. I have scheduled such milestones about two or three months apart. If a project is using Agile-style sprints, the milestone should include the effort from several sprints.

I have often focused these milestones on some high-level system function or on some pathway through the system. In the software effort on one multi-spacecraft project, the first milestone demonstrated that the basic software and communication frameworks functioned in a testing environment. The next milestone showed simple control loops in the flight software working; the milestone after that, collective guidance for the collection of spacecraft. Each milestone built on the work of the ones before it.

Of course, not all of the team need be involved in one of these milestones. Part of the team may be working in parallel on other functions. In the multi-spacecraft system example, other parts of the team were working on spacecraft hardware design, mission design, ground systems, and so on.

There is a risk in this approach: that the team takes too narrow a focus and fails to account for the larger system. Any focused effort, whether for an internal milestone or for something else, must be balanced by consideration of the whole system. In the project above, the systems engineering team kept working in parallel to the software teams in order to ensure that the software designs continued to meet mission needs and would integrate properly with the spacecraft hardware and ground systems.

8.3.5 Principle: Use prototyping safely

Use prototyping to validate a concept or determine if an approach is technically feasible. Never let a prototype escape and become treated as a part of the real system.

Building a prototype of a component or a part of the system is an excellent way to learn about how the component or part can be built, and how it will work. It is also a good way to check that a potential design will meet its needs.

Building a prototype is also one of the more dangerous activities that a team can do while building a system. The risk is that a prototype will appear to function in the way needed and will be treated as if it is an initial version of the “real” component, even though it is not.

A prototype has value when it can be developed quickly, at lower cost than its “real” counterpart. Taking shortcuts, implementing only some parts of functionality, not performing much verification—these are all positive approaches to building a prototype and negative for building a component to be used in the final, deployed system.

One example of what can happen comes from a colleague. He was tasked with building some sample software code that would show developers how one could construct a particular kind of application on a new operating system product. The sample code was intentionally simple; it illustrated a particular flow of activity that an application would need to do. It was not a full application in itself. He took some shortcuts in non-essential parts of the code, making the primary part of the application robust but (for example) making some helper functions non-reentrant because they were not an essential part of what was being illustrated. Unfortunately, after this code was published as part of a tutorial, people began blindly copying the helper functions—even though the example was labeled as illustrative only. This led to other organizations releasing buggy applications because they took the easiest and fastest route to building their application by just copying the helper functions.

I observed another example in an ambitious autonomous vehicle system. The company in question began development of their vehicle by building prototypes of several key systems, both hardware and software. In doing so they learned a lot about the problems they were trying to solve. The prototyping effort did what it should: it provided information about how the system should be designed as well as a platform for experimenting with algorithms (such as some of the control systems). Unfortunately, the company did not label or treat these artifacts as prototypes; they saw them as early versions of the real system. The prototypes allowed them to demonstrate vehicles that could perform some operations to investors. This led to increasing pressure to get more features implemented, and to correct problems they found with the vehicle operations as soon as possible. The prototypes had never been designed for reliability, safety, or security, and early safety analyses found significant flaws. Interestingly, the company did treat their hardware platforms as prototypes, and built a hardware platform that was designed to meet safety and security requirements to replace the early prototype boards.

These examples point to both the positive and negative sides of prototyping. To the positive, in both examples, developing a simplified version of the system in question helped people understand the problem at hand. The effort to develop the prototype went faster because the effort focused on only the essential element of what needed to be learned, and omitted aspects that would be needed for a production system. On the negative, in both cases the prototypes ended up being treated as production ready. The prototypes, having been built without the rigor needed for correct, safe, or secure function, led to flaws in the system products. These flaws increase the cost of building a working system, and the flaws tend to be discovered late in development when it is far more costly to correct them. (One startup company I worked with had to rebuild a third of its project when they realized how much they were spending to try to patch up the prototype-quality software they had written; they had to go through extra venture funding rounds to get their product released.) end missed

Using prototyping, thus, is a necessary and helpful part of building a complex system, but it must be done with discipline that keeps prototypes separate from the “real” system components.

Some project managers have talked with me about solving this by policy: they will have their team build a prototype but they will ensure that the prototype is not used for production, and they will put building a real component into the schedule. Unfortunately I have then seen this resolve fade away quickly as the project begins to run late or have funding issues or have an important demonstration coming soon. These imperatives have always, in my experience, taken precedence over system correctness and even over the longer-term cost and schedule to build a working system.

Prototypes are used more safely when they cannot be used in the real system. For example, people often construct storyboards or slides of the user interface for an application. These storyboards allow the developers and potential users to explore how the interface will work, but they cannot be made into an executable application. Similarly, building a software prototype using languages or tools that cannot be integrated fully into the production system helps keep that software from being used in production. Using prototype hardware that is similar but perhaps in a different form factor allows a team to see if a hardware design can work without risking the prototype being put into production.[4]

8.3.6 Principle: Analyze for feasibility

Analyze a system concept for feasibility before committing large amounts of resources to it.

I have worked on multiple projects that were, in retrospect, infeasible. Project A was trying to build a collection of cubesats to perform a demonstration of cross-link communication between the spacecraft. No radio or flight computer was available that could achieve communication between spacecraft except for a brief period at the start of the mission. Project B involved designing a commercial system for which no commercial business case existed—the system was fundamentally a public good that would not generate a commercial return on investment. A third Project C depended on multiple competing government contractors voluntarily developing a shared system architecture, when the rational behavior for all the contractors was to focus only on their own work. Yet another, Project D, depended on secure operating system technology that did not yet exist.

In all these cases, large amounts of money and effort were spent before the projects were canceled.

With hindsight, it is clear that the problems with all but one of these projects could have been detected early. In Project A, basic systems engineering could have created a mission concept of operations and modeled whether available radio and computing hardware was up to the task. The incentive for competing contractors in Project C not to collaborate was clear from the beginning, but the management overseeing the project chose to continue anyway. The missing technology in Project D was identified early but the customer insisted on proceeding.

Project B was the exception. It was defined as a two year limited-time exploration of the problem. At the beginning of the project, no one involved knew whether the system was feasible or if there was a business case. Over the course of the project we learned about the nature of the system, including that it produced a public good [5] rather than a private good, and thus it was not a sensible commercial product.

8.4 The team

A project’s people do the work of building the system. The team is itself a system made up of complex parts, and how effectively it works depends on how well it is organized and led. Supporting a team with the structure it needs, and in particular with the communication channels it needs, gives the team a fighting chance of working effectively and working through the difficult problems that will come along.

8.4.1 Principle: Document team structure

Define clear roles and responsibilities for each member of the team. Document and share that information so everyone has an accurate understanding.

As I noted earlier (Section 8.1.4), the team is itself a system. As a system, it has structure—who is on the team, what their roles and authority are, and how people should communicate (Section 7.3.3).

There are many ways projects can structure their teams. The specific choices depend on the nature of the project—the number of people, the range of disciplines involved, whether there is one organization or many.

In a well-functioning project, everyone on the team will have a common understanding of what that structure is. Each person will know who they should communicate with and when. Each person will know what their areas of responsibility and authority are, so that they know when they can make a decision and when they should work with someone else. They also will know who to go to for answers to questions about other parts of the system.

A shared understanding of team structure becomes most important when people find problems to address. If one person finds a problem with the design of a component, they will need to work with the people who are responsible for components sharing functional or non-functional relationships (Section 12.2). If there are interpersonal problems between two team members, the responsibility for escalating problem resolution should be clear.

Clear team structure enables delegation. In a project of more than trivial complexity, the work must be shared among multiple people. Sharing responsibility only works when both parties trust each other: that both will do their part of the work, that both will communicate what should be done and the progress that has been made, and that both will communicate when they find a problem with the planned work. This trust depends on a shared understanding of the rules about responsibility and communication.

8.4.2 Principle: Plan on reorganizing the team as it grows

Adapt the structure of the team as it grows, to reflect the increased coordination needed as the number of interactions increases.

A very small team, of up to around five people, needs little formal structure, because all the people can interact directly with all the others to coordinate the work. A large team needs formal structure, with defined scopes of responsibility and communication paths. In between, the team needs some degree of structure.

As a team grows, it will move gradually from the size where it needs little structure to needing more and more structure. It will reach points where it is outgrowing the structure it has had and needs to change to have a more formal structure. I have observed that teams need to change at around 5, 30, and 70 people.

In a well-functioning project, the leadership monitors the team’s performance to detect when the team is reaching a size where it needs a change in structure.

Some of the signs that a team needs to move to a more formal structure include:

Loss of clarity about who is responsible for what.
Problems with people having the information they need, not being informed when they should, or having trouble finding the right people to ask questions of.
Overloaded managers or leads who are tasked with coordinating work.

8.4.3 Principle: Have shared procedures

Document procedures that everyone on the team will use for important tasks.

Procedures define how people perform certain tasks (Section 7.3.5 and Section 20.5). These procedures should be documented and easy for everyone on the project to find. The team should have a cultural norm of following the procedures—not just the letter of the procedure, but the spirit of it as well.

People working together means one person does part of the work, then another builds on their work. For this to succeed, people need confidence that the work they build on has been done properly. Part of that assurance comes from having shared procedures and having a team norm that everyone is following those procedures.

Some procedures are simple lists of steps or checklists. For example, if a team is using a shared artifact repository like git, everyone needs to follow conventions about how to check in work, maintain branches, and baseline versions (such as by pulling to a main branch). If someone does not follow the procedures, then the state of the repository can become damaged.

Other procedures are more complex. Completing a Preliminary Design Review (PDR) in the NASA life cycle (Section 24.2.1) means that the project is ready to commit money and resources to begin detailed design and, later, implementation. This is a check on the whole project, not just on the design of one part. Passing the review implies that many project artifacts are completed, at least to a preliminary level: cost and schedule baselines, security and export control plans, orbital collision and debris avoidance plans, specifications to at least three levels, technical concepts, operational concepts, and many others. If the project continues but some of these checks are not true, then the project is likely to have serious problems later. (This was the case on a NASA project I worked on.)

8.4.4 Principle: Define regular communication paths

Document regular times and media for team members to communicate with each other.

The work the team does is interconnected. A decision about one part of the system affects other parts, following the system structure relationships. The decisions are based on information that, in turn, comes in part from the other parts of the system. Others on the team are responsible for ensuring that the project is making progress, including detecting when something is not going as expected.

Regular communication ensures that this information is pushed to the people who need it. A well-functioning team knows when to share information (such as times when decisions are being made), and who to share it with (the people whose work it will affect). Such a team will also avoid pushing information to those who do not need it. This avoids inundating people with useless information and thereby obscuring information they do need.

To achieve this, make sure that the project’s operational procedures include defined points when team members are expected to communicate. This might include times like starting on the design for a component, when changes are proposed for an interface, and when a component’s design or implementation are ready for review and approval.

Other team members need regular communication for other purposes. Status updates provide information to update the project’s plan. Other communication ensures that the team is working well, helps project leadership keep a finger on the team’s productivity and satisfaction, and provides a way for everyone on the team to learn the project’s overall goals. Johnson discusses communication as a foundation for team functioning [Johnson22, Chapter 2] and how communicating feedback is essential for keeping team members working at their best [Johnson22, Chapter 5].

8.4.5 Principle: Define exceptional communication paths

Define and document clear expectations about when and how someone will raise issues with others. Make this an essential part of the team’s cultural norms.

Delegation and sharing work is essential to a team that is building a complex system, and they are based on mutual trust. One part of that trust comes from each party doing their work well, following the project’s procedures and the team’s norms. The other part is being able to trust that people will communicate when there is a problem. (See Section 19.1 for more on this.)

There are many things that can go wrong. Someone can find an error in a specification or design. They can find that they don’t have the resources or skills to complete some task. People can have disagreements that they cannot resolve. A supplier can be late providing some component.

When these things happen in a well-functioning team, people will communicate—not keep the problem to themselves. The project’s operational procedures should make it clear how to handle some of these cases. For example, when someone finds a design error, they work with the person responsible for the design to find a solution, and they let others doing work that could be affected by the design change know. Ideally, they will ask for feedback from these other people to make proposed changes work for related parts of the system.

Communicating about exceptional situations only works if both the person raising an issue and the recipients can trust that the message will be heard, acted upon, and that all the parties involved will handle the matter respectfully. Much has been written about how to create an environment where this happens—see Johnson [Johnson22], for example—and I will not try to add to what others have written.

8.4.6 Principle: Train team in communication skills

Communication is only effective when information passes accurately among the participants, and when everything that needs to be communicated gets heard. Effective communication is a skill that can be learned.

There are many ways communication can go wrong. One person can say something and the other person understands something different. Something can be said that causes the hearer to have an emotional reaction that interferes with hearing and understanding. Two people can be trying to exchange multiple pieces of information, but things interfere and some key information doesn’t get shared. Someone can have something important to say, but withholds the information out of fear of an inappropriate reaction from the person who needs to hear it.

In safety-critical environments, such as air traffic control, pilots and controllers talk using a pre-defined vocabulary, follow pre-arranged patterns for who can talk when, and each party always reads back key information to confirm correct understanding [FAA23]. These rules have been developed over the years to ensure that each party can speak when they need to, that everyone involved will understand what is said in the same way, and that key information is checked.

A well-functioning team has a shared culture of communication practices. These practices include many of the principles found in ATC communication, such as careful definition of terms and reading back or paraphrasing to confirm what has been heard (sometimes called active listening). In addition, people will have uncomfortable things to say and hear while working to build a system and the team’s communication practices will have to handle messages that could trigger emotional reactions without breaking trust within the team. The communication practices also should encourage regular communication to actually happen rather than people forgetting to talk to each other.

There is a lot of useful information available in book, courses, and classes on how to improve communications within a team.

8.4.7 Principle: Provide independent resources for checks

Explicitly organize the team so that people have responsibilities for checking others’ work, including through reviews and by doing testing. Manage relationships in the team to keep the checking from being taken personally.

Building checks into the work plan is a principle listed above. The principle of doing checks requires having team members available to do those tasks. Having someone who did not do the design or implementation perform checks improves the odds that they will find a problem because they do not have implicit assumptions/biases of the designer or developer. This implies that a well-functioning team will be staffed to provide for independent checks, and that some team members know they will be responsible for checks.

It is easy to underestimate the effort required for reviews and tests. Doing a meaningful design review takes significant effort, because the reviewers need to actually understand the design—not just look for particular easy-to-find markers that might indicate a problem.

I have heard many opinions about how much of a team’s effort should be allocated to reviews and checks, anywhere from half the effort to a small fraction. My own experience has been that the teams where about one-third of total effort was allocated to reviews and testing had better outcomes than the teams with less effort available. The appropriate fraction of resources is likely dependent on many factors not yet appreciated.

Reviewers and testers can end up having an adversarial relationship with designers and implementers, and so the way reviewing and testing tasks are allocated requires some care. In one organization I worked with that had permanent testing teams separate from developer teams, the developers looked down on the testers and relations between the teams were sometimes difficult. While some tension is useful so that the work remains independent, careful management will monitor the relationships and work to ensure that the interactions between developer and checker do not become personal and that the skills required for both roles are honored.

Part III: Systems

A detailed model of what systems are. This includes

The system’s purpose;
The components that make up the system;
How structured relations between components lead to emergent properties;
How subset views of the system can help understand the system’s content; and
How evidence can show whether the system does or does not meet its purpose.

The system artifact graph is a synthesis of the parts of this model, realizing the information in the model in concrete artifacts that are created and manipulated in making the system.

Chapter 9: Purpose

21 October 2024

With respect to the definition of business purpose and business mission, there is only one such focus, one starting point. It is the customer. The customer defines the business.

— Peter F. Drucker [Drucker93, Chapter 7]

9.1 Introduction

Creating a system requires time, effort, and many other resources. The result of spending those resources should be worth the expenditure: the system should do something useful for someone.

This is another way of saying that the system should have a purpose, and that the purpose should be expressed in terms of what the system can do for the people or organizations that will depend on it. This definition of a system’s purpose means that it depends both on what the system does and who it does it for; both must be worked out to be able to accurately reason about a system’s purpose.

The list of who the system is for should be expansive, including everyone who has an interest in the system. This includes the system’s users, who will need to benefit directly from what the system does. It also includes the people or organizations who build and maintain the system and their investors, who will need to get benefit from the effort and resources they put into making the system. It includes others, such as regulators or industry groups, who represent the public interest in avoiding dangerous activities. This list amounts to the (often-abused) term stakeholder, interpreted broadly.

Each of these stakeholders will have a different interest in the system. The needs of each stakeholder must be discovered and recorded. Users derive benefit from the system’s explicit behaviors. Builders and funders derive benefit from compensation for the system, and in the longer term from the potential opportunity to evolve the system, provide it to others, or develop new systems. Regulators, industry groups, and the public derive benefit from how the system affects the world at large in terms like safety, fairness, or security. All these needs must be satisfied, and they cannot be satisfied reliably unless they are known.

A system’s purpose is a high-level, probably imprecise statement of what people want out of the system. The purpose is the basis for specifications, such as requirements, that are precise.

9.2 Why purpose matters

Purpose provides a basis for decisions about whether something is worth doing, or to choose among different ways to do something. It guides the design and implementation: each part of the design can be judged on whether it adds to meeting the purpose or not. The sum can be judged on whether it meets all or enough of the purpose to justify building or deploying the system

This principle applies to parts of the system as well as to the system as a whole. Each part has a purpose that it needs to fulfill in order for the system to fulfill its purpose.

Purpose matters because of what happens when one does not give it enough consideration. I illustrate this with two examples, from among the dozens I have encountered.

Early in my career, I was tasked with building software that would be used by machine shop workers to process repair work orders and manage parts inventory. This system would be installed on minicomputer systems with terminals around the shop. I had what I thought was a clever idea for the user interface, based on the ideas of non-modal UIs that were beginning to enter the world in the early 1980s. The result met all of the functional requirements needed—and was completely unusable. I had focused on building something I thought surely would be good without doing the work to understand the needs of the shop workers who would use the system.

More recently, I worked with a startup that was building a software system to control a small vehicle. The software designer had decided that the foundational software infrastructure should provide an event loop mechanism, where the infrastructure would cycle at some frequency, and in each cycle would call functions to read sensor data, perform computations, and write commands to actuator devices. This is a common design pattern for this kind of system, and a reasonable starting point. However, when the designer was asked how they envisioned this being used to implement PID controller logic, it turned out that they had not ever considered what a controller would need and many necessary capabilities were missing. By the time the first version of the system was released for deployment, the vehicle had no control systems implemented.

The common thread in these examples is that in neither case did the person responsible work through the system’s purpose in order to ensure that what was built would be useful. Instead, the designs were based on an unvalidated belief about the right design, and the choices resulted in unusable implementations.

In both cases, a significant amount of time was spent building a system that did not work. In both cases the resulting system could potentially have been redesigned and reimplemented, but building the wrong thing had used up the available time and delivery deadlines were close by the time they were finished. In the case of the shop management system, the project subsequently failed as a result. In the vehicle control system, at the time of writing it remains to be seen if the team can get funding and time to correct the errors.

Both examples would probably have turned out better if effort had been put into a proper articulation of what the system needed to provide before anyone went into depth on design.

I discuss gathering information about purpose and documenting it in Chapter 32.

9.2.1 Not monolithic or fixed

While it would be nice if purpose could be defined once and then remain fixed for the life of the system, this rarely happens.

First, a system’s purpose is rarely fully understood, especially in the beginning of a project to build the system. A team can begin by talking to potential stakeholders and finding out what they need, but inevitably someone will realize some important system behavior well after design or implementation are in progress. Not all of the stakeholders may be apparent at the beginning: for example, in one project I worked on, insurers turned out to be an important stakeholder, but we didn’t appreciate that for quite some time. A team must expect that their understanding of a system’s purpose will be rough at the start and become more accurate over time.

Second, purpose is not usually monolithic: there are many things that could be part of the system’s purpose, and usually people want many more things that are practical to build. The list of potential features usually has to be narrowed down from a long list of user or stakeholder wishes to a short list of the most important features—perhaps with a plan to add more capability over time. This means being able to separate the different features or properties and rank them by importance and achievability. A team must expect that items will be added and removed from a system’s agreed-upon purpose as time goes by.

Finally, needs change. If a project to build and deploy a system takes a few years, the world in which it is deployed will likely be different from the world when the project started. Available technology may change, or the user’s market may have shifted, or new regulation may come.

The result of these conditions is that a system’s purpose is not fixed, and the team building the system must be prepared for these changes. Being prepared means regularly checking for changes in stakeholder value and recording what is learned. It means using design and development processes that can adapt to these changes when they happen. And it means a management commitment to managing change honestly, pushing back on user requests when needed and supporting the development team when changes need to be made. It also means that an organization must be prepared to end a project when the system’s intended purpose no longer has enough value to its stakeholders.

Chapter 32 discusses how to gather information about purpose, and how to work with that as the understanding changes.

9.2.2 Inconsistent or conflicting purposes

Having multiple stakeholders usually means that two or more stakeholders will have incompatible needs or desires. Even a single stakeholder may have conflicting desires.

This can cause two problems. First, conflicting needs make it hard to design a system that meets its purpose. Second, conflicting objectives make it harder to rank and choose among potential system objectives.

There is no simple recipe for handling such inconsistencies. One first has to recognize when an inconsistency or conflict exists, which requires understanding what all the stakeholders are saying and understanding the implications of that information. Then one has to work with the stakeholders to find a resolution—be that a negotiation that produces a compromise, or a realization by one party that their needs cannot be met. This can lead to difficult discussions, especially with customers: it is hard to tell a customer that current regulations make some feature they strongly desire illegal.

9.3 Explicit purpose

A system’s or component’s purpose can be separated into explicit and implicit parts. I use a simplified eVTOL aircraft as an example to explain these.

The explicit part is what stakeholders who will directly use the system say they need. This includes:

The functions or behaviors the users want. Example: the aircraft needs to carry up to four people over a distance of at least 100 km, with takeoff and landing at heliport-like facilities.
The use cases that give context to the functions. Example: the aircraft will be used as a taxi service to carry passengers between airports and city centers.
Important properties the users want. Example: the aircraft must be at least as safe as commuter-class aircraft.

The stakeholder can only rarely specify exactly what they want. They may have a general idea, but it often requires several discussion sessions for them to express the idea clearly. The team eliciting the purpose from them usually needs to employ active feedback techniques, providing the stakeholder with an interpretation of what the team thinks they have said in order to validate that they have correctly understood the needs.

See Section 16.2 for more about different kinds of stakeholders and projects, and what must be done to learn about each kind.

9.4 Implicit purpose

A system’s implicit purpose comes from stakeholders who are involved in the system but are not its direct users.

The organization developing the system has some reason for doing so. Example: an aircraft company designs the eVTOL aircraft in order to make a profit selling the aircraft and maintaining them.
A venture organization or bank providing funding expects a return on funds invested in the project.
Regional and local governments have an interest in promoting economic development in their region. Example: a regional government supports the company building the aircraft in order to provide jobs in its region, for tax revenue, and to build a reputation as a place to do similar work.
Government agencies and industry bodies have an interest in the public good. Example, in the US the FAA reviews and certifies aircraft designs for safety before they are allowed to fly in general use, based on regulations in 14 CFR Part 23 [14CFR23], and defines how the aircraft must integrate into the air traffic control system.

9.5 Objectives and constraints

Purposes generally fall into two types: objectives and constraints.

Objectives are things that the system should do, or that it should optimize. For example, a system should transport people from one point to another, or it should capture images of the Earth’s surface at a given resolution and wavelength.

Constraints are limits on how acceptable system designs can meet objectives. Constraints on moving people around might include limits on the rates of different kinds of safety accidents. A constraint on Earth imaging might be that images of sensitive geographic areas must have limited distribution.

One way to identify constraints is that they often have a trivial or pointless solution. Accidents can be eliminated if the system is never built. Images of sensitive geographic areas won’t be distributed if the system never takes any pictures. Objectives, on the other hand, are only satisfied if the system is built and deployed.

9.6 Using purpose

A system’s purpose must guide its design and development. This means that the purpose provides the standard on which design and management decisions can be made. There are several activities in system development that depend on purpose.

First, a project must actively gather and validate its understanding of the system’s purpose. This activity must be explicitly planned for, and sufficient time and resources provided. The resulting information should be validated with the customer and recorded in artifacts that can be referenced throughout the life of the system.

Second, the desired purpose is almost always more complex than what can be developed feasibly at first. The initial desires need to be ranked and pared down to what is essential.

Third, every project has a “go-no go” decision checkpoint, when the team decides whether to proceed with building a system or not. The fundamental question is whether a system can be built that meets all its important purposes, and this requires an analysis to determine whether that is feasible. Is it likely that a system can be built that meets the customer needs? And that will provide necessary compensation to the organization that builds it? Will other stakeholders agree to it? If the answer is no to any of these, then the team should not proceed further in building the system.

Fourth, purpose should guide design and implementation decisions. Each part of the system must play a role in meeting a stakeholder need, and the team should be able to articulate how it does so. If some part does not support the system purpose, it should not be built. If there is a choice to be made between different design or implementation approaches, the one that best meets the system’s purpose should be the choice. Moreover, the team must be able to explain how each of these choices were made. Chapter 33 discusses how the purpose leads to a concept for the system; Chapter 34 discusses how the purpose and concept lead to specifications, which are in turn the basis for design, implementation, and verification.

Finally, the system’s design and implementation should be checked against the decided purpose. Chapter 14 discusses the evidence for meeting a system’s purpose.

Sidebar: Summary

Purpose comes from the customer and other stakeholders.
Purpose is the basis for what the system is and what it does.
- Thus the source of the system concept and specifications.
Purpose should be explicit; as little as possible should not be left implicit.
Purpose can be split into objectives and constraints.

Chapter 10: System scope

20 May 2024

10.1 Introduction

A system’s purpose defines why it exists—the reasons it might be built.

What the system is comes next. This is a high-level view of what the system is and will do—and not how it does those things.

The definition of what a system is starts by defining the boundary between the system and the rest of the world. There are things that are part of the system, which I will call the system’s scope. The rest of the world provides an environment in which the system operates. Interactions between the system and its environment take place at the boundary between them.

The things within the system’s scope are what is being built, and thus under the control of the team making the system. This includes the functions, behaviors, and qualities of the system that are visible from outside the system. These are interactions between the system and its environment across the boundary between the two. These interactions should fulfill the system’s purpose.

What is in the environment is not under the builders’ control. The team building the system should understand these things, but they can’t be changed.

The environment includes the things that interact with or use the system. This includes things that go in and out of the system, physical places where the system operates, and the ambient environment (atmosphere, electromagnetic forces, dust, vibration, or radiation). The environment also includes those who will use the system, and thus define the purpose for the system to exist.

A caution: the system’s scope covers what the system does, and does not address how the system does that. Matters of how the system is designed to meet its scope are separate.

Sidebar: Keeping specification and design separate

One often reads statements like “good practice is to keep specification separate from design” or “requirements should not address the how, only the what”.

Why is this good practice?

The separation comes from the difference in the tasks involved in working out what something should do and how it should do it. Working out a system’s purpose or scope is a matter of working with a customer, real or potential, to learn about needs in the world outside the system-building project. Everything relating to the purpose or scope should come from other people and organizations—the team may choose which needs they will try to meet, but they cannot in general act as if the customer wants something different from what they actually do. The design, on the other hand, is about figuring out what kind of system design will meet those needs. All the decisions about the system’s design are within the control of the team, as long as those decisions end up supporting the customer’s needs.

In other words:

Purpose and scope come from the outside, not from within the project.

Design and implementation are for the project team to work out.

When design decisions are mixed in to scope or specifications, it is often a sign that the team has skipped over some of the deliberative steps of working out why some design is the best choice and jumped directly to a conclusion. This also can impose false constraints on the team: I have seen people avoid looking at design alternatives because they believe that some design decision came from a customer and can’t be changed.

Mixing scope or specification and design also can cause problems later, when the system is modified. Someone working out how to change a system needs to know why certain design decisions were made in order to understand the effects of changing the design. When specification and design are mixed, people often don’t record the rationale behind the design decision and the people who later need to understand the rationale have to guess.

10.2 Why scope and boundary matter

Building a system starts with working out the system’s scope. All of the specifications of the system are a refinement of the scope, and all of the design follows from that.

One will want to know how big the effort to build a system will be, at some point early in a project. This depends on knowing what the system will be.

Knowing what is in the environment—and thus not changeable by the team building the system—defines constraints on building the system.[1]

Finally, defining a system’s scope, boundary, and environment provides a way to check that the team understands the customer’s purpose properly by asking the customer to review the scope and boundary.

10.3 Content

The what of a system is the root of all the design of the system and its pieces. As discussed in the next chapter, the model of the whole system is the root of a hierarchy of component parts that define how the system is made. That chapter provides a model for how to define each component, including the system as a whole.

The team will use documentation of system’s scope and boundary over and over as they build the system, meaning that the information should be organized in artifacts that people can readily find and understand. (See Chapter 17 for discussion of what this implies.)

The system’s scope includes a few kinds of information: a concept, objectives, constraints, assumptions, and environment.[2]

A concept for the system provides a description of what the system will do. The concept is generally narrative, telling stories about the system. The concept should include major usage scenarios, for how the system’s customer will interact with the system and how the system will interact with its environment. I have often used a few diagrams to illustrate the concept. People looking at the concept should come away with an understanding of generally what the system will do and, equally important, what is not in the system’s scope.

Objectives (or goals) are a more organized way to present similar information. This takes the form of a list of the things people want out of the system: its behaviors or functions, and its properties. These will be general statements, and the process of developing a specification of the system will refine these into something more precise.

Constraints list limitations on acceptable system designs. The constraints do not establish what the system does, but only constrain how well it does those things. Many constraints relate to safety or security. For example, the system might need to meet some safety standard. Initial constraints will be general, and they will be refined as the system’s specification and design are worked out. Many constraints lead to analyses that work out in greater detail what these constraints imply.

The assumptions record information that affects how the system can be designed but that might be forgotten or might be missed. (This is similar to making objectives explicit or leaving them implicit.) This is often organized as a list. The assumptions guide later design decisions.

Finally, the environment lists information about the world in which the system will operate. The environment constrains how the system can operate or how it must be designed: to accommodate a certain level of vibration, for example, or that cellular radio coverage will be variable over a region where a vehicle will operate.

10.4 Using scope and boundary

The scope and boundary are a realization of the system’s purpose. The record of the scope should be traceable to the purpose. The team uses the scope and these traces to check that the definition of scope meets all of the system’s purpose, and that there aren’t significant parts of the scope that are not based on some part of the purpose.

The act of defining the system’s scope helps reveal the details of system’s purpose and constraints. Discussions with a customer or other stakeholder are usually informal and incomplete. The discussions result in notes and drawings, but they are rarely directly usable for working out the system’s specification. The tasks of working through records of those discussions and organizing a model of the system’s purpose will reveal ambiguities in what the customer has said, or gaps or inconsistencies. The team can then work with the customer to resolve those issues so that the definition of scope is more complete.

The team will use the definition of the system’s scope to document top-level specifications for the system, which then inform the system’s design and its decomposition into component parts.

As the project moves forward, the team will work out the design for high-level system properties such as safety, security, or reliability. The tasks that build the designs for these emergent properties (Section 12.4) begin with the definitions of what safety or security the system is expected to provide. Those definitions are part of the scope.

Sidebar: Summary

The system has some things it will do, which are its scope.
- This implicitly defines what the system will not do.
The system’s scope is embedded in its environment.
The boundary between the system and its environment is where interaction and observation happens.
What is within the system’s scope is under project control.
What is outside the system’s scope is not.

Chapter 11: Component parts

20 May 2024

11.1 Introduction

In this book, I describe a system in terms of its parts and its structure. The system overall has a purpose, which can be described in terms of things it should do or properties it should have. The system meets this purpose by combining the parts together with the structure of how the parts interact. One should be able to show that the desired system behavior and properties follow from the combination of parts and structure.

In this chapter, I start by discussing components, the term I will use for parts. In the next chapter I will discuss structure, and how the combination of parts and structure leads to emergent properties that meet system needs.

Terms. I use the term “component” as a generic term for a part of the system. Some approaches use different terms, such as “element” or “item”. Other approaches use different terms depending on the level of encapsulation: system, subsystem, component, subcomponent, for example. I use the term “component” throughout, with “system” reserved for the system as a whole, and “subcomponent” used to denote a component that is part of another component.

11.2 Definition of component

A component is something that is part of a system and that people can think of as a unit. “Unit” implies some kind of singular aspect to the component: one purpose, one implementation, or one boundary, for example.

The unitary nature of a component means that the world can be divided into that which is within the component, that which is at the boundary, and that which is outside the component—its environment and the rest of the world.

This definition implies that different people will see different things as unitary components, often depending on the level of abstraction one wants to work with. One person may think of “the electrical storage system” as a unitary component, while another person may think of battery cells and power regulator chips as components, and the electrical storage system as a collection of components. Both views are correct, and both are useful at different times or for different people.

The focus on unitary purpose or boundary is a way to address complexity in a system. The focus is meant to help humans organize and understand the system they are working with by taking a divide-and-conquer approach. It means that some people can focus their attention solely on the component, making sure that it is designed and implemented to meet its purpose while not having to think about the rest of the system. The focused attention on the component must be complemented by attention on the system structure that connects the component to others, as described in the next chapter.

There are three related principles that can help identify what is a component and what is not. (Some of this is based on principles presented by Parnas [Parnas72].) These are only guides, and there are exceptions to each of them.

Is there a singularity of purpose in the thing that might be a component?
Is there stronger coupling or connectedness within the thing than across a boundary to the outside?
Can the design or implementation of the thing be replaced by a different version without affecting other components in the system?

The goal of the first principle is to organize components around their purpose. If a thing has multiple purposes, that suggests that it might be divisible into smaller parts, each with a sharper focus, or that part of the thing might be better combined into something else with a similar purpose. On the other hand, if there is some feature that is implemented by more than one component, then those components are candidates to merge together. This is particularly true when those components contribute to some important emergent property (Section 12.4).

The second principle addresses how independent a thing is from other things. Independence can be viewed in terms of causal relationships with other components, as covered in the next chapter. The more tightly two things are related, the more they will have to be designed, implemented, and tested together; the less they are related, the more they can be worked on independently. If two things are strongly related, one should consider merging them into a single component; if they are loosely related, they can be more readily treated as separate components.

The final principle is also related to independence. If the design or implementation of a thing can be replaced with little or no effect on the design of the rest of the system, then that is evidence that the thing is independent and can be treated as a component. Having clear and narrow interfaces between the thing and the rest of the system is a sign that the component is independent. More broadly, replaceability is often an indication that something should be considered a separate component.

There is one additional indication that something should be treated as a component: when it is something that is usually sold or acquired as a unit. Electronic chips, antennas, motors, and batteries are all generally bought as units. Software packages are often acquired as units, whether bought or acquired as open source. A person hired as a contractor to fill a commonly-defined role can be seen as a component in a system.[1]

11.2.1 Component purpose

Every component has a purpose, which defines how that component contributes to the system as a whole. “Purpose” is a broad term, including behaviors that the component should have, properties it should exhibit, or functions it should provide. A component’s purpose is not necessarily defined precisely; sometimes, the purpose is a somewhat ambiguous prose statement of what a human wants the component to do or be. Turning that ambiguous statement of purpose into a precise and actionable definition is part of the engineering process. I discuss this in Chapters 32 and 33.

I discussed the purpose of the whole system in Chapter 9. The system purpose is the purpose for the top-level component in a hierarchy, which represents the whole system.

Most human-designed components have a single primary purpose or property, possibly with multiple secondary purposes. Consider a battery: its primary purpose is to store electrical energy and make it available to other parts of the system. The battery may have a number of secondary purposes, such as providing mechanical structural rigidity, providing thermal mass to help maintain a constant temperature in the system, or contributing to the location of the system center of mass.

Each component has a number of properties that derive from its purpose: its state and behaviors, the inputs and outputs it can provide, and constraints on how it should be used. The documentation of these properties provides an unambiguous and precise specification of the component.

People working on the component need to have the purpose (and the specification that derives from it) available as they do their work. This information guides how they design the component, and how they verify that a design or implementation meets its needs. It is important that all of the purpose is available to them in one place so that they know they are considering everything they need to consider, without hidden surprises they couldn’t find.

11.2.2 Limits of the component approach

Components help human engineering and understanding—but when humans aren’t doing the design, there are limits on how the approach applies.

Consider a mechanical structure designed with a generative design tool. The tool can take in a specification of what the structure should achieve—forces, attachment points, and so on—and will find a design that optimizes for given criteria such as weight or cost. These structures often do not resemble ones people design because the tool can explore a more complex design space than a person can, and as a result often produce substantially better results than the human designs. Such designs can also potentially co-optimize multiple functions, such as a mechanical structure that includes channels for coolant flow within the structure or that meets RF reflection properties. While a person could make such a design, generative tools can do so at far lower cost.

As a second example, consider a neural network trained to recognize elements in a visual scene. The neural network is designed by performing a training process that uses a large number of examples of the kind of recognition the system should perform. The resulting network is typically much more accurate than a manually-designed algorithm. However, it is difficult to investigate the network itself to determine how the connections in the network lead to accurate (or inaccurate) image recognition. It is difficult to look at a specific connection in the network and explain how it affects the result, or how changing that setting will change recognition properties.

Both these examples are components that will be part of a larger system. As components, they have a defined purpose, from which a specification can be derived defining what the component should do. From there, automated methods take over to produce the design (for the mechanical part) or directly produce the implementation (for the visual recognition component). If these components were designed by people, we would expect that we could review and understand the component’s design as a check on its correctness. As machine-generated components, however, we only verify that the design or the implementation complies with its specification.

There is one significant difference between the two examples: how they can be verified for compliance with their specification. A mechanical component’s specification is generally complete: all of the conditions in which the component should function and the component’s behavior in each environment can be specified. This means that compliance can usually be checked using finite element analysis software tools, and example components can be built and subjected to their intended loads. Components implemented using neural network methods, on the other hand, usually are expected to function in a complex environment that is too large to fully enumerate. The training methods use a number of example cases, and induce from those examples an implementation that should properly generalize to all, or enough, real cases. The compliance of the component therefore cannot be completely verified, but must be done statistically.

11.3 Divide and conquer: the component breakdown structure

The component approach involves breaking down the system into unitary component parts, in order to make each part manageable by a person. However, different people use different levels of abstraction to understand the parts of the system.

In practice, people divide up a system first into major subsystems, and those into smaller components, and so on until the components are simple enough to deal with. This recursive division defines components at varying levels of abstraction: the electrical power system as a whole, with the power storage, power distribution, and power generation components as parts of the overall power system.

Figure 11.1 shows an (intentionally) partial breakdown structure for a spacecraft, illustrating how the spacecraft as a whole (the “space segment” of the whole system) is organized into multiple trees of components.

Figure 11.1: A partial component breakdown structure for a spacecraft.

This recursive division creates a tree-structured component breakdown structure of the parts of the system. The breakdown structure organizes components in a way that helps people find components, including both finding a specific component that they are looking for and discovering related components that they do not already know about. The structure also defines levels of abstraction that allow people working at different system levels to focus their attention.

The relationship between a component and its parent is often called a part-of relationship. (This relationship is how the breakdown structure is encoded into the system artifacts graph; see Chapter 15.)

The breakdown structure organizes components, but it does not define the system structure, which I will discuss in the next chapter. The system structure defines how components interact with each other, which generally crosses different parts of the breakdown structure tree.

The system and high-level components should be broken down into subcomponents that have a strong internal relatedness and weaker relationships between subcomponents, as I discussed earlier. In doing so, the high-level component provides an abstraction of its subcomponents. This usually means breaking into subcomponents either by function or physical location. Most people think first of dividing by function: the electrical system, the hydraulic system, the communication system. Location is often more implicit. For example, a space flight mission is organized first by ground system, launch system, and flight system (physical locations) and then by function in each location.

A system will not necessarily have a single optimal breakdown structure. When that happens, one must pick some approach and stick with it. Some systems will have lower-level components that contribute to multiple high-level functions. If the system is organized according to the high-level functions, then the low-level components could fit into multiple branches of the hierarchy. I will discuss this further in the next chapter , when I cover how one uses hierarchy to organize the structure of the system.

Keeping components together that are functionally related is important. Part of the purpose of the hierarchy is to help designers and implementers: the hierarchy should guide them toward the information they need and should not hide or lead them away from that information. I have worked on some projects where the team decided to consider the esthetics of the hierarchy and tried to balance the depth of branches. While the resulting hierarchy was easier to draw, actually using the organization became more complex and error-prone. High-level components no longer provided an abstraction of a collection of subcomponents as a whole. Instead, the collection of related subcomponents was split between two or three high-level components; nowhere was the one abstraction of the whole set represented. Building specifications, tests, and project plans became harder because related things were no longer related in the hierarchy.

11.4 Component characteristics

Each system component is defined by a number of characteristics. These characteristics define an external view of the component: information about the component that can be observed without knowledge of how the component is designed internally. The characteristics constrain the component’s internal design, but should only include those aspects that will affect how the component fits with other components to make up the system.

There are six kinds of characteristics in the component model used in this book:

Form
State
Actions or behaviors
Interfaces
Other non-functional properties
Environment

Form. The “shape” of the component. The component does not typically change its form over time. For physical components, form is obvious: the geometry of the volume or area that the component occupies. Form might include the material of which a physical component is made. For electronic or data components, form is how it is packaged: a data file in some format, or a software component in the form of an executable application.

Examples include:

The geometry of an aircraft wing rib (which roughly matches the shape of the wing’s airfoil)
The length and diameter (or gauge) of copper wire carrying an electrical current
The library format used for a software package

State. This is the mutable “condition” of the component at a particular point in time. More formally, state is the information that is necessary and sufficient to encapsulate the past history of the component, so that any reaction that the component performs to some input is fully determined by the input and the state. State can be discrete (such as binary-encoded digital data) or continuous (such as the angle and angular momentum of a rigid body at a point in time).

Practical examples include:

The value of a digital memory register
The heat stored in a solid object
The angle of a control knob that is connected to a rheostat
The velocity and rotation rates of a rigid body moving in space
The wear on a bearing connecting a rotating object to a non-rotating one
The altitude of an aircraft

Actions or behaviors. These are the state changes that the component can perform. Some behaviors are reactive, meaning they are initiated by some input. Other behaviors are continuing, meaning that they continue to be performed without further input.

Examples of reactive behaviors:

Incrementing a memory value in response to a sequence of program instructions
Starting current flow in a circuit in response to closing a switch
Changing the temperature of a component as heat is applied to an interface
Changing the rotation speed of a rigid body in response to applied torque

Examples of continuing behaviors:

Steady change of the rotation angle of a rigid body at its rotation rate, given no external torque applied to change the rotation rate state

Interfaces. These are the ways in which a component is connected to other components in the external world, and is the only way for to observe the component’s behavior from outside. Inputs can be given to a component, and output can be received from it. Inputs and outputs create a causal relationship between actions in one component and another. Inputs trigger reactive behaviors in the component that receives the input. Outputs can be a result of a reactive behavior, or an observation of a continuing behavior. Outputs are the only way another component can observe information about a component.

Examples of inputs:

A digital message received over a network or I/O port
The changing of an electrical switch position from open to closed or vice versa
Heat received by contact with another component, or by radiation from another component
Force transmitted across a contact with another component
Movement of a user interface control by a user

Examples of outputs:

A digital message sent over a network or I/O port
Electrical current or voltage from a generating component
A visual observation of the angle of a rotating rigid object
An alarm sound played to alert a user
Ongoing sound and vibration that a person uses to monitor a component’s function

Non-functional properties. Components often have some properties that do not change over time (or change very slowly). These properties are not state per se, but they create important constraints on the component’s design and implementation and affect how the component should behave.

Some non-functional properties:

Mean time to some type of failure
Minimum or maximum capacity (in liters or amp-hours)

Environment. A component is also characterized by the expected environment in which the component will operate. This can be viewed formally as part of the component’s interface, but in practical terms it is useful to call it out separately. The environment specification typically includes information like the storage and operating temperature range, humidity, atmosphere, gravitation or acceleration, electronic signal environment, or radiation.

Section 34.4 details more about components and how to specify them.

11.4.1 Characteristics and hierarchy

A high-level component provides an abstraction for the subcomponents that make it up. This implies that each of the characteristics of a high-level component—its form, state, behaviors, and so on—needs to be reflected in the subcomponents. For example, if the high-level component has some state A, then one or more of its subcomponents must have some state that, when aggregated, implements A. If the high level component has form B, then the subcomponents when put together must have that same shape.

Consider a radio communications component. The purpose of the component is to send and receive data packets with another radio somewhere else. The radio component has interfaces to communicate data with another local component, an interface to emit and receive RF signals, and other interfaces for control, configuration, power, and heat transfer. This example radio component, similar to those that might be used on a small spacecraft, has an antenna that is initially retracted but can be deployed on command.

The radio is built of a number of subcomponents. These subcomponents must implement the state of the radio overall, as well as all its interfaces. The diagram below shows a simplified possible implementation.

The set of subcomponents implements each of the interfaces named in the high-level radio component. Many of them are provided by the transceiver component, but the antenna handles the RF signal sending and receiving.

The state of the high-level radio is divided over the subcomponents. Again, much of the state is contained in the transceiver component, as it performs the data manipulation. The deployment state is a physical property of the antenna: it is either retracted or extended.

In the example implementation, however, there are multiple powered components—the sensor and actuator related to deploying the antenna in addition to the transceiver. This results in a more complex power state than defined in the higher-level radio component: some of the components could be powered on while others could be powered off, rather than a binary on/off overall state. During design, discrepancies like this should lead to improving the specification of the state of the high-level component.

11.5 Downsides

As I have noted, breaking a system into separate and independent components benefits the people who need to understand the components. This advantage generally outweighs other considerations, but there are downsides to this approach.

The first downside is that a reductive approach doesn’t allow for many kinds of system optimization. Having two separate components means that the two are not jointly optimized.

Software language compilers illustrate this. If each program statement is considered independent, the compiler translates each statement into a block of low-level machine code. However, optimizing compilers break this independence, and gain large speed improvements in the generated code. For example, a code optimizer can detect when two statements perform redundant computations and merge them. An optimizer can detect that a repeated computation (in a loop, for example) can be moved out of the loop and performed only once.

Software optimizers allow a developer to write understandable code, and it performs optimizations that can be proven to maintain correctness but that make the resulting machine code hard for a person to understand and verify. There remains the possibility of system optimizers that perform similar translations, but they are not generally available today.

The second downside is that breaking a system into many components creates an organizational problem: how does one name or find a particular part? A hierarchical component breakdown can help organize the pieces.

11.6 Why components matter

One splits complex systems into component parts in order to make parts that are understandable by the people who have to work on the parts. The approach also makes it easier to manage parallel design, implementation, and verification of the parts. If one wants to acquire a component from an outside source, having a definition of what the component is helps the acquisition process.

Each of the people working on the system needs information to work on their parts. Defining a component provides a locus around which to organize the information related to a component. Having a model of what a component is provides a basis for designing artifacts that will contain the right information.

Different people will need to work at different levels of abstraction in the system. Organizing the components hierarchically provides these different levels of abstraction.

The people working on the system need to find pieces of the system, both when they are looking for information about a specific piece and when they are trying to learn what the pieces are. The hierarchical structure provides a way to name and find information about a component, and provides a structured index to help people browse and discover.

Finally, it is generally understood that the structure of a system is related to the structure of the team that builds it [Conway68]. I discuss this further in Chapters 19 and 55.

Sidebar: Summary

A component is a unitary part of the system.
A component can be made out of subcomponents, and so on recursively.
- The system is the top-level or root component.
The hierarchical structure of components provides a way to organize and name them.
Each component has a purpose and scope, just as the system as a whole does.
A component is modeled as: form, state, behaviors, and interfaces.
Each component is situated in an environment.

Chapter 12: Structure and emergence

1 July 2024

12.1 Introduction

Component parts of a system define the building blocks out of which a system can be built, but by themselves they do not create the complex, high-level behaviors that systems are built to exhibit. System behaviors and properties arise from how the component parts work together. How the components are connected, and how they interact over those connections, is the structure of the system.

In this chapter, I define what is meant by system structure and provide examples of how behavior can emerge from the combination of components and their interactions.

To build a system, one generally has to build a model of what the system is and does. This model will play essential roles in designing a system and analyzing its design. Enquiry into how to organize information about a system’s structure helps one develop a useful model, and so in this chapter I present an informal way to model a system’s structure.

12.2 Definition

The meaning of “system structure” has been debated, but I use the following definition, chosen for its engineering utility:

Structure is how each component part’s behavior relates to each other component part’s behavior.

This structure can be expressed as the graph of how components affect each other.

Components can be related in two different ways:

A functional relationship, which defines how an event in one component can cause an event in another component, and
A non-functional relationship, which defines how events in two components are correlated (or not).

Functional relationship. The functional relationship is a relation from one component to another that maps how some output on an interface of one component can potentially be received on an interface of another component, and thereby cause a reaction in the receiving component. That is, the functional relation is a map of possible interactions that can be viewed as a directed graph, with components as nodes and directed edges showing how causation can flow between them.

Consider two electronic components connected by a signaling line, similar to those used in several serial communication standards. One component is able to send a signal on the line by changing the voltage relative to a common ground; the other component is able to observe the voltage and determine what signal was sent. By sending a sequence of different voltage levels, the sender can transmit a series of zero and one bits over the line to the receiver. The receiver can decode the bits into a message, perhaps containing a number, and act on the message it has received.

This functional relation is separate from and mostly independent of the component breakdown, defined in the previous chapter. The component breakdown is primarily about organizing the parts so they are identifiable, and do not imply a causal relationship. Functional relationships show how components in different parts of the component hierarchy work together. The component breakdown is helpful for defining levels of abstraction; I will deal with those in the next section.

Non-functional relationship. A non-functional relationship between two components indicates how their behaviors may be related in non-causal ways, such as two components being independent of each other or showing correlated behaviors. These effects do not depend on interaction between the components, but instead are based on inherent characteristics or history of each component.

Independence and correlation are typical non-functional relationships. These terms are defined in the usual statistical sense. Informally, two components are independent if the probability of some event occurring on both components is the same as the product of the probability of each event occurring on its own. Events on two components exhibit some degree of dependence if the probability of both occurring is different from the product of each event occurring on its own. For a positive correlation, when one event occurs the other is more likely to occur. For a negative correlation, when one occurs the other is less likely to occur. At the extremes, one event occurring means the other is certain to occur, or that the two events never occur together.

Many non-functional relationships are the result of common-cause events. This can occur when two otherwise-independent components A and B have functional relationships with a third component C. When an event occurs in C, it interacts with both A and B so that both change their states. After such an event, the states of A and B are no longer independent.

System reliability is often built on a foundation of failure independence. For example, data can be stored in two copies, so that if one copy fails the other remains available. A scheme like this fails when both copies fail, and so the copies are designed to be independent to minimize the chances of both failing together. Independence can be a result of using different technologies to store each copy, or using devices from different manufacturers. Two devices from the same manufacturing batch might share a common manufacturing defect, which would increase the probability that both will fail.

12.2.1 Examples of functional relationships

Here is a list of some kinds of functional relationships that I have encountered in systems I have worked on. The first few relationships are simple and primitive from an engineering point of view, while the later examples are built as abstractions on top of simpler relationships.

Mechanical force transmission. One component can apply force to another component, perhaps causing the second component to move.
Electrical energy transmission. One component can provide electrical energy to another component, which allows the second component to turn a motor or operate electronic circuits. The first component thus has a degree of control over the second: when it provides energy, the second component can operate; when the first does not, the second component does not operate.
Fluid movement. One component can send a fluid (perhaps a gas or liquid) to another. As with electrical energy, the receiving component can use that fluid to perform some actions, such as moving using hydraulic systems or transforming a fuel into kinetic energy in an engine.
Data movement. One component can send information to another, perhaps using an electric or optical connection between them. The second component might store that information, changing its internal state, or it might perform some computation action on the information, such as computing a control action based on sensor data. Data movement is typically an abstraction on top of a lower-level electrical or optical relationship (though other relationships, including force transmission, are sometimes used).
Control. One component can provide abstract control over another, providing actuation commands to the second component. The second component uses those commands to change its state or behavior. A control relationship is typically built as an abstraction on top of a data movement relationship.
Authorization. In many systems, a person or component needs to have permission in order to perform some action; if it lacks that permission, it will be prevented from performing the action. Authorization is typically granted by one component to another component. For example, an administrator can use a permissions management tool to record who is allowed to do what; when a user attempts to request another component to perform some action, that component checks with the permission system to verify that the action is allowed. In this way the permissions management tool provides control over the component that is doing the checking. This is a highly-abstracted relationship between the permissions tool, user, and other component—but there remains a causal relationship among them.

12.2.2 Examples of non-functional relationships

Non-functional relationships capture ways that components can behave in coordinated ways without a direct causal relationship between them. These are typically states or behaviors that occur because two components share some common state (or do not share such state).

The following examples all relate to the independence or dependence of different components that are being used redundantly to improve reliability.

Location independence. Essential IT systems are expected to survive disasters, such as a fire, storm, or earthquake destroying a building housing servers and storage. This leads to using two data centers located far apart geographically—so that they are unlikely to experience local disasters at the same time. This location independence relationship does not involve causality between the two data centers; it involves only their state of being in different locations in relation to expected kinds of disasters.

Batch independence. Two separate but otherwise identical components might have been produced at the same time on the same factory line. This can lead to those two components sharing a common manufacturing defect: contamination in some material used in their manufacture that is not caught during acceptance testing, for example. Common manufacturing defects can lead to common failure modes, which is a problem if the two components are supposed to provide redundancy for each other. As a concrete example, an electronics board used in a spacecraft might have the wrong solder used to make connections, which can lead to the development of tin whiskers that create shorts on the board. If two or more components are being used for redundancy, selecting components from different batches that are unlikely to share manufacturing defects can reduce the chances that both will fail at the same time.

Wear correlation. Mechanical components often experience some kind of wear as they are used. Rotating components such as a wheel, for example, have to interface to a non-rotating component. The interface between the two is designed to minimize friction using smooth finishes, lubricants, and bearings—but there will still be some small amount of friction that will wear down the surfaces over time, leading to increased friction. Eventually the friction will become great enough to interfere with correct operation. Installing multiple new mechanical components at the same time and using them in the same way leads to correlated probabilities of failure. Mixing components of different ages, or applying different usage patterns to them, can reduce this correlation and improve the effectiveness of redundant designs.

Common-cause failure modes. This is a catch-all for many kinds of correlated failures, when two components that are nominally independent in fact respond to common situations by failing—perhaps in different ways. Consider primary and backup control systems, where the backup has a different implementation than the primary and provides a simpler control capability. (In spacecraft systems, this backup is sometimes called “safe mode control”.) In most situations, if one of the two components has a problem, it is unlikely that both will fail; this results in increased reliability over having just one or the other. However, there can be some situations that will cause both the primary and backup to fail: some input condition that neither can handle, such as a rotation rate beyond what the system designers expected to be possible, or an environmental condition, such as a radiation burst that damages the electronics of both the primary and the backup.

12.3 Abstraction

An abstraction is a summary or reduced form of a more complex thing, usually focused on the essential or intrinsic aspects of the complex thing. The abstraction is separate from any concrete instantiation of it.[1]

People use abstraction to manage the complexity of a large system. In an airplane, people talk about “the electrical system” or “the powerplant”—things that are built out of thousands of subcomponents, but which are usefully thought of as whole things in themselves. While the component breakdown structure, in the previous chapter , is one example of abstracting the details of multiple components into one larger, abstract component (or subsystem), most complex systems have multiple, overlapping ways to abstract and simplify views of the system.

12.3.1 Kinds of abstraction

Because abstraction is a powerful tool, it is used in multiple ways in understanding complex systems. Here are three of those ways.

First, abstraction is used to understand component structure. It shows how one complex, abstract component is made up of a number of subcomponents. The focus is on the hierarchical containment or decomposition of components. The high-level component has a single set of functions or properties it provides in the system, and the abstraction shows how this set is provided by functions or properties in subcomponents. Note that the high-level component need not have a unitary physical realization; instead, it may be realized in many physical subcomponents spread throughout the system’s physical space.

Second, abstraction is used to show how the objectives for a system or component are realized in lower levels. The focus is on how an objective (or purpose) is decomposed into objectives for lower-level parts of the system. It can be expressed as the tracing of a high-level objective to lower-level objectives. An abstracted objective need not be constrained to follow the system’s component hierarchy. High-level objectives might be decomposed into lower-level objectives within the same component. High-level objectives might also be decomposed into objectives for multiple different components, some of which are not close together in the component hierarchy.

Finally, abstraction is used to reason about the chain of how an abstract property or objective is mapped through layers of specification to a design or implementation that realizes that objective. This is similar to means-end hierarchies, which have been used to reason about how products are selected. Leveson uses a five-layer approach, starting with system purpose, mapping that to system design principles, then black box behavior, then physical and logical function, and finally physical realization [Leveson00, §4.2.1]. This use helps people reason about the specification and design process as well as about the structure of the system.

12.3.2 Abstraction of objectives or component structure

In general, abstracting structure is about taking a relation between two (or more) high-level components and breaking it down into relations between subcomponents. In the example below, two high-level components A and B have a functional relationship. A and B are both abstractions of a set of subcomponents. The relationship between A and B is an abstraction of the relationship between the A.2 and B.1 subcomponents.

As a concrete example, consider software on two microcontrollers that communicate over a serial line. The software on each breaks down into an application software component and a serial driver. The serial drivers communicate (over a serial cable) directly.

Non-functional relationships can follow a similar pattern. If two high-level components A and B exhibit some kind of correlated behavior without direct causation, and those high-level components decompose into lower-level components, then at least one of the subcomponents of A must have a corresponding non-functional relationship with a subcomponent of B.

12.3.3 Overlapping abstractions

Abstraction is not necessarily purely hierarchical: some high-level abstractions overlap. Two different people can look at the same component and need to work with different aspects of it, and see it as part of different high-level abstractions. This is common in systems of even moderate complexity.

Consider an aircraft with modern avionics and engine systems. The avionics provide many functions: flight deck displays, pilot inputs, navigation, radio communications, autopilot, among many others. The powerplant provides thrust to move the aircraft and electrical power to run other systems, but in a modern aircraft it also includes an engine controller (FADEC) that provides autonomous management of engine operations.

The avionics and powerplant have overlap. The flight deck display will display engine status: thrust, temperature, thrust reverser deployment, and alerts when there are engine problems. The pilot thrust levers are connected to avionics, but provide commands to the engine controller. The autopilot needs to know the capabilities of the engines and how to provide them with control settings.

This overlap leads to a question: is the engine display function part of avionics or part of the powerplant? The answer is that it is part of both, depending on who is looking at that part of the system.

Consider a specific avionics unit for general aviation aircraft: the Garmin G3X display [Garmin13]. It can connect to an engine interface adapter, which in turn connects to sensors or a digital engine controller on the engine. The display is a general-purpose component, which can provide a pilot with many different kinds of information; engine status is just one function. The G3X unit contains a configuration database that defines what engine information it will be receiving, how to display that information to the pilot, and the conditions when it should issue alerts. This database resides within the avionics display unit, implying that someone designing the avionics system will be concerned with it. However, the database is specific to the powerplant installed on the aircraft—changing an engine model requires changing the database—and so it is of concern to people designing the powerplant.

This pattern is common in systems that have multiple functions: some particular component will contribute to multiple high-level functions, and different people will see that component as part of one abstraction or another based on what functions they are working on. Models of the system must accommodate these overlaps.

When two abstractions overlap, shared components must support both abstractions by implementing behaviors and properties that accurately support both higher-level abstractions. In the G3X avionics example, the configuration database needs to address the configuration of the powerplant as well as the interface to support pilot information displays. This can add complexity to designing the shared component, since behavior that supports one abstraction must not interfere with behavior necessary to support the other.

12.3.4 Abstracting a relationship

Some relationships between high-level, abstract components are themselves abstractions.

Consider once again the example of two microcontrollers that communicate with each other, as in the earlier section, but this time they communicate using a wired Ethernet rather than a serial cable. At the abstract level, there is a functional relationship from A to B where A sends data to B.

The data communication relationship, however, is an abstraction. The microcontrollers communicate using an Ethernet, which might consist of a pair of cables and a switch. The cables and switch reify the abstract relationship, meaning they take the abstract and make it into something real.

The inputs to and outputs from the reified data communication link are the same (at the abstract level) as the high-level abstract relationship: data gets transferred from microcontroller A to B.

This is an example of a general pattern. Two components at a high level may have a functional relationship, and both the components and the relationship between them decompose into a number of subcomponents. The consistency between the high-level abstraction and the lower-level details must be maintained, of course, but there is nothing that requires that a high-level relationship can only decompose into lower-level relationships.

In fact this pattern continues recursively down to the lowest observable levels. In the example, microcontroller A passes data into the Ethernet cable as a set of low-level electrical signals. Those signals, in turn, are made up of yet lower-level electromagnetic behaviors of the atoms in the conductors that join the microcontroller to the cable.

12.3.5 Consistency

A high-level abstraction and a lower-level implementation of the abstraction need to be consistent with each other. Speaking broadly, the high- and low-levels are consistent with each other if the low level implements everything in the high level abstraction, and everything in the low level implementation is reflected in the abstraction—that is, that neither level adds or removes anything from the other.

Abstraction does imply simplification, however. The high-level abstraction of a distributed software component might have a “logging” relationship to a centralized monitoring system. The decomposition of that relationship might involve a logging subcomponent within the software that uses a network connection to send log records to a receiver component within the monitoring system. The high-level logging relationship focuses on the ability to reliably and securely send log information to the monitoring system. To be consistent, the lower-level details must provide a way to transfer that information—using the network to move the data, for example. The statement that the information is sent securely—which would need to be better defined at the high level—might be matched by state and behaviors of the endpoint software components to authenticate each other and encrypt data in transmission.

Continuing this example, the lower-level implementation would not be consistent with the high-level abstraction if the network communication mechanism provided a way to send information in the other direction, from the monitoring system to the distributed software component.

This can be put in somewhat more formal terms as follows.

Components: Within one abstraction-implementation hierarchy, each high-level abstract component decomposes into one or more lower-level components. (Note that if a component is not decomposed, it isn’t abstract.) All of the state and behavior of the abstract component is implemented within the states and behaviors of its subcomponents, and the relations between its subcomponents. All of the states and behaviors of the subcomponents are reflected in the abstract state and behaviors of the high-level component.
Functional and non-functional relations: If there is an (abstract) functional relation between two abstract components A and B, then there must be one or more functional relations between the subcomponents of A and the subcomponents of B that implement all of the high-level abstract functional relation. All of the functional relations between subcomponents of A and subcomponents of B must be reflected in abstract functional relations between A and B.

This definition of consistency means that an abstract component or relation has to reflect all of the states, behaviors, or interactions that the lower-level components or relations can have, so that the abstract things model all of what the lower levels will do, and it cannot add to what the lower-level parts do. In reverse, the lower-level components or relations must implement all of what the abstract components or relations do, without adding other behaviors or interactions.

12.4 Emergent system properties

Emergence is the complement of abstraction: it is how high-level properties or behaviors arise from the properties or behaviors of a collection of lower-level components and their interactions. Put another way, one designs the emergent properties in a system to make abstractions true. Previously, in Chapter 6, I introduced the idea that system properties and behaviors are emergent from the properties and behaviors of the components that make it up, combined with the way those components interact. This idea continues recursively through a system, where each high-level abstraction is achieved by designing how subcomponents work and interact.

Emergent behaviors or properties are usually things that cannot be sensibly talked about at lower levels: these are properties that the individual components do not have on their own, but that the aggregation does when the components are combined. In physics, concepts such as gas pressure are emergent: no individual gas molecule has meaningful pressure, but the collection of a large number of molecules in an enclosed space gives rise to measurable pressure. Similarly, the shape of a leaf is emergent. No cell making up the leaf in itself has a property of the shape of the leaf, but the aggregation of all the cells as well as how those cells interact as the leaf grows (that is, morphogenesis) leads to a consistent shape that can be perceived of the whole.

In engineered systems, properties such as safety or “correct behavior” are emergent from the design of components and their interactions [Leveson11]. Consider an automobile: it has a property that the driver must be able to control its speed. The driver’s ability to control arises from the driver’s ability to give commands to regulate speed and the vehicle’s correct response to those commands. The vehicle’s speed arises from a combination of motor behavior, brake behavior, wheel interfaces to the road surface, vehicle inertia, and external forces like wind or gravity. One can talk about the rotational rate of the motor, or the degree to which brakes are applied, but driver control over speed arises from the combination of all these things.

An emergent, high-level property is said to supervene the low-level properties of components. A change in the high-level property can only occur when there is a change to the low-level properties. This principle implies that one can in general design low-level properties in order to achieve a desired high-level property. It may be difficult to do this design, of course, but it is possible; properly-designed low-level properties do not necessarily create undesired emergent behavior.

Designing a system so that a desired property or behavior emerges from components involves placing constraints on how lower-level components behave and interact. This is a top-down approach to handling emergent behavior. Reliability properties, for example, are often met using redundant components; for those redundant components to provide reliability, they must be connected in a way where one component can provide service when another fails—a property arising from how the redundant components interact with other components. The redundant components must also exhibit a non-functional relation of some degree of failure independence. I will discuss several more examples in the coming sections.

It is generally more effective to work top down, from a desired emergent property of an abstraction to the components and relations that will make it up, than to work bottom up, starting with a set of component behaviors and hoping a desired abstract property will emerge. Component properties combine in unexpected ways, and determining whether they combine in a way that produces the desired result and at the same time avoids unintended consequences is most often a nearly-intractable problem. Working top down means determining the constraints that must apply to the components and structure that implement the abstraction; analyzing (or designing) the components to determine if they meet those constraints is a simpler and more tractable problem.

For example, the software components inside most operating systems cannot be evaluated for good evidence that they provide the operating system’s intended features in all usage scenarios—and practical experience with popular operating systems shows that most contain large numbers of undiscovered errors. Those operating systems were generally built from the bottom up, with new components being developed on their own following only a minimal goal of function, and then added to an existing system. Only a very few operating systems or software systems of comparable complexity have been analyzed to prove that they actually implement their stated function correctly, and those examples have all started with clear definitions of the abstract behavior and worked from there to design the lower-level components and structure [Klein14].

12.4.1 Examples of emergent properties

Emergent properties can be simple or complex; what they share is that the combination of properties or behaviors from multiple components yields something of a nature that would not apply to the individual components. Here is a set of examples illustrating different kinds of emergent properties or behavior, ranging from the almost trivial that one might not ordinarily think about as emergent to the complex, and including both desired and undesired emergent behaviors.

12.4.1.1 Reliable data communication

Reliable communication happens when information is sent from one place to another, with the information received matching the information sent. “Reliable” is usually qualified: a maximum probability that any arbitrary bit or message that is received does not match what was sent, and qualifications on the environmental circumstances such as distance between sender and receiver, or the absence of deliberate interference.

At a high level, communication involves an information source and an information sink. The source and sink have a functional relation of sending information from one to the other.

At the lower level, communication involves a set of components. The information source and sink remain. The functional relationship between them is reified by a chain of components: a transmitter, a receiver, and the medium between transmitter and receiver. It also involves various encodings used in sending from transmitter to receiver over the intermediate medium. The components have functional relations from one to the next, for moving information along this chain of components. The transmitter and receiver have a non-functional relationship: agreement on the encodings to be used to move information over the medium between them.

Neither the transmitter or receiver in themselves move information reliably from source to sink. Instead, reliable transmission is a simple emergent property of combining all the lower-level components and their relations. The reliability comes from properly matching the designs of the transmitter and receiver, including how they encode signals for transmission and reception, so that they can achieve the desired reliability on the medium that connects them.

12.4.1.2 Door closing and latching

Consider a door, perhaps to a cabinet. The door can be open or closed. When open, it can be closed by a person acting to close it. If no one acts on the door, it might remain open or close on its own. When the door is closed, it remains closed until a person takes a specific action to open it. “Remaining closed” means that the door stays closed even when force up to some defined limit is applied to the door. These behaviors should occur reliably for at least some number of open-and-close cycles. They only need to hold reliably in some benign environment (no deforming forces, no corroding atmosphere, and so on).

This is an example of an emergent property of a high-level component that can be achieved by properly designing the subcomponents that make it up.

One possible implementation of the door that would meet this high-level property uses a latch to hold the door closed. When the door swings closed, the latch engages and keeps the door closed. The latch can be connected to a knob or lever that a person can use to release the latch, allowing the person to perform a two-part action to open the door (release the latch, apply force to the door to move it open).

The high-level door thus decomposes into three subcomponents: the basic door, a latch, and a knob. These three subcomponents, plus the door’s user, have four functional relationships:

Latch to door: the latch holds the door closed when engaged.
Knob to latch: the knob can be moved to disengage the latch.
User to door: apply force to open or close the door.
User to knob: apply force to turn the knob.

The high-level opening action that a user can apply to open the door decomposes into a sequence of lower-level actions: a turn action applied to the knob, an opening force applied to the door, probably followed by a release action on the knob. The high level closing action decomposes into, first, ensuring that the knob is released, then applying a closing force to the door.

The implementation admits states that do not directly map to the high-level states of the door. For example, the implementation allows the user to turn the knob and then take no further action. This leads to a state of the system where the door is in the closed position and the latch is disengaged. If the environment applies an opening force to the door, the door is not restrained and will swing open. A designer will have to work out what these intermediate states are, and determine whether they are acceptable or not. (In this case, the situation might be resolved by saying that the high-level “open” condition maps to any implementation state where the door position is not closed or the latch is disengaged. Handling intermediate implementation states is not always so simple.)

The knob and latch will have properties that, together, support the high-level property that the door will remain reliably closed through some number of open-and-close cycles. These properties likely involve constraints on the wear imposed on each of them each time the door opens or closes, and the amount of wear before they begin to be unreliable. Similarly, the property that a closed door stays closed when some amount of force is applied to the door decomposes into properties on the latch and knob to ensure they will hold the door in position.

The overall property of remaining closed is an emergent property of the design of the latch and knob. The latch by itself is not closed or open by itself; that is a property of the door that arises when the latch is engaged and the door is in a closed position.

12.4.1.3 Failure resilience

A failure resilient component is one that can mask one or more failures of its parts while continuing to provide correct behavior. This is one way to meet a goal that a component is reliable or available; the other way is to make the fundamental reliability of the component higher.

For a concrete example, consider a control system for an autonomous road vehicle. The control system takes in commands from a user or other outside system, then must provide correct, active control of the vehicle’s attitude and movement to travel on the commanded path. Typical acceptable failure rates are one in 10^-7 to 10^-9 operational hours. The vehicle should fail safely where possible when the control system fails, but I will leave that aside in this discussion.

Many systems achieve this level of failure resilience using redundancy and voting. In this approach, multiple independent processors run the control algorithm synchronously, each receiving the same sensor input and generating actuation output. The actuation output from each processor is fed to voting logic, which determines whether a majority of the processors are generating consistent output and if so applies that output to the plant being controlled. If one of the processors fails by stopping, or by generating different outputs, the voting logic masks out the presumed failure.

The combined components will generally perform the same operations as one single computing component by itself, but the combination will fail less frequently. This improvement is an emergent property of the combination. It depends on two non-functional relationships between the redundant components: that they all exhibit the same behavior, and that they generally fail independently.

For the example vehicle control system, I found that the approach of using three identical embedded computers was (based on reliability analysis, not measurement) likely to provide only a modest improvement to overall vehicle safety. The redundant computers were not fully independent: they ran the same code, they shared the same power source, and were subject to heat and vibration in the same environment, all of which increased the chances two or more computers would fail together. They had a greater degree of independence to matters like a cable vibrating out of its connector or dust shorting out traces on the boards. In other applications, such as spacecraft, there are more sources of independent failure, such as radiation upsets. For spacecraft and aircraft, the cost of unreliability is also higher than for a road vehicle, making this approach to redundancy worthwhile.

An incident involving an Airbus A330 landing on 14 June 2020 illustrates how lack of independence between supposedly-redundant computer systems leads to failure [TTSB21]. The Airbus A330 uses three flight control primary computers; on landing, these control the braking, thrust reversers, and spoilers that slow the aircraft on the runway. In this incident, there was an error in the flight control law implemented in all three flight computers. On touchdown, the flaw was triggered in one flight computer after another until all three had failed, leaving the pilots only manual control of the brakes. The pilots were able to apply manual braking to stop the aircraft before running off the end of the runway. The failure occurred because there was a design flaw common to all three flight computers, meaning that there was no redundancy in the face of the particular condition that occurred on that landing.

12.4.1.4 Undesired emergent properties

Components are usually designed and organized so that together they achieve the desired emergent system properties. However, the same design can exhibit other emergent properties that are undesirable.

Network congestion is a commonly-cited example of undesirable emergent behavior. In its simplest manifestation, when multiple streams of data meet and cross at some router in the network, the streams can overwhelm the router’s capacity to process and forward data. The router typically drops some packets in order to try to keep up, which causes some of the streams in turn to detect missing packets and retransmit them—causing even more traffic through the router. This was first observed in the Internet in October 1986, when a particular congested network link was moving about 0.1% of the data it normally could when not congested [Jacobson88].

This has led to congestion avoidance and congestion control mechanisms in Internet protocols, which aim to either keep stream data rates below the level when congestion starts or recover quickly when congestion does occur. The sender and receiver behaviors in the congestion control mechanisms, however, have been found to lead to behavior synchronization across multiple senders, leading to oscillating loads that repeatedly overwhelm a bottleneck, then back off, wasting resources for a while, until the cycle leads to another period of congestion [Zhang90].

These behaviors are similar to other situations where behavior is unstable, and once it starts to behave poorly it gets progressively worse. In many of these cases, congestion or overloaded conditions make it more difficult for mechanisms that would address the situation to work.

The lesson to draw from the possibility of undesirable emergent behavior is that system designs need to be analyzed to look for such negative behavior—not just analyzed to ensure that desired behaviors happen. This is related to a kind of confirmation bias Chapter 57 where one is motivated, usually unintentionally, to look for evidence that confirms what one wants or expects. It often requires deliberate effort to look for evidence of negative behavior.

12.4.1.5 Spacecraft imaging a ground location

The final example takes the basic principles in the previous, simpler examples and combines them into a realistically complex case.

Consider a spacecraft system that is intended to take images of ground locations and send those images to users on the ground. The system includes many different parts:

A mission control center, where users can determine locations they would like imaged, generate commands for the spacecraft, receive images, and view the results
One or more ground stations, with antennas and transceivers that can communicate with the spacecraft
The spacecraft, which in turn includes:
- A communications subsystem, for exchanging messages with ground stations
- A control subsystem, which receives and executes commands from the mission control center
- An orbit determination subsystem, which might include a GPS receiver and logic for determining where the spacecraft is and how it is moving in orbit
- An attitude control subsystem, which controls the direction that the spacecraft is pointing and rotating
- A power subsystem, which provides power to other spacecraft components
- A payload subsystem, including an imager and memory for storing images

The process of taking an image involves every one of these parts, as well as others omitted from the example to keep the list from getting too long to read. It includes:

A user determining what ground location to collect an image of;
Formulating commands to take that image, and sending them to the spacecraft;
Receiving the commands, decoding them, and executing them when the proper time comes:
- Turning the spacecraft to point at the ground location, based on knowing where the spacecraft is and how it is moving;
- Powering up the imager, taking a picture, copying it to storage, and turning the imager off
Packaging the image data along with related metadata for transmission to ground
Sending the data to the mission control center by:
- Determining when the spacecraft is coming in range of a ground station;
- Turning on the transceiver;
- Encoding the image in packets and sending those packets to ground;
- Turning the transceiver off when out of range of the ground station;
- Repeating this sequence until all of the image has been transmitted
Decoding the image in the control center and providing it to the user

If any one of those steps fails to happen properly, the system as a whole will fail to achieve its objective. At the same time, no one component involved in these steps achieves the system objective by itself. In other words, the system behavior of taking an image of a ground location is an emergent property of the system as a whole.

This example is typical of most system properties and behaviors, in that achieving the desired behavior involves many components working properly together. This implies that all these components have been designed to have their individual properties, and that the components have been wired together with the right functional and non-functional relations to work together.

This example also illustrates a common issue: that components depend on other components for their function. For example, the ability for the spacecraft to communicate with the ground depends on the spacecraft being able to determine when it is coming in range of a ground station. This means that the spacecraft must be able to tell where it is, which might rely on the GPS system. If there were to be a problem with the GPS constellation, the spacecraft would not be able to communicate correctly. This kind of dependency creates non-functional relationships—in this case, a non-functional relationship between ground station systems and spacecraft communications that communications will function only when the GPS constellation is working properly.

12.4.1.6 Safety and security

Leveson argues that safety is a fundamentally emergent property:

Safety, on the other hand, is clearly an emergent property of systems: Safety can be determined only in the context of the whole. Determining whether a plant is acceptably safe is not possible, for example, by examining a single valve in the plant. In fact, statements about the “safety of the valve” without information about the context in which that valve is used are meaningless. Safety is determined by the relationship between the valve and the other plant components. As another example, pilot procedures to execute a landing might be safe in one aircraft or in one set of circumstances but unsafe in another. [Leveson11, §3.3]

I argue that related properties, including security, are similarly emergent and must be understood, designed, and analyzed in terms of how components are related.

12.5 Working with structure

The notions of components, structure, and emergence form a foundation for the work to be done when designing and building a system. Upcoming chapters will define the tasks, artifacts, and processes involved in terms of this basic model of how systems can be organized.

For example, the design of a system consists of artifacts that document what the components are in the system, and the desired properties and relations that connect them. Verifying the design involves gathering evidence for and against whether the behaviors that emerge from the components and their relations match the desired system behaviors. A design can be evaluated based on properties of the graph of relations between components, and the graph of relations can guide investigations into whether subtle non-functional relations (such as expected component independence) will hold.

In addition, there are common design patterns of components and relations that provide guidance for implementing complex behaviors. These design patterns can be expressed in general terms of components and relations, making the patterns broadly applicable rather than specialized to a particular use case.

Sidebar: Emergence all the way down

I have taken a pragmatic approach to abstraction and emergence, focusing on the kinds of relations and abstractions one actually encounters in building most real systems. This means only drilling down into lower layers of abstraction as far as is needed, and not as far as it could go.

Consider data that is exchanged between two electronic components. Data is an abstract component that has no direct physical reality; it is an emergent property of lower-level components and relations. The data itself is dependent on mechanisms for observation and interpretation by people—including agreement between sender and receiver on what the data “mean”. The data are transmitted from one component to another using low-level electrical signals over wires; the signals are designed to move the data from one component to the other. The low-level electrical signals are themselves an emergent property of yet lower level atomic and electromagnetic behaviors in the transmitter, wires, and receiver. These may in turn be emergent properties of yet lower level structures and forces, some of which may not yet be understood.

It is intriguing to think about how far one can take this approach. Luckily we can usually stop at some practical level and take the rest for granted.

Sidebar: Summary

Structure is how components interact or are related to each other.
Complex behaviors and properties result from (are emergent from) these relationships.
- Emergent behaviors and properties are much easier to design in than to retrofit.
Components can have functional relationships, where one can cause effects in the other.
Components can also have non-functional relationships, where state or behavior are correlated without a direct causal relationship.
Abstracting components and relationships helps manage the understanding of complexity.
- Sometimes a system needs multiple overlapping abstractions.

Chapter 13: System views

20 May 2024

13.1 Introduction

Systems are too big for one person to understand all the facts at once. It’s necessary to focus on subsets to manage the scale.

At the same time, different people have different interests as they are working on a system. They need a particular kind of information about part of the system, but do not need to be distracted by other kinds of information.

These needs for subsetting lead to developing multiple views on a system. Each view defines a subset of the information on a system, with the subset defined to support a particular person’s needs and interests. Ideally, each person can do their work using one view or another, and when all the work has been done using many different views the work has addressed all of the system.

Some of these views have a technical focus, being about the function or properties of the system and its parts. These views support those who design, analyze, implement, or verify parts of the system. Other views are non-technical, supporting people who manage the project, organize the teams doing the work, handle scheduling, and similar tasks.

Views highlight some information and hide other information in order to help someone perform a task. If the view shows too much information, then the person using the view will have trouble finding the specific pieces of information they need. They may, indeed, be distracted by irrelevant information. On the other hand, if the view is hiding information that the person needs, they are likely to work with the incomplete information they have and infer that the system does not include the missing information.[1]

The view concept I am defining here is a general mechanism for subsetting information about the system. There are several architecture framework standards that define “view” and “viewpoint” concepts, including DODAF [DOD10] and ISO 42010 [ISO42010]. The view concepts in those framework standards arise from ideas about the processes that should be used to build systems well, and are thus more specific than the general idea presented here. These standards focus on developing models of a system’s design, with subset views that are motivated by exploring the objectives that system stakeholders have in the system. The approach in these standards is one way to use the general idea of subsetting information about a system based on some focus; I will discuss this further in later chapters when I turn attention to how to build systems using the foundational concepts presented now.

13.2 Technical views

Technical views are ones that subset the contents of a system in a way useful to the designers, implementers, or verifiers of the system. These views focus on how a part of the system functions or is organized in some technical sense.

These views can focus in different ways, depending on the specific need:

On sets of related components;
On paths through components to achieve some function; or
On the dependencies of components or relations.

A view focused on a set of components is useful to someone responsible for a particular subsystem or abstraction. The view can collect all the components, at varying levels of abstraction, related to one part of the system. This might be defined as one or more subtrees in the component hierarchy (Section 11.3)—for example, all the components that make up an electrical power system for a spacecraft. This might also start from some other abstraction. Views like this can be used when working out how an abstraction is to be realized in concrete subcomponents (Section 12.3). It can also be useful for checking whether certain design properties hold, like total mass.

A view focused on a path through the system is useful for working out or checking how behaviors are realized. Such a view might start with an event in one component, then trace how one event causes events in adjacent components, onward until the high level behavior is complete. Views like this are useful when checking where a path might have gaps that need to be addressed. It is also useful for checking that a causal path among abstract components and relations is properly realized in concrete subcomponents.

Looking at a path can help reveal what conditions need to hold for each step in the path to occur properly. For example, in the spacecraft commanding example in the previous chapter, a ground pass has to happen successfully if a command message is to be received at the spacecraft. A successful ground pass requires a functioning and available ground station, accurate ground knowledge of where the spacecraft will be, knowledge in the spacecraft of where a ground station is and when it will be in range, and the ability to operate the communications subsystem.

The third kind of view focuses on trees or graphs of dependencies. This information is useful to someone who is verifying that some safety or security condition holds. It is also useful for revealing where there are unexpected vulnerabilities in a system. In particular, looking at the transitive closure of dependencies can reveal unexpected shared dependencies between two components. In the spacecraft commanding example above, a spacecraft’s ability to know when it should operate its transceiver for a ground pass might be based on the spacecraft knowing its location through GPS. This creates a dependency on a GPS receiver on board and the correct function of the GPS constellation. Further, it may require the spacecraft to maintain an attitude where GPS antennas can see the GPS constellation; this may conflict with other demands on spacecraft attitude (like pointing an antenna toward a ground station). Both the communications transceiver and GPS receiver may rely on a shared electrical power system.

These three kinds of views are not mutually exclusive. Often someone can benefit from starting in one view, such as a path through the system, and then use other views to explore or refine the system, such as checking on dependencies.

13.3 Non-technical uses

Some views are useful for managing project execution. As a manager or lead, I have been responsible for working out what tasks people need to do to develop the system to some milestone, along with potential dependencies among tasks and estimates of the time and resources needed. I have needed to understand the system in order to derive this information about tasks.

For example, I have often started with a high-level design for a part of a system, containing a few abstract components and relations and a few paths through them for performing key behaviors. I have used one or two paths through those components to sketch out milestones that the team can design and develop toward; at each milestone, the designs or implementations will be integrated to demonstrate some level of functionality. This management step uses views of a few paths through the system. After that, I have worked from the view of components and relations that feed into each milestone to work out a set of design and development tasks that will get each part ready for its milestone. These steps use information about the components and relations involved to work out both the individual tasks and how those tasks might depend on each other, leading to constraints in how the effort can be scheduled. I expand on these techniques in Section 20.6 and Chapter 48.

Following paths through a system, as well as tracing through the ways that abstractions are decomposed, allows one to find gaps in the current understanding. These gaps represent uncertainty, which can lead to risk. Further, following paths through the system that lead to and from some uncertainty to other components or relations helps one work out how much other parts of the system may be affected by uncertainty. This allows one to judge the potential effects of changes that may arise from the uncertainty; the magnitude of the effects is part of determining how much developmental risk some gap poses. I discuss how to use this kind of analysis in Chapter 63.

Sidebar: Specifying a view

The descriptions above may seem focused on extracting subsets of a defined system, but the view concept is intended more generally.

In set theory, subsets are often specified one of three ways: by listing the elements of the subset; by constructing the subset through combinations of set operations such as intersection and union; and by specifying a characteristic function—essentially, a description of a query on the set.

All of these have been useful to me at one time or another. While a system is being designed, the population of components and relations that make it up will be changing constantly. The path through components and relations to achieve some function will be steadily refined; in many cases, there may be two or more alternative designs for parts of the same path to compare. This case lends to a query-like formulation of views, which are updated as the system’s contents change. On the other hand, tasks to verify that a design or implementation are correct and complete benefit from being an unchanging snapshot. This way someone can step through each part of the system, verifying each piece and each integration, without having that work change as people make changes to the system.

Sidebar: Summary

Views help people work with the system by focusing or restricting the information they see.
What view is useful depends on the task.
Accuracy and completeness are needed to avoid misleading when some important information is hidden.

Chapter 14: Evidence of meeting purpose

21 October 2024

14.1 Introduction

An implemented and operational system needs to meet its purpose (Chapter 9). After all, that purpose is the reason that resources have been spent on developing the system and using it. Meeting purpose means two things: that the system does all the things it is supposed to, and that it does not do things it is not supposed to.

One cannot assume that a system meets its purpose. Each system needs to be evaluated to determine whether it actually does or not, and if not, how and where it does not. The evaluations catch design and communication errors that occur when one party thinks they have specified what is needed, and another party does not understand what was meant or makes a mistake in translating the specification into practice.

How a system works changes over time as well, and regular re-evaluation catches cases where operational behavior diverges from what is needed for correct or safe operation. This includes wear and tear on the system that must be corrected with maintenance. It also includes changes in how the system is operated—from operator practice to management organization and environmental context.

In this work I talk about gathering evidence of a system meeting its purpose.

Parts of a system’s purpose can be specified quantitatively or qualitatively. Quantitative purposes can lead to deterministic ways to check that the system meets the purpose. Complex quantitative purposes, however, aren’t necessarily so easily evaluated: computational complexity or the difficulty in actually measuring system behavior can lead to quantitative properties that cannot be easily or definitively evaluated.[1] For these complex quantitative problems, one must be satisfied with statistical evidence that indicates whether the property is likely true. Qualitative purposes are not amenable to proof of satisfaction or not. These purposes are evaluated by human judgment, which again leads to evidence but not proof of satisfaction.

14.2 Verification versus validation

Systems engineering processes often use the terms verification and validation (or just V&V). These are both special cases of the general need to gather evidence for and against whether a system meets its purpose or not. I presented an overview of the difference between the two terms in Section 6.6.

The system’s purpose is near the root of all the parts that make up the system (Chapter 15). All of the rest of the information about the system derives from this purpose. There is, thus, a chain of relationships starting from the purpose to every other piece of system information. This chain runs from purpose to concept, to specification, to design, and to implementation, and downward through multiple levels of subcomponents.

Verification is the process by which each relationship gets checked. For example, a component $X$ might have a specification to provide some behavior $B$ . The component’s design is that $X$ will be implemented in some subcomponents, $X_{1}$ , $X_{2}$ , and $X_{3}$ . The design indicates how parts of $B$ are allocated to the subcomponents, as behaviors $B_{1}$ , $B_{2}$ , and $B_{3}$ . Verifying the design of $X$ means showing that the $B_{i}$ together implement behavior $B$ . The verification also means showing that all of the behaviors of the subcomponents of $X$ implement all of the specified behaviors, and that the combination of subcomponent behaviors does not lead to any behaviors not in the specification.

Theoretically, if each chain of relations is correct then the final implementation will meet the system’s purpose as expected. However, the odds are good that some specification will be ambiguous, or that some design element is not quite sufficient to meet a specification, or that one of many other little errors will occur.

Validation provides a cross-check to catch problems with the step-by-step verification process by checking some part of the system directly against the original customer needs, bypassing the chain of derivation.

This can be done either by working with the customer, or by using the information gathered about the customer’s purpose for the system.

The recorded system purpose is validated by working with the customer to ensure the purpose is accurate. Other parts, such as the concept, designs, models, or implementations, can be validated either by demonstrating them to the customer or by comparing them against the system purpose.

14.3 When to evaluate a system

Checking whether a system meets its purpose is an ongoing need, starting from when the system is first conceived, through system design, implementation, and operation. In general, a system should be evaluated any time its purpose changes, or any time its design, implementation, or operation changes.

In practice, there are five times in a system’s life cycle when the system—whether in design, in implementation, or in operation—gets checked against its purpose.

At each of the individual steps from initial concept, through specification, design, and implementation.
At the time when the system is accepted for deployment.
Periodically and regularly while the system is in operation, to monitor for drift.
At each step when a change is requested, from concept through design and implementation.
At the time when a changed system is accepted for deployment.

During development, systems are checked in two ways: step by step, and a separate evaluation of the whole system when implementation is complete. The step by step checking occurs at each development step, including generating a concept for the system, generating a specification, designing, and then implementing the design. The expectation is that if each of these steps is correct, then the concept will follow the purpose, specification will follow concept, and so on, and the resulting implementation will properly meet the system’s purpose. (See the figure in Section 6.6.) In practice something gets missed or misinterpreted at some step of development, and so the argument that each step is correct does not hold. Separately evaluating the implementation at the end directly against the original statement of purpose allows one to cross-check the step-by-step evaluation. It helps one find which step had a mistake and thus where to make corrections.

Evaluations are part of the process of working out components‘ specifications and designs. The idea of safety- or security-guided design [Leveson11, Chapter 9][Horney17] is to start with safety or security objectives as part of a component’s purpose (or the system’s purpose), refine those objectives into parts of the component’s specification, and then use this to help guide design work. Using safety or security objectives means conducting analyses of specifications or designs to see if they address the objectives, and adjusting the specification or design until there is evidence that they do meet the objectives.

Any time the system’s purpose changes, the system must be re-evaluated in light of the change. This involves repeating steps in the life cycle shown above. Re-evaluation is easy when early in initial design; the later in the life cycle, the more expensive re-evaluation gets. The scope of what parts of the system need to be re-evaluated can be limited by examining the structure of the system and how a change propagates from one component or relation to another.

A system should be evaluated regularly while in operation. In practice, systems drift over time from how they are originally designed and implemented. People who are part of the system, whether as operators, oversight, or management, can shift in their understanding of what they need to do, and often find shortcuts for their role as they adapt to how the system is to work with. Mechanical parts of the system can wear, changing their behavior or increasing the chances of failures. The environment in which a system operates can change, perhaps with people moving near an installation that was previously isolated or maintenance budgets being cut. As a simple example, in one early software system I built, the software included a billing module that would create itemized invoices to be sent to insurance companies that were expected to reimburse for medical expenses. Over time, the people who should have been running the module and creating invoices forgot to do it as regularly as it should have, leading to revenue problems for the business. Leveson discusses several other examples [Leveson11, Chapter 12].

Finally, a system’s purpose usually changes over time. The users need new features, or some assumption about how they will use the system will be found to be wrong. Regulations or security needs may change. All of these lead to a need to change the system’s design and implementation. The team will recapitulate the development process to make these changes, including evaluating the updated concept, design, and implementation against the new purpose.

14.4 Kinds of evidence

There are two kinds of evidence: evidence that something happens and evidence that something doesn’t happen. Both are needed to evaluate whether a system meets its purpose.

The first kind of evidence is an indication that the system properly implements some desired property or behavior. This is what most people think of first: that the mass of system hardware is within some maximum amount, or that the system performs action X when condition Y occurs.

The other kind of evidence, sometimes called negative evidence, is an indication that the system does not do something. Safety and security evaluations are fundamentally about collecting this kind of evidence: that the system will not do some unsafe action or enter into some unsafe state. This kind of evidence is therefore vital to determining whether a system meets its objectives, but evidence of the negative is generally hard to establish. In practice, analytic methods are the only ways we currently have to establish the absence of a condition.

Bear in mind that, as the saying goes, absence of evidence is not evidence of absence; that is, no amount of testing that fails to find an undesired condition can establish with certainty that a realistic system is free of that undesired condition. Negative evidence through testing requires testing every possible scenario, which is infeasible for anything other than trivial behaviors. Testing a very large number of scenarios can potentially generate a statistical argument for the absence of an undesired condition, but only if the scenarios chosen can be proven to provide sufficient, unbiased coverage of all possible scenarios, including rare scenarios. I have never found an example of someone being able to construct an argument for the significance of the test scenarios in a real-world system. Kalra and Paddock [Kalra16] present an analysis for testing autonomous road vehicle safety, and show that it would require an infeasible number of driving miles to show the absence of unsafe behaviors—and they conclude that alternate means are needed to determine whether autonomous road vehicles are sufficiently safe.

Many undesirable behaviors or conditions cannot be completely eliminated from a system, and instead the standard is to show that the rate at which these behaviors occur is sufficiently rare. For example, aircraft are expected to experience failures at no more than some rate per flight hour in order to be certified for operation. These safety conditions lead to a need for evidence of statistical bounds on rate of occurrence at a given confidence level.[2] If these bounds are sufficiently loose, then a carefully-designed test campaign can provide statistically significant evidence. However, statistical significance and confidence rely on the test scenarios either being selected without bias, or with a way to correct for selection bias. This means, for example, ensuring that there is no class of scenarios that are avoided in selection. It also means understanding the probability of rare but important scenarios occurring and accounting for that rarity in the number of scenarios tested or in the way scenarios are selected.

14.5 Methods of gathering evidence

There are three general methods for gathering evidence about systems satisfying their purpose:

Experimentation,
Inspection or review, and
Analysis.

Experimentation tests an operational system (or part of a system) to show evidence about some desired capability. This is the gold standard for gathering evidence that something works.

Experimentation is usually divided into two categories: testing and demonstration. Testing involves setting the system into a defined condition and providing it defined inputs, measuring the system’s response, and comparing that response to expectations. Tests are expected to be repeatable. Demonstration is more open-ended, where the system is operated for a while, possibly by people, and not always in a fully-scripted, repeatable way. Demonstrations can address some non-quantitative conditions, such as whether people like something or not.

Inspection or review is a way to check a design or system for things that cannot be readily measured by experimentation. These methods use human expertise to check the system for specific conditions. It can be useful for gathering negative evidence when other methods don’t apply. In the simplest form, inspection checks simple conditions that would be difficult to automate; for example, that a physical car has four wheels. For more complex reviews, humans observe and think about what they observe in the system to determine whether what they observe meets expected behavior.

Analysis can be used to collect both kinds of evidence. Indeed it is generally the most useful way to gather negative evidence—which is often about thoroughness, and analytic methods are better at ensuring all possibilities have been examined. Analysis takes as input a model of the system, extracted from its design or its implementation. It then applies algorithms that work to prove or disprove statements about that design, such as whether there exists some sequence of state transitions that would cause a component to enter an undesired state. The evaluation is usually performed using automated computational tools, though it can sometimes be done by hand for analyses of modest complexity. I have used analytic methods occasionally, usually for foundational components or abstractions on which the system depends for its correct operation. The first time I used it, on the design of a synchronization mechanism in a multi-threading computing environment, it caught a subtle flaw that would have occurred rarely but would have been difficult to detect. On another project, colleagues and I proved the correctness of the design of a family of distributed consensus algorithms—which helped us accurately implement the algorithms. The SeL4 operating system kernel [Klein14] has been formally proven design and implementation, showing that its implementation provides key confidentiality, integrity, and isolation properties as well as functioning according to its specification.

14.6 Completeness and minimality

Separate from these methods for gathering evidence, one also needs evidence of completeness and minimality.

When a system is believed to be complete, one doesn’t want only to show that one or a few purposes are met; eventually one needs to provide evidence that all purposes are met. This does require knowing what the purpose is, and then being able to provide evidence showing each part of it has been satisfied.

One also needs to show that the system as designed or implemented does not do things that don’t derive from and support the purpose. This includes showing that safety and security properties (of bad things not happening) are met. It also includes ensuring that people have not inserted features that the end users do not need or want, which would imply that development resources have been mis-spent and that the system can potentially do things the users will find undesirable.

Sidebar: Summary

An operational system should meet its purpose.
It takes evidence to show that this is true.
- Including evidence that the system does not do things that are not in its purpose.
Gathering evidence is an activity that happens all through a project.
- Especially important in guiding design.
Evidence comes from experimenting (testing), inspection or review, and analysis.
- Analysis is needed to show the absence of something, but sometimes one can only gather statistical evidence.

Chapter 15: Synthesis

14 October 2024

15.1 Introduction

The previous chapters have covered a model for understanding what a system is. The coming chapters are about how to build that system.

This chapter presents where these two arcs meet: the web of artifacts that make up a system and record what it is—the system artifacts. This web is a reification of the abstract model in the previous chapters; the making of this web is the work of making the system.

The system artifacts are a (directed) graph. The nodes in the graph are artifacts that record some aspect of the system, such as a component’s implementation or a requirement or a verification result. The edges in the graph record relationships between nodes, such as a “part-of” relationship between two components, or a “satisfies” relationship between a specification and a design.

The graph contains all the information about the system, as discussed in the previous chapters. It starts with one or more project ideas. These ideas lead to stakeholders and to purposes and constraints. These flow to nodes making up the concept, and from there onwards to implementations and verification evidence.

The development and maintenance or evolution phases of a system-building project are about creating and maintaining this structure. The graph does not spring into existence fully formed; nor is it build in one sweep from the top (idea) to the bottom (implementation). The work starts at the top of the graph with the ideas and stakeholders, with the rest of the structure unknown at the start. People explore downward in multiple threads. Some threads progress deeper than others at any given time. Some threads explore in one direction, find that they have gone in an unhelpful direction and start over in a different direction. Most nodes start out simple, and are added to and corrected many times as the team explores the graph.

The business of making a system well is, then, about starting from the basic idea and, first, exploring and building the structure of system artifacts efficiently; and second, building a structure with good form. Efficiently means using as little time and other resources as possible to build the structure; this implies using resources concurrently and minimizing the amount of work that is re-done (because of false starts or poor workmanship). Good form means that the structure accurately reflects stakeholder objectives and constraints; the system meets those while avoiding other features that are not about customer needs; and the graph accurately reflects the derivation from objectives to implementation. A good system artifact graph will contain all the information people need, but not too much more.

In the next part, I will discuss how one goes about making a system, which amounts to organizing how the team builds this structure in an organized way, without losing track of what they are doing.

15.2 Structure of artifacts

The graph of system artifacts contains all of the information about a system—its purpose, structure, implementation, and why all of that is the way it is (the rationale).

The graph has many types of nodes, including:

Ideas;
Stakeholders;
Purposes, objectives, constraints;
Concepts;
Components;
Specifications, including requirements;
Designs;
Analyses;
Verification methods;
Implementations; and
Verification results.

There are many different kinds of edges between these nodes, showing the relations between the artifacts. Some of the most common are:

Derivation, showing how one thing fills a need or specification in another node;
Functional and non-functional relations between components, following the definitions in Chapter 12;
Part-of, showing the component breakdown structure (Section 11.3);
Evaluation or rationale, recording why some part of the structure has been chosen or why one part is correct; and
Production, recording that some component is manufactured by the procedure or tools documented in other components.

The graph in Figure 15.1 illustrates a small part of of the upper levels of an example system artifacts graph. The illustration is a simplification; the specification node in the graph, for example, is actually a collection of requirement trees, each of which have derivation relationship to components above and below. Similarly, a design is actually a collection of many things; the illustration leaves out the relationship between the subcomponents within one higher-level component. In practice the graph for even a fairly simple system will have at least tens of thousands of nodes; for complex systems, orders of magnitude more.

This graph of artifacts is what the project builds, bit by bit, over the course of development.

Figure 15.1: Top portion of the system artifact graph

15.2.1 Layering

Looking at the whole artifact structure, as opposed to just particular kinds of information, reveals problems with narrower views.

Most projects and almost all tools organize requirements into a tree. Each component in the breakdown structure has its own set of requirements. These start with the system objectives and constraints, and flow down from one component to another along the component breakdown structure.

In isolation, this structure is missing essential information about the mapping of requirements on component $C$ to the requirements on its children $C_{1}$ , $C_{2}$ , …, $C_{n}$ . The set of $C_{i}$ implicitly encodes how $C$ is decomposed into subcomponents. By itself, this leaves out the roles of each subcomponent within $C$ and the relationships between the subcomponents. The requirements on $C_{i}$ derive only in part from the requirements on $C$ ; they depend equally on the role that $C_{i}$ plays within the larger component.

Concretely, consider a spacecraft electrical power system (EPS). The EPS has many requirements, such as being able to provide some minimum wattage to other spacecraft subsystems over the course of the average orbit, being able to passivate (permanently shut down) the EPS, to control which subsystems are powered on or not, and so on. The EPS is built from several components: solar cells that generate electricity, a battery to store the generated energy, a power distribution unit (PDU) that switches where electricity is going, and others.

The passivation requirement, which comes from regulation requiring all energy sources in a spacecraft to be fully and permanently de-energized when the mission ends, must be implemented by those subcomponents. This means that the battery must be drained to zero charge and the EPS must be placed in a mode in which it will never afterward provide any power to any other spacecraft subsystem. This includes the battery never being recharged. Figure 15.2 shows a set of requirements that encode this.

Figure 15.2: Requirements flowdown for spacecraft electrical power system passivation

One way to meet the passivation requirement would be to add two features: fuses between the solar panels and the PDU and between the battery and PDU, which when blown would permanently disconnect them; and a mechanism to drain the battery, perhaps by connecting it to a circuit with a resistor that will dissipate any stored energy.

In the requirements flow-down above, there are design decisions hidden in the requirements tree: where and how these functions will be implemented. The disconnection fuses might be implemented as part of the PDU or as part of the solar panels and battery; the design decision was to assign these functions to the solar panels and battery rather than having them within the PDU. Similarly, the function to drain the battery could be implemented in the battery or in something else attached to the PDU. The hidden decision was to assign that function to the battery.

In fact requirements (or specification) and design (in the component breakdown, and the way high-level components are decomposed); they form alternating layers and one cannot be correctly understood without the other. A design records how a component will be made up of subcomponents, and defines the roles and functions that are allocated to each of the subcomponents. The roles and functions define the purpose for the subcomponent. The requirements placed on the subcomponent are derived from the requirements of the higher-level component as transformed through the purpose of the subcomponent.

15.3 Exploration and development

Development of the system is a process of exploration to find and create the contents of the system artifacts graph. This can be thought of as similar to the way the original North American transcontinental railroad was developed (Section 4.7), but with many more choices.

In an idealized situation, where there were no significant uncertainties and where the team had system design patterns to follow, the exploration could start with the idea, proceed to work out the stakeholders and their needs, and then the team could work their way downward step by step until they reached the implementation and verification artifacts at the bottom edge of the graph. I am not aware of any project where this ever happened.

Unfortunately, that kind of steady, unidirectional progress is what too many people expect when they think about the whole mass of information that a project develops. Knowing instinctively that such development doesn’t actually happen, they turn away from organizing the project’s work and focus on building the system implementation artifacts in whatever way they can manage. This is especially true of startups and of small projects.

What gets lost is that the system artifact graph is the end of building a system, not the process for building it. The state of the artifact graph at any moment is a record of what the project has done to that time. The graph is not the plan; the plan is its own artifact (Section 20.6).

The process of making all the parts of the system artifact graph—how the graph grows and changes—is always complex. The process is non-linear and non-monotonic: people will sketch some node in the graph, explore around it, and come back to improve the sketch based on what they have learned. Some people may start working from the top (stakeholders and their needs) downward, but others may have ideas about how parts of the middle of the graph might be structured, and work from the middle out. When there are choices for how some component might be structured, the team explores multiple tentative options, creating multiple parts of the graph, before selecting one of the options and removing the other tentative parts.

The challenge for a team is to develop the graph efficiently and correctly, while allowing for the necessary complexity of working in non-linear ways.

There are two goals for developing the graph efficiently by not wasting resources. The first goal is to minimize errors and rework. If someone develops a component design that does not meet its specification, or an implementation with bugs, then at least some of that effort has been wasted and more effort will be spent fixing the problems. Rework also happens when two people perform duplicate work when only one is needed. The second goal is to keep all the team working without stalling waiting for someone else to complete something. The time someone spends waiting without being able to do useful work on the project is time lost. Organizing the work to maximize potential concurrency can help achieve this second goal.

There are other techniques that keep a project moving quickly, such as assigning work to the person best able to do it. These are in addition to the business of exploring to find the system artifact graph; they are covered in Chapter 67.

While a system is being developed, parts of the system artifact graph will be uncertain. Some parts will still be empty, waiting for someone to start working them out. Some nodes will be incomplete, needing more detail or to be checked for correctness. There may be components that have been specified but for which no feasible design yet exists. The overall process of developing a system is one of driving these uncertainties down to zero (which only happens when the last implementation completes verification and the system is done).

Any time someone is working on something that depends on something that is uncertain, there is a risk that the work they do will have to be redone. The greater the uncertainty, the greater the likelihood of rework. This suggests that the development should proceed cautiously from highly certain parts to less certain ones, and not moving onward until uncertainty has been worked down. Unfortunately this does not work: high-level decisions are often tentative, depending on whether something at a lower level works out. For example, one might decide that the electrical power system from the previous section should include a battery, solar panels, and a power distribution unit. Whether this decision works out will depend on whether battery and solar panel components are available. If they aren’t, or if the available components are too heavy or don’t produce enough power, then that initial design would have to be revisited. At the same time, the decision to use a battery and solar panels for a spacecraft operating in Earth orbit is probably feasible—that is, there is only moderate uncertainty—and so it is reasonable to move forward tentatively with that approach. As I will discuss in Chapter 63, I would then plan to investigate whether there are appropriate solar panels and battery available as next steps.

The final structure of the system artifacts graph should be correct. Some of what makes it correct:

Every node in the graph has a path (trace) from some stakeholder objective or constraint to that node, except for the top-level ideas and stakeholder definitions. This implies that everything in the graph is there for a documented reason.
As a corollary, there are no nodes that do not have such a path. A node without such a path is either not there to meet a stakeholder need and is thus excess that could be removed, or it does meet a stakeholder need and the graph is missing edges.
Every stakeholder objective and constraint is met by the nodes that can be reached from it.
The part-of relationships between the nodes that make up components are isomorphic to the component breakdown structure (Section 11.3). The component breakdown structure is directed and acyclic.
All leaf components in the breakdown structure map to implementations in the system artifacts graph. This means no component is left unimplemented.
The functional and non-functions relationships between nodes that make up components map one-to-one and onto the graphs of relations between components (Section 12.2).
Nodes have attached rationales where appropriate, documenting why a part of the structure is the way it is.

The process of exploring and developing the system artifact structure will necessarily involve evaluating alternatives for one part or another. A good final structure will come from having explored enough alternatives to have confidence that good choices were made (even if they were not formally optimal). Information about what alternatives were considered and why the result was chosen should be included in the rationales in the structure.

Finally, a good final system artifact structure is complete to a reasonable level of detail, but does not go overboard. It should contain enough information that people can learn what they need from looking at the graph and reading about its nodes, but without including details that won’t help them understand both what the system is and why it is that way.

15.4 Views

The system artifact graph quickly becomes large and intricately connected. For even modest systems, it becomes more than one person can hold in their mind.

People need tools to help them see the parts of the graph that are relevant to their work, while keeping parts that are not useful to them out of their way. I discussed the idea of views in Chapter 13: the ability to focus on subsets of the system or its associated artifact graph.

There are some common ways people need to be able to view subsets of the information. When they are working on one component, they need to see all the information about that component and all its context (relations, derivations, neighbors, parents). When they are working on the structure—components and relations—they need structural information. When they are working on an emergent behavior that crosses many components, such as a safety or security property, they need to see the affected components and their relations, along with information about how component behaviors at different levels combine to create the emergent behavior. When they are verifying some aspect of part of the system, they need to see what the specification is they are verifying and structural information to help them work out how to verify the aspect—perhaps by building a test that uses component interfaces.

I have found that I often use three kinds of views:

A view that focuses on one component, showing the specification, design, and implementation of that component as well as its context: the relations it has with other components and the roles it plays in emergent behavior at higher levels.
A view focused on a set of relations, such as a protocol among a set of components or a collection of physically adjacent components. This is useful for working out how the components work together.
A view that overlays one hierarchy on another. Most often this shows how some feature flows down through a part of the component breakdown.

15.5 Relation to making a system

The preceding chapters have presented an abstract model for what a system is. This model provides tools for thinking about systems: their purpose, scope, components, and structure.

The coming chapters cover how to build the system. They present a model organized around the tasks people do, in which a team uses tools to create artifacts, and the work is organized by how the project operates.

The system artifacts graph is where these two threads meet. The system artifacts are a reification of a system’s abstract model; they record all the information about a system and include the final implementations that are eventually manufactured and deployed. The system artifacts, therefore, must accurately represent the information in the model. The system artifacts are the object of all the work building the system. The nature of the system artifacts graph constrains how the team organizes itself and its work. In particular, the life cycle and planning that a team adopts are statements of how the team will explore and develop the system artifacts.

Sidebar: Summary

The system artifact graph is a collection of artifacts that provide a record of the system being built, from its purpose to its implementation.
The artifacts in the graph are what the team creates and maintains in building the system.
The artifacts are organized into a directed graph, with many kinds of nodes representing different artifacts, and edges between the nodes representing relationships between the artifacts.
The process of building the system can be understood as exploring and creating the graph. Looking at the work as a graph helps explain concepts like uncertainty and efficient project operation.
The graph will be large and complex, and so people need ways to view subsets relevant to their work.

Part IV: Making a system

A detailed model for how to go about building a system:

The approach to organizing the model, and what stakeholders need from the project;
The artifacts that are created and maintained during the work;
The tools used to do the work;
The team that builds the system; and
Planning and guiding the work itself.

Chapter 16: Approach

21 May 2024

Making a system is about the activities to build the system and the people who do that work. In Chapter 7, I laid out a basic model for these activities and what they involve. The model involves five elements (repeated from that chapter):

The model is organized around the tasks that are performed to build the system. The tasks generate artifacts, including design and implementation. The team is the people who do these tasks. The people use tools to do some of the tasks. And finally, the plan organizes the work.

This model provides a template for thinking about how to set up the processes and policies for a system-building project. That is, when it comes time to do a project, one can use this model to help guide the decisions about how the project will be run. In this book I do not specify how one should make these decisions—each project has its unique needs, and no one recommendation will be a good solution for every project. Instead, the model provides a framework for understanding what decisions need to be made, and in later chapters I provide menus of choices for different parts of the model.

All the pieces of running a project are themselves a system, whose purpose is in general to get the system built. In this part, I follow a general approach for designing any system in order to lay out a set of functions that each part of the model can have. In doing so, this lays out a framework for the criteria by which someone can judge potential designs for their project’s organization.

The approach, then, begins with working out the purpose of the system for running the project. The purpose in turn derives from the stakeholders who must be satisfied with the execution. In the rest of this chapter, I lay out a template list of stakeholders and the needs each of them might have. This set of needs will then provide guidance for what each component part of the model—artifacts, team, plan, and tools—should have in order to satisfy the stakeholders.

16.1 Purpose

The primary purpose of the system that is the project is:

Get the system built, accepted, in operation; maintain and evolve it.

There are also secondary objectives that different stakeholders will have, which I will discuss next. This includes, for example, needs of the organization hosting the team that does the work: the organization in most cases expects at least to be able to cover the cost of development. If the organization doesn’t believe that it can cover the cost, they may well decide not to pursue the project.

In the next step, I identify potential stakeholders. Following that I will identify potential needs each can have, what different kinds of each there can be, and how each stakeholder relates to the organization that runs the project.

16.2 Stakeholders and needs

The first step in working out a system’s purpose is to identify the stakeholders who define the purpose (or put constraints on the project that are, in effect, part of the purpose).

I group stakeholders into five classes:

The customer for which the system is being built;
The team that builds the system;
The organization(s) of which the team members are part;
Funders who provide the investment to build the system; and
Regulators who oversee the system and its building.

Each of these are meant to be roles, rather than single entities. For example, when a system is built under contract for an organization who is paying for the work, that organization is both the customer (they will be using the system) and the funder (they are paying for the system-building).

16.2.1 Customer

The customer is the person or organization(s) that will use the system once it has been built and deployed. The system’s value in the world in the end derives from what the customer can do using the system.

The customer primarily cares about the system meeting some need they have. In addition, they care that the system:

Is reliable, safe, and secure.
Will be within their budget, both up front and over time during operation.
Can be certified or approved (if needed).
Can be adapted as their needs change.

Variations. The simplest kind of customer is when one organization contracts with another organization to build the system for the first organization. In this case, it is clear who needs to be satisfied with the system (the one paying for it).

Other times the customer is internal: when an organization determines that it needs some system for its own use. Who defines the purpose of the system is then usually clear—though sometimes it is unclear who defines the purpose, because there is not such a clear separation between the “customer” and the builders.

Finally, the more complex situation occurs when the customer is hypothetical. This occurs when an organization builds a system product in hopes of providing it to future (paying) customers. In this case, there is no one person or organization who can dictate the system’s purpose. Instead, the team designing the system must build up an idea of who potential customers are and what they might want.

I discuss the different kind of customers further in Section 33.15.

Relation to broader organization. Most organizations have someone or a team responsible for finding and working with customers. This might be a business development group, or a sales and marketing group. These people will be responsible for actually working with the customer, and they should stand in as a proxy for the customer during internal discussions. The systems aspects that I discuss here support the interface between the marketing or business development people and the people who build the system that is delivered to the customer.

16.2.2 Team

The team is the collection of all the people who do the work to design and build the system. This group includes developers and engineers, managers, contracting specialists, marketing, and everyone else who does tasks related to getting the system built.

Many of the things that the team needs are not directly related to building the particular system, but are aspects of the organization for which they work. An organization’s policies and management have the most effect on whether the team are satisfied, but there are aspects of systems work that can support (or hinder) the organization.

The people in the team need, in general:

Satisfaction in their work. This means that they should have work that uses their skills and provides a reasonable challenge. They should have confidence that their work will have a positive outcome.
Appropriate staffing. The team needs to have enough people to get the work done without being overstaffed, leaving some people underused.
Secure position. The people in the team need to feel secure in their position, meaning they understand how they fit into the project, what is expected of them, how they will be evaluated, and where authority lies. They also need to understand how they can raise problems to be addressed, and have confidence that they will be heard.
Fair compensation. The people need to believe that they are being compensated fairly for their time and effort (and in ways that involve more than money).
Confidence in the project. The people need to have confidence that the project is well run, and that both their work and the result of their work will accord with ethical standards.

This list is based on the analysis documented in Section A.2.2.

Variations. The team can be as simple as one or two people, or it range to a large team of hundreds. The team can be all within one organization, or it can be spread over multiple organizations (such as when multiple organizations collaborate on a project). A team can also be viewed as including external vendors who provide parts of the system or essential services.

Relation to broader organization. Most of a team’s needs are matters of project management and business operations, not of systems-building in itself. The organization defines its human resources policies, for example, which address matters of how people are evaluated or paid, and how they can report problems.

However, the organization of systems work can help to meet these needs. Accurate staffing depends on understanding the work to be done, which in turn depends on the system’s design. Well-defined job descriptions and processes help people understand how to get their job done, contributing to people feeling secure in their position.

16.2.3 Organization

The organization is the entity or entities for whom the people in the team work, and which provide a legal entity for the project. I use the term “organization” rather than “business” or “company” because there are many kinds of organizations that can run a project: a government, a consortium of other organizations, a non-profit organization, or an informal group of people can all run a project.

All organizations share one concern: the ability to deliver the system. This includes having the ability to communicate with the customer (or model potential customers) and the ability to hire and support the team doing the work.

Organizations also share a need to maintain their reputation. If an organization has a reputation for delivering good systems, on time and on budget, they will be more likely to be able to keep going.

Some organizations have additional needs, focused around how the project will position them to deliver other things to other people. An organization may need to show a profit—enough to fund the organization’s overhead and to deliver returns to funders. An organization may need to be able to sell the system to potential customers. And an organization may need the project to position the organization for future work, based on improving the organization’s capability and maintaining its reputation.

Variations. There are many different kinds of organizations. These include:

A for-profit business, which looks for a financial return on its investment.
A non-profit organization, which looks to cover its costs and see a return in some other way (such as a community benefit).
A government, which looks to have some public need met.
A group or consortium of organizations, such as a primary contractor with subcontractors.
An informal group, such as a group of people who agree to work together to build an open-source system.

Relation to broader organization. Obviously, most of an organization’s needs are addressed not by the team building a system, but by the organization’s management and operations. The systems project supports these needs, however. The organization needs to be able to estimate the cost and time involved in a project in order to ensure that it has the funding needed to complete the project. The organization’s reputation depends in no small part on its ability to execute the systems-building project, so things that helps the project move ahead efficiently and smoothly will be good for the organization.

16.2.4 Funders

Funders provide the capital or other resources needed to build the system.

A funder has one primary need: the return on their investment. The return may be monetary (profit from sales of the system) or it may be more intangible (a business ecosystem, regional economic development).

Some funders will have secondary needs, such as enhancing their reputation and positioning themselves for funding future projects.

Variations. Funders can be external to the organization building the system, providing investment in the expectation of a monetary return. Venture capital funding is one example of this kind of funder.

The customer can be a funder when the customer pays for building the system. This can be a commercial customer funding the project through a contract. This can also be a government organization providing a development contract. The expected return in these cases is primarily the system itself, and secondarily less tangible benefits like the development of capacity to build similar systems.

A project can also use internal funding. This occurs when an organization has the capital to develop a system itself. The organization generally expects a return on its investment either by improving the organization’s own capabilities, such as by building a tool that helps the organization run better, or by providing a product that the organization can sell for a monetary return.

Relation to broader organization. While the organization has the primary responsibility for working with funders, a systems-building project can help meet the funders’ needs by building the system efficiently, using the investment well, and by producing a good system, which helps ensure that the expected return will occur.

16.2.5 Regulators

Regulators in general are people or organizations independent from the team and project. The regulators provide an external check on organizations and products to ensure they meet safety and security regulations, or that they provide legally-required public value.

Regulators need compliance with regulation in the system and in the work the team does to build the system. The regulator may verify that regulations have been met by inspecting the final system or by auditing records of the system’s creation. The regulator may block a system’s deployment until the system can be certified as meeting these requirements, as happens with aircraft. Alternatively, the regulator may depend on the team to know and follow the regulations and only check the system’s compliance when something goes wrong. The US automotive industry is an example of this.

The systems-building process, at minimum, supports regulators’ needs by knowing and following the regulations. This often can involve dialog with the regulatory organization to ensure that the team has all the information it needs, and to ask for clarification or guidance when the team is unsure about the regulation. The team also needs to maintain records that can be checked to show how it has complied with regulations. When the system requires certification before being deployed, the team usually needs to engage with the regulators to ensure the process goes smoothly.

Variations. A government organization is the obvious regulator. They have the charter to look after the public interest, especially when a project has incentives that would work against that interest.

Industry organizations can act as de facto regulators. A group of companies can come together to set voluntary standards for the systems they make. The groups that standardize the Internet (the Internet Corporation for Assigned Names and Numbers, ICANN) or WiFi (the IEEE Standards Association and the Wi-Fi Alliance) for interoperability are examples. These organizations do not have authority to penalize systems that do not comply, but a system that does not is not allowed to claim compatibility.

Finally, there are non-governmental organizations that set safety or security standards, often for a particular industry. ISO, SAE, and others provide safety standards (such as [ISO26262] or [ARP4754]) and companies have grown up around them to help other organizations comply with the standards. These organizations also have no authority to penalize non-compliant systems directly, but compliance is usually evidence used to show that government regulations are met, or provide a defense against lawsuits.

16.3 Mapping needs to model

The previous section introduced a set of stakeholders that have an interest in how the project operates, and a summary of each of their needs. The next step is to work out how the model for performing the project can support meeting those needs (see the diagram above). This involves mapping the stakeholder needs to each of the parts of the model (artifacts, team, tools, plan).

I developed this detailed mapping. Appendix A reports the details of each stakeholder and their needs, along with the full derivation from needs to the requirements for the pieces of the system-building model. The mapping has the form of tables of requirements or objectives, with each stakeholder need mapped to one or more objectives for each part of the system-building model. The result is that every stakeholder need is either supported by aspects of the system-building model, or is explicitly labeled as the responsibility of others outside the system-building project. The derivation also shows that every objective listed for the system-building model is justified by helping meet some stakeholder need.

The remaining chapters of this part of the text explain what each part of the model should be or do. These chapters are based on the derivation in the appendix.

Sidebar: Summary

Making a system involves organizing the work that a team does.
Model is organized around tasks.
- Team performs the tasks.
- Tasks produce artifacts, including records and the system.
- Team uses tools to do the tasks.
- Operations organizes the tasks.
Objective is to get the system built, then maintain it.
Stakeholders have objectives for the work:
- Customers want a system that works and meets their need and budget.
- Team needs satisfaction and support.
- Organization (e.g. company) has business objectives.
- Funders need efficiency for return on investment.
- Regulators need compliance for public interests.

Chapter 17: Artifacts

15 October 2024

17.1 Purpose

Artifacts are all the things created in the process of making a system. It starts with records of the purpose of the system, and the requirements it must fulfill. It includes the implementation of the system ready to deploy—such as hardware inventory in a stock room and software ready for installation. The artifacts include everything in between, including design, source code, verification records, rationales for decisions, records of reviews and approvals, and many, many more. The artifacts also include information used by the team to help do its job, such as information about who is on the team, processes to follow, and how the team operates. The artifacts are organized into the system artifact graph (Chapter 15), which records how all these artifacts are related.

The objectives for artifacts are documented in Section A.3.1.

The artifacts have three functions: as deliverables, as communication, and as a record of the project for auditing.

As deliverables, the implementation artifacts are the actual system to be deployed. It should be possible to take a set of implementation artifacts, assemble them (following instructions that are themselves artifacts) and have a working instance of the system. These artifacts are joined by things like records of regulatory approval and information associated with serial numbers or versions showing the history of the specific artifacts deployed in the system.

Most of the artifacts, however, are for communication: between people working on one task and another, between the customer and system designers, between those who implement and those who verify. Sometimes those people are working concurrently, such as when two people design two components that are expected to work together. Sometimes the communication is between someone who specifies attributes for a part of the system and someone who implements that parts. The communication is also between someone who made a design decision and someone who, years later, must understand that decision in order to make changes to the system.

Audit is a special case of communication. It is between the project and someone outside who will be checking the project’s work. In many cases the external party will have an adversarial role, looking to find mistakes or violations. Regulators, for example, may look through records to check that the team has followed processes that meet regulatory requirements.

Note that there are many ways to achieve the objectives laid out in this chapter. Each project will need to determine how to handle its own artifacts. The specific solution will depend on the complexity of the project, the size of the team, and requirements from the organization or industry. The appropriate solution may change over time: as a team grows, it may need more formal mechanisms.

I have seen a range of working approaches for handling artifacts. Two projects kept track of planning information on designated whiteboards. Others maintained plans in project management tools. (The whiteboard approach had a problem: one time someone erased the board. Luckily there was a recent picture of its contents.)

I have also been on projects that had an overly complicated solution. One project was a joint venture between multiple companies on multiple continents. That project used multiple repository tools for different kinds of information. There was a process for proposing design and implementation changes, but no one knew quite what it was or how to follow it. After a few years that joint venture fell apart, in part because the teams could not figure out how to work together.

Whatever solution you adopt, it is important that it fit your project and team. It should be capable enough to manage the kinds of artifacts your team will use, and simple enough for the team to use.

The objectives in this chapter can help you work out what capabilities your solution should handle.

17.2 General principles

The artifacts are meant to be shared, at least within the team and sometimes to people outside. The people using these artifacts will come and go, so supporting people who will use them in the future is as important as sharing in the moment.

This leads to some general principles about artifacts.

People should be able to find the artifacts they need. An artifact is not useful it the people who need it don’t know it exists, or if they don’t know how to find it. The artifacts should be organized in some way that helps people find them.

“Finding” has multiple aspects. It can mean that when they know something exists, they can get to that artifact conveniently. It can mean that they know that a general kind of thing probably exists, and they need to be able to navigate through to the artifacts of that kind. They may not know what is out there, and need to be able to browse or discover artifacts in order to learn about the system. Or it might mean that they need to have confidence that they can itemize all of a certain kind of artifact, without missing any.

People should have confidence that they have found the correct artifact. In the worst case, someone will look for a particular thing and find three or four potentially-relevant artifacts. Which, if any, of those should they believe? What if they disagree with each other?

This principle generally means, first, that any particular piece of information or artifact should be in one place. There should not be two different artifacts that appear to be authoritative sources for the same piece of information. It also means, second, that when there are legitimately multiple versions of an artifact, those versions should be clearly identified and that a user should see consistent versions of different artifacts unless they take explicit actions to see different versions.

Relationships between artifacts should be represented. No artifact made in a system-building project stands on its own. It responds to specifications, or is the source of specifications for something else. It explains why some other artifact was designed as it was. It has a functional relationship with some other component, implemented in another artifact.

If the artifacts do not represent these relationships somehow, people using the artifacts will sometimes miss that the relationships exist and make mistakes in building or checking something they are working on.

The relationships can be represented in many different ways, and not all relationships need to be represented in the same way. Naming conventions work for some things, while others must be documented explicitly.

The artifacts should be maintained securely. The system that the customer will ultimately use is based on many artifacts that the project maintains. If someone subverts or damages some of those artifacts, the resulting system will be compromised. If someone destroys some of the artifacts, some of the team’s work will be lost.

This argues at minimum for maintaining the integrity of the artifacts, meaning that the artifacts or the collection of them cannot be modified in an unauthorized way. (Good practice is that any change to an artifact can be traced reliably to the person who made the change.)

Some of the artifacts may need to be kept confidential, if they contains secret information. Almost every project has some information to be kept confidential, at minimum as part of maintaining the integrity of artifacts. (Login credentials, for example.)

17.3 Kinds of artifacts

This section lists the kinds of artifacts that the analysis in Appendix A showed contribute to meeting stakeholder needs. The artifacts are listed in the order in that analysis. This list expands on the examples in Section 15.2.

17.3.1 Purpose and constraints

These artifacts include clear documentation of the customer’s purpose for the system. Every feature of the system derives, directly or indirectly, from this purpose. If that purpose is not written down, the team is unlikely to accurately design to meet those needs—and is likely to add features that the customer does not want (so-called “gold plating”). These artifacts should be visible to most of the team in order to guide them as they design, build, or verify the system.

The customer’s non-functional constraints should be included. This includes the safety, security, and reliability they expect.

Constraints from other stakeholders should also be documented. The organization may place constraints on the project, such as expected profitability. Regulators can place many constraints that must be met to license or certify the system.

A clear record of a system’s purpose is essential for bringing new team members up to speed, and helps the whole team keep its work on track.

The understanding of the purpose or constraints will change over time. A customer will find they have needs they did not initially realize, or they will discover that whatever purpose was agreed with the team is not quite what they meant. An organization or regulators may change their constraints as time goes by.

There should be an explicit record of the changes requested or identified. If a change is accepted—and the project may choose to reject some changes—then it should lead to a new version of the purpose and constraints. It should be possible to determine whether other artifacts, such as requirements or design, are consistent with a particular version of the purpose and constraints.

The specific kinds of artifacts include:

A record of the customer’s purpose for the system.
- Source material that shows how this purpose was extracted from unstructured interactions with the customer.
- A record of what the customer wants to use the system for.
A record of the customer’s non-functional requirements, such as safety, security, or reliability.
- A record of how these requirements were determined.
A record of the constraints that the organization places on the system.
A record of the constraints that regulations place on the system. (More of this in Section 17.3.7 below.)
Records of how this information has changed over time.
- Records of requests to change part of the purpose.
- Records of decisions to accept or reject the change.

17.3.2 Team information

Maintaining information about the team helps the team work together.

I worked on one project where the management did not want to put together an organization chart or a list of team members. I ended up talking to the wrong person about a particular technical subject—that person was happy to talk about it, but it turned out they were not actually on the part of the team working in that area. Their opinions turned out not quite to agree with those of the person actually in charge, but I hadn’t been able to find the person I should have been talking to.

This kind of confusion is more common than people expect, and it results in people getting the wrong information, or in people not getting information they should.

Information about the team is only valuable if it is accurate, however. The team should have someone responsible for keeping it up to date—meaning that ideally updating the information is a normal part of the processes (Chapter 62) for bringing in a new team member or changing assignments.

The specific kinds of artifacts that will help include:

A roster of the people on the team. Ideally this includes ways to contact them, especially if the team is distributed.
The organization of the team: how the team is grouped and who reports to whom. This can be important when there are conflicts within the team, and people need to escalate some issue.
The roles, responsibilities, and authority of each team member, and conversely the people who fill each role. It should be clear who needs to be consulted for a decision, and who has the authority to make the decision, as well as who does not have that authority.
Information about each team member, such as what certifications or trainings they have completed.

17.3.3 System artifacts

These artifacts are the system that is being built—the majority of the work of a project.

The system artifacts include:

Concepts for the system, which derive from the purpose and constraints, and can be used to learn about potential design approaches for the system (Chapter 33).
Specifications and designs for the system and its components (Chapter 34).
Rationales of the design and the choices it reflects, which will aid future team members when they need to learn how the system works—perhaps to fix an error, or to change the system.
Analyses of designs to check whether they match specifications and the customer purpose.
The implementation of components.

The exact set of these system artifacts depends on the process and life cycle (Section 20.4) that the project uses. If the life cycle has some review milestone that a part of the system is supposed to meet, then there may be documents or analyses specific to that review.

That said, good system building practice involves some core kinds of artifacts: specifications, designs, and implementation.

The artifacts should include some items that are more about the system building process than about the deliverable system itself. These include:

Records of the major technical uncertainties in the system. Technical questions will need to be worked through, and may require special work to find solutions. This work needs to be accounted for in project planning. Maintaining a list of questions also ensures no major open question is left unaddressed before the system is finished.[1]
Records of reviews and approvals. Depending on the process that a project uses, designs or implementations will need to be checked before they go from being proposed works in progress to an accepted part of the system. Keeping track of who reviewed the artifact, what their conclusions were, and who approved the artifact helps the team keep track of the status of each artifact.

How the team maintains these artifacts can vary widely. Many software efforts use version control systems, which maintain versioned software artifacts in a repository server. Many hardware design tools either provide their own versioning repository, or are designed to work with a separate repository system. For hardware artifacts—not their design—one must work out where to store and how to track each physical artifact.

17.3.4 Verification artifacts

Verification artifacts support verifying that the system (or components in it) meet their intended purpose and specification, and that they are free of errors.

These artifacts include:

Designs for how different parts of the system should be verified: test procedures, symbolic analysis, or reviews. For tests, this includes how a test should work.
Test implementations. If software is involved, this can be software test cases. For hardware, this can include lab equipment setup, test jigs, as well as written instructions on how to perform a test.
Results of performing verifications. Note that results apply to a specific version of artifact that has been checked, and so the records need to maintain information that lets people see whether some specific version has been checked or not.

These constitute both a record of what parts of the system have been checked and found to meet their verification criteria.

Verification should be repeatable. The artifacts maintained for doing verification checks should be complete enough that different people can perform the checks in the same way. The instructions for performing checks should be clear. The test equipment should be maintained and people should have instructions on how to use it. Software test environments should be controlled so that when a test is run twice, it is in the same environment both times.

The verification results are generated by people performing checks, and used by people reviewing part of the system to ensure it has been checked before it is accepted as working. They may also be audited by regulators or other outsiders who will be checking whether the project has built the system properly.

17.3.5 Release, manufacture, and deployment

Releasing and deploying a system are complementary steps. Releasing involves taking implementation artifacts and making them available for manufacture or distribution.[2] Manufacturing the system follows if needed—involving producing and assembling hardware, or packaging software into a deployable form. Deployment takes the manufactured system and sets it up for a customer to operate.

The artifacts should include the procedures used to release, manufacture, and deploy the system. The release procedures define the sequence of steps involved in taking a version of the implementation artifacts, checking that they have been verified and meet the intent of a release (such as the features implemented or bugs fixed), and placing copies of those artifacts in a separate area as a release. The manufacturing procedures define how to take those released artifacts and manufacture products that are ready for deployment: assembling hardware according to a released hardware design, for example, and giving them serial numbers. The deployment procedures tell how to take those manufactured artifacts and install them so that they are a working customer system.

There are different variations on this flow of operations depending on whether one is releasing and deploying a whole system or an update, whether the artifacts are electronic (software or data) or physical (hardware components), and whether the system will be mass produced or not.

Hardware components will generally start with a release of a hardware design. That hardware design is the basis for manufacturing instances of the component. Whether it is a single unit made in house or many units produced in a dedicated facility, the manufacturing procedure determines how the hardware products are made. Before finishing manufacture, hardware components are typically given an identity, often recorded as a serial number, that identifies the specific component instance and associates it with records like which design release version was used, what subcomponent parts were used, date of manufacture, and so on. Then the part is placed in inventory from which it can be deployed.

Software components most often follow a different path. Being electronic information rather than physical, there is no “manufacturing” step. The release procedure gathers implemented software and creates a deployable package from it. The manufacturing procedure gives the package an identity (a release number) and signs it or otherwise sets up security protections. It can then be copied to a server that makes it available for distribution and deployment.

Deployment procedures take hardware from inventory and software from a distribution server and puts it into use for a customer. This could be as simple as letting customers know that a software update is available for download. It could involve moving a number of physical components to a customer site, setting them up, and performing deployment checks to ensure that the installed system is working. It could be as complex as delivering a spacecraft to its launch provider, preparing it for launch, and having the spacecraft start up on orbit.

The whole process of producing deployed systems often generates a lot of records. Hardware devices have associated records about what specific design was used, what subcomponents were used, when and were it was manufactured, and then accumulate service records: when deployed, what defects were reported, what repairs made, how the device was disposed at end of life. Software has similar records: the identity of the software image, the versions it contains, how it was built, when it was made available for deployment, where it has been deployed, and its service history.

17.3.6 Project operations

Artifacts that support operations can be broken down in the same way that operations itself is (Section 7.3.5 and Chapter 20).

The project’s life cycle and procedures can be maintained in simple documents. Because these documents serve as a reference for team members, it is important that people be able to find easily the parts of the documentation they need for a particular situation: for example, if someone is setting up a design review for a particular component, they need to find the procedure for design reviews. The documents also need to support people reading through the life cycle or procedures to learn how the project operates in general. Having a good table of contents or index and accurate summaries can help them understand the breadth of operations before they need to learn about some specific procedure.

I have worked on several projects—especially including NASA projects—that develop complex “management plans” and “systems engineering management plans”. I have found that few people in the team actually use these documents. The management plans often follow a template that speaks to the team’s aspirations (“the team will do X”) but does not lay out the actual procedures (“do X by doing Y and Z”). The information in these plans is also often organized for a management reviewer, rather than for the people who need to follow the procedures. As a result, the documents sit unread after being approved and the team operates on shared lore about how to do one task or another, and the plans become increasingly out of date as the team’s practice diverges from the original intent.

Instead, the life cycle and procedure documentation should:

Be organized to be used primarily by the team members who will do the work.
Give clear and specific directions for procedures that a team member can follow to get a task done.
Be regularly updated as the project’s procedures change.

Beyond the life cycle and procedures, planning and tasking activities involve creating and maintaining records. These artifacts are often maintained using specialized tools, such as project planning tools and task management (or issue management) systems.

Operations also maintains records of supporting information, such as budgets, risk registers, and lists of technical uncertainty.

17.3.7 Regulatory artifacts

Working with regulators typically involves a lot of records. The team uses some of these to guide how it builds the system. Other records form a legally-binding record of what the project has done and how the team has interacted with the regulators.

First, the artifacts should include records of the regulations that the project must comply with. This might be as simple as references to publicly-available reference sources (such as web sites that make current government regulations available). It may also include documents that explain what these regulations mean. This information is only of value if it is accurate; this means it must be kept up to date as regulations change. (In some fields, it is worthwhile having someone who tracks likely upcoming regulatory changes so that the team can anticipate those as well as working to current regulations.)

The artifacts should also include records of the processes that the team needs to follow working with the regulators. For example, if the system must obtain a license before being deployed for use, then there will be a process for applying for that license. Again, this information must be kept up to date to be useful. The processes are often difficult to find or interpret, so it is helpful to maintain documents that explain the process as well as just a record of the process.

Second, systems that need licenses or certification will require applications to regulators. The application information should be maintained, including copies of any application forms (with dates!) and any supporting documents generated as part of putting the application together. For example, I helped one team apply for a license to operate a small spacecraft in low Earth orbit. The license application included an orbital debris assessment report, which was sent to the regulator as part of the application packet. The assessment report included information generated by a debris assessment tool [NASA19]. The database used by the assessment tool was an artifact to be maintained, along with the report itself.

Correspondence with regard to the applications also needs to be maintained. This should include any information that shows how the team took steps to follow the application processes.

Next, the project must keep records of licenses or certificates that have been issued.

Finally, the project will need to maintain evidence that the system it has built complies with regulation, whether a license application is involved or not. These take the form of a mapping from a table of regulatory requirements to the evidence of compliance with each of the requirements. The evidence can be complex: for example, showing that the probability of a particular hazard occurring being below a mandatory threshold.

17.4 Managing artifacts

Artifacts are the result of the team’s work, and thus they carry value to the team and its customers. They represent the system being built. They are used continuously to inform and manage the team. They are often used long after they are created, to audit the work and to guide modifications to the system.

The artifacts change over the duration of the project. An early design draft gets revised into a version used to build the corresponding component. Later, the design is revised for a second-generation component.

These conditions lead to three general principles for managing artifacts: security to protect integrity, organization so people can find the artifacts, and change management.

17.4.1 Security

The artifacts need to be managed in a way that preserves their value by maintaining their integrity. Losing or damaging an artifact results in a loss that could be anything from annoying (losing minutes from a status meeting) to fatal to the project (damaged implementation of a critical component). The artifacts should be protected against both accidental loss, such as a server breaking, and malicious loss. For data artifacts, this means using resilient storage systems with good cybersecurity. For physical artifacts, it means storing artifacts in storerooms that maintain a benign environment and that provide physical security.

Access to the artifacts should be limited to authorized people using access control mechanisms. These mechanisms reduce the risk of malicious damage by limiting who can get to the artifacts. For artifacts that need to be kept confidential, limiting access helps reduce knowledge leaking to unauthorized people.

17.4.2 Organization

A random jumble of artifacts is of little use to people on the team. The team members need for the artifacts to be organized in a way that allows them to find the ones they need accurately and quickly. Bear in mind that most artifacts are part of the system artifact graph (Chapter 15), which has structure based on how artifacts are related.

There are three kinds of “finding” that team members will do.

In the simple case, they will know what they need: the design document for some component, or the risks associated with the project, or widget serial number X. To find something specific, they need to know where to find artifacts and how those artifacts are organized. They can use that organization to get to the specific one.

Another case is when someone knows they have a need but does not know exactly what they are looking for. This might be someone who has recently joined the project, or someone who is working in an area they aren’t familiar with. These people will need to be able to see and learn how the artifacts are organized, and will need a guide to help them understand what is available.

The third case is when they are looking at structure more than any specific component or artifact. They might be working on some component and want to learn about its environment, or why it has certain specifications. They might be investigating how two components interact. These kinds of information are captured in the relations between artifacts, not just in the individual artifacts.

Finally, there should be one logical place for each artifact, and artifacts should not be duplicated. (There might be copies for redundancy, but the people looking for one artifact should see those copies as if they were one thing.) Two people looking for the same information should not end up finding two different artifacts that cover the same topic and that have diverged from each other. This leads to people building incompatible components, sometimes in ways that are hard to detect but that lead to errors in the system.

17.4.3 Change management

As I have noted, artifacts change regularly over the course of a project. However artifacts are managed, they need to account for the effects of these changes.

Some artifacts, like records of task assignments and progress, change often but at any given time there is only one accurate copy of the information.

Most system artifacts, on the other hand, evolve in more complex ways. At any given time there may be multiple versions that are works in progress—containing incomplete changes that their creators don’t believe are ready to be used by others. Some of those in-progress versions may develop to become accepted versions, ready for others to use: a design that is ready to be implemented, or an implementation ready for integration testing. A version that has been used like this may later become obsolete as an updated version comes along.

This pattern of change calls for supporting versioning on this kind of artifact. Versioning means that one can find multiple versions of the artifact, and each artifact has an identifiable status so that someone can know whether they should be using that version to build other artifacts, or just looking at the version to understand it.

The dependencies of one artifact on another, such as a design leading to an implementation, and and implementation leading to verification test results, means that mutually consistent versioning is also important. When looking at an overall version of the system, it should be clear that (for example) the design for component X has been updated, the implementation for that component is in progress of being updated to match the design, and any verification results are from an older implementation that may no longer be accurate.

Most project life cycles and procedures define different statuses that an artifact version can have, along with procedures for how that version can change status. While the details differ, the statuses generally include some sort of work in progress, proposed, approved (or baselined), and superseded. The procedures generally say what has to happen for a version to move from one status to another, such as defining that a proposed design needs a review and approval step to be accepted as a baseline.

17.4.4 Implementing artifact management

There are many tools and processes in use today for managing artifacts. At the time of writing, no one tool works well for all kinds of artifacts, and so a project must stitch together its approach to managing artifacts out of multiple different tools.

Electronic artifacts. Software development uses version control systems to manage electronic files. There are many such systems, all of which provide a storage repository with a few common features:

Maintaining a consistent view of all the files being managed. This means that the collection of files does not have a mixture of different versions.
Ability to change files locally, then commit updated files to the repository.
The ability to define branches of the repository, which store separate, internally consistent revisions of all the files. One branch is often used as the team’s shared working view of the files, while people create other branches to isolate works in progress.
The ability to merge revisions from one branch to another, so that updates from a working version can be combined into the team’s shared version.

Other industries use document control systems to manage collections of electronic files. These systems also provide a repository for a collection of files, but the generally focus on the management of documents rather than just versioning. They commonly include features like:

Support for review and approval processes.
Sophisticated search, using organized metadata or document analysis.
Strong access control and permissions on individual documents.

In addition, tools such as CAD systems or requirements management often include versioning and workflow features. These tools support creating different versions of an artifact, and defining a workflow for the procedure to be followed for approving a version as a baseline.

In practice the tools for managing artifacts do not often work together, requiring a project to (for example) select one tool for managing software artifacts, one for CAD system artifacts, another for structured systems engineering artifacts (such as requirements or specifications), and another for documents that do not fit neatly into these other categories.

Hardware artifacts. Many projects will create physical artifacts—mechanical components, electronic boards, manufacturing jigs, and testing equipment. These physical components need:

A place where they can be stored until needed. This might be a stock room for some components, or lab space for test equipment.
Information about the specific artifact, perhaps using a serial number to associate the information. This might include a part number, version information, and usage or manufacturing history.
An inventory of the artifacts, including where each one can be found.

[1] “I kept a seven-by-ten-inch black notebook divided into six sections, as follows: (1) Schedule, (2) Systems Briefings, (3) Experiments, (4) Flight Plan, (5) Miscellaneous, and (6) Open Items. Section 6 meant problems of which I became aware as we went along, and which were duly listed by number. As long as they remained unsolved, or open, I reviewed them periodically and bugged the appropriate people for solutions. As they were solved, they were closed, and I drew a line through that number. By the morning of launch, I had 138 items, and all 138 had been crossed out. If this process was a bit scary and time-consuming, it was also immensely satisfying. It was going to be one hell of a flight, if only I could figure out… Whip out the notebook and write it down before I forgot it.” —Michael Collins, writing about the preparation for the Gemini 10 flight [Collins74].

Sidebar: Summary

Artifacts are all the things that people create while making the system.
- The system itself.
- Its design.
- Records and communication.
- Operations support, including plan and procedures.
Artifacts are managed.
- Known place to store and find them.
- Versions of artifacts.
- Ability to find the correct version.
- Baselined versions that are approved and current.
- Security and integrity to protect artifacts.

Chapter 18: Tools

25 May 2024

18.1 Purpose

Tools are things that people use while designing and building the system. The tools are not part of the system itself; they are not delivered to an end user. Their purpose is to help the team do their job. Each project will have its own needs for tools, so this list is meant to inspire ideas rather than prescribe what may be needed for building any specific system. There are, however, some common principles for selecting and managing tools.

This chapter brings together information about many different kinds of tools, with references to the other parts of this volume that discuss details.

Please note: I do not recommend specific tools.

18.2 General considerations

There are a few general principles that apply to selecting tools generally.

First, most tools will be used for shared work. Tools should be evaluated on how well they help the team work together. Computer-based tools that manipulate shared data, such as CAD tools, should make it easy for multiple people to access the information concurrently. They should support the project’s approach to versioning information Section 17.4.3. Physical tools should be accessible to those who need to use them. This is especially important to consider if people work in multiple physical locations.

Second, many tools require training to be used effectively and safely. The project must ensure that each person has been trained to use a tool safely before they are allowed to use it. That implies that tools should be evaluated on the quality of educational material available on how to use them.

Third, good tools are integrated so that they work together. Tools that can share information can provide greater value to the team than ones that cannot.

Next, tools should support the general life cycle and procedures the project uses. They should fit into the project’s procedures for managing artifacts, versioning them, and reviewing them.

Finally, tools should be secure. Good tools will support the project’s overall approach to security, including controlling access to information based on a person’s role in the project. This includes both electronic and physical security.

18.3 Kinds of tools

This section provides an overview of all the kinds of tools discussed elsewhere in this volume, with references to the sections that provide details. The overview can serve as a checklist for a team working out what tools they need.

18.3.1 Storing and managing artifacts

The tools for storing and managing artifacts are discussed in Section 17.4.4.

Electronic artifacts. Alternatives include:

Version control systems, typically used in software development systems.
Document control systems, used to manage general documents, typically including workflow and access control features.
Document control systems built into discipline-specific tools, such as CAD systems.

Hardware artifacts. These can use:

Storage facilities and stock rooms.
Inventory management systems.

18.3.2 Specification tools

As I will discuss in Part VII, the team will develop specifications for system components. A specification defines a component’s external interfaces—in systems terms, how the component is part of functional and non-functional relationships (Section 12.2).

There are several kinds of specifications (Section 34.4), including requirements, interface definitions, and models.

Requirements (Chapter 35). Requirements provide textual statements of things that are to be true about a system or component. Requirements can be managed using:

Spreadsheets, which are easy to use but provide little or no automation.
Dedicated requirements management tools, which directly organize requirements and the traces between them.
UML/SysML requirements tools that use the UML diagram standard to represent requirements and the connections between them.

I list a number of considerations for selecting requirements management tools in Section 35.13.

Interface definitions (Chapter 37). Interface definitions specify how one component can interact with others. These can be written using:

Textual Interface Control Documents, often using a template for each document. These provide flexibility but little support for navigation and automation.
Interface definition languages that are part of software or communication development tools.
Interface definitions based on industry-specific standards. For example, the SAE J1939 standard [SAE21] defines messages that on-board automotive components can exchange, and the standard provides message definitions in a spreadsheet.
Project-specific interface definition tools. Several projects I have worked on have developed their own tools, which integrated with other tools the projects used.

Models (Chapter 36). Mechanical, mathematical, electronic, behavioral, and other kinds of models are used as specifications. Relevant tools include:

Mechanical and electronic CAD systems.
UML and SysML tools for modeling behaviors.
Symbolic math packages and notebooks or workbooks for documenting and evaluating mathematical models.

18.3.3 Design tools

A project’s design phase works out a set of designs for the system and its components that satisfy the corresponding specifications (Part VIII).

A design records the structure of each component—whether a high-level, composite component or a low-level component (Chapter 11). It also records analyses that lead to designs and rationales for how a design ended up as it did.

There are two kinds of design artifacts: the breakdown structure and the designs themselves. The model in Section 11.4 has six parts to a component design: form, state, actions or behaviors, interfaces, non-functional properties, and environment.

Breakdown structure (Chapter 39). I recommend that the component designs be organized by the component breakdown structure. This structure organizes the components into a hierarchical name space, giving each one a unique identifier and showing how one component is made out of others.

On most projects, I have used a spreadsheet to list all the components, the breakdown organization, and their names. This has worked well enough, and I am not aware of tools that explicitly support such organization.

Form (Section 40.1). The form represents the aspects of a component that do not change, or only very slowly. The design of physical components is generally handled using CAD tools. These tools use notations or drawing standards appropriate to each subject.

Mechanical designs using CAD systems.
Electronic board designs using CAD systems.
Electronic block diagrams and circuit diagrams using CAD systems.
Occasional designs that are not supported by CAD systems and are drawn by hand.

State, actions, behaviors (Sections 40.2 and 40.3). This part of a design addresses the parts of a component that change readily.

UML and SysML tools provide diagramming notations for representing state machines and behavioral diagrams.
More sophisticated state and behavioral specifications, such as I/O Automata [Lynch89], typically are expressed in specialized, non-graphical languages.
Additional state information is typically recorded in text. This allows for state information that is not readily expressed in the formal notations a project uses.

Non-functional properties (Section 40.5). These properties change slowly and are not part of the component’s form.

Allocatable resources (or budgeted items) that are part of a component, such as mass, area, or energy, are often recorded in spreadsheets. A few system design tools can track how these resources have been allocated.
Most other properties, such as reliability measures, are recorded in text.

Environment (Section 40.6). This is the environment in which the component is expected to operate, or in which it may be stored. This is usually recorded in text.

18.3.4 Analysis tools.

These tools help the design process by providing feedback on how well a particular design works. They also are used when verifying a proposed design.

Tools for thermal analysis, computational fluid dynamics, electromagnetics, mechanical structures, and electronic circuits. These tools are often integrated with CAD tools.
Software analysis tools that can evaluate designs for problems with complexity or security.
Theorem-proving tools that can evaluate system designs formally for safety or security properties.
Tools that coordinate review processes. Some of these integrate with CAD or software development tools, facilitating people performing manual analyses of a design.

18.3.5 Build tools.

These tools help translate designs or implementations into operable components that can be integrated into a running system, or used for testing.

The built artifacts will need to be stored and tracked, as discussed above .

Physical artifacts. The building of physical artifacts is, in effect, manufacturing one or a small number of those artifacts. These can be built in multiple ways.

Contract manufacturing provides a building service. Once a contract is negotiated, it takes as input a design for the component to be built and delivers a physical component built to that design. Contract management tools and procedures can help a team work with a contract provider.
Additive (3d printing) and subtractive (milling) manufacturing tools can take in CAD designs and produce a physical product. The physical manufacturing systems usually have associated software systems that translate CAD drawings into the forms needed by the manufacturing tool.
Manual physical build tools. This includes a workshop of tools to shape materials and join parts. It also includes guides, forms, or jigs used make particular parts.

In-house building will require maintaining a stock of the materials used in the components. This may include a stockroom of pre-acquired parts, such as metal or plastic stock and fasteners, or suppliers that can provide the needed material quickly.

The building process should be deterministic: if the team builds multiple instances of the same component, the components should all look and behave the same way. This places constraints on whatever tools and procedures are used to build the components.

Software artifacts. Software artifacts are built by translating source code into binary and packaging it in forms that can be installed on a target system.

Compilers and linkers perform the translation into binary forms
A build environment provides a server, perhaps virtual, for running the software build tools.
Release management tools package the binary software for distribution, and then typically signing the packages and copying them to a distribution server.

The software build process must be repeatable: if the same software is built twice, the result should be identical in behavior (differing only in things like version numbers, timestamps, or affected signatures). This usually means that the software build tools should be under configuration management so that identical tools will be used each time.

18.3.6 Testing tools

Testing involves taking a component, or collection of components, and subjecting it to some sequence of activities to verify that the component behaves as specified.

Testing occurs at two different times during system development: as people are building parts of the system and when a component or the system is being verified for final acceptance. These two uses lead to somewhat different needs in the tools for testing.

Tests need to be accurately reproducible: someone should be able to run a test one time on one component, then run the same test later on the same component and get the same result. Of course some component behaviors are not fully deterministic, but accounting for that, one should be able to count on passing a test meaning that the component really does meet the specification being tested. If a test fails, people need to be able to reproduce what happened to understand the flaw and to determine whether a fix works.

Reproducibility places constraints on testing tools. Physical tests will need to be done in consistent environments, using control and measurement tools that can be calibrated to ensure they are behaving consistently. Software tests similarly need to be run in controlled environments.

Hardware testing. Testing hardware components can range from measurements of single components to integration tests of subsystems or even the complete system. The tools available vary widely, depending on the kind of testing being done.

All hardware testing will involve:

Lab space where testing can be performed safely. Lab space is often a limited shared resource, so lab space scheduling tools can be helpful.
Inventory management tools that will allow testers to identify which components are to be used for tests.
Safety equipment to contain the hardware under test. This can range from safety goggles, to greater personal protective equipment, to containment enclosures or hoods.
Tools to help design test sequences.
Tools to help people follow a testing procedure correctly, including checking off completed steps and recording information at designated points in the test sequence.
Tools to record the test results.

Tools that support testing electronic components can include:

Power supplies, possibly with remote control to turn power circuits on and off.
Input or signal generators. In some cases these can be sophisticated software-based systems that generate input signals following the rules of a test scenario.
Tools to monitor the component’s internal state and output, such as test points or ports. This information might be shown live to a tester, or recorded for later analysis.

Tools for testing mechanical components include:

Tools for creating the testing environment, such as a shake table, vacuum chamber, wind tunnel, or Faraday cage.
Supplies of consumable materials used in tests, such as gases or liquids.
Test controllers that can take a scenario specification or procedure and control the testing environment and system under test accordingly. This may be a set of controls that a person following a script can manipulate or a complex control system that automatically manages the test components.
Measurement and recording tools to determine what happened during the test. These range from recording data taken from the system to video and audio records of the test.

Integrated system testing can go well beyond the tools listed here. Flight testing a new aircraft, for example, is far more complex than suggested by these tools. I leave the design and operation of such testing to others better versed in it.

Software testing. Software testing generally involves setting up a number of test cases or scenarios, running the software being tested, and recording the results. There are many different tools that can be used, and these depend on the kind of test being performed and the environment or language being used.

Categories of tools include:

Languages or applications for defining test cases, including the software configuration to be tested, the inputs and events applied, and the conditions that qualify as correct.
Tools for simulating larger systems that are not being tested. These range from simple component mockups or scaffolds to simulated environments. Integration testing can go as far as a software-in-the-loop simulation environment.
Tools to run software tests, both on-demand (during development) and in batches (during continuous testing and during acceptance verification).
Debugging and tracing tools that can gather information about what occurred during a test and help a person understand why a test failed.
Tools that collect and manage results.

18.3.7 Operations tools

The team uses other kinds of information to manage its operations—about the team, about procedures, plans, and to support decision-making.

Team information (Section 17.3.2). This information is organized around the roster of who is on the team, along with their roles and authority.

This information links to other other tools, some of which are often outside a project’s scope. These include:

The relationship between team members and organization security, in order to manage who has physical access to buildings and electronic access to information systems.
The relationship between a team member’s training and certification and learning systems.
The relationship between the project and other organization systems, such as time billing or human resources.

These relationships get updated whenever someone joins the team, leaves the team, or their role changes. Using tools that guide people through the procedures for these updates will make the changes more accurate.

Life cycle and procedures (Chapter 20). Teams follow a project life cycle and procedures to do their work. These consist of steps that people should follow to get specific tasks done.

Workflow management tools exist to help guide people through these procedures. These tools can help by:

Providing a way to organize the various procedures, to help people find the procedures they need to follow.
Tracking the progress of specific tasks.
Showing each person the tasks they have ahead of them.
Keeping records of work that has been done, as well as reviews and approvals when needed.
Making records available for audit.

Plans and tasking (Section 20.6 and Section 20.7). The project maintains plans for how the system-building work will move forward and the work currently in progress. The plan records the work that the project will be doing, at varying levels of confidence and detail, while tasking tracks the specific work that people have been assigned. This information is used both to make sure that the team do the work that is needed, without important tasks getting forgotten, and for forecasting the time and resources needed to move forward.

Maintaining plans and tasking is an exercise in managing a lot of detail. Many tools are available to help with these.

In practice, many of the tools available have been designed for projects other than systems-building, and do not support systems projects well. Many project scheduling tools are based on methodologies worked out for predictable work like building construction, where the tasks can be known fairly accurately in advance. These tools often are organized around a Gantt chart of the work, prompting their users to estimate duration and make task assignments early in the project. This works poorly in systems projects that have significant uncertainty early in the project, and where the degree of certainty (or predictability) improves unevenly as time goes by. This often results in a false sense of confidence in the project’s schedule early on, and requires a lot of effort to try to keep the schedule adapted as work moves forward.

It is worth spending effort working out how a project will manage its planning and tasking, and ensure that any tools chosen will support that approach.

Support. Project operations maintains other kinds of information as well, for which tools are sometimes available. These include:

Budget tracking. How much money has been spent, and how much is left. This often is part of or interfaces with an organization’s financial systems.
Risk tracking. This tracks potential risks to the project’s operations and helps people ensure that risks are addressed.
Customer relations. These tools track interactions with a project’s customers.

18.4 Managing tools

Good tools can enhance a team’s performance. Poorly chosen or implemented tools can harm it. One must choose tools carefully and apply thought to how they are implemented and used.

A project’s tools are themselves systems, and should be treated with the same care as the system being built for a customer.

Each tool should have a purpose. Spending the time to work out who will benefit from a particular tool, both directly and indirectly, can provide useful guidance when choosing between options for that tool.

The engineering support tool industry has generated many products that can be used, meaning there are often many possibilities to choose from. While sometimes the team can cut a decision process short because they already have experience with one particular tool, in the other cases it is worth setting out some criteria for making the choice.

Factors that can influence the choice of tool include:

What is the specific need that the team has? How well does a tool meet that need?
Does the tool need to be shared among multiple people? If so, how does that work? Software systems are often deployed with a limited number of licenses, limiting who can use them concurrently. Physical tools have inherent limitations, and one may need to acquire enough of them to meet need.
How well does the tool integrate with other tools and the procedures the project uses? If the tool manipulates artifacts that need to be under configuration management, does that tool support that?
Does the tool implement features that conflict with those of other tools, or that conflict with the procedures the team uses? I have worked on several projects that adopted a tool for a specific purpose (requirements management and mechanical CAD, for example), and those tools had internal review and approval workflows and configuration management. These internal features made it harder for the team to follow a project-wide review and approval workflow that used information managed by multiple separate tools.
Are there safety or security needs that the tool should meet? If so, how well does it do so? And do the safety or security features integrate with the project’s safety and security measures?

Once a tool has been chosen, it will need to be purchased or built, and deployed for the team to use. This usually requires finding space for the tool, whether that is physical space in a lab or capacity on a compute server. The acquired tool will need to be deployed and integrated into the project’s systems: adding information about the tool to an inventory database, setting up a service schedule if needed, integrating software systems with the project’s security mechanisms.

Team members will need to learn how to use new tools. For some tools, this can amount to providing a written introduction or presentation on how the tool works. More complex tools will require more formal training. If there are safety or security risks in using the tool, the project should ensure that people are required to receive training before using the tool. It is common to track formally which people have gotten this kind of safety training.

Sidebar: Summary

Tools support many kinds of tasks: specification, design, analysis, build, testing.
Other tools support operations, such as planning and task tracking.
Tools should match how the team works.

Chapter 19: Teams

26 June 2024

Business enterprise (or any other institution) has only one true resource: man. It performs by making human resources productive. It accomplishes its performance through work. To make work productive is, therefore, an essential function. But at the same time, these institutions in today’s society are increasingly the means through which individual human beings find their livelihood, find their access to social status, to community, and to individual achievement and satisfaction. To make the worker achieving is, therefore, more and more important and is a measure of the performance of an institution.

— Peter F. Drucker [Drucker93, Chapter 4]

Building a complex system involves a team of people to do the work. The people in the team fill many different roles: developers, managers, customer and regulatory interfaces, support staff, among others.

A team of more than perhaps three or four is not an amorphous blob of anonymous people; it is organized so that each person has a role. The way a team is organized may arise spontaneously or deliberately, but it will end up with an organization. A well-functioning project will design its team organization and take deliberate actions to maintain its good function.

In this chapter I discuss the issues to be addressed when deciding how a team should be organized, including its structure, roles, and communication.

19.1 Purpose

Building a complex system requires many people to share the work. One person cannot do all of the work: they will be overwhelmed, it will take too long to complete the system, and the project will likely require skills no one person has.

In the model of making systems (Chapter 16), the team consists of a group of people who do the tasks that create the artifacts that make up the system. The team are informed by the project’s operations—plans, procedures, life cycle—and use tools to do their tasks.

The team is a social entity. The people in the team work together and interact constantly. How well they get along with each other influences how well they get work done.

A team is, however, less than a complete society. The team’s social structure is relevant only to the work they do on the shared project. The team structure does not define how the people on the team organize the rest of their lives: these fall to community and family interaction. This means that the social structures in a team are simpler than those of a complete society.[1]

It is generally understood that the structure of a system is homomorphic to the structure of the organization that is building the system [Conway68]. This means that people must work to ensure that the structure of the team and the structure of the system are compatible, possibly by organizing the team around the system structure when possible. Doing so requires having an understanding of what the system structure is, and the hierarchical component breakdown (Chapter 39) provides part of that understanding. In the other direction, the team’s organization will inevitably bias how the system is organized and built; being aware of the two organizations helps one to see unhelpful bias reflected in the system organization.

One can look at the purpose and needs of the team from the point of view of the people in the team and of the customers, organization, and funders who want to see the system built (see Section 16.2 for a discussion of these stakeholders).

Members of the team (Section 16.2.2, Section A.2.2) generally look for satisfaction in their work, enough help to get the work done, and a working environment that gives them a secure sense of how to do their work. Team members generally want team cohesion, when the people have developed bonds and trust that allow them to work together without friction. That is, they are motivated first by how the project affects them. The needs of the stakeholders are a secondary concern, mainly in how meeting those needs contributes to satisfaction and compensation.

Other stakeholders (Section 16.2.1 through Section 16.2.5) look to the team to build the system efficiently and accurately. They are motivated by the value that having the system completed will bring and by the cost of building it. The needs of the team members are secondary, in the ways that the well-being of the team contributes to the cost or benefit of building the system.

Meeting the stakeholder needs involves:

The team making system parts that are consistent with each other, so that the parts can be integrated into the whole;
The team doing its work accurately, so that few errors need to be fixed and as little work needs to be redone;
The team using resources efficiently, both in terms of who on the team does different work and how the team uses external resources like tools;
Team resilience so that the project keeps going even when people have problems; and
Team sustainability so that the team continues working at good efficiency without burning out.

An effective team balances these two classes of need—those of the people on the team and those of external stakeholders. The needs can be in conflict when the need to build a system efficiently and rapidly means that someone on the team has to do a task that they don’t enjoy. More often, both classes of need can be met by organizing the team and its culture. A team member’s satisfaction increases when they have confidence that their work is contributing to the project’s success, which comes in part by assigning tasks to the most appropriate people, avoiding duplication and rework, and ensuring that people communicate well. In general, when a team is able to use resources—people, tools, funding—effectively, the team members‘ confidence in the project will increase.

19.2 Model of teams

The following is a model for reasoning about teams. I will use this model in a later section (Section 19.3) to discuss how a team’s structure and culture can be understood, and how that can be used to manage a team.

The model begins with people. Teams are fundamentally social structures, made up of a group of people, each of whom have their own skills and experience. These people are sharing the work of building the system and of the needed supporting activities.

The role of the team in a project is to do the work of building the system. The work can be understood in terms of time-limited tasks and ongoing roles. A task is a particular piece of the work, with an intended result and a limited duration. A role is an ongoing assignment of responsibility, which leads to performing tasks within the scope of that responsibility.

Consider a team from the point of view of one team member. That team member has tasks to do, and roles for which they are responsible. They need to know what tasks they should be doing, what roles they are responsible for, and what they are not responsible for (so that they can refer instead to others who do have the appropriate role). As they do their tasks, they need input: how their task fits with other tasks, including ones that other people are doing, and how parts of the system are supposed to work. They will have questions to ask of others. In the course of doing a task, they will make decisions—about concept, about design, about implementation. These decisions will in turn affect others. From time to time they will find problems, both technical and social, and will need to identify who to work with to resolve the problems

At the same time, the team member sees themselves as part of the group. They will need to understand the team’s culture and norms. They will want their social needs met, developing trustful relationships with others they work with. The personal relations that someone has with others on the team influences who they choose to work with and who they will avoid, and influences how well they work together when they need to.

How someone works in a team can be expressed in terms of the team’s basic structure. The elements of this structure include:

Communication: how people ask questions and convey information about the work being done.
Groups: how people join together to work closely on some part of the project.
Trust: how the people on the team relate to each other.
Authority and responsibility: the scope of responsibility that each person on the team has.
Division of labor: how different people take on different kinds of tasks.

The objectives of the team and other stakeholders are emergent properties that arise from the low-level interactions among people on the team, following the structure.

Some of these elements deal with separating people from each other, while others deal with uniting them [Durkheim33, Chapter 3, pp. 115-122]. Authority and division of labor are about how each person has their own role, and they are expected to refrain from exceeding those bounds. Communication, groups, and trust, on the other hand, are about how people are joined together to achieve more than they could individually. A team needs both to function well: the ability to work in a group depends in part on each person knowing their role.

19.2.1 Communication

The communication elements of the model describe how people on the team share information about the project.

The work of building the system will be divided up amongst the team members. When one person, for example, designs one component, they will need to communicate with the people designing related components (using the model of relations in Chapter 12). Similarly, a team member who is handling planning and tasking (Section 20.2) will communicate with many other team members to track progress and status.

There are four general times when people will need to communicate:

When they are looking for information that another person may have. For example, when someone finds they need to know how some component is going to behave.
When they have information that will affect someone else’s work. For example, when one person decides on a component design, and that component interacts with another component.
When they need a decision or action. For example, when someone has completed a proposed design and procedures indicate that the design should be reviewed and approved before moving to implementation, or when someone has a team problem that needs to be resolved at a higher level.
When a decision or action has results. For example, when reviews are done, or when action is being taken on a team problem.

Communication can push information from where it is generated or known to people who need that information. Communication can also pull information from someone who has it, by asking them a question.

Communication can happen interactively or asynchronously. Interactive communication happens when two people are communicating directly with each other. Asynchronous communication happens when one person makes information available and another finds that information later. Documentation is a way for one person to communicate with another over long periods of time.

Communication happens when a decision or action is needed, or when one of them has produced a result that others need to know about.

Communication patterns can thus be characterized by:

When people communicate: interactively, over long periods of time, immediately after information is generated, in response to a need.
With whom people communicate: who knows what information, who asks whom to request information, how people find the person or document artifact from which to get information. How do people make decisions about who to communicate with? What incentivizes people to choose a good person to talk to?
Is the communication direct from one person to another, or indirect through an intermediate person or document?
What kind of communication is needed: normal technical communication, normal operational communication, or exceptional communication.

These communication patterns are encoded in team culture, in procedures that people use to do tasks, and in how people are organized into groups.

19.2.2 Groups

Many people like to work together: interacting regularly, sharing work, building social bonds. Working in a group is helpful when people are working on closely-related tasks or have closely-related roles. How closely depends on the person; some people are gregarious and gravitate toward groups while others reserve their interactions for fewer, more trusted people.

People can come together as a group when doing tasks together, or closely-related tasks requiring lots of interaction. They can do so spontaneously based on the work, or because a group is organized deliberately. People can also come together based on shared interests, experience, or work discipline.

A group is more than just people who communicate a lot. A group generally gives its members with some sense of identity and shared purpose.

One person can be part of multiple groups. It is common, for example, for one person to be part of one group that has been deliberately organized to work on a collection of components, while being part of a second deliberately-organized group based on work discipline, as well as being part of ad hoc, informal groups based on social interactions.

Groups can promote trust. When the people in a group behave respectfully toward each other and demonstrate behavior in line with team norms, the high level of interaction within a group provides a way for the group members to establish trust. When trust develops within a group, it can also promote feelings of trust for people outside the group: if person A recommends person B to person C, and C trusts A, then C is more likely to assume that B is trustworthy.

Groups can also promote distrust. If two people within one group don’t get along, they can create a rift among more people. A group also runs the risk of in-group identity turning into out-group dislike, expressing itself as teams working in silos because they lack trust for people in the out-group.

Sometimes people need to form a group with people they don’t get on with. This happens when there is a need for them to work together that overrides their relations with each other.

Groups can be characterized by:

Who is in the group? What is the basis for determining who is in the group?
What is the commonality between people in the group?
Is the group established deliberately, or does it come together spontaneously?

19.2.3 Trust

Trust is a condition describing part of the relations between people in the team.

Trust arises from social norms and respect. By norms, I mean standards of behavior both for interaction between people and for technical work, to which everyone on the team is expected to conform. By respect, I mean each one believing that the others have worth or value, and acting accordingly.[2] Trust is the confidence that others will follow the team’s norms, and act and communicate with respect.

Trust starts by one person learning from experience that they can trust another person. Trust arises from demonstrated behavior. People may enter into a working relationship with someone with a predisposition to trust them, but that is different from demonstrated reasons for trust. A team culture that incentivizes people to behave in trustworthy ways can result in that predisposition when someone learns that someone they trust also trusts a third person. A team, however, cannot meaningfully incentivize trusting someone; the team can only incentivize someone behaving in a way that can earn someone else’s trust.

Because trust comes out of experience working together, not everyone in a large team will know everyone else well enough to have a trusting relationship. In those cases, trust operates at a level of groups rather than individuals: person A believes that the people in group C are trustworthy based on reputation and team cultural norms. This is a weaker form of trust but just as essential for a well-functioning team.

Ideally, trust is reciprocal but it does not have to be.

When person A trusts person B, the two of them can work together more effectively compared to when they do not trust each other. A can share work with B and expect that B will follow the team’s norms about doing accurate work and communicating well. B can expect that A will delegate a task and then respect B enough to avoid micromanaging them. As long as the trust remains, both A and B have less anxiety about the work being done, both are more productive, and both get greater satisfaction than they would otherwise.

Lack of trust leads to the opposite results. If A assumes that B will not behave in ways that accord with team norms, then A will believe that they need to check on B’s work more often. A and B will share less information with each other and will be less willing to share work. Poorer communication will lead to errors in the work, and result in more work and greater anxiety for both parties.

A breakdown of trust can happen between groups as well as between individuals. When a team has a breakdown of trust, they do not communicate. Factions within the team stop coordinating their work, hiding information from each other. I was part of one large multi-company software project with teams at several sites; the teams would try to undermine each other in order to get their version of some software component accepted into the system. After a few years the project ended and the product languished. As another example, specific failures on the Boeing CST-100 Starliner crew capsule have been blamed in part on team mistrust. For example:

Neither team trusted one another, however. When the ground software team would visit their colleagues in Texas, and vice versa, the interactions were limited. The two teams ended up operating mostly in silos, not really sharing their work with one another. The Florida software team came to believe that the Texas team working on flight software had fallen behind but didn’t want to acknowledge it. (A Boeing spokesperson denied there was any such friction.)

—Eric Berger in Ars Technica [Berger24].

Trust can be characterized by:

The degree to which each person on the team trusts each other person.
The degree to which each person believes people they do not know well to be trustworthy, based on reputation.
The norms that people in the team are expected to follow, on which trust is based.

19.2.4 Authority and responsibility

While the previous model elements—communication, groups, and trust—are about people uniting to work together, the next two elements are about how people are different from each other.

In effective teams, each person does the right work. They know what is expected of them, and what is beyond the scope of their authority.

Authority and responsibility deal with how the project’s work is split among the team members; that is, the role that each person has.

I treat authority and responsibility as two parts of the same thing. Authority is the right to make decisions or do work on some topic. Responsibility is the obligation to do that work, and to do it well. The two go together: responsibility without authority is perverse, while authority without responsibility means bad decisions.

A role is associated with some scope of work. The scope defines what subjects the person is responsible for. The scope can be defined many ways as long as its meaning is clear enough that everyone will interpret it the same way. Scope for technical work might be based on system component (“person A is responsible for the design of component X”). It might be based on discipline: “person B is responsible for all security analyses”. It can also be based on a procedure (“person C is responsible for making orders from vendor Y”), or on operational work (“person D maintains the plan for meeting the Z milestone”).

The scope defines the right to make decisions or take actions. If one person has the role of designing component X, they are responsible for ensuring that component X is well designed and they have the authority to work out what that design is.

Conversely, if some topic is outside someone’s identified role, they must refrain from making decisions or taking responsibility for that topic.

A role is different from a task. A role is a long-term, ongoing responsibility to do work associated with a scope. That work may include tasks that fall within that scope, but a task has limited duration and a concrete intended result. The person who has a role is often responsible for doing work that is not part of a specific task. For example, the person responsible for design of some component will handle the task to create the design or tasks to correct errors in the design, but they are also responsible for answering questions about that component from other people on the team.

The goal is that every element of the work has someone who is responsible for it, every person has something they are responsible for, and that it is clear to everyone who is responsible for what.

Communication. When someone has authority to make decisions, at some point they need to communicate those decisions (or their effects) to others who will use that information in making their own decisions and taking their own actions.

Sharing roles. More than one person may take on a role. For example, a role that involves providing support to a customer may have more work than one person can do. When people share a role, they have a responsibility to coordinate their work so they give consistent answers or make consistent decisions. That they share the role should be clear to all the people involved and to people who may need to work with them.

Inadvertent overlap in roles can lead to errors. If two people both believe they have the authority to do a certain bit of work, but they are not aware they are sharing the role, they can make conflicting decisions, tell others conflicting information, or produce conflicting artifacts. Each of these situations can lead to errors in the system being built or in the way the team operates.

Delegation. Authority and responsibility can be delegated. Delegation means that one person confers some part of their role to someone else, possibly for a defined period or with restrictions on the kind of authority granted. The delegation might transfer the role from one person to another, so that the first person no longer fills the role (perhaps temporarily). Alternately, the role may be shared with the other person, in which case both people are responsible for the work and for coordinating with each other. A delegated role might also be rescinded.

One way to use delegation is for one person to have the overall role for some system component, and for that person to delegate responsibility for specification to someone skilled in specification, delegate design to a designer, and so on. The person with overall responsibility for the component typically reserves authority to review and approved work that has been delegated to others.

Sidebar: Delegation and micromanagement

Projects involving many people require sharing work. If someone doesn’t share work, then they will be overwhelmed, will take too long to get work done, and will be a single point failure in the project.

Delegating or sharing work implies a dynamic between the two people involved. Person A delegating the work defines the work that Person B, the delegatee, is to do. Person B does the work and periodically gives progress updates. Once the work is delegated, Person B can proceed independently and Person A can turn their attention to other things.

One way this can go wrong is if Person A doesn’t let Person B get on with the work independently, and instead tries to micromanage the work. Learning the habit of managing loosely takes time and effort—but it requires trust between the two people involved. That trust in turn depends on Person A having confidence that Person B will follow shared norms doing the work.

Another way this can go wrong is if Person B isn’t able to complete the work independently. If Person B finds a problem with the work, such as design error, that is beyond their scope, they can raise the issue to Person A and jointly resolve the problem. If Person B is unable to do the work, perhaps because they don’t understand the problem or find they lack a necessary skill, they can raise the issue and jointly handle the problem. If Person B tries to muddle through, however, they stand a good chance of not doing the work needed, leading to Person A needing to check their work in detail and possible redo the work.

In other words, sharing work requires having clear expectations of how to define delegated work and when to raise exceptions.

Resilience. A well-functioning team is able to handle problems when they arise. A team’s resilience depends in part on how authority is structured within the team.

There are several kinds of problems a team will encounter:

Someone being unavailable for a while or leaving the team; for example, when someone becomes ill for a while. Handling someone becoming unavailable involves allowing for redundancy in role assignments, and having one or more people who can take over when someone is unavailable. This implies that the person who takes over will know what they need to know—meaning there is communication before something happens.
Someone making technical mistakes. When this happens, the mistake must be detected, then resolved, then steps taken to reduce the risk of similar mistakes happening again. This typically involves defining procedures to review work to catch mistakes, and assigning someone the role to check that the reviews have happened as well as assigning others roles to perform reviews. (This is related to Section 8.2.6—Principle: Build in checks.)
Someone making operational mistakes. This is similar to a technical mistake, the effects of an operational mistake can affect how a team works, not just the system itself. Some operational mistakes reported by whistleblowers have legal implications. In many organizations, this includes giving someone a role to hear reports of problems from anyone in the organization, and then act on the report.
People having disagreements. Handling disagreement usually involves giving some third party a responsibility to hear about disagreements and authority to resolve them.

There are patterns in common to how many of these problems can be planned for. Providing redundancy in how authority is organized is at the core: planning in advance for someone to take over important roles when needed, building in checks of work, and assigning roles that create alternative communication paths to resolve problems. All of these in turn depend on communication so that someone can take over a role or check work.

Formally, these kinds of structures add nuance to the definitions of scope for the roles that need to be resilient. For example, three kinds of roles are defined to catch and resolve technical mistakes: the role to do the work, the role to check it, and the role to ensure that the check is done. These imply a limitation on the authority of the first role to make any arbitrary decision about the work, because the work must be checked by someone else. It adds a responsibility to ensure that the work is reviewable (for example, adequately documented) and that the relevant artifacts are communicated to reviewers. Similarly, having someone who can take over a role implies that someone who is a backup for the work is responsible for keeping current and stepping in when needed—and also refraining from acting on the role when the regular person is doing their job.

19.2.5 Division of labor

Division of labor is the principle that people do different kinds of work, meaning they have different authority and responsibility. This is desirable because different people have different skills and experience, and because work should not be duplicated unnecessarily.

Division of labor in systems-building is different from the classical usage of the term. The original usage was about a serial production system or assembly line, where one person does one step, hands the result to someone else who does a second step, and so on until the product is complete. (Smith, for example, uses the example of making pins [Smith22, Book I, Part 1].) The argument is that a worker’s specialization leads to increased skill that improves their productivity, and that avoiding the cost of switching from one task to another eliminates wasted time.

Division of labor is directly related to roles as discussed in the previous section. The roles define the units of labor to be divided among the team members.

Systems work divides labor in more ways than just serial production. Work can be divided by component, with a hierarchical structure from system to lowest-level component. It can be divided into supporting role, such as planning and team management, versus system-building roles. Not all roles need full time attention, leading to one person taking on multiple roles. Some roles are associated with specific procedures, such as coordinating purchasing.

Someone in the team has a role to decide how roles are assigned. This might be one person for a small team, or the role might be divided up and distributed to multiple people. These people should follow well-understood norms and procedures for making the decisions about who is assigned what role, including communicating those decisions to everyone affected. The way roles are assigned should take advantages of the way people differ: in their likes, their skills, their experience, and their desired growth.

The way work or roles are divided affects how people grow their skills and experience. If people are assigned work only on their current skills, they will not grow. Giving people tasks that stretch them can lead to improved skills, but can also lead to them doing the work badly and learning bad habits. Learning works best when someone being stretched can get mentorship from someone with relevant skills or experience.

19.3 Using model of teams

The high-level objectives such as efficient and accurate system-building or team cohesion are properties that emerge from the details of how people on the team interact. The structure and norms of the team can be designed and managed to promote these objective, and the model above provides a way to think about the structure.

Note that these properties emerge from how people actually behave, not from how the team is designed or how it is supposed to work. That is, the outcomes depend on the mental models that each team member has of how they work in the team, and the habits that come from their mental models.

Achieving desirable outcomes therefore means getting two things right: designing and maintaining a good intended structure for the team, and the team taking that structure on board and behaving accordingly.

19.3.1 Team culture and structure as a control system

Leveson et al. [Leveson11] discuss how to design systems so that they produce desired emergent behaviors while avoiding undesired behaviors. Their approach treats the problem as a control system, where a control process monitors and shapes the behavior of a lower-level process in ways that lead to the desired high-level results.

The control system in this approach consists of a controller, which monitors the state of the team (the controlled process) and makes decisions about actions the team should take. One or more people in the team take on the role of being the controller. The controller has a process model, which includes the controller’s beliefs about what the team should achieve, how the team is structured, and how all the people in the team are doing. The controller gets feedback from the team in the form of observed behavior and of things team members tell them. Once the controller determines that it is time to act on some issue, the controller can take steps (control actions) to change the team’s behavior.

The social norms and habits of respect come from the example set by the team’s leaders: those team members who have greater scopes of authority, or who are recognized as experienced in their discipline. In practice, one team member who is working on the details of some component has little influence to create the team’s norms, but can cause disorder and disrespect that spreads. The establishment and following of positive norms is a collective action problem that requires some degree of compulsion [Olson65]. Preferably, the compulsion is in the form of rewards for following good examples of adhering to norms, but sanction is needed to back up the rewards.

The model in this chapter can serve to organize the design of this control system.

Who is responsible? The responsibility for making the team work is spread over everyone in the team.

Looking at the team as a control system, there are two reciprocal classes of roles: the roles that fill controller functions and those for everyone on the team (the controlled process).

Everyone on the team has the role of being a team member. This role has the responsibility of following team norms and procedures. In terms of a control system, each person is responsible for accepting and following instructions from the controller, and for providing feedback about work and about how the team is functioning. In particular, when anyone on the team detects that there is a problem in how the team is functioning, they are responsible for communicating about the issue with someone whose role includes resolving the issue.

The controller part of the control system can be broken down into three classes of roles. These are:

The observer role: a person who receives feedback in the control system, meaning they observe how team members are doing their work and are responsible for deciding when there may be an issue to resolve.
The decider role: a person who is responsible for deciding how to respond to an issue; that is, for deciding on a control action that should address whatever situation has occurred.
The exceptional role: a person who is responsible for detecting problems with the normal control system roles or for receiving reports about them. This role comes into play when the normal observer and decider roles are not handling a problem. When someone reports a problem with the control system, it is sometimes called a skip-level or whistleblower report.

These roles can be used to support many different team structures. For example, a traditional hierarchical department/team structure can be represented by each department’s or team’s manager filling the observer and decider role for their department or team. The manager over a manager can fill the exceptional role to address problems with a manager’s work. Separately, many organizations create an explicit whistleblower function to address potential corruption or illegal behavior; the people in this function then fill part of the exceptional role.

Process model. This model is how the controller understands both the objectives for the team and the state of the team.

The process model includes the team’s objectives and how well the team is meeting them; its structure, and how people are working with that structure (or not); the roles each person on the team has and how they are progressing on the work associated with those roles; and generally how well each person is doing.

The observer and decider roles use this information to determine when part of the team is how working as it should, and to decide what steps to take to make things work better.

Unlike the control system for a machine, the process model for a team accounts for the well-being of the people on the team.

The process model also needs to consider what people on the team actually understand of the team’s culture and procedures, and their roles. In managing a team of skilled and well-meaning people, I have found that miscommunication or misunderstanding is the most likely source of problems.

Control decisions. Those people who have the decider and exceptional roles are responsible for deciding when there is an issue to be addressed, and what actions to take. Sometimes when there is some indication of an issue, the choice will be to wait and gather more information.

Some issues may appear in one part of the team but, on investigation, will be found to have causes in other parts of the team. If the decider or exceptional role is shared among multiple people, decisions will require deciders working together.

Problems in team execution can arise because the team has outgrown their current structure, not because any one person is behaving wrongly. I discuss this further below.

Control actions. The control actions are how people influence the team to keep it working on track.

The example set by a team’s leaders is perhaps the most important influence. If someone the team looks up to is doing some activity one way, they will be likely to follow: if someone is seen to be careful following a procedure to get design reviews, for example, others will be motivated to do so as well.

This raises the question of who is considered a leader. Leadership is a social construct; it is not necessarily an explicit role that someone in the team is given. People who are given roles with extensive scope of authority and responsibility are often seen as leaders. Others who are understood to have experience, who mentor other team members, and who establish social connections are also treated as leaders. Having this level of social influence in the team comes with a responsibility to model desired behavior, and should be considered when taking action to fix a team performance issue.

Instructions from one person to another based on scope of authority are a second kind of control action. If someone has a role of managing a subteam, the manager (who has a decider role) can instruct someone on the team to change their behavior. The instruction need not be hierarchical; for example, when two people are peers working on designing related components and are expected to come to agreement on how those components will interact, one of them can inform the other that they will not agree to some part of an interface design.

As I noted in the sidebar on delegation and micromanagement above, there are choices to be made about how instruction should be given. It can be directive, telling the recipient exactly what they should (or should not) do. This is appropriate when related to following a procedure that requires precision, such as operating test equipment that has the potential to cause injury. In other situations this can turn into micromanagement and inhibit the recipient’s ability to improve and work independently. On the other hand, the instruction can take the form of letting someone know that there is a problem and letting them work out how to address the problem. This approach helps the recipient learn and grow, especially if they can discuss potential solutions to get feedback. However, if the recipient is not able to figure out how to make a change, this approach will leave a problem unresolved.

Next, a decider can address an issue through education. Mentoring someone about part of their work can improve their work in the long term as well as addressing an immediate issue.

Finally, sometimes the appropriate control action to address an issue is to change the team’s structure. This can happen as the team grows and authority and communication structure reaches a scalability limit. It can also happen as a project transitions from one phase of work to another—for example, when moving from the initial design and implementation into preparing for placing the system in operation.

Team member behavior. The people on the team make up the controlled process in this approach. A controlled process receives communication from the controller with input that is intended to change the process‘ behavior. The process then generates feedback to the controller as it goes about its behaviors.

The team being made up of people, not machines, so they hear and act in a human way to communication from people in the controller role. When the team’s structures are designed, the ways that people communicate should be worked out so that those who are giving instructions know how to communicate to people, and so that people on the team know how to tell when instructions are being delivered.

People react when they receive instructions or assignment, of whatever kind. In a well-functioning team, the team members will act to confirm the instruction they think they are receiving, then adjust their work behavior accordingly—changing the roles they are working on, adjusting some technical work, and so on.

However, real humans don’t always neatly follow this behavior. Sometimes they misunderstand the instructions. Sometimes they ignore the instructions. Sometimes they develop resentment at the instructions. The communication between people with a controller role and the people on the team must include checks to catch misunderstandings, and continuous communication so that leaders understand how people in the team are feeling about their work.

The controller needs feedback from people on the team in order to continue to make accurate control decisions. In a team, this means that people on the team are providing information. The people have a responsibility to keep those overseeing their work informed of their progress. They are also responsible for communicating when they are dissatisfied with their work situation, or when they observe issues with the project.

Feedback. The people forming the control system have several ways to get feedback and observe the team. Some of these mechanisms can be designed into the system formally; others are informal behavior by the people with control roles.

Getting explicit feedback and reporting from team members is the first formal mechanism. As people on the team make progress on different tasks, they inform the controllers of the work completed, the problems found, and the steps yet to do. These reports can take many forms: updates to a task tracking system, regular status communications, and informal discussions. Explicit reporting has the advantage that it can occur regularly and in a form that encourages documentation of status. It has the disadvantage that it can become impersonal, and the reports can become inaccurate (especially optimistic) over time because team members want to look good to their teammates.

Someone in a controller role can complement these explicit reports with regular informal communication. Some organizations have advocated “management by walking around”, in which a manager informally talks with those who they oversee, without a regular schedule. This interaction ideally happens in person, so that the manager and the team member can treat each other as people and build up social bonds. In person communication also has the advantage of the full range of communication methods, such as body language. These informal communications have the disadvantage of not producing a documented record of what was learned, and if done clumsily they can lead to a team member feeling like they are being constantly monitored.

A well-designed team will account for problems in communication between a team member and those who oversee their work directly. A team can build in periodic “skip level” communication, where a team member can discuss their work and their state of mind with someone other than their direct managers, in order to detect and resolve problems with a manager. A well-functioning team will also provide feedback channels for team members to report larger or more systemic problems. In many industries, organizations are required to provide a way for anyone on the team to report corrupt or illegal behavior, for example.

Whatever feedback mechanisms a team uses, the channels should be designed to address bias and sampling problems. For example, if someone only reports on progress when they complete a task, there is no way to detect when they are having problems completing some task--in other words, reporting at completion biases information toward good news. Having only one path for information to come from a team member through a manager to higher-level controllers can show bias as the manager digests the information they receive and passes along a summary. This is one reason to combine multiple ways to get feedback.

Working with the control system. The team’s structure has to be designed and redesigned. It is expected to result in getting the system built efficiently and accurately, and for the team to maintain its satisfaction and productivity throughout.

Achieving this end only happens when the team is organized deliberately. While historically the organization choices have been made initially based on experience and then by incremental changes, it is possible to do better by explicitly designing and analyzing the team’s structure as a system.

The control system approach allows one to use techniques for designing and analyzing systems that have important emergent properties. In particular, the STAMP model of accident causes [Leveson11, §4.5] and the STPA hazard analysis technique [Leveson11, Chapter 8] provide a sound basis for analyzing how the team is organized. They provide a disciplined methodology for determining what hazards the team could face, such as duplicating work, failure to communicate, disagreements between people, or errors in the work. It also provides a structure for reasoning about how these hazards can occur and how to design the control system to eliminate or handle when the underlying causes occur.

As one example, the STPA hazard analysis methodology calls for identifying and addressing cases where multiple controllers can generate control actions for one controlled process. In a team, this happens when two people have some kind of controller role for one team member. These controllers can give conflicting instructions, or they can give instructions that have unexpected side effects. The hazard analysis methodology includes identifying cases when this can occur and then defining how the multiple controllers will coordinate their decisions to avoid conflicts [Leveson11, §4.5.3].

Team structure and planning. The structures and procedures that a team follows are related to but separate from how the team plans its work. The interactions within the team’s control system are continuous and immediate. They serve to maintain the social bonds that keep a team together. In building social cohesion and team culture, a team’s structure makes the team able to plan and carry out its work.

19.3.2 When to use the team model

The purpose of having a model of teams is to provide a language for describing how a real team is organized, and to provide tools for working out how a team’s organization might need to change.

There are four times to use the team model:

In ongoing team operation;
When forming a new team;
When a team’s structure needs maintenance; and
As a team outgrows its structure.

Ongoing team operation. A team’s structure and culture should determine how the team works and interacts. The purpose of treating the team as a control system is to ensure that it continues to work well, and to provide a basis for adjusting the team’s structure or its members to meet that end.

In ordinary operation, the assignment of roles and tasks has a great effect on how the team is functioning. A good assignment of roles will have at least one appropriate person covering each needed role, and work to spread the workload as evenly as possible across the team. In control system terms, to whom a task or role should be assigned is the control decision. It is based on the decider’s understanding of each person’s ability, current workload, and interests. Communicating the assignments make up the control actions, and then team members do the work.

Once assignments have been made, those with control roles monitor progress. One person may become busier than expected, and workloads adjusted in response. Someone may have unexpected trouble doing some task, which needs to be detected so that that person can be given help.

The team’s culture and social cohesion also needs to be monitored and managed. In control system terms, the controller sets the expected norms and procedures and communicates those to the team. The control actions that communicate this information take many forms: documented procedures, documents of team charters and cultural norms, and the examples set by leaders. The team, as the controlled process, will observe all this input and respond in their behavior. The controller is then responsible for watching how the team members work together, learning how they feel about each other, and identifying when some people aren’t meeting the expected cultural norms or getting along, and make adjustments accordingly.

Team member evaluations are a part of many organizations‘ procedures. These provide an opportunity to give people feedback on how they are doing at performing tasks and how they are fitting into the team’s culture. Having clearly-defined cultural norms and work assignments enables people to give feedback that measures a team member’s work against criteria that everyone should understand in the same way. (And if the feedback process reveals that some people do not understand the criteria in the way they were intended, then this is feedback to the team that the documentation needs to be improved.)

When people detect that the team is not operating as planned, they initiate corrective action. The kind of action depends on the kind of problem. If one person is not working as expected, the actions can be focused on that person: giving them suggestions or education, changing their work assignments, or in the worst case moving them out of the team. If the problem is between multiple people, then the next step is to determine why they are not working well together in order to address the working relationship between them. Sometimes, however, investigation will reveal that the team’s structure or culture needs to be improved. I discuss that below.

Forming a new team. A new team is an opportunity to design the team’s structure and culture. While this is often left to chance, with the first team members jumping into technical work, spending effort early in the process to plan how the team will work pays off in a project that functions well as it proceeds and starts to face organizational challenges.

The way that a team starts out affects how it continues to work years later. The habits that a team forms in its early days continue to influence how people work, and changes to these habits is difficult and slow (Section 8.1.5—Principle: Team habits). This means that some effort early on will pay off for a long time.

The model in this chapter provides a way to organize the thinking about how a team should work. How should authority and work be divided among people? Should the team have a hierarchical department structure, or a matrix organization? What cultural norms are expected for people’s behavior toward each other? What procedures should the team follow to do different parts of the work?

The structure of a team is not just a theoretical construct. To work, it must fit the abilities and experience of the people involved. A structure that requires perfection will not work, because nobody can work perfectly. A structure that people can’t understand or is too complicated won’t be followed, or worse, people will try to follow it but do so in some odd way.

The team’s design provides opportunity to think about how to make the team resilient. What functions should be shared? How can people working on one part of the system help each other? How do those people who are responsible for maintaining team operation keep aware of how the team is doing? What should happen when there is a serious problem within the team?

The decisions about how the team will work should be documented for everyone to read and follow. Having these documented—and brief—helps get everyone into agreement. Going through a process of building draft documents and getting team feedback helps build consensus early on. It also makes the task of adding new people to the team easier (they can read the documents) and smoother (each new person gets the same information others do).

Documenting the rationale for why the team is designed the way it is, along with analyses of how the team’s structure will meet its objectives, provides a basis for maintaining the team’s structure as it grows or as the work changes. It also helps people understand the spirit of the structure, helping them interpret the intent behind the documented structure.

A team typically grows and goes through distinct phases where it needs different kinds of structure; I discuss this below. The initial team structure will likely be simpler than what the team will need a few months later. However, the initial design for the team’s structure should include some thinking about the team will work as it grows. Because the social organization that is a team has inertia in its habits, starting out the team’s structure in a way that can grow into what it needs to be later will help avoid reorganizations that upset team operations and affect productivity.

Maintaining team function. Sometimes the planned team structure doesn’t work the way it was expected to.

When this happens, the response should be to work out why the structure isn’t working, and then determine how to change the structure to work better.

The tools for systems accident analysis are available to analyze why a team is not functioning. The STAMP methodology provides an organized way to determine possible reasons that a team is not functioning as desired [Leveson11, Figure 4.8]: people with the decider or exceptional role are not providing the needed instructions for some reason; those people do not have an accurate understanding of the state of the team; they are not getting accurate feedback from the team; the team are not getting instructions or acting on them as expected; or conflicting control actions. An analysis following this kind of methodology can reveal where the underlying problems are, and in turn suggest ways to change the team structure or culture.

For example, a team has people who are not following defined procedures for getting component designs reviewed and approved before moving on to implementation. An analysis might find that some people are not aware of the procedure (a problem with the control action/controlled process), suggesting that improving documentation and education about the procedure. On the other hand, the problem might come from one group being under pressure to deliver quickly, and those who are supposed to do reviews or give approval are not able to respond at the needed pace. This would suggest that streamlining reviews or adding reviewing resources would address the problem, or that the group needs schedule relief.

As a different example, consider a team organized into groups based on the component hierarchy. This team is having trouble with component integration: components that interact have passed design and implementation reviews, but when they are combined for verification they do not function as expected. This situation could arise from many sources. The organization into groups might be inhibiting communication between groups, leading to interfaces being designed that do not meet the needs of components being built by different groups. This might, in turn, come from flawed procedures that do not account for cross-group reviews; or it might come from group managers that don’t get along; or it might come from a problem with how interface design artifacts are managed. An issue like this could also show related problems, such as the project management not detecting the problems quickly or accurately until there is a crisis. I have found that in situations like this there are often several small changes that need to be made together to address the problem.

Team growth. A team’s organization generally starts small and informal, as a very small group starting to investigate a customer’s need or a potential system project. As the project moves forward, the team grows and its needs for structure change. The team also changes as people join and leave, and as people move from role to role.

I have found that most teams go through phases as they grow—rather than showing smooth changes over time. These changes arise from the combination of complexity growth, development of group relationships, and the growth in understanding of the work ahead.

Small groups (of just a few people) have been observed to go through a development sequence [Tuckman65][Tuckman77]. These small groups begin as the group forms, and the people work out how they should relate to each other and how to get work done. As time goes by they develop into a cohesive group that gets work done and where people trust each other. (The studies do not discuss how this process can fail, leading to a group that does not cohere or disbands.)

The interpersonal complexity of a team grows with the size of the team. The number of potential connections between team members is $O(n^{2})$ in the size of the team. In my anecdotal experience, the amount of time spent on coordinating work within the team grows in line with the number of connections. If there is no structure to the team, at some point the amount of time and effort spent on communication will exceed the amount spent doing work building the system.

When a project starts, the nature of the system to be built is not well understood. The team has to go through a process of working out the purpose of the system, developing concepts, and eventually beginning to design. Along the way, the team gets increasing understanding of the work ahead.

In practice, the combination of these causes leads the team to change its organization over time. At the beginning, the initial exploration of what the project might be (working a purpose and finding some initial concepts) is typically a small group. This small group will go through a process of learning to work together, but typically the group can self-organize and does not need hierarchy for much. As the work progresses and a few more people join the project, they will initially try to fit into the self-organized small group. These additions will alter the interpersonal relationships, but at some point the complexity of using consensus will necessitate creating some initial structure. The team will settle into this structure. But as the team continues to grow, it will initially accommodate people into the structure but eventually reach another point where more structure is needed to manage complexity.

The message is that a project should expect its team organization to change over time. Almost every project I have been part of has been resistant to addressing a need for changing team structure, and has put off dealing with it until a crisis occurs. In every case this cost the organization time and money, needlessly setting back the project. A project’s leadership should be alert to the need to periodically reorganize the team so that this can be done before it causes problems.

19.3.3 Example: conflicting instructions leading to inconsistent design

I worked on two projects that had problems building their systems because someone on the team got conflicting instructions on the objectives for some component they were supposed to be building.

In one case, a software developer was tasked with implementing a particular CPU scheduling algorithm in a real-time operating system kernel. This scheduling algorithm had been chosen in order to make certain system safety properties work, and to enable some high-level control features. The developer in question did not understand the assignment, and reached out to someone else—someone not authorized to make decisions about the CPU scheduling algorithm. The developer got advice from the other source and implemented a different scheduling algorithm. The other algorithm could not provide basic safety and control features the system needed. As this project was being executed on a cost-plus contract, the developer’s organization had to pay for someone to remove the work the developer had done and implement the correct algorithm.

In another case, one senior system architect (systems engineer) was responsible for a particular feature set of the system. The system architect was working with a pair of developers to work out a design for those features. A second senior system architect, who was not responsible for that part of the system, was having a conversation with the developers and instructed them to design the features in a particular way. This conflict in instructions to the developers led to confusion that took several days to detect and resolve.

Both these problems reflect two common team design flaws. First, both are instances of conflicting control ([Leveson11, §4.5.3]), in which a controlled process (the developer) receives conflicting control actions. Second, in both cases design authority (Section 19.2.4) had been assigned, but developers got instructions from someone else. In the first case the developer sought out advice from an inappropriate source; in the second, a senior person gave instructions outside of their authority.

The techniques for addressing a potential system hazard apply to the conflicting authority: first try to eliminate the conditions that can lead to a hazard, then make it unlikely to happen, reduce the likelihood of it causing a problem, and then try to limit the damage when it does happen.

The first line of defense is thus to organize the project so that conflicting decisions and authority do not occur, or make it unlikely. This is most easily done by having for each part of the system exactly one person authorized to make decisions, and making that information clearly available to everyone on the team. Note that this does not mean that only one person is allowed to design; rather, it means that one person has responsibility for the design. The responsible person can and should delegate the design effort as much as possible to the people actually doing the work, and the responsible person should focus on setting objectives for the design, guiding the design, and checking that the results are acceptable.

Theoretically, a team can avoid conflicting decisions or directions by having a few people operating in a way where they reach consensus before making decisions. In practice consensus algorithms work well enough for computer systems but people find it hard to work that way: communication happens informally, people are in a hurry, or someone has a good idea they get enthusiastic about and don’t wait to share it with others for agreement first.

The second line of defense is to have regular review points in the project when discrepancies can be caught.

19.4 Directory

Two of the first things people on the team need to know are their own roles and who else is on the team. Once they have that information, they can communicate with others to learn other things they need to know.

Consider the following scenarios.

Person A is working on some component. That component has an interface with another component, and so person A needs to coordinate how they implement their part of that interface with someone working on the other component.
Person B has finished a design for an update to a component. Project procedures say that they need to have the design reviewed and approved before moving on to implementing the design. Person B needs to find out who the reviewers and approver will be.
Person C discovers an ambiguity in the specification for a component, and they are concerned that this ambiguity may lead to a flaw in the designs that follow from the specification. Person C needs to find the people responsible for the specification so they can discuss the potential problem and find a resolution to the ambiguity.

For all these scenarios, the people need to determine who on the team is responsible for some part of the system beyond what they are working on themselves.

To meet this need, the project should maintain some kind of directory of people on the team. This should record:

Who each of the people are;
Where they are located or how they can be contacted;
The roles and authority each one fills, including what parts of the system them work on; and
How they fit into the team’s structure.

This information is generally fairly simple, but it must be kept current. If people come to believe that the directory is likely out of date they will not trust it.

Sidebar: Summary

Teams are made up of the people who do the project’s work.
A team has structure.
- How and when people communicate.
- How people are grouped.
- Trust and social relations.
- Role, authority, and responsibility.
- Division of labor.
Team structure can be designed and analyzed.
Team structure and system design are closely related.
Team can be treated as a control system.
- Some people responsible for guiding team behavior.
- Detect when team is having issues and work out corrections.
Team structure changes with team size.

Chapter 20: Operations

7 March 2025

Operations covers how people on the team organize the work of building the system.

I introduced the basic ideas of operations in Section 7.3.5. In that section, I model operations as five parts: life cycle, procedures, plan, tasking, and support. In this chapter I go into more detail about each of these parts. The material in this chapter is based in part on the needs analysis reported in Appendix A.

This chapter details out the model for operations in general, without recommending specific solutions.

Note that this chapter is focused on the operations directly involved in building the system. An organization has other things that fall under business operations; I defer to others to address that broader topic.

20.1 Purpose

Operations is about organizing work, in the form of tasks. It is complementary to team and artifacts, which I discussed in previous chapters. Operations ensures that people know what tasks they should be doing, similar to knowing what they should be producing (artifacts) and who they do it with (team).

I leave “task” largely undefined, relying on its colloquial meaning. It should be taken to mean some unit of work to be completed.

Operations organizes the work so that:

The right tasks are done at the right time by the right person. Each task is performed by the person with the right skills to do it, and who has an appropriate role in the team. This is accomplished with tasking based on the plan.
Everyone does their work in compatible ways. Each team member has a single common model for how to do their work. The rules are documented clearly and understandable by the whole team. People understand what steps will be coming up after they perform a specific task. Team members communicate so that they all work with the same information and can raise issues when found. These conditions lead to people having confidence in what others are doing, and they allow detection and correction of problems. This is accomplished with the life cycle and procedures, as well as by team structure.
The work is done efficiently. The team avoids work that is not actually relevant to the system being built. They minimize work that is a dead end, re-work due to quality problems, waiting time because of dependencies, and overhead for operations. Operations allows detection and correction of problems, including accountability and feedback on schedule. This is accomplished with procedures and plan, especially in how they account for uncertainty and risk.
The work is of high quality. The team thinks through needs before moving forward, allowing for controlled exploration or prototyping. Work is checked independently to catch flaws, and steps do not fall through the cracks. Work is coordinated so system parts fit together, and flaws are detected and fixed. This is accomplished with procedures and tasking.
The work meets deadlines and budgets. The project can project forward the time and resources required to reach milestones, allowing for uncertainty. The work is actually possible for the team to complete. Project management can detect and resolve potential problems by adapting the plan, changing system objectives, or getting more resources. Progress is visible to project management, funders, and the organization. This is accomplished with plan and tasking, especially flexibility in planning.
Adapts with need. The project gracefully handles requests for changes in purpose or regulation. The project gracefully handles learning more things about the system as work progresses. The operations support people making decisions that change the plan. This is accomplished by building points where things are checked into the life cycle, and using procedures to deal with change.
The project supports its customer and funder. The project’s execution fits with acquisition and funding processes. This is encoded in life cycle and possibly procedures.
The project has a future by maintaining the information that others will need later for maintenance or system evolution. This is also encoded in the life cycle.

Each project will work out its own approach to operations. The list above provides objectives against which an approach can be measured.

20.2 Operation model

The model operations in Section 7.3.5 has six parts:

The development methodology is the general philosophy for how the team will tackle work—how to break up work and what to prioritize, for example.

The life cycle is the overall pattern of how the project works, with phases and milestones.

Procedures are the checklists or recipes for performing key tasks.

The plan is an evolving understanding of the path forward for the project.

Tasking is the assignment of tasks to people, and figuring out what tasks each person should do next.

Support maintains tools and information needed to do the other parts.

I have ordered the parts of this model by rate of change and at which decisions are made. The development methodology and life cycle are established early in the project and change slowly after that. Procedures change a bit more frequently, but not often. The plan is updated on a regular cadence, while tasking is continuous. Support activities go on throughout the project.

20.3 Development methodology

Each project will at some point choose a development methodology to follow. There are several popular methodologies, such as waterfall development, spiral development, or agile development, along with a great many variants of each.

The operations model I have presented can support any of these methodologies. The methodologies affect the life cycle patterns, how the plan is structured, and how tasking is done.

Waterfall development is characterized by developing the system linearly, starting with a concept and working through design and implementation of each of the pieces, then integrating those pieces together to form the final system. The life cycle pattern for waterfall development will reflect this linear ordering, and plans will follow the life cycle pattern.

Spiral development is organized around a set of intermediate milestones. The system becomes a bit more complete at each of these milestones (or iterations). Each milestone adds some set of capabilities to the system and the system, or some part of it is integrated and operable at each. The life cycle pattern for spiral development will define the way each spiral or iteration proceeds. The plan will reflect how the team will reach each of the upcoming milestones.

Agile development is organized around short cycles (called sprints in some versions of the methodology). Each cycle typically lasts a few weeks, and adds a small number of capabilities to the system. The system is expected to be integrated and operable at the end of each cycle. Unlike spiral development, the objectives for each cycle are typically decided at the beginning of the cycle based on the set of tasks that are ready to execute, and priorities for each task. This means that agile development is primarily about tasking, and it relies on a plan that defines what all the ready tasks are. Agile development can be complemented by a life cycle pattern that imposes discipline on the order of tasks—such as doing specification and design before implementation, or setting a review and approval step to ensure work quality.

In practice, most projects end up using a combination of methods.

The cost or difficulty of changing a decision usually drives a project to combine methods. The easier it is to change a decision, meaning undoing the work of some tasks already completed, the more agile the methodology can be. The more costly it is, the more care that should be taken to ensure that changes downstream are unlikely. (I discuss this further in Section 23.10, using the idea of reversible and irreversible decisions [Bezos16].)

The cost of change is significantly lower near the beginning of a project, when there is less work to be redone and when one change will not cause a cascade of changes to other work already completed. As work progresses, a particular change will become increasingly costly.

The cost of change also depends on the kind of work involved. Software and similar artifacts are malleable. The cost of changing a line of software source code or changing one line in a checklist is, in itself, tiny, though a change in the software may cause a cascade of changes in other parts of the system and may cost time and effort to verify. Changing a built-up aircraft airframe, on the hand, is costly in itself—in both materials and in effort.

These differences in the cost of change lead to differences in life cycle patterns and planning related to potentially-expensive decisions. For example, the NASA family of life cycles [NPR7120] follows a linear pattern in its early phases so that key aspects of the project can be worked out before the agency commits to large amounts of funding, especially for building aircraft or spacecraft hardware. Parts of some of these projects follow a more agile process after they have passed the Critical Design Review milestone (Section 24.2.1).

20.4 Life cycle

A project’s life cycle[1] is the set of patterns that define the big picture of how the work unfolds. It encodes ideas like the system going through phases: development, deployment, update, retirement. It can define phases within those. Within development, for example, there can be concept development, specification, initial design, detailed design and construction, integration, and acceptance.

The idea of a life cycle can apply to the whole system project, or to specific parts of the work. Each component in the system, for example, can have a life cycle for its development or for an update to the component. The life cycle can apply recursively to subcomponents.

In general, life cycle patterns define when one step of work should be done before another, when steps can proceed in parallel, and the conditions that define when a step is ready to start or when it is complete.

In this way, a life cycle can be viewed as an abstraction of the steps a project or a part of a project will go through. It is expressed as a set of patterns that guide how people do the work: the order in which steps are planned and performed. The actual sequence of events in the project may not match the pattern exactly, but the patterns give people a way to talk about what should happen and compare actual work to the ideal. The life cycle is not a schedule. It is a set of patterns that guide the team as they work out a plan and schedule tasks that achieve that plan.

I use the term life cycle patterns in this text to emphasize that these are ideal sequences of events, simplified to help people organize the work they do. Some of the patterns can be used repeatedly during a project, such as using the life cycle pattern for building each system component

The life cycle is not specific to a particular system project. A life cycle pattern can be more or less well suited to a project depending on attributes of the system being built—most especially how often there are irrevocable or expensive-to-reverse decisions. An organization can build up a library of patterns, improving them with experience and sharing the learning among many projects.

A project’s life cycle patterns help team members understand how the work they are doing fits with other work. It provides guidance on what they should expect from work that others are doing that will lead into work they will do. It helps people work out who is doing work related to their own, and who to talk to about that work. It helps people understand what steps will be coming up after they perform one step.

The life cycle approach is not in itself a methodology; nor does it imply a particular diagraming model or formal semantics. It is a technique for working out how the project will order its work that has evolved from a combination of examining several different life cycle standards, observing how teams use Gantt charts for scheduling, and the common practice of sketching things out on a whiteboard. I use an informal diagramming notation that is inspired by the diagrams used in NASA documents.

I introduce the general idea of life cycle here, without advocating for any particular patterns. I discuss specific examples and guidelines for building a life cycle pattern in Chapters 23 and 31.

Model of life cycle patterns. A pattern generally consists of:

A set of phases or work steps that the work can go through.
A set of milestones or checkpoints associated with each phase.
Conditions that should be true before starting or ending the phase.
Dependencies between phases.

In other words, the life cycle can be viewed as a directed graph of phases, with annotations on each phase. (Because the dependencies are time-like, the graph is also acyclic.)

For example, a simple life cycle pattern might say that a project must start with a phase where it works out and documents the customer’s purpose for the system before proceeding on to other work. That purpose-determining phase would conclude with a milestone where the customer reviews and agrees on the team’s purpose documentation. The next phase would involve developing a general concept for the system. This phase would include review milestones, checking that the concept will meet the customer’s purpose and that it can likely meet the organization’s business objectives. After those reviews, there might be a milestone where the organization makes a go-no go decision about whether to proceed with the project.

A life cycle pattern can be coarse-grained or fine-grained. A coarse-grained pattern would have phases that apply to the whole project at once, and take weeks or months to complete. The NASA family of life cycles [NPR7120] is coarse-grained: it is organized around major mission events like approval to move from concept to design, or from fabrication to launch. Fine-grained patterns might apply to a single component at a time, such as a component being first specified, then designed, then implemented, then verified, as a sequence of four phases with review checkpoints at the transition between phases.

Some life cycle patterns have phases that used many times in parallel in a project. Consider a fine-grained life cycle pattern that applies to each component. The general pattern might be:

Note that this pattern is closely related to the artifacts related to one component. The purpose phase, for example, produces artifacts on the component’s stakeholders, needs, and concept; the specification phase produces requirements and related artifacts.

A project might apply this pattern to each component being built. When multiple components are being developed in parallel, multiple instances of this pattern will be proceeding at the same time, and different components may be at different points in their cycle.

Iterative development methodologies. Many development methodologies build components iteratively rather than in one pass. Iterative development means stepping through the life cycle pattern for a component multiple times, once for each iteration. Each iteration will have different goals—different functionality to add, for example. The final iteration not only adds the last functionality but also performs final reviews and checks to ensure that the component is actually complete.

Iterative development methodologies often perform iterations that include multiple components, co-designing and implementing a set of related parts. When this happens, the iterations step through the life cycle patterns for all the components involved.

Non-linear progress and rewinding. The project’s life cycle patterns do not necessarily imply one-way linear progress; real work does not actually happen perfectly linearly.

While someone is working on the specifications for a component, they may well be thinking of design approaches. Following the four-step pattern above, the design thinking falls into a phase later than the specification work. Until the specification is final (and the specification phase is completed), any design work must be considered conditional: it might be made irrelevant as new specifications are worked out. Making the phases explicit helps people understand what work can be relied on as baselined (Section 17.4.3).

A project or the work on one part of the system can potentially move through a phase, progress to another, and later rewind back to the earlier phase. A project might rewind to a design phase when a flaw is discovered. Some implementation work has been done and might still continue while the design is reworked, as long as there is someone to do it and parts of the implementation are unlikely to be affected by the redesign.

The dashed line in the following diagram shows how work proceeds on one component. It proceeds through specification and design into implementation, with accompanying reviews ensuring that both the specification and design are complete. During implementation, the need for a design change arises, and work reverts back to the design phase. Once the redesign is done and reviewed, work goes back to proceeding with implementation.

Rewinding can happen informally or as explicit tasks. In the example, if the design change only involves fixing obvious typos, this can be done quickly and informally—though the life cycle pattern requires that the typo fixes get an independent review. This ensures that the typos really were typos and not intentional, and that the corrections are themselves correct. On the other hand, if the design change is substantial then it should be treated as tasks to be done and included in planning and tasking. This way the effort can be accounted for and any side effects can be handled, such as needing to allocate someone’s time to do a more thorough review of a design change. The difference between handling informally or as explicit tasks depends generally on whether the amount of new effort is negligible or not.

Planning versus measurement. There are two ways that one can view life cycle patterns. The first way is as a path that guides the work: one must go here, then here, then there. The other is as a way to measure progress. Being in some phase means certain things are believed done, while other things are in progress and yet others will be done later. These two views are compatible, and it is useful to use both viewpoints.

The difference between the two comes when dealing with changes. If the work on some component is in phase X, what happens when an error is found in work from an earlier phase? Or when a request for a change in behavior arrives? And what if one chooses to build a component in multiple steps, creating a simple version first then adding capabilities over time?

This is where viewing the pattern as a measure of progress is helpful. Consider a component that is to go through specification, design, implementation, and verification phases. When the work is in implementation, the implication is that specifications and designs are complete and correct. If someone then finds a design problem, the expectation that design is complete is no longer true. This situation leads to those tasks needed to make true again the condition that the design is correct. Put another way, this “rewinds” the status of the work on that component into the design phase. People will then do the tasks needed to advance back to the implementation phase by correcting the design and performing a review of at least the design changes.

Documentation. A project should clearly document the life cycle patterns it will use and make them accessible to the whole team. While the patterns are used directly for planning, making them accessible to everyone ensures that everyone knows the rules to follow and reduces misunderstandings about what is acceptable to do.

For some projects, the life cycle will be determined by an external standard. NASA defines a family of life cycles for all its projects [NPR7120]. This flow is designed to match the key decision points where the project is either given funding and permission to continue, or the project is stopped. It defines a sequence of phases A through F, with phases A-C covering development, D covering integration and launch, E covering operations, and F covering mission close out. Specific kinds of projects or missions have tailored versions of this overall life cycle.

Many companies have similar in-house project life cycle standards that revolve around decision points for approving the project for development and ensuring a product is ready for commercial release.

20.5 Procedures

Procedures define how specifically to do actions or tasks defined in the life cycle. They often take the form of checklists or flow charts.

Procedures are related to the system being built, but are generally portable between similar projects.

Having implemented clear procedures will:

Let new team members know how to do steps that they might not know how to do correctly on their own.
Help all team members do steps that aren’t performed often, where people haven’t learned the steps through frequent execution.
Ensure that key steps aren’t missed.
Build in ways to check quality.
Provide a way for the team to learn and improve: when a procedure isn’t working well, the team can revise the procedure.

Having common procedures for the whole team makes key work steps less matters of opinion and more based on shared fact. This can improve team effectiveness by removing a source of conflict between team members.

A project can realize these benefits only when the team members know what procedures have been defined, when they can find and understand the procedures, and when the team uses those procedures consistently.

Three steps help team members know what procedures are defined. First, the procedures should be defined in one place, with a way to browse the list of procedures as well as a way to find a specific procedure quickly. Second, the life cycle should indicate when one procedure or another is expected to be used. For example, when the life cycle indicates that an artifact should be reviewed, it should reference the procedure for performing the review. Finally, new team members should be shown how to find all this information for themselves.

Understanding and using procedures depends on the procedures being actionable: they should indicate the specific conditions where they apply, and provide a list of concrete steps for someone to perform. This is especially true for procedures that will be used when someone someone is under stress, such as in response to a safety or security accident. I have often seen “procedures” that say things like “contact the relevant people”—which is unhelpful. The procedure needs to list who the relevant people are (or at least their roles) so that a person in the middle of incident response can contact the correct people quickly.

20.6 Plan

The project life cycle and procedures define how people should get work done in the abstract, before any work actually gets done. Now I turn attention to how the actual work is organized in the plan and tasking, using the life cycle patterns and procedures as structure.

The plan is a record of the current best understanding of the path forward for the project. It contains the foreseeable large steps involved in getting the system built and delivered, and getting it to external milestones along the way. It guides the work, as opposed to people working on tasks at random.

The plan:

Guides the work being done, showing which tasks should be done sooner and which can be deferred.
Helps people ensure that all needed work gets done, and that important work does not fall through the cracks.
Helps people find and address situations where some work might be blocked waiting for other work to complete, improving efficiency of the work.
Supports forecasting the progress the project will make and the resources involved, showing whether key milestones can be met and prompting re-planning if they cannot. This supports desires from the organization, customer, and funder to know how soon they will have a working system.
Communicates to the team how each thread of work fits into the whole.
Provides a framework for adapting project execution when changes are needed and as people learn more about the work ahead.

Plans versus schedules. I differentiate between a plan and a schedule.

A schedule is a “plan that indicates the time and sequence of each operation”.[3] In practice, a schedule is treated as an accurate and precise forecast of the tasks that a project will perform. People treat the timing information it provides as firm dates, and will count on things being done by those dates. Schedules are often part of contractual agreements. Because people outside the project use a schedule to plan their own activities, a schedule is hard to change.[4]

Schedules are appropriate when the work to be done can be characterized with sufficient certainty. In most construction projects, for example, once the building design is complete, the site has been checked for geologic problems, and permits have been obtained, the remaining steps to actually construct the building are generally well understood and the time and effort involved can be estimated with confidence. However, before the site has been inspected one might not be able to create an accurate schedule because there could be undiscovered problems in the ground (perhaps an unmapped spring or an unstable soil layer).

The plan, on the other hand, is not a detailed schedule. It is a general indication of the steps to be taken, along with as much information about time required for different steps as can be estimated. It will reflect varying degrees of certainty about the steps and timing, from fairly certain in the near term to highly uncertain later in the work. It provides guidance, but it does not represent a promise of dates or exact sequencing of events.

A plan is dynamic and constantly changing, as it is a reflection of where the project currently stands.

At the beginning of a project that requires innovation, the team is just beginning to work out what the system will be, and so they cannot build a schedule because there are too many unknowns. As the project works out the customer needs and basic concept, the flow of work becomes a little clearer but most of the work ahead is still unknown. People will continue to learn more and more about the system, and at each step there will be fewer unknowns and the certainty of plans can improve. Even so, the exact schedule is not known until the very end of the project—when there are no places left that could hide surprises.

To some degree, the difference between a schedule and a plan is an attitude. A schedule is something people treat as a contract, and so it does not accommodate uncertainty well. It includes a lot of detail; if that detail is uncertain, people will be constantly updating that semi-fictitious detail. A plan is a flexible current best estimate that doesn’t promise much except to accurately reflect what is known, and avoids information that might appear accurate but in fact is not certain. A schedule is useful to someone writing a contract to get something done. A plan is about an honest accounting of where the project stands and where it is going, and thus more useful to the people building the system.

Plan contents. A plan gathers four types of information:

The set of work steps that can be foreseen to be needed. Some will be detailed; others will be vague or general.
Milestones, both internally-defined and those imposed from the outside.
Dependencies among the work steps, and between work steps and milestones.
Estimations of uncertainty about all of these.

The chunks of work and milestones form an acyclic graph, with dependencies as edges between the work or milestones. The work can be annotated with estimates of resources or time required, to the degree those are known—and they should not be annotated if the information cannot be estimated with reasonable confidence.

In addition, some projects will give work steps a priority or deadline. Tasks that should be done soon should be scheduled early, perhaps to meet a deadline, to address uncertainty, or to account for a task that is expected to be lengthy.

There is no set format for recording a plan. I have used scheduling tools that use PERT charts and Gantt charts as user interfaces. I have used diagramming tools that help the user draw directed graphs. I have used graphs and time tables written on white boards. I have used tools meant for agile development, with task backlogs and upcoming iterations. All of these have had drawbacks—scheduling tools are not meant for constantly-changing plans; agile development tools are structured around that methodology; drawings on white boards and drawing tools are hard to update over time or to share.

Making and updating the plan. The plan starts at the beginning of a project, and is continuously revised until the project ends.

Assembling an initial plan starts with knowing the status of the project and working out the destinations. At the beginning of a project, the status is that the project is largely undefined beyond a general notion of what customer problem the system may solve. The endpoint might be delivering a working system, or it might involve expecting to deliver a series of systems that grow over time.

Initial plan for a new project.

If the project is already in progress, one starts on the plan by working out what is currently completed and in work.

Example initial plan with milestones filled in.

The next step is to fill in major intermediate milestones and work steps. The project’s life cycle patterns should provide a guide to these. For a new project, the life cycle might indicate that the project should start with a phase to gather information about customer needs. As the first phases progress, the team will begin to develop a concept for the structure of the system. If the customer or funder has required some intermediate milestones, those can be laid in to the plan, along with very general work steps for getting ready for each of those milestones.

Example life cycle pattern for the overall project.

It is normal for the plan to have large work steps that amount to saying that somehow the team will get something completed or designed or whatever. In the example above, “implement system” is completely uncertain when the project starts. When one does not know how part of the system will be designed, or how to implement some component, or even how some part of the work should proceed, it is better to put in a work step that accurately reflects the uncertainty. Being accurate about what is known and not known prompts people to find answers to the unknowns, gradually leading the plan toward greater certainty.

The plan then grows according to the system design. As the team works out the components that will make up the system, each new component creates a stream of work to be done to specify, design, implement, and verify that component, as specified by the life cycle. All these add new work steps into the plan, along with dependencies from one step to the next.

Example pattern for developing a component (linearly).

The plan should be revised regularly. It will change whenever there is some change to the likely structure of the system and as each component proceeds through its specification and design work. Many components will require some investigation, such as a trade study or prototyping, before they can be designed. The plan will evolve as those investigations generate results.

Part way through building the system, the plan will typically become large and show significant parallelism. This is also normal and desirable, because it reflects the true state of development. Mid-project there usually are many things that people could be working on. The plan should reflect all these possibilities so that those managing the project know the true status of the work and can make decisions with accurate information.

Example plan in progress. Some steps are complete, some are in progress.

The life cycle patterns a project uses provide building blocks out of which people can construct parts of the plan, but they do not dictate the plan entirely. Maintaining the plan is not simply a mechanical process of adding a set of work steps each time someone adds a new component to the design. There are three more factors to consider, and these make maintaining the plan a task requiring some skill.

First, the various components will be integrated into the system. The steps to put the components together and then verify that they interact correctly adds more work steps.

Second, a component does not necessarily proceed linearly through specification to design to implementation. Often the design will require investigation, perhaps a trade study to compare possible alternatives. In many cases it is worth building a simple prototype of one or more of these alternatives to learn more about the component before settling on a design. This turns a design step into several steps. Sometimes the outcome of an investigation is that the whole approach to designing a set of components is wrong and design needs to be revisited at a higher level. (This is the rewinding discussed in the section on life cycle above.)

Third, many system development disciplines, such as agile or spiral development, do not proceed linearly with developing a component from start to finish in one go. They often focus on building a simple version of a component or of a collection of components first, and then adding features over time.

Each project will have its own style for addressing these factors, and this will be reflected in the specific work steps included in the plan. For example, when a project follows a spiral development methodology, the plan for developing a part of the system might have several internal milestones: first a simple version of the components that can do some minimal function, then another version or two with increasing function. There might be design, implementation, and verification steps for each component involved for each milestone.

A project should document what methodology it has chosen, so that team members know what to expect and so they can plan consistently.

Plan and tasking. The plan is used to guide tasking—the assignment of specific tasks to specific people (Section 20.7). The plan includes work steps that are in progress and ready to be executed. These are the sources of tasks that people can pick up and work on.

Most of the time, there will be more tasks that are ready to be worked on than there are people to do them. The plan organizes the tasks and thereby helps the process of deciding what someone should do—whether a manager makes task assignment decisions or people pick tasks for themselves. If work steps in the plan include priorities, those will help guide task assignment decisions.

The plan and tasking together support accountability and measurement. They should allow someone to identify when a plan was changed, to see if the change was an improvement in retrospect. They should help identify when some tasks were completed faster or slower than expected, or completed with quality problems. This information can be used to improve forecasting and to identify tasks and procedures that should be restructured.

Plan and forecasting. Most projects will have deadlines they must meet. Customers want estimated delivery dates, so they can make preparations for steps they will take to put the system in operation. Funders may want intermediate milestones to show that their investment is on track. Others want to know the budget—money and time—required to get the system built, or to meet other internal milestones. The team will need to manage project execution in order to meet those deadlines.

One can look at this as a control problem. Forecasts using the plan provide the control input: based on the current plan, including its uncertainties, is the project likely to hit a deadline or not? The control actions rearrange the work steps in the plan or to add and remove steps. Adding or removing steps often means adding or removing capabilities from the system, also known as adjusting the system to fit the time available.

Forecasting using the plan will always be imprecise because the plan reflects the actual uncertainty in the project. In some industries it is possible to estimate the time and effort required for work steps, within a reasonable error bound, once the system is well enough understood—for example, in many building construction projects. However, when building systems that do not have extensive comparable systems to work from, estimates will be unreliable for much of the project’s duration.

There are ways to manage a project’s plan to reduce uncertainty as quickly as possible. I discuss those in Chapter 63.

20.7 Tasking

Tasking is about the day-to-day management of what tasks people are working on and what tasks are ready to be worked.

The choices of what tasks are ready is based on the plan, along with bugs that have been found, management tasks that need to be done right away, and ongoing tasks that do not show up in the plan.

Tasking builds on the plan. The plan should be accounting for which tasks need to be done sooner than others in order to meet deadlines or to avoid stalling because of a dependency between tasks.

The objectives for tasking are:

To keep everyone on the team productively busy;
To keep everyone on the team informed of what they can work on, and who will be working on related tasks;
To make task assignments that fit with team members’ roles (Section 19.2.4);
To help detect when a task is not going as expected and lead to a response to address problems;
To focus on higher-priority tasks; and
When a task involves multiple people, they are all available together.

One can treat tasking as a decision or control process that works to meet those objectives. Other scheduling disciplines, such as job-shop scheduling and CPU scheduling, can provide useful ideas for how to make choices about who should work on what.

There are many different choices about when, who, and how much. Each project will need to define its own approach, usually following whatever development methodology the team has selected. The approach should be documented as a procedure that the team follows.

Decisions about tasking can happen at many different times. It can happen reactively, when one task is completed, when a task someone is working on is stalled waiting for something else to happen, or when some urgent new task arrives (such as a high-priority bug or an external request). It can also happen proactively or periodically, putting together a set of tasks for someone to do ahead of time.

Tasking can be done by different people as well. In the teams model (Section 19.2.4), the authority to make tasking decisions is a role that can be assigned. One person can have a scheduler role and make these decisions. A group can divide up tasks by discussing and reaching consensus. Each person can take on tasks when they are ready for more. Combinations of these also work.

Finally, tasking decisions can occur one task at a time, or they can focus on giving each person a queue of tasks to perform.

A large project will have a very large number of tasks—potential, in progress, and completed—to keep track of. Using a shared task tracking tool of some kind is vital. Without one, tasks will be forgotten, or there will be confusion about how they have been assigned. The tracking tool is another one of the tools that the project should maintain (Chapter 18).

Each task must be defined clearly enough that the person doing the work can properly understand what is to be done, and so that everyone can agree when the task is complete.

20.8 Support

The decisions made in planning and tasking need supporting information.

Risk and uncertainty affect choices of what should be done sooner or deferred. I have often chosen to prioritize work that will reduce risk or clarify uncertainty, in order to make the project more predictable down the road. Many projects maintain a risk register, which lists matters that could put the project at risk. These risks are often programmatic, such as the risk that a delayed delivery from a vendor will cause the project to miss a deadline. I have on some projects maintained a separate, informal list of the technical uncertainties yet to be worked out; for example, how should a particular subsystem work?

Project management will also need to manage budgets. Programmatic budgets, most often funding, affects how the project execution can proceed. Technical budgets, such as mass, power, or bandwidth, are aspects of the system being built. For both types of budgets, the amount of the resource that has been used and the amount left need to be tracked. The project will need to estimate how much more of them will be needed to finish the project. If there isn’t sufficient resource left, then the project management will have a decision to make—whether reallocating resources, reducing demand, or finding more resources.

Almost every project will need to report on how the work is progressing, relative to deadlines and available resources. The plan mechanisms should help people obtain and organize this information.

20.9 Using the operations model

20.9.1 Managing operations

Managing operations has much in common with managing the team structure (Section 19.3.1). The approach to operations is defined early in a project, leads to habits that the team uses, and is maintained as the project moves along. Operations fits into the control system model proposed for teams.

The team decides on its initial approach to operations when the project starts. The initial approach might be simple and worked out on the fly, appropriate for a project that must first sort out what kind of system it is going to build, as happens in a startup company. Other projects will inherit an existing approach to operations based on the organization they are part of or from the team’s previous experience.

A team establishes habits of how to do their work early on in a project. People mostly act on their internalized understanding of how they are expected to work, not by referring to documents or by working out what they should do from first principles. When these habits match the behaviors needed for the project to meet its objectives—such as those listed in Section 20.1—then the project can proceed smoothly. If there is a mismatch, then the project will have problems.

This implies that a project should establish its operations approach early, meaning the life cycle patterns and procedures it uses, along with assigning roles to handle plans, tasking, and support. The earlier that a project makes these decisions, the sooner the team can learn them and begin building habits around the intended approach. I have supported projects that prioritized writing code over working out operations; when those projects later tried to confront the operational problems they were having, they were unable to get people in the team to change the habitual and dysfunctional behaviors they had worked out in the absence of direction.

At the same time, no approach to operations will be perfect, and a project’s needs often change over time. A well-functioning project has people responsible for watching how well project operations are working, detecting when there are problems, and adjusting the life cycle, procedures, or assignments from time to time. The control system approach introduced for teams is helpful for thinking about how to maintain a project’s operations.

20.9.2 Practical considerations

Some people will look at the life cycle and procedure parts of this operations model and say that it is “process”—a term that has acquired a negative connotation. Yes, the life cycle and procedures do define processes that are supposed to guide the team. Process, when done well, helps a team work more effectively and more happily. Such process is simple: it is a guide for how to do common sequences of events, or tasks that are critical to be done a certain way. It provides a checklist to make sure things get done and aren’t missed. It encodes checks to make sure technical work is done correctly.

I have outlined the advantages that life cycle patterns and procedures can bring to operations in Section 20.1 above.

In my experience, the potential disadvantages, and the reasons people have come to dislike the idea of process, arise from three misuses of operations: making it too heavy, making it too complicated, and defining something the team is unwilling to use.

As an example, a colleague told me about a project they had worked on where getting approval to order a fairly simple part (for example, a cable) took multiple approvals and potentially weeks to complete (heavy process). Indeed, nobody was even sure exactly how to go through the process to get an approval to get the part ordered (complicated process). The processes were, presumably, put in place to ensure that only parts of sufficient quality were used and to manage the spending on parts acquisition. In practice the amount of money spent on people’s time far outweighed potential cost savings, and the amount of work required for people to review an order over and over meant that the reviewers did not have the time needed to perform meaningful quality checks.

A “heavy” life cycle or procedure is one that takes more effort or more time than is warranted for the value it provides—meaning it subtracts from the value of the project. This works against the objective of doing work efficiently. Each part of a life cycle pattern or procedure should have a clear reason for being included. The effort and time involved should be compared to that reason, and the procedure or pattern should be redesigned if the comparison shows it is too heavy. To avoid this, each procedure and life cycle pattern should be scrutinized to eliminate any steps that are not actually needed.[2]

A complicated life cycle or procedure is one that involves many steps, often with complex conditions that have to be met before some step can proceed. In the example from my colleague, nobody on the team could figure out all the steps that needed to be done. This can be avoided by, first, ensuring that procedures are as simple as possible, and second, by documenting them and making that documentation easy for people on the team to find and understand.

Teams are generally willing to follow procedures, as long as a) they know what the procedures are; b) they understand the value of following them; and c) following procedures has been made a part of the team’s norm. This means that the life cycle patterns and procedures should be documented, and their purpose or objectives should be spelled out. Normalizing following the procedures, however, is not something that can be accomplished by just writing something down. This has to be practiced by the team from the start, with leadership setting examples (see Section 19.3). Involving the team in setting up the life cycle patterns and procedures can help people understand and buy into the process.

Bear in mind that when a project adopts a particular life cycle pattern, the project is making an implicit commitment about staffing. If the pattern indicates that certain reviews must happen before key events happen, like ordering an expensive piece of equipment or beginning a complex implementation effort, then the project must ensure that there are enough people with enough time to perform those reviews. If the project does not staff enough, people on the team will quickly learn the (correct) message that the project or its organization does not actually care about the reviews and will begin to work around the pattern.

How all of this is handled for a particular team depends a lot on the team’s size. It’s common for a project to start with simple life cycle and planning when it is small and the project is uncertain. The project will need to shift strategies at times as the team grows, as the work becomes more complex and interconnected (see team growth in the previous chapter).

Sidebar: Summary

Operations addresses how the team organizes its work.
- Life cycle patterns provide general ordering of tasks.
- Procedures give specific steps for specific tasks.
- Plan maintains the outlook for upcoming work.
- Tasking assigns each task as needed.
- Support includes tracking risk and budgets, which inform planning and tasking.
Operations approach must be designed to match the development methodology the team uses.
Operations can be treated as a control system, with people responsible for monitoring and guiding the work.

Appendixes

Appendix A: From stakeholder need to model purposes

8 January 2024

A.1 Introduction

In Chapter 16, I presented an approach for determining what features and capabilities should be supported in the project in order to do a good job of building a system, and meeting stakeholder needs. In this appendix, I present the detail of that derivation.

Bear in mind that this derivation results in a set of objectives for a project. It does not say how any particular project should meet these objectives; each project must decide those things in ways that meet the specific needs of that project and that system. The objectives can be seen as a set of considerations that each project should examine as they decide how to run the project.

The derivation only addresses matters that are related to the project’s approach to building a system. There are many other factors outside this scope: matters of project management, or of policy in the organization that hosts the project. Where appropriate I have made notes of these matters external to the system-building scope.

A.1.1 Stakeholders

The set of stakeholders is:

The customer for which the system is being built;
The team that builds the system;
The organization(s) of which the team members are part;
Funders who provide the investment to build the system; and
Regulators who oversee the system and its building.

I introduced each of these in Section 16.2.

A.1.2 Model elements

I introduced the model for making systems in Section 7.3. This model is organized around the tasks that need to be performed to build the system, and has the following elements:

Artifacts that are created by performing tasks, and represent the system and records about it;
The team that builds the system by performing tasks and making artifacts;
The tools that people on the team use in doing tasks; and
The plan that organizes what tasks need to be done, in what order, and using what resources.

In addition to these elements, I have include an element for matters external to the system-building project for matters that stakeholders need but that aren’t about building the system itself.

A.1.3 Derivation

The derivation maps stakeholder needs onto objectives for parts of the model.

The result is a set of objectives or capabilities that people should consider when working how how the project should operate.

I discuss each stakeholder in the sections that follow, along with tables of the needs or objectives of each. The objectives that support these stakeholder objectives are annotated with a right-pointing arrow: →.

A.2 Stakeholders

A.2.1 Customer

The customer (see Section 16.2.1) is a stakeholder who wants the system built because they are going to use the system. They may or may not be funding system development directly—if they are, then they are also a funder below.

model:2 Customer

2.1 Fill purpose
The project must deliver a system that meets the customer’s purpose

2.1.1 Know purpose
The project must know what the customer’s purpose for the system is

→ model.artifacts:2.1

→ model.plan:3.2.1

→ model.team:2.1.1.1

2.1.2 Build to purpose
The project must produce a system that meets the customer’s purpose

→ model.artifacts:1.1, 2.1, 4.2, 4.4, 4.5, 5.1, 5.2

→ model.plan:1.2, 2.1, 2.2, 2.3, 3.3, 3.3.2

→ model.team:2.2.1, 2.2.2, 2.5.1

→ model.tools:3.1, 3.2, 4.1

2.1.3 Know requirements
The project must know the customer’s reliability, safety, and security requirements

→ model.artifacts:2.1.2

→ model.plan:3.2.1

→ model.team:2.1.1.1

2.1.4 Meet requirements
The project must produce a system that meets the customer’s reliability, safety, and security requirements

→ model.artifacts:2.1.2, 4.5, 5.1, 5.2

→ model.plan:3.3, 3.3.5

→ model.team:2.2.1, 2.2.2

2.1.5 Free of errors
The project must produce a system that is free of errors

→ model.artifacts:4.5

→ model.plan:3.3.5

→ model.team:2.2.2

2.2 On time and budget
The project must deliver a system by the required deadline and within the needed budget

→ model.plan:1.2.5, 4.1, 4.2

2.2.1 Know budgets
The project must know the budgets and deadlines for delivering the system

→ model.plan:3.2.2, 4.1

2.2.2 Know consumption to date
The project must know the resources and time that has been used to date that count against budgets or deadlines

→ model.plan:4.1.1

2.2.3 Project forward usage
The project must be able to project the resources and time required to complete the system or meet other deadlines

→ model.plan:1.2

2.2.3.1 Uncertainty
The project must be able to estimate the uncertainty in any forward projections of resources or time

→ model.plan:1.2.1

2.2.4 Control execution
The project must be able to control execution to adjust resource and time consumption

→ model.plan:1.2.4

2.3 Certifications
The project must deliver a system that has appropriate certifications or approvals

2.3.1 Know regulations
The project must know the regulations or standards that apply to certification/approval

→ model.artifacts:8.1

→ model.plan:3.2.5

2.3.2 Follow process
The project must follow any processes required to get certification/approval

→ model:2.5.2

→ model.artifacts:8.2, 8.3

→ model.plan:3.3.1.1, 3.3.2.1, 3.3.3.1, 3.3.7

2.4 Release and deployment
The project must be capable of releasing a version of the system and deploying it to a customer

→ model.artifacts:1.1, 6.1

→ model.plan:3.4

→ model.team:2.5.1

→ model.tools:3.5, 4.3

2.5 Evolve system
The project must evolve the system in response to changes in customer or other needs

2.5.1 Receive requests for change
The project must be able to receive and process requests for change from the customer

→ model.plan:5.1, 5.3

→ model.team:2.1.1.2

2.5.2 Receive regulatory changes
The project must be able to receive and process changes in regulatory requirements

→ model.plan:5.2

→ model.team:2.3.1.2

2.5.3 Know purpose of change
The project must know the purpose of the change (and the change in system purpose that results)

→ model.artifacts:2.2

2.5.4 Build to meet change
The project must be able to produce a system that meets the changed purpose while maintaining the system’s other purposes and requirements

→ model.artifacts:1.1, 2.1, 2.2, 4.2.1

→ model.plan:1.2, 2.1, 2.2, 2.3, 3.3.6

→ model.team:2.1.1.2, 2.2.1, 2.2.2, 2.5.1

A.2.1.1 Filling purpose

A customer has some purpose for the system, meaning something they want to achieve by deploying and using the system. This is the problem that the customer wants solved, which is a higher-level concern that the specific features that the system will provide.

A customer may have additional requirements on the system. They likely have a need for a minimum level of reliability. They likely have needs related to safety and security of the system.

The project needs to build a system that can meet this purpose and the requirements.

The project can meet these needs by:

Learning and documenting the customer’s purpose and requirements, resulting in documents that the team can refer to while building the system.
Having processes defined that ensure that the team will follow a deliberate process as they design, build, and verify the system.
Having processes for checking the design and implementation to ensure that it meets the customer’s purpose, and for finding and fixing errors.
Having a team that has enough people with the right skills to do the work, and gives them clear assignments for what steps they are responsible for.
Having tools that help the team analyze and verify the design or implementation.
Keeping track of the work that needs to be done, including having a general plan to guide the work and tracking the work currently in progress.

A.2.1.2 On time and budget

The customer likely has a deadline by which they would like the system delivered. They likely also have a budget for how much they want to invest in acquiring the system. At minimum, customers generally want the result as soon as possible and for as low a price as possible.

To meet these needs, the project should:

Know what the budget and deadlines are for the work.
Have a plan that can be used to estimate resources and time needed to get to deadlines or finish the project. This plan should be visible to everyone on the project, so that they can understand how what they are doing fits into the overall work flow.
Account for degree of uncertainty in these estimates—which will be inaccurate early in the project, and become more accurate as time goes by.
Track how much resource has been used and the progress on tasks.
Regularly update the plan based on progress made and information learned.
Have processes in place to detect when there may be a budget or deadline problem, and to figure out how to resolve the problem.

A.2.1.3 Certifications

In many industries, some kind of certification or approval is necessary to operate the system. An aircraft, for example, needs a type certification from the local aviation authority as well as approval for a specific instance of the aircraft. Even if there is no overt certification required, there are often regulatory standards to be met.

The project must build the system in compliance with regulations. When certification is needed, the project must follow the process to get that certification.

To achieve this, the project should:

Know what regulations and certification process applies to their work, and make that information available to everyone on the project.
Have people who interface with regulators, to learn when regulations change, to ask for clarification or guidance when needed, and to work with the regulators during certification.
Account for regulatory requirements in the specifications for the system, equal with customer objectives and requirements.
Maintain design, analysis, and process records that can be checked to verify that the project is working to meet regulations.
Have a process for regularly checking that the project building a system that meets regulations, including verifying the design or implementation.
Have a process for working with the regulators to get certification, including having a process for resolving issues that the regulators identify that stand in the way of certification.

A.2.1.4 Release and deployment

The customer needs the system actually to be delivered and put into operation. The project must deliver the system, and provide or support its deployment.

To do this:

Maintain the tools used to maintain products for release, including a protected repository for software releases and stockrooms for hardware inventory.
Have a process defined for releasing and deploying a system.
Have people on the team who are trained and responsible for release and deployment.

A.2.1.5 Evolve system

The the system is successful, the customer often finds that it can be made even better with some changes. Or the customer’s needs may change, and they will want the system to adapt to meet their changed needs. The project should be able to maintain and evolve the system to support the customer’s changing needs.

A system may also need to change when regulations change.

The project can support an evolving system by:

Having people, tools, and a process for receiving requests for changes.
Having a process to decide (and prioritize) which requests to act on, and which to defer or reject.
Tracking which changes are being reviewed, which are being worked on, and which have been completed.
Maintaining rationales of decisions made during design and implementation, so that people working on a change can accurately understand how the system can be modified without causing errors.
Having processes and tools to maintain all the design, implementation, and verification artifacts as they are changed—and keeping changed artifacts consistent with each other and separate from versions without a particular change.

A.2.2 Team

The team (see Section 16.2.2) is the collection of stakeholders who build the system. These people need the things that skilled, technical workers generally need: satisfaction, security, confidence, compensation.

Meeting these needs is mostly outside the scope of systems-building itself. These needs are largely met by the project and organization management who create the environment in which the team works. Still, there are aspects of systems-building that can help (or hinder) meeting the team’s needs.

The analysis of a team’s needs presented here is somewhat idealistic. It focuses on skilled workers who are not readily interchangeable, whose value to a project derives in part from the knowledge they carry about the system being built. It assumes workers motivated largely by work satisfaction and have essential material needs met by their compensation. These assumptions lead to a particular balance of power between the team and the organizations that employ them. This would not apply to a team of interchangeable workers or workers whose material needs are not well met by their employment.

model:3 Team

3.1 Satisfaction in the work
The team must have work that challenges them and results in satisfaction in what they produce

3.1.1 Positive outcome of work
The team must have confidence that their work will have a positive outcome

→ model.external:1.1

→ model.plan:1.1, 1.2

3.1.2 Challenging work
The team must find that the project’s work challenges them and makes use of their skills while remaining achievable

→ model.external:1.1, 1.2

3.1.3 Avoid irrelevant work
The team must believe that they are not being asked to do irrelevant work as part of the project

→ model.artifacts:1.3, 1.3.1

→ model.external:1.2

→ model.plan:1.2.5

→ model.tools:1.1

3.2 Appropriate staffing
The team must be staffed with the right people to do the work

3.2.1 Sufficient staffing
The project must have sufficient staff, with the right skills, to build the system

→ model.external:1.3

→ model.plan:1.2.3, 6.1, 6.3

3.2.2 Not overstaffed
The project must not be overstaffed in a way that leaves some unable to make meaningful contributions

→ model.external:1.3

→ model.plan:1.2.3, 6.1, 6.4

3.3 Sufficient supporting resources
The project must provide the team with sufficient resources to do the work

→ model.tools:3.2, 3.3, 4.1, 4.2, 5.1

3.4 Secure position
The people in the team must feel secure in their position in the team

3.4.1 Understanding of fit
The team members must understand how they fit into the organization

→ model.external:1.5

→ model.team:1.1

3.4.2 Clear expectation
The team members must have a clear and correct understanding of their responsibilities in the project

→ model.plan:1.2.7, 2.4, 3.2.3, 6.2, 6.3

→ model.team:1.2

3.4.3 Fair evaluation
The team members must have an expectation that their work will be fairly evaluated

→ model.external:1.7

→ model.team:1.2.1

3.4.4 Clear lines of authority
The team members must have a clear understanding of the authority of others in the project

→ model.artifacts:3.1

→ model.plan:3.2.3, 6.2, 6.3

→ model.team:1.1.1, 1.1.2

3.4.5 Ability to raise issues
The team members must have the ability to raise issues about the team and about the system, without retribution

→ model.external:1.4

→ model.plan:2.4, 3.3.5

→ model.team:4.1

3.5 Fair compensation
The team must be fairly compensated for their time and effort

→ model.external:1.6

3.6 Belief in project
The team must be able to believe in the project, its purpose, and its leadership

3.6.1 Belief in objective
The team must have confidence that the organization is accurately working with the customer

→ model.plan:1.1

→ model.team:2.1.2

3.6.2 Ethics
The team members must believe that the system will be used in ways that accord with their ethical beliefs

→ model.artifacts:2.1.3

→ model.external:1.8

A.2.2.1 Satisfaction in the work

Team members are expected to need satisfaction arising from the work they are doing on the project.

The satisfaction comes in part from believing that the work they are doing will have some positive outcome. That outcome might be that they see the system deployed and having a positive effect on the world. It might be that they see their work acknowledged, publicly or privately, even if the system ultimately is not deployed. It could come from social standing among their peers improving because of their association with the work.

Skilled workers also want work that makes use of their skills—which leads to a sense that they, as a specific individual, are making a contribution to the work. Work that challenges them or from which they learn things contributes to that satisfaction.

Doing work that is seen as not relevant or not likely to have value decreases their satisfaction. If asked to do something that is not achievable, they will lose enthusiasm. If they are asked to do work that they perceive as irrelevant, they will feel a lack of their individual value.

To support team satisfaction, the project can:

Have a clear explanation for why the organization decided to go ahead and run the project to build the system.
Have a plan that gives team members context for how their work fits into the overall flow of work.
Use tools that automate as much repetitive or low-skill work as possible.

Other aspects are outside of the project’s scope.

The organization and project management need to keep the team informed on how the project is proceeding. This helps the team have confidence that the project will successfully finish the system, and it gives the team a way to get involved if there are problems in execution.
The project management needs to assign work to the people with the right skills to do the work.

A.2.2.2 Appropriate staffing

A team needs to have enough of the right people to do the work—but not too many people. Having enough people on the team who can do the work contributes to a team member’s sense that the project has a good chance of having a positive outcome.

Having too few people, or too many people who lack necessary skills, leads to an overworked and burnt out team.

Having too many people leads to team members who don’t have useful work to do. It can lead to people making up new work just to feel like they are contributing.

The “right” staffing level is dynamic. It changes over time as the project moves forward: a particular skill in designing electronics boards may be important for one period in the project, but once the necessary hardware has been designed and built, the need decreases. It changes over time as people change. As a team members learns new things, they may find that they should move on to a different project. Life events occur that change a person’s motivations and needs. The key is not to always have the perfect cohort working on the project, but to have a pretty good group and work to address changes as they happen. If the team has trust in their management that the management is able to address team composition, people will generally stay satisfied.

Ensuring appropriate staffing involves:

Having a reasonably accurate idea of the staffing needs, which comes from having an overall system design and a plan for building the system.
Ability to hire staff, or transfer them in from other projects. This includes having the ability to bring new staff members up to speed on the project.
Having an accurate record of who is on the team, and what role each person plays, in order to identify where there are gaps in staffing.

Much of this is outside the scope of the project itself. The organization holds the funds used to pay staff. It also provides the ability to hire and fire people.

A.2.2.3 Sufficient supporting resources

As with staffing, the team needs resources to do their work: a place to work and the tools to do the work, for example. They may need consumable resources as well. For example, a team might need a ready supply of liquid nitrogen in order to test a hardware component that is supposed to operate at low temperature.

If the team lacks these resources, they can’t do their work. This affects their satisfaction.

The project needs to have:

Facilities where the team can work. This includes both office space for desk work and lab space for building and testing.
Ability to make hardware components, including the machining used, the stock used for components, and ability to store and track the inventory of made parts.
Software build and test tools.

A.2.2.4 Secure position

Team members need to have a sense of security in their position. This means that they need to believe that they understand their position in the project and organization, believe that they will be treated fairly, and believe that issues they raise will be addressed. The opposite of this is when they have a sense of insecurity—because they do not understand what is expected or how they are evaluated, or because they believe that problems will not be resolved, even if they raise an issue.

The sense of security allows people to put their effort into their work, rather than spending their time and energy on worry. The sense also helps keep people on the team so that their knowledge of the system continues to benefit the project.

This comes from the project:

Having clear records of what role each person is expected to fill, and the ability for each person to find the people who have some role or responsibility.
Having clear reporting and decision authority, documented for all the team to know.
Having a process for raising issues about both technical and team problems, along with a process for tracking the issues as they are resolved.

The organization also needs to:

Document how the project fits into the organization’s overall structure.
Document (and follow) a policy for how each person will be evaluated.
Respond to raised issues promptly and without retribution.

A.2.2.5 Fair compensation

A technical worker needs to believe they are being fairly compensated for their time and effort. They need to be compensated well enough that they are not distracted by want. That compensation may be monetary, but it may take other forms as well.

Setting compensation policy is usually a responsibility of the organization, not the project.

A.2.2.6 Belief in project

Skilled workers often have choices about what project they work on. Many of them are motivated by a belief in the work they do: that it will help its users, or that it will result in some good for the world. If they come to believe that either or both is not true, they will be demotivated.

The project should:

Document and maintain the reasons why the organization and the team are building the system.
Maintain documentation of what the customer will use the system for.
Communicate regularly with all the team about how interactions with the customer are going.

The organization should also maintain an ethics policy that details:

What the organization considers ethical behavior by team members;
What the organization commits to as ethical behavior in the world; and
Procedures for raising and resolving ethical questions—including external reporting when appropriate.

A.2.3 Organization

The people in the team work for the organization, which provides a home for the project (see Section 16.2.3.) The organization is responsible for finding funding for the project and providing a legal entity for doing the work. The organization provides the business operations that make the project possible.

There is no one kind of “organization” that fits all situations. The organization might be anything from a single person, to a company, to a consortium of organizations, depending on the project. The organization might exist to return profit in exchange for the work, or it might be a non-profit or a governmental organization that looks for non-monetary benefits from the project. Some organizations exist only to build and deliver one system; others expect to deliver the system to many customers and to build more systems in the future.

Many of an organization’s needs are not to be met by the system-building project itself; they are met by how well the organization’s management and business operations. Nonetheless, how the system is built can help or hinder business management or operations.

The diversity of kinds of organizations means that the list of needs below has to be tailored for each project and each organization.

model:4 Organization

4.1 Ability to deliver
The organization must have the ability to deliver the working system to the customer

4.1.1 Ability to communicate with customer
The organization must be able to communicate with the customer

→ model.team:2.1.1

4.1.1.1 Conflict resolution
The organization must be able to negotiate and resolve conflicts between the team and the customer

→ model.team:2.1.1.3

4.1.2 Support for the team
The organization must have the infrastructure to support the team

4.1.2.1 Leadership
The organization must have leadership that can run the organization in a way that enables the team

→ model.external:1.4, 1.5, 1.7, 1.8

4.1.2.2 Infrastructure
The organization must have the ability to staff and finance the team

→ model.external:1.3, 3.1

4.1.2.3 Resources
The organization must have resources to hire the team and for them to operate

→ model.external:1.3

4.1.2.4 Workplace regulation
The organization must provide a workplace that meets regulation

→ model.external:1.9

4.2 Ability to sell
The organization must have the ability sell the system produced (when appropriate)

→ model.external:2.1

4.2.1 Articulate value
The organization must be able to articulate the value of the system product being sold

→ model.artifacts:2.1

4.2.2 Market
There must be a market for the system being sold

→ model.artifacts:2.1

→ model.external:2.2

→ model.plan:3.2.1

4.2.3 Sales and marketing team
The organization must have a sales or marketing capability, with the resources to do its job

→ model.external:2.3

4.3 Profit
The organization must get enough profit from the project to fund overhead and to support future projects

→ model.external:3.2

→ model.plan:1.2.5, 1.2.6, 3.2.6

4.4 Positioning for future work
The organization must be positioned for future projects and/or maintenance of this system

4.4.1 Reputation
The organization must have a reputation for being able to build systems well

→ model:2.1, 2.2, 2.5

4.4.2 Reusable capability
The organization must have capabilities in processes, teams, and tools that will apply to future projects

→ model.external:4.1

4.4.3 Ongoing improvement
The organization must be able to learn and improve its capabilities over time

→ model.external:4.2

A.2.3.1 Ability to deliver

The purpose for the organization pursuing a system-building project is to deliver a system to the customer. If the project does not deliver something, the organization will see little return on its investment and effort.

Of course, an organization might get a contract from a customer and get started, only for the customer to cancel the contract. (Hopefully the organization has taken this into account in its planning.) The organization still needs to have been able to deliver the system, even if the work was stopped.

The ability to deliver has two aspects: communication with the customer and support for the team, in addition to the general ability to build a system for the customer.

When the system being built has a specific customer, the organization needs to be able to talk to them, keep them updated on progress, and hear concerns or issues from the customer. When there is disagreement, the organization needs people who can negotiate and resolve issues.

The project can help this by maintaining the interface with the customer, including having people assigned to work with the customer, documenting what they learn from the customer, and negotiating with the customer as issues arise.

The project team can do little without the organization supporting them. The team needs leadership; it needs workspace and other infrastructure; it needs human resources and payroll and accounting support. The organization needs to:

Provide the resources (funding, space) that the team needs to do its work.
Provide supporting capabilities, such as hiring, problem resolution, and marketing.
Maintain a safe and regulation-compliant workplace.

A.2.3.2 Ability to sell

If the system is expected to be delivered to multiple customers over time, the organization needs to be able to find those customers, make the case to them that the system will benefit them, and work out a deal to deliver the system.

I have written this need in terms of selling, but the needs apply when something is being delivered not for monetary return. An open-source project that is freely available to users does not sell the system for money, but the way that project has value is for users to pick up, deploy, and use the system. The project may want to attract developers to build up an ecosystem of related products or services. Meeting these needs involves making potential users aware of the system and making the case that they will benefit from the system.

To be able to sell the system, the organization needs to:

Be able to say why the system is worth acquiring and using. The system-building project can help this by keeping records of the intended value proposition for the system.
Ensure that there is a market for the system. The project can assist by having that documentation of what the system is for, and by working with the organization’s marketing team while designing the system.
Have a team that can do sales and marketing.

A.2.3.3 Profit

The organization will be expecting to get some kind of return on its effort. That may be a monetary return, but a non-profit or government agency may look for a non-monetary return, such as a community benefit.

The project can support this in two ways. First, the organization can set business objectives for the project, such as expected profit. The project can keep records of these objectives, and take them into account in the system’s design. Second, the project can organize its work as efficiently as possible so that investment goes as far as possible (consistent with deadlines). The project’s management can monitor the time and money being spent and work out how to adjust the project if it looks likely that the project will not meet the organization’s expectations on return or profit.

A.2.3.4 Positioning for future work

Many organizations build multiple systems over their existence—whether this is building multiple bespoke systems for customers, or building multiple products that are delivered to many customers. The ability to continue to deliver profitable systems is a major part of a company’s stock performance: the stock price is determined by the market expectation of future returns to the investors.

An organization’s reputation affects its ability to attract customers and investment, as well as its ability to hire talented staff. The reputation depends in part on its ability to deliver good systems.

An organization can become more productive over time—and thus improve its reputation, its ability to deliver, and its profitability. This comes from learning and improving. If the organization builds up a staff that knows how to run a system-building project well, future projects can be executed more efficiently. Better tools will help the next projects. However, learning and improvement does not often happen by chance; it happens when an organization sets out to learn from its performance.

The project can:

Add to the organization’s reputation for building a system well, on time, and on budget.
Maintain a transparent and good working relationship with the customer, so that a happy customer will refer others for future projects.
Meeting its team’s needs, so that people on the team spread the word that this is a good organization to work for.

The organization should:

Take steps to build up reusable capability from one project to another.
Have the the commitment and processes to learn and improve over time.

A.2.4 Funders

The funders provide the investment that funds the team building the system (see Section 16.2.4.) The funder provides these resources in the expectation of some kind of return, be that monetary or not. A venture capital funder is most likely to look for a monetary return from future profits from the organization it is funding. A company providing internal funding more likely is looking the project to add to the company’s capabilities, which will in turn enable the company to increase its future profits. A government agency is likely looking for something that will benefit the public in some way.

As noted earlier, there are many different kinds of funders, from venture capital to company internal funding to customers paying for development.

model:5 Funders

5.1 Return on investment
The funder must get at least the expected return on its investment

→ model:4.2, 4.3

5.1.1 Visibility
The funder must have sufficient visibility into the organization’s behavior and progress to determine when the project is at risk of not providing a return on investment

→ model.external:4.3

→ model.plan:1.2, 4.1, 4.2

5.1.2 Influence
The funder must have influence on the organization or project in order to address performance that will jeopardize return on investment

→ model.external:4.3.1

5.2 Ability to attract future investment
The project must help the funder attract investment for future projects

→ model:5.1

A.2.4.1 Return on investment

Funders provide capital to run the project on the expectation that they will get some return on that investment.

In some cases, the return will come from profit realized in building the system (Section A.2.3.3) or from an increase in the value of the organization (Section A.2.3.4). In other cases the return will come from the value of the system after it is delivered and deployed (Section A.2.1.1, Section A.2.3.1).

The funders will also expect to be able to track the organization’s and project’s progress, and to raise issues when they find that there may be a problem that could jeopardize the funder’s return. The organization needs to have people whose responsibility includes interfacing with the funders.

The project can support the interface with funders by maintaining a realistic plan for the work, managing its budget, and keeping the organization informed of progress. The project should also have the processes in place to respond when the funders raise an issue that leads to a potential change to the system.

The project may also need to maintain accurate records and artifacts that allow the funder to audit the project—verifying that the information the funder has received is accurate and complete.

A.2.4.2 Ability to attract future investment

The funders get the capital they invest from somewhere. In many cases, the investment capital comes from their customers: individual and institutional investors for venture capital, legislatures and the public for government investors. The funders will keep their investor customers satisfied if they can show that their investments produce the expected returns, leading to a reputation for using capital wisely. At the same time, funders want to avoid bad press from projects that have problems, which can reflect on the funders’ ability to select organizations or projects.

The ways that the project can address this funder need are all included in the previous section, on the funder’s need for return on investment.

A.2.5 Regulators

Regulators (Section 16.2.5) provide an independent check on work to ensure that it meets regulations or standards, thus ensuring that some public good is maintained that the organization or project might not otherwise be incentivized to meet.

The interaction between the project and regulators depends on the countries involved and the nature of the project. Some industries require licensing or certification of some kinds of system: most aircraft, for example, must obtain type certification from the local civil aviation authority before that aircraft is allowed to fly or be sold. Spacecraft require a set of licenses for launch and communication. Other industries, such as consumer electronics or automobiles in the United States, depend on compliance with regulation but compliance is only checked after the fact.

I include voluntary standards as part of regulation. Non-governmental organizations set interoperability standards; the standards for USB (set by the USB Implementer’s Forum) and WiFi (set by the Institute of Electrical and Electronics Engineers 802.11 working group) are examples. Other organizations set safety standards that help to ensure consumer products are checked to be safe.

The regulators perform multiple tasks:

Setting regulations or standards.
Communicating those regulations to those who might be affected.
Performing compliance checks for certification, when appropriate.
Performing compliance checks when violation is suspected.
Working with the organizations that are being checked to identify and resolve issues.
Overseeing remedies, when appropriate.

model:6 Regulators

6.1 Compliance and certification
The regulator must be able to work with the project to ensure regulatory compliance and (when appropriate) certify the system

6.1.1 Available regulation
The regulator must make information about regulations available to the organization and possibly user

→ model.artifacts:8.1

→ model.plan:3.2.5

→ model.team:2.3.1.1, 2.3.1.2

6.1.2 Application
The project must apply to the regulator for certification and then follow the certification process

→ model.artifacts:8.2

→ model.plan:3.3.7

→ model.team:2.3.1.5

6.1.3 System auditability
The regulator must be able to audit that the system complies with regulations

→ model.artifacts:4.2, 4.4, 4.5, 8.4

→ model.plan:3.3.3.1

→ model.team:2.3.1.3

6.1.4 Process auditability
The regulator must be able to audit that the organization has followed required processes in building the system

→ model.artifacts:8.3

→ model.team:2.3.1.3

6.2 Monitoring
The regulator must be able to monitor the organization, project, and/or system for compliance with regulation

→ model.team:2.3.1.3

6.2.1 Accurate information available
The organization and/or user must make available to the regulator accurate and complete information about the system and organization behavior

→ model.artifacts:4.2, 4.4, 4.5, 8.3, 8.4

6.2.2 Notify regulator
The organization or user must proactively provide information to the regulator when a potential regulatory problem is detected, as required by regulation

→ model.plan:3.3.7.1

→ model.team:2.3.1.4

6.3 Problem resolution
The regulator must be able to work with the project and/or user to identify and resolve potential regulatory problems

6.3.1 Communicate with organization or user
The regulator must be able to communicate with the organization or user about potential regulatory problems

→ model.plan:7.1

→ model.team:2.3.1.3

6.3.2 Accurate information
The regulator must obtain cooperation and accurate information from the organization or user to investigate a potential regulatory problem

→ model.artifacts:4.2, 4.4, 4.5, 8.3, 8.4

6.3.3 Respond to remedy
The organization or user must be able to respond to a regulator’s remedy

→ model.plan:7.1

→ model.team:2.3.1.5

A.2.5.1 Compliance and certification

The regulator makes regulations (or standards), and makes them available to teams building affected systems.

The project responds to the regulations by designing and building the system so that it meets the regulations, maintaining records needed to show that the regulations have been met, and beginning a process for getting certifications or licenses when appropriate.

The project is responsible for:

Finding and documenting all the regulations that apply to the system, and communicating those to the team. The project should maintain artifacts documenting these regulations, and needs people assigned to tracking down this information. Frequently the team will need to ask the regulator for explanation or clarification as well.
Tracking regulations for change. When regulations change, the team needs people who will find the changes and oversee any changes to the system needed to comply with the new regulations.
Designing and building the system to meet the regulations, including documentation of how compliance has been verified. The team must include regulation as it formulates requirements and designs. The team should have a process that ensures that designs and implementation are checked against the regulatory requirements, and maintains records of the checks.
Maintaining records of the process that has been followed, so that the regulator can audit the process that has been followed (separate from the content of the system).
Applying for certification. This involves knowing how to apply, developing a working relationship with the regulator to perform the application, delivering information to the regulator as needed, and responding to issues raised by the regulator.

A.2.5.2 Monitoring

In some cases, the regulator must monitor the project’s work—for example, during aircraft certification, which is generally a joint effort between the aviation authority and the company building the aircraft. A regulator might also need to monitor the project after a violation has been found and the team is working on remedial action.

Accurate and timely information is paramount when this occurs. The project must maintain good records to be able to provide that information to the regulators. The information potentially covers everything about the project: the design and analyses of the system, its implementation, records of design rationales, and logs of the processes followed.

The team must also be prepared to notify the regulator proactively as situations arise. The team should have people who will watch for situations and communicate with the regulator.

A.2.5.3 Problem resolution

I have never observed a licensing or certification process to go with no problems. The processes and regulations are often complex, and unless a team has done the process several times before there will almost certainly be things the team needs to learn to get through the process.

This means that there will be problems to resolve. Sometimes the team will discover the problem and need guidance from the regulator. Other times the regulator will raise the issue.

The team can make smooth the problem resolution process by:

Documenting the process it is following, including notes on how the team understands the process and what it learns along the way (to help the next time).
Maintaining good communications with the regulators, in order to be able to ask them questions or receive notices from them. This implies having people on the team who have the ongoing responsibility to maintain the interface with the regulators.
Maintaining accurate information, as noted above.
Having and following a process to handle regulatory issues when they are found or reported, including working with the regulator to find a resolution and then implementing and verifying the resolution.

A.3 Model elements

All of the objectives in the previous section map to objectives related to artifacts, team, tools, and plan. Some of them also map to things other than the system-building that goes on in the project.

This section lays out tables of the objectives for each element of the model. Each objective is annotated with its parents; that is, the objectives that are the reason that this objective is included. These are annotated in the tables with an arrow pointing down and right: ↘. If one of the objectives has children, those are annotated with a right-pointing arrow: →.

A.3.1 Artifacts

model.artifacts:1 Artifact management

1.1 Store artifacts
The project must have a place to store artifacts

↘ model:2.1.2, 2.4, 2.5.4

↘ model.artifacts:2.1, 2.2, 3.1, 4.2, 4.4, 4.5, 5.1, 5.2, 6.1, 6.2, 7.1, 7.2, 7.3, 7.4, 8.1, 8.2, 8.3, 8.4, 9.1

→ model.tools:2.1, 3.2.1, 3.4

1.1.1 Consistent versioning
The artifact storage must be able to maintain versions of all artifacts that are consistent with each other

1.2 Finding artifacts
Team members must be able to find artifacts that they are looking for

↘ model.artifacts:2.1.1, 3.1.3, 4.2, 4.4, 5.2, 6.1, 8.1

1.2.1 Discovery
Team members must be able to learn about artifacts they need to know of that they didn’t previously know existed

1.3 Status
Anyone looking at an artifact must be able to determine the status of that artifact (work in progress, proposed, approved, complete, and its version)

↘ model:3.1.3

↘ model.artifacts:2.1, 2.2, 3.1, 4.2, 4.4, 4.5, 5.1, 5.2, 6.1, 6.2, 7.1, 7.2, 7.3, 7.4, 8.1, 8.2, 8.3, 8.4, 9.1

1.3.1 Support workflow

model.artifacts:2 Purpose-related

2.1 System purpose
The artifacts must include documentation of the customer’s purpose for the system

↘ model:2.1.1, 2.1.2, 2.5.4, 4.2.1, 4.2.2

↘ model.plan:3.3.1

↘ model.team:2.1.1.1

→ model.artifacts:1.1, 1.3

2.1.1 Discoverable purpose
The artifacts that document the system’s purpose must be readily and accurately discoverable by members of the project team

→ model.artifacts:1.2

2.1.2 System requirements
The artifacts must include documentation of the customer’s reliability, safety, and security requirements for the system

↘ model:2.1.3, 2.1.4

2.1.3 System usage
The documentation of the customer’s purpose must include accurate information about what the system will be used for

↘ model:3.6.2

2.2 Changes in purpose
The artifacts must include records of requests made for changes to the system’s purpose, including the status of that request and any artifacts resulting from an approved change

↘ model:2.5.3, 2.5.4

↘ model.plan:5.1, 5.2, 5.3

↘ model.team:2.1.1.2, 2.3.1.2

→ model.artifacts:1.1, 1.3

2.3 Reasons for building system
The artifacts must include documentation of why the team has chosen to build the system

↘ model.plan:1.1

model.artifacts:3 Team-related

3.1 Team structure
The artifacts must include documentation of the structure of the team

↘ model:3.4.4

↘ model.plan:3.2.3, 6.3, 6.4

↘ model.team:1.2

→ model.artifacts:1.1, 1.3

3.1.1 Team membership
The documentation of team structure must include accurate records of who is on the team

3.1.2 Roles and authority
The documentation of team structure must include accurate records of the roles and authority that each team member has

3.1.3 Navigability
Members of the team must be able to conveniently and accurately navigate the records of team structure

→ model.artifacts:1.2

model.artifacts:4 System-related

4.1 Technical uncertainty
The artifacts must include records about the uncertainties or risks identified for the system being built

↘ model.plan:8.2

4.2 Specification and design artifacts
The artifacts must include accurate records of the specification and design of the system components and structure

↘ model:2.1.2, 6.1.3, 6.2.1, 6.3.2

→ model.artifacts:1.1, 1.2, 1.3

4.2.1 Rationales
The design-related artifacts must include rationales for the design choices made

↘ model:2.5.4

4.4 Implementation artifacts
The artifacts must include the implementation of the system

↘ model:2.1.2, 6.1.3, 6.2.1, 6.3.2

→ model.artifacts:1.1, 1.2, 1.3

4.5 Analysis artifacts
The artifacts must include accurate analyses of the system component and structure design or implementation

↘ model:2.1.2, 2.1.4, 2.1.5, 6.1.3, 6.2.1, 6.3.2

→ model.artifacts:1.1, 1.3

model.artifacts:5 Verification-related

5.1 Verification tests
The artifacts must include implementations of tests used for verifying the system implementation

↘ model:2.1.2, 2.1.4

↘ model.artifacts:5.2

→ model.artifacts:1.1, 1.3

5.2 Verification results
The artifacts must include accurate results of performing verification tests, reviews, and analyses

↘ model:2.1.2, 2.1.4

→ model.artifacts:1.1, 1.2, 1.3, 5.1

model.artifacts:6 Release/deployment-related

6.1 Release/deployment procedures
The artifacts must include the procedures to be used to release or deploy the system

↘ model:2.4

→ model.artifacts:1.1, 1.2, 1.3

6.2 Release/deployment records
The artifacts must include records of each release and deployment made of the system

↘ model.plan:3.4

→ model.artifacts:1.1, 1.3

model.artifacts:7 Management-related

7.1 Budget records
The artifacts must include records tracking resource budgets

↘ model.plan:4.1

→ model.artifacts:1.1, 1.3

7.2 Roadmap and plan
The artifacts must include the plan for completing the system

↘ model.plan:1.2

→ model.artifacts:1.1, 1.3

7.3 Tasking
The artifacts must include records about the tasks currently being performed, or that are scheduled to be performed in the near future

↘ model.plan:2.1, 2.2

→ model.artifacts:1.1, 1.3

7.4 Project uncertainty
The artifacts must include records about the uncertainties or risks identified for project execution

↘ model.plan:8.1

→ model.artifacts:1.1, 1.3

model.artifacts:8 Regulatory-related

8.1 Regulations
The artifacts must include all the relevant regulations that the system must meet (or references to them)

↘ model:2.3.1, 6.1.1

↘ model.plan:5.2

↘ model.team:2.3.1.2

→ model.artifacts:1.1, 1.2, 1.3

8.2 Certification process
The artifacts must include information on the certification process

↘ model:2.3.2, 6.1.2

→ model.artifacts:1.1, 1.3

8.2.1 Application
The artifacts must include records of applications made for certification

8.2.2 Certifications
The artifacts must include records of certifications that have been granted or denied for the system

↘ model.plan:3.3.7

8.3 Regulatory process records
The artifacts must include records of the process being followed to meet regulation or obtain certification

↘ model:2.3.2, 6.1.4, 6.2.1, 6.3.2

→ model.artifacts:1.1, 1.3

8.4 Regulatory verification records
The artifacts must include records showing that the system has been verified against regulatory requirements

↘ model:6.1.3, 6.2.1, 6.3.2

↘ model.plan:3.3.3.1

→ model.artifacts:1.1, 1.3

model.artifacts:9 Audit

9.1 Approvals
The artifacts must include records of reviews and approvals of designs and implementations

→ model.artifacts:1.1, 1.3, 1.3.1

A.3.2 Team

model.team:1 Organization

1.1 General structure
Each team member must be able to find and understand the structure of the team

↘ model:3.4.1

1.1.1 Finding team members
Each team member must be able to find essential information about other team members

↘ model:3.4.4

1.1.2 Reporting
Each team member must be able to accurately determine the reporting structure of the team

↘ model:3.4.4

1.2 Roles and responsibilities
Each team member must be able to accurately find and understand their roles and responsibilities

↘ model:3.4.2

→ model.artifacts:3.1

1.2.1 Clear responsibilities
Each team member must be able to accurately find and determine the responsibilities on which their performance will be evaluated

↘ model:3.4.3

model.team:2 Capabilities

2.1 Customer-related

2.1.1 Customer interface
The team must have people whose responsibility is to work with the customer

↘ model:4.1.1

2.1.1.1 Learn and communicate the customer’s purpose
The team must have people whose responsibility is to work with the customer to identify the customer’s purpose and requirements for the system and to communicate that to the rest of the team

↘ model:2.1.1, 2.1.3

↘ model.plan:3.2.1

→ model.artifacts:2.1

2.1.1.2 Learn and process change requests
The team must have people whose responsibility is to receive requests for changes, document those requests, and drive the process to decide on and resolve the requests

↘ model:2.5.1, 2.5.4

→ model.artifacts:2.2

2.1.1.3 Ability to negotiate
The people on the team who interface with the customer must be able to raise issues to the customer and negotiate resolutions of issues or conflicts

↘ model:4.1.1.1

2.1.2 Internal communication
The team must have people whose responsibility includes regularly informing the rest of the team about the status of working with the customer

↘ model:3.6.1

2.2 Technical capabilities

2.2.1 Ability to build system
The team must have the skills needed to build a system that meets the customer’s purpose

↘ model:2.1.2, 2.1.4, 2.5.4

2.2.2 Ability to verify system
The team must have the skills needed to verify that the designed or implemented system meets the customer’s purpose, regulatory requirements, and other constraints

↘ model:2.1.2, 2.1.4, 2.1.5, 2.5.4

2.2.3 Track technical uncertainty
The team must have people whose responsibility includes identifying risk or uncertainty related to the system being built, documenting those uncertainties, and ensuring that the uncertainties are resolved

↘ model.plan:8.2

2.2.4 Ability to release/deploy system
The team must have the skills needed to release or deploy the system

↘ model.plan:3.4

2.3 Regulator interface

2.3.1 Regulator interface
The team must have people whose responsibility is to work with regulator(s)

2.3.1.1 Identify regulation
The team must have people whose responsibility includes identifying relevant regulation or certification requirements and documenting them

↘ model:6.1.1

↘ model.plan:3.2.5

2.3.1.2 Detect and handle regulatory changes
The team must have people whose responsibility includes detecting that relevant regulations have changed, documenting those changes, and driving changes to plans to address the changes

↘ model:2.5.2, 6.1.1

→ model.artifacts:2.2, 8.1

2.3.1.3 Handle regulator requests
The team must have people whose responsibility includes receiving and responding to requests for information from the regulator

↘ model:6.1.3, 6.1.4, 6.2, 6.3.1

2.3.1.4 Handle regulator notification
The team must have people whose responsibility includes notifying the regulator at identified events that affect certification or regulatory compliance

↘ model:6.2.2

2.3.1.5 Perform certification/approval process
The team must have people whose responsibility includes working with the regulator(s) to obtain certification or approval

↘ model:6.1.2, 6.3.3

↘ model.plan:3.3.7

2.3.2 Process oversight
The team must have people whose responsibility is to ensure that the project is following processes that will result in a system that meets regulations and/or obtain certification

↘ model.plan:3.3.1.1, 3.3.2.1, 3.3.3.1

2.4 Management capabilities

2.4.1 Track plan
The team must have people whose responsibility includes building and maintaining the project plan

↘ model.plan:1.2, 5.4.1

2.4.2 Detect and respond to problems
The team must have people whose responsibility includes detecting when the project may not meet deadlines and oversee the response

↘ model.plan:1.2.4

2.4.3 Prioritize and assign tasks
The team must have people whose responsibility includes determining which tasks should be performed next according to some prioritization, and assigning those tasks to team members

↘ model.plan:2.1, 2.2

2.4.4 Maintain team information
The team must have people whose responsibility includes maintaining information about the team

↘ model.plan:6.2, 6.3, 6.4

2.4.5 Track staffing levels
The team must have people whose responsibility is to detect when staffing levels need to change, and then ensure that needed changes are made

↘ model.plan:6.1

2.4.6 Track project uncertainty
The team must have people whose responsibility includes identifying risk or uncertainty related to project execution, documenting those uncertainties, and ensuring that the uncertainties are resolved

↘ model.plan:8.1

2.5 Support capabilities

2.5.1 Maintain tools
The team must have people whose responsibility includes maintaining the tools used for building the system

↘ model:2.1.2, 2.4, 2.5.4

model.team:4 Exceptions

4.1 Raise technical issues
Each team member must know how to raise technical issues when they find them

↘ model:3.4.5

model.team:5 Other

5.1 Tracking
The team must be able to track time spent building the system

↘ model.plan:4.1.1

A.3.3 Tools

model.tools:1 General

1.1 Automate simple tasks
Where possible, the project should use tools to automate simple or repetitive tasks

↘ model:3.1.3

model.tools:2 Artifact management

2.1 Digital artifact storage
The tools must provide for storing and managing digital artifacts

↘ model.artifacts:1.1

model.tools:3 Hardware support

3.1 Design tools
The project must have the design tools needed to design hardware components

↘ model:2.1.2

3.2 Manufacturing
The project must have the tools needed to manufacture hardware parts

↘ model:2.1.2, 3.3

→ model.tools:3.4

3.2.1 Stock storage
The tools must provide for maintaining stock materials used to build hardware components

↘ model.artifacts:1.1

3.3 Hardware test
The project must have tools to perform verification tests on hardware components

↘ model:3.3

↘ model.plan:3.3.3, 3.3.5

3.4 Inventory management
The project must have the facilities and tools to maintain hardware parts inventory and track its content

↘ model.artifacts:1.1

↘ model.tools:3.2, 3.5

3.5 Hardware deployment
The project must have tools for delivering and distributing hardware components

↘ model:2.4

→ model.tools:3.4

model.tools:4 Software support

4.1 Software build
The project must have tools to build software components

↘ model:2.1.2, 3.3

4.2 Software test
The project must have tools to perform verification tests on software components

↘ model:3.3

↘ model.plan:3.3.3, 3.3.5

4.3 Software release
The project must have tools for making, maintaining, and distributing software releases

↘ model:2.4

model.tools:5 Facilities

5.1 Team facilities
The project must have facilities in which the team can work to develop the system

↘ model:3.3

A.3.4 Plan

model.plan:1 General roadmap

1.1 Overall objective
The roadmap must include a clear statement of the objective(s) of the system-building effort

↘ model:3.1.1, 3.6.1

→ model.artifacts:2.3

→ model.plan:3.2.4

1.2 Plan to completion
The project must maintain a plan that shows an estimation of the time and effort required to complete the system

↘ model:2.1.2, 2.2.3, 2.5.4, 3.1.1, 5.1.1

→ model.artifacts:7.2

→ model.team:2.4.1

1.2.1 Reflect uncertainty
The plan must accurately reflect the degree of uncertainty (or risk) in what is known about the steps needed to complete the system

↘ model:2.2.3.1

→ model.plan:8.1, 8.2

1.2.2 Update plan
The plan must be updated as work is completed or uncertainty changes

↘ model.plan:3.3.6, 5.4.1

1.2.3 Include resource estimate
The plan must include estimates of the time and resources required to complete steps in the plan

↘ model:3.2.1, 3.2.2

1.2.4 Detect and handle deadline problems
The project must include processes and milestones that will detect when project deadlines will not be met, determine how to respond, and ensure the response is executed

↘ model:2.2.4

→ model.team:2.4.2

1.2.5 Only project-relevant work in plan
The project must include only work that is relevant to building the system, or managing and supporting that development, in the plan

↘ model:2.2, 3.1.3, 4.3

1.2.6 Efficient execution
The project must organize and plan the work in an efficient way, minimizing time to completion and/or cost without sacrificing quality, customer purpose, or requirements

↘ model:4.3

→ model.plan:2.2, 8.1, 8.2

1.2.7 Navigability
The project plan information must be accessible to team members in a way that allows them to understand how the work will proceed and how it will affect their assignments

↘ model:3.4.2

model.plan:2 Sequencing and prioritization

2.1 Track current effort
The project must include processes to track what tasks are currently being worked on or have been assigned to people to be worked on in the near future

↘ model:2.1.2, 2.5.4

→ model.artifacts:7.3

→ model.team:2.4.3

2.2 Assign next tasks
The project must include processes to determine what tasks should be worked on in the near future, and by whom

↘ model:2.1.2, 2.5.4

↘ model.plan:1.2.6

→ model.artifacts:7.3

→ model.team:2.4.3

2.3 Detect and handle tasking problems
The project must include processes and milestones to detect when there are problems performing one or more tasks, determine how to respond, and ensure the response is executed

↘ model:2.1.2, 2.5.4

2.4 Navigability
The scheduling information must be accessible to team members in a way that allows them to determine what work they should be doing, and who is doing work related to theirs

↘ model:3.4.2, 3.4.5

↘ model.plan:6.1

model.plan:3 Process and life cycle

3.1 Defined life cycle
The project must have a defined life cycle that defines the processes people must follow to build and deploy the system

↘ model.plan:3.2, 3.3

3.2 Project startup
The project life cycle must include the processes involved in starting the project

→ model.plan:3.1

3.2.1 Learn and verify customer purpose
The project life cycle must include an initial step to learn the customer’s purpose for the system and ensure that the team properly understands that purpose

↘ model:2.1.1, 2.1.3, 4.2.2

→ model.team:2.1.1.1

3.2.2 Learn and verify available resources
The project life cycle must include an initial step to determine the resources initially available for building the system

↘ model:2.2.1

↘ model.plan:4.1

3.2.3 Establish organization structure
The project life cycle must include an initial step to decide on and document the structure of the team that will be working on the project

↘ model:3.4.2, 3.4.4

→ model.artifacts:3.1

3.2.4 Establish reasons for building the system
The project life cycle must include an initial step to determine whether the team should build the system, and if so, why

↘ model.plan:1.1

3.2.5 Establish regulatory constraints
The project life cycle must include an initial step to determine what regulations apply to the system, including the potential need for certification

↘ model:2.3.1, 6.1.1

→ model.team:2.3.1.1

3.2.6 Establish organization expectations
The project life cycle must include an initial step to determine what expectations has of the project, including definition of constraints on the project and system

↘ model:4.3

→ model.external:3.2

3.3 System building
The project life cycle must include the processes involved in building the system

↘ model:2.1.2, 2.1.4

↘ model.plan:5.4

→ model.plan:3.1

3.3.1 Design to purpose
The project life cycle must provide processes and milestones that ensure that the system design accurately reflects the customer’s purpose for the system

→ model.artifacts:2.1

3.3.1.1 Design to meet regulation
The project life cycle must provide processes and milestones that ensure the system design meets regulatory requirements

↘ model:2.3.2

→ model.team:2.3.2

3.3.1.2 Design for release/deployment
The project life cycle must provide processes and milestones that ensure that the resulting system can be released or deployed

↘ model.plan:3.4

3.3.2 Build to purpose
The project life cycle must provide processes and milestones that ensure that the built system accurately reflects the customer’s purpose for the system

↘ model:2.1.2

3.3.2.1 Build to meet regulation
The project life cycle must provide processes and milestones that ensure the built system meets regulatory requirements

↘ model:2.3.2

→ model.team:2.3.2

3.3.2.2 Build for release/deployment
The project life cycle must provide processes and milestones that ensure that the built system can be released or deployed

↘ model.plan:3.4

3.3.3 Verify against purpose
The project life cycle must provide processes and milestones that verify that the system meets the customer’s purpose

→ model.tools:3.3, 4.2

3.3.3.1 Verify against regulation
The project life cycle must provide processes and milestones that verify that the system meets regulation

↘ model:2.3.2, 6.1.3

→ model.artifacts:8.4

→ model.team:2.3.2

3.3.4 Verify no extraneous behavior
The project life cycle must provide processes and milestones that verify that the system does not include functions or behavior that is outside the customer’s purpose

3.3.5 Identifying and fixing errors
The project life cycle must provide processes and milestones that ensure error will be detected with high likelihood, and that detected errors will be fixed

↘ model:2.1.4, 2.1.5, 3.4.5

→ model.tools:3.3, 4.2

3.3.6 Adaptation
The project life cycle must provide processes and milestones to adapt the plans and design as the team learns about the system or as uncertainties are resolved

↘ model:2.5.4

→ model.plan:1.2.2

3.3.7 Perform certification/approval
The project life cycle must provide processes to result in certification or approval, if required for the system

↘ model:2.3.2, 6.1.2

→ model.artifacts:8.2.2

→ model.team:2.3.1.5

3.3.7.1 Regulatory notification
The project life cycle must define events at which the project must notify regulators, and processes by which the notification information is gathered and delivered

↘ model:6.2.2

3.4 System release and deployment
The project life cycle must include the processes involved in releasing and deploying the system

↘ model:2.4

→ model.artifacts:6.2

→ model.plan:3.3.1.2, 3.3.2.2

→ model.team:2.2.4

model.plan:4 Budgets

4.1 Resources
The budgets must include the amount of various resources allocated to the project

↘ model:2.2, 2.2.1, 5.1.1

→ model.artifacts:7.1

→ model.plan:3.2.2

4.1.1 Amount used
The budget must accurately record the amount of resources already used in the project

↘ model:2.2.2

→ model.team:5.1

4.1.2 Amount remaining
The budget must provide accurate measures of how much resource remains available

4.2 Deadlines
The budgets must include milestones or other deadlines that the project must meet

↘ model:2.2, 5.1.1

model.plan:5 Change handling

5.1 Receive change request
The project must have a process for receiving and documenting a request for change in purpose or features

↘ model:2.5.1

↘ model.external:4.3.1

↘ model.plan:7.1

→ model.artifacts:2.2

5.2 Receive regulatory change
The project must have a process for detecting and receiving changes in regulatory requirements

↘ model:2.5.2

→ model.artifacts:2.2, 8.1

5.3 Decision process
The project must have a process for determining whether to proceed with a change request or not

↘ model:2.5.1

↘ model.external:4.3.1

→ model.artifacts:2.2

5.4 System change
The project life cycle must provide processes and milestones for building changes to the system

↘ model.external:4.3.1

↘ model.plan:7.1

→ model.plan:3.3

5.4.1 Plan change
The project life cycle must provide processes for updating plans when change requests are accepted for building

→ model.plan:1.2.2

→ model.team:2.4.1

model.plan:6 Team-related

6.1 Determine need for staffing changes
The project life cycle must provide processes and milestones for detecting when a change in staffing is appropriate, and the following through on the needed changes

↘ model:3.2.1, 3.2.2

→ model.external:1.3

→ model.plan:2.4

→ model.team:2.4.5

6.2 Maintain team information
The project life cycle must provide processes and milestones for maintaining information about the structure, roles, responsibilities, and authority in the team

↘ model:3.4.2, 3.4.4

→ model.team:2.4.4

6.3 Adding team members
The project life cycle must provide a process for adding new team members to the processes and tools, and educating them about the project

↘ model:3.2.1, 3.4.2, 3.4.4

↘ model.external:1.3.1

→ model.artifacts:3.1

→ model.team:2.4.4

6.4 Removing team members
The project life cycle must provide a process and definitions of triggering events for removing a member from the team

↘ model:3.2.2

↘ model.external:1.3.2, 1.4

→ model.artifacts:3.1

→ model.team:2.4.4

model.plan:7 Regulatory-related

7.1 Receive and process regulatory issues
The project life cycle must provide a process by which the project can receive notification of issues from the regulator, identify remedies, implement the remedies, and respond to the regulator

↘ model:6.3.1, 6.3.3

→ model.plan:5.1, 5.4

model.plan:8 Risk and uncertainty

8.1 Project uncertainty
The project life cycle must track and manage uncertainties or risks related to project execution

↘ model.plan:1.2.1, 1.2.6

→ model.artifacts:7.4

→ model.team:2.4.6

8.2 Technical uncertainty
The project life cycle must track and manage uncertainties or risks related to the system being built

↘ model.plan:1.2.1, 1.2.6

→ model.artifacts:4.1

→ model.team:2.2.3

8.2.1 Efficient burn-down
The project life cycle must provide a process and milestones that lead to efficiently reducing technical uncertainty as the project progresses

A.3.5 External responsibilities

model.external:1 Team-related

1.1 Satisfaction
Project and organization management must take steps to provide team members with the information they need to give them confidence that the project is on track and will challenge the team members

↘ model:3.1.1, 3.1.2

1.2 Appropriate assignments
Project management must take steps to match team members with tasks that are needed and that challenge the team member

↘ model:3.1.2, 3.1.3

1.3 Appropriate staffing level
Project management must manage the makeup of the team to ensure that the project has the right people to do the work, including having processes to hire, contract, or let go of staff

↘ model:3.2.1, 3.2.2, 4.1.2.2, 4.1.2.3

↘ model.plan:6.1

1.3.1 Hiring
Project management and the hosting organization must be able to hire or move in staff when needed for the project

→ model.plan:6.3

1.3.2 Firing or transfer out
Project management and the hosting organization must be able to move out or let go staff who are no longer needed for the project

→ model.plan:6.4

1.4 Respond to team issues
Project management must respond appropriately to issues raised by team members, both technical issues and non-technical issues

↘ model:3.4.5, 4.1.2.1

→ model.plan:6.4

1.5 Fit to organization
The organization must provide the team with an understanding of how the project and the team members fit into the organization

↘ model:3.4.1, 4.1.2.1

1.6 Compensation
Project and organization management are responsible for setting compensation levels for team members at a level that is acceptable to both the staff and the project/organization

↘ model:3.5

1.7 Evaluation
Project and organization management are responsible for setting and documenting the evaluation process that will be used to judge each team member’s performance

↘ model:3.4.3, 4.1.2.1

1.8 Ethics policy
Project and organization management are responsible for setting and documenting an ethics policy that applies to all team and organization activities, along with mechanisms for reporting and resolving potential ethics violations

↘ model:3.6.2, 4.1.2.1

1.9 Workplace regulation
The organization management must provide a workplace that meets regulation

↘ model:4.1.2.4

model.external:2 Customer-related

2.1 Ability to sell
The organization must have the ability to sell the system produced (when appropriate)

↘ model:4.2

2.2 Ability to define market(s)
The organization must have the ability to define plausible market(s) for the system

↘ model:4.2.2

2.3 Sales and marketing capability
The organization must have a capability to perform sales and marketing of the system

↘ model:4.2.3

model.external:3 Resource-related

3.1 Sufficient resources
The organization and project management are responsible for obtaining funding and other resources sufficient to perform the project

↘ model:4.1.2.2

3.2 Define expected return
The organization must define the expected return on investment for the project to build the system

↘ model:4.3

↘ model.plan:3.2.6

model.external:4 Organization-related

4.1 Reusable capability
The organization must take steps to build up reusable capabilities that can apply to multiple projects

↘ model:4.4.2

4.2 Ongoing improvement
The organization must have the capability to learn and improve its capabilities over time

↘ model:4.4.3

4.3 Communication with funder
The organization must communicate with the funder in a way that gives the funder visibility into the organization’s behavior and progress on the project

↘ model:5.1.1

4.3.1 Receive and process funder concerns
The organization must have the capability to receive concerns from the funder, discuss the concerns, and take steps to address the concerns

↘ model:5.1.2

→ model.plan:5.1, 5.3, 5.4

Appendix B: The Heilmeier questions

22 July 2024

The Heilmeier questions, also known as the Heilmeier Catechism, was developed by George Heilmeier, and early director of the US Defense Advanced Research Projects Agency. These questions, adapted from the DARPA web site [Heilmeier24] are:

What are you trying to do? Articulate your objectives using absolutely no jargon.
How is it done today, and what are the limits of current practice? What are the consequences of doing nothing?
What is new in your approach and why do you think it will be successful?
Who cares? If you are successful, what difference will it make?
What are the risks?
How much will it cost?
How long will it take?
What are the mid-term and final “exams” to check for success?

Some variants add some additional questions:

How might this program be misperceived or misused, and how can you prevent that from happening?
How will the program be organized?
Why are you and your team the right people to do this?
What value will this project create?

Acknowledgments

I would like to thank the following people, who provided information, explained key ideas, or gave feedback on this text:

David Brown
Srimal Choi
Scott Golding
Vanessa Kuroda
Zack Lovering
Max Nova
Alicia Noyola
Peter Sachs
Olin Sibert
Richard Tsai
Lawrence Turner

Bibliography

[14CFR23]	Part 23—Airworthiness standards: normal category airplanes. In Title 14, Code of Federal Regulations. United States Government, December 2023. https://www.ecfr.gov/current/title-14/chapter-I/subchapter-C/part-23, accessed 14 February 2024.
[ACWG21]	The Assurance Case Working Group. Assurance case guidance: Challenges, common issues, and good practice. The SCSC Assurance Case Working Group, Report SCSC-159, Version 1, August 2021. https://scsc.uk/r159:1.
[ARP4754]	Guidelines for Development of Civil Aircraft and Systems. SAE International, Standard 4754 rev. A, December 2010.
[Agile]	Wikipedia contributors. Agile software development. In Wikipedia, the Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Agile_software_development, accessed 14 February 2024.
[Alexander02]	Christopher Alexander. The Nature of Order. The Center for Environmental Structure, Berkeley, California, 2002.
[Alexander15]	Christopher Alexander. A City is not a Tree. Sustasis Press, Portland, Oregon, 2015.
[Alexander77]	Christopher Alexander, Sara Ishikawa, and Murray Silverstein. A Pattern Language. Oxford University Press, New York, 1977.
[Bain99]	David Haward Bain. Empire Express: Building the First Transcontinental Railroad. Viking, New York, 1999.
[Berger24]	Eric Berger. The surprise is not that Boeing lost commercial crew but that it finished at all. Ars Technica, 6 May 2024. https://arstechnica.com/space/2024/05/the-surprise-is-not-that-boeing-lost-commercial-crew-but-that-it-finished-at-all/, accessed 28 May 2024.
[Bezos16]	Jeffrey P. Bezos. 2015 Letter to Shareholders. Amazon.com, Inc., 2016. https://s2.q4cdn.com/299287126/files/doc_financials/annual/2015-Letter-to-Shareholders.PDF, accessed 22 February 2024.
[Brand94]	Stewart Brand. How Buildings Learn: What Happens After They’re Built. Penguin Books, New York, 1994.
[CMMI]	ISACA. What is CMMI? https://cmmiinstitute.com/cmmi/intro, accessed 24 March 2024.
[Collins74]	Michael Collins. Carrying the Fire: An Astronaut’s Journeys. Farrar, Straus and Giroux, New York, 1974.
[Conway68]	Melvin E. Conway. How do committees invent? Datamation, vol. 14, no. 4, April 1968, pp. 28–31. http://www.melconway.com/Home/pdf/committees.pdf.
[DOD10]	DoD Deputy Chief Information Officer. DoD Architecture Framework Version 2.02. Department of Defense, United States Government, August 2010. https://dodcio.defense.gov/Library/DoD-Architecture-Framework/.
[Drucker93]	Peter F. Drucker. Management: Tasks, Responsibilities, Practices. Harper Business, New York, NY, 1993.
[Durkheim33]	Emile Durkheim. The Division of Labor in Society. George Simpson, translator. The Free Press, Glencoe, Illinois, 1933.
[FAA23]	Air Traffic Control. Federal Aviation Adminitration, Department of Commerce, United States Government, Order JO 7110.65AA, April 2023. https://www.faa.gov/regulations_policies/orders_notices/index.cfm/go/document.information/documentid/1029467.
[Fowler05]	Martin Fowler. Code as Documentation. https://martinfowler.com/bliki/CodeAsDocumentation.html, accessed 15 March 2024.
[Garmin13]	G3X Installation Manual. Garmin, 190-01115-01, Revision K, July 2013.
[Hardin20]	Russel Hardin, and Garrett Cullity. The Free Rider Problem. In The Stanford Encyclopedia of Philosophy. Edward N. Zalta, editor. Metaphysics Research Lab, Stanford University, Winter 2020 ed., 2020. https://plato.stanford.edu/archives/win2020/entries/free-rider/, accessed 28 March 2024.
[Heilmeier24]	George H. Heilmeier. The Heilmeier Catechism. In DARPA. https://www.darpa.mil/work-with-us/heilmeier-catechism, accessed 13 July 2024.
[Horney17]	David Craig Horney. Systems-theoretic process analysis and safety-guided design of military systems. M.S. thesis, Massachusetts Institute of Technology, June 2017. https://dspace.mit.edu/handle/1721.1/112424.
[ISO26262]	Road vehicles — Functional safety. International Organization for Standardization, Geneva, Switzerland, Standard ISO 26262:2018, Second ed., 2018.
[ISO42010]	Systems and software engineering—Architecture description. International Organization for Standardization, Geneva, Switzerland, Standard ISO/IEC/IEEE 42010, December 2011. http://www.iso-architecture.org/42010/index.html.
[JPL00]	JPL Special Review Board. Report on the Loss of the Mars Polar Lander and Deep Space 2 Missions. Jet Propulsion Laboratory, Report JPL D-18709, March 2000. https://smd-cms.nasa.gov/wp-content/uploads/2023/07/3338_mpl_report_1.pdf.
[Jacobson88]	Van Jacobson. Congestion Avoidance and Control. Proc. SIGCOMM, vol. 18, no. 4, August 1988.
[Johnson22]	Clair Hughes Johnson. Scaling People: Tactics for Management and Company Building. Stripe Press, South San Francisco, California, 2022.
[Kalra16]	Nidhi Kalra, and Susan M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? RAND Corporation, Santa Monica, CA, Report RR-1478-RC, 2016. https://www.rand.org/pubs/research_reports/RR1478.html.
[Klein14]	Gerwin Klein, June Andronick, Kevin Elphinstone, Toby Murray, Thomas Sewell, Rafal Kolanski, and Gernot Heiser. Comprehensive formal verification of an OS microkernel. ACM Transactions on Computer Systems, vol. 32, no. 1, February 2014. https://trustworthy.systems/publications/nicta_full_text/7371.pdf.
[Kruger00]	Justin Kruger, and David Dunning. Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, vol. 77, no. 6, January 2000, pp. 1121–1134.
[Leveson00]	Nancy G. Leveson. Intent specifications: an approach to building human-centered specifications. IEEE Transactions on Software Engineering, vol. 26, no. 1, January 2000. http://sunnyday.mit.edu/papers/intent-tse.pdf.
[Leveson11]	Nancy G. Leveson. Engineering a safer world: systems thinking applied to safety. Engineering Systems. MIT Press, Cambridge, Massachusetts, 2011.
[Lynch89]	Nancy A. Lynch, and Mark R. Tuttle. An introduction to input/output automata. Massachusetts Institute of Technology, Cambridge, Massachusetts, Technical memo MIT/LCS/TM-373, 1989. https://www.markrtuttle.com/data/papers/lt89-cwi.pdf.
[McConnell09]	Steve McConnell. Software Estimation: Demystifying the Black Art. Microsoft Press, Redmond, Washington, 2009.
[NASA16]	NASA Systems Engineering Handbook. National Aeronautics and Astronautics Administration (NASA), Report NASA SP-2016-6105 Rev2, 2016.
[NASA19]	Debris Assessment Software User’s Guide, Version 3.0. National Aeronautics and Astronautics Administration (NASA), Report NASA TP-2019-220300, 2019.
[NPR7120]	NASA Space Flight Program and Project Management Requirements. National Aeronautics and Astronautics Administration (NASA), NASA Procedural Requirement NPR 7120.5F, 2021.
[NPR7123]	NASA Systems Engineering Processes and Requirements. National Aeronautics and Astronautics Administration (NASA), NASA Procedural Requirement NPR 7123.1D, 2023.
[OMG17]	Unified Modeling Language. Object Management Group, Standard version 2.5.1, December 2017. https://www.omg.org/spec/UML/2.5.1/PDF.
[Olson65]	Mancur Olson. The Logic of Collective Action: Public Goods and the Theory of Groups. Harvard Economic Studies. Harvard University Press, Cambridge, Massachusetts, 1965.
[O’Connor21]	Timothy O’Connor. Emergent Properties. In The Stanford Encyclopedia of Philosophy. Edward N. Zalta, editor. Metaphysics Research Lab, Stanford University, Winter 2021 ed., 2021. https://plato.stanford.edu/archives/win2021/entries/properties-emergent/, accessed 13 February 2024.
[Parnas72]	D. L. Parnas. On the criteria to be used in decomposing systems into modules. Communications of the ACM, vol. 15, no. 12, December 1972, pp. 1053–1058.
[SAE21]	On-Highway Equipment Control and Communication Network. SAE International, Standard J1939/1_202109, September 2021.
[Smith22]	Adam Smith. An Inquiry into the Nature and Causes of the Wealth of Nations. Edwin Cannan, editor. Methuen and Co., London, Third ed., 1922.
[Spiral]	Wikipedia contributors. Spiral model. In Wikipedia, the Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Spiral_model, accessed 14 February 2024.
[Suss24]	Richard A. Suss. First Do No Harm Is Proverbial, Not Hippocratic. OSF Preprints, 22 November 2024. https://osf.io/preprints/osf/c23jq.
[TTSB21]	Major Transportation Occurrence—Final Report, China Airlines Flight CI202. Taiwan Transportation Safety Board, Report TTSB-AOR-21-09-001, September 2021. https://www.ttsb.gov.tw/media/4936/ci-202-final-report_english.pdf.
[Thompson99]	Adrian Thompson, and Paul Layzell. Analysis of unconventional evolved electronics. Communications of the ACM, vol. 42, no. 4, April 1999, pp. 71–79.
[Tolstoy23]	Leo Tolstoy. Anna Karenina. Constance Garnett, translator. Project Gutenberg, 2023. https://www.gutenberg.org/ebooks/1399.
[Tuckman65]	Bruce W. Tuckman. Developmental sequence in small groups. Psychological Bulletin, vol. 63, no. 6, 1965, pp. 384–399.
[Tuckman77]	Bruce W. Tuckman, and Mary Ann C. Jensen. Stages of small-group development revisited. Group and Organization Studies, vol. 2, no. 4, 1977, pp. 419–427.
[Wilson05]	Simon P. Wilson, and Mark John Costello. Predicting future discoveries of European marine species by using a non-homogeneous renewal process. Journal of the Royal Statistical Society Series C: Applied Statistics, vol. 54, no. 5, November 2005, pp. 897–918. https://academic.oup.com/jrsssc/article/54/5/897/7113002.
[Zhang90]	Lixia Zhang, and David D. Clark. Oscillating behavior of network traffic: a case study simulation. Internetworking: Research and Experience, vol. 1, 1990, pp. 101–112.

Making systems

Table of contents

Chapter 1: Introduction

Chapter 2: A note of caution

Part I: System stories

Chapter 3: Making a simple system

3.1 The request

3.2 Building the cottage

3.3 Retrospective on building

3.4 Adding to the cottage

3.5 Retrospective on addition

3.6 Principles

Chapter 4: Stories about building systems

4.1 Developing a spacecraft mission without engineering the system

4.2 Marketing and engineering collaboration

4.3 Missing implicit requirements

4.4 Building at a mismatch to purpose

4.5 The persistence of team habits

4.6 Heavyweight, understaffed processes

4.7 Planning the transcontinental railroad

Part II: Systems background

Chapter 5: What making systems is

Chapter 6: Elements of systems

6.1 Introduction

6.2 System purpose

6.3 System boundary

6.4 System parts and views

6.5 Structure and emergence

6.6 Evidence

6.7 Using this model

Chapter 7: Elements of making a system

7.1 Introduction

7.2 Objective

7.3 Model

7.3.1 Artifacts

7.3.2 Tasks

7.3.3 Team

7.3.4 Tools

7.3.5 Operations

7.4 Using this model

Chapter 8: Principles for a well-functioning project

8.1 Project leadership

8.1.1 Principle: Communication and translation

8.1.2 Principle: Provide staff to run the engineering team’s operations

8.1.3 Principle: Systems view of the system

8.1.4 Principle: The team is a system

8.1.5 Principle: Team habits

8.1.6 Principle: Keep it lightweight and actionable

8.2 System-building tasks

8.2.1 Principle: Start with a purpose before doing work

8.2.2 Principle: Evaluate tools before adopting them

8.2.3 Principle: Take care with build-versus-buy decisions

8.2.4 Principle: Follow the spirit, not just the letter

8.2.5 Principle: Document things so there is a future

8.2.6 Principle: Build in checks

8.2.7 Principle: Work against cognitive biases

8.3 Plan for building the system

8.3.1 Principle: Prioritize work by risk or uncertainty

8.3.2 Principle: Prioritize integration

8.3.3 Principle: Have a long-term plan

8.3.4 Principle: Set up intermediate internal milestones

8.3.5 Principle: Use prototyping safely

8.3.6 Principle: Analyze for feasibility

8.4 The team

8.4.1 Principle: Document team structure

8.4.2 Principle: Plan on reorganizing the team as it grows

8.4.3 Principle: Have shared procedures

8.4.4 Principle: Define regular communication paths

8.4.5 Principle: Define exceptional communication paths

8.4.6 Principle: Train team in communication skills

8.4.7 Principle: Provide independent resources for checks

Part III: Systems

Chapter 9: Purpose

9.1 Introduction

9.2 Why purpose matters

9.2.1 Not monolithic or fixed

9.2.2 Inconsistent or conflicting purposes

9.3 Explicit purpose

9.4 Implicit purpose

9.5 Objectives and constraints