Making systems

Volume 1: Fundamentals
Richard Golding

Copyright ©2024 by Richard Golding

Release: 0.3-review

Table of contents

Chapter 1: Introduction

9 May 2024

This book began as many presentations and short documents that I put together for different projects over the years. Those presentations covered topics from basic requirements management to good distributed system design to how to plan and operate a project that was regularly in flux. A few of the documents were retrospectives about why a project had run into trouble or failed. Others were written in an attempt to head off a problem that I could see coming.

I have worked on many projects. Most of these have been about building a complex system, or one that required high assurance—ones where safety or security are critical to their correct operation. Some have gone well, but all have had problems. Sometimes those problems led to the project failing. More often they have cost the project time and money, or resulted in a system that was not as good as it should have been. In every case the problems have required unnecessary effort and pain from the team working on the project.

This raised the question: what could be learned from these projects? How can future projects go better?

I began to sense that there were some common threads among all the education and advice I was putting together for these teams, and the problems they were having. With the help of some colleagues who were working their own challenging projects, I began to sort through these impressions in order to articulate them clearly and gather them in one place.

I have found that many of the problems I have observed have come at the intersection of systems engineering, project management, and project leadership. Building a complex system effectively requires all three of these disciplines working together. Most of the problems I have seen have arisen from a breakdown in one or more of them: where there is capable project management, for example, but poor systems engineering, or vice versa.

The intersection is about how each of these disciplines contributes to the work of building a system. The intersection is where people maintain a holistic view of the project. It is where technical decisions about system structure interact with work planning; where project leadership sets the norms for how engineers communicate and check each other’s work. It is where competing concerns like cost versus rigor get negotiated. And it is where people take a long view of the work, addressing how to prepare for the work a year or more in the future.

I’ve worked with many people who were good at one of these disciplines, but didn’t understand how their part fit together with others to create a team that could build something complex while staying happy and efficient. I have worked with well-trained systems engineers who knew the tools of their craft, but did not know how or, more importantly, why to use them and how they fit together. I have worked with project managers who had experience with scheduling and risk management and other tools of their craft, but lacked the basic understanding of what was involved in the systems part of the work they were managing. I have also worked with engineers and managers who were tasked with assembling a team, but did not understand what it means to lead a team so that it becomes well-structured and effective.

In other words, they were all good at their individual disciplines but they lacked the basic understanding of how their discipline affects work in other disciplines, and how to work with people in the other disciplines to achieve what they set out to do.

And that brings me to the basic theme of this book: that making systems is a systems problem, an integration problem. The system that is being built will be made of many pieces that, in the end, will have to work well together. This requires at least some people having a holistic, systemic view of the thing being built. The team that builds the system is itself a system, and its parts—its people, roles, and disciplines—need to work together. The team is something to be engineered and managed, and it needs people who maintain a holistic view of how its parts work together.

This book is not a book on systems engineering or project management per se. Rather, it provides an overarching structure that organizes how the systems engineering, project management, and leadership disciplines contribute to systems-building. While I reference material from these disciplines as needed, do not expect (for example) to learn the details of safety analyses here. I do discuss how those analyses fit with the other work needed for building a system, and provide some references to works by people who have specialized in those topics.

This book is for people who are building complex systems, or are learning how to do so. I provide a structure to help think about the problems of building systems, along with ways to evaluate different ways one can choose to solve problems for a specific project. I relate experience and advice where I have some.

Using this book. This book covers two topics: the system being built, and how to go about building that system. These topics are intertwined, because the point of going to the effort to build a system is to build a well-functioning system.

The first two parts of this book are meant for everyone, and to be read first. They provide a general foundation for talking about making systems. That is, they present a simplified but holistic view of making systems. They present a short set of case studies to motivate what I’m talking about (Part I). Part II presents models for thinking about systems and the making of systems at a high level, along with recommended principles for both.

The two parts that follow provide more detailed discussions of what systems are (Part III) and what systems-building is (Part IV). These parts expand on the material in Part II, providing more structure for talking about each of their subjects. These parts are meant to be read after the foundational parts, but need not be read in order or all at once.

The remaining parts go into depth on concepts and tools that help with building complex systems. These include topics like project life cycles, system design, team organization, and planning. These parts use the language built up in the first parts. The later chapters are meant to be read as needed: when you find you need to know about a topic, dip into relevant chapters.

Chapter 2: A note of caution

15 August 2024

This work aims to help people understand how to do a better job of building complex systems. The strategy I use is to gather together in one place many things that people have already learned, but not necessarily understood as connected.

This strategy has good company. Many people over the years have worked to improve engineering and management practices. Many of those works have led to improved project performance and better systems—and every one of them I know about has had a down side. This work will be no exception.

I imagine the reader as I am writing this work, as if we are having a conversation. But in truth this is not conversation; whoever reads this cannot ask clarifying questions, and I cannot respond with better explanations where the writing is unclear. This leaves me to wonder: will the reader read and understand what I meant to say?

Everything I am writing is based on my own experience, whether it comes directly from projects I have worked on or from what I have learned from others. This raises, as for any work, questions about biases in viewpoint and correctness. I have tried to question my viewpoints by checking them with others and comparing my conclusions to their experience, but that will always be an imperfect approach to the truth. So there are two other questions: is what I present here correct? And will it apply to new situations that I do not now imagine?

There is yet a third worry in writing a work like this. Will it come to be treated as the truth, unquestioned? Will someone treat it as dogma?

A work like this, which provides practical guides for doing complex projects, can prod people who already have some experience into new thinking about how they do their work, adding new perspectives or providing overviews that help them think about what they are already doing. It adds to what they already know, but does not serve as the only guide for how they work.

In the longer term, though, every work like this that I know about has come to be taken as a One True Approach, where decisions are justified by saying “that’s how it is written”—without the people involved actually understanding why that approach says what it says, and not thinking through how the guidance applies to the actual work they have in front of them.

There are two examples that illustrate this behavior.

NASA has an extensive set of processes and procedures, with extensive documentation. The NASA Systems Engineering Handbook [NASA16] is an accessible way to start exploring that process. The processes have evolved over several decades of experience in what what leads space flight missions to fail or succeed, and the procedural requirements are full of small details. People will do well to understand and generally follow those procedures for similar projects.

However, this has led to too many people within NASA and the associated space industry to follow these requirements blindly. A NASA project must go through a sequence of reviews and obtain corresponding approvals to continue (and get funding). I have watched projects treat those reviews pro forma: they need a requirements review, so they arrange a requirements review, but nobody actually builds a useful (or even consistent) set of requirements to be reviewed. A preliminary design review has a checklist of criteria to be met, so presentation slides are prepared for each point, but the reasons behind those criteria aren’t really addressed. And a little while later the project is canceled because it isn’t making good engineering progress.

Similarly, the Unified Modeling Language (UML) [OMG17] brought together the experience of many software and system engineering practitioners to create a common notation for diagramming to describe complex systems. The notations provided a common language teams could use to express the structure and behavior of systems, so that everyone on a project could understand what a diagram meant. A common notation allowed organizations to build tools to generate and analyze these drawings. While not everyone follows the notation standard exactly, the standard has improved the ability of many engineers to document and understand systems. I certainly use many elements of UML diagramming regularly.

The UML has had a corresponding downside. Engineers who first learned how to think about systems using UML have had trouble thinking in other ways. In particular, there are some aspects of system specification and design that are not fundamentally graphical—but some people who have grown up on UML find themselves uncomfortable working with tabular or record-structured information in databases (such as for requirements). I worked with one engineer who was using the SysML dialect of UML, which does not include all of the diagram types in the main UML language. He needed a kind of diagram in UML but not in SysML, but was shocked at the suggestion that he should just use the UML diagram he needed anyway because it “wasn’t part of SysML”.

Generalizing from these, the problem is that when something provides a general, comprehensive guide about how to do complex work, this thing can come to be treated as the only answer. People who only learn from that one source can end up with a constricted understanding of how to do the work.

And so I hope that those who read this work will not take what is written here as the only word on the subject. A guide is not a substitute for learning and thought. I hope that readers will take what I have gathered here as inspiration to think about the work they will encounter in making complex systems, drawing on their own experience and the experience of other around them as well as what has been written. I hope that the points in this book help people keep in mind why something is being done, so that they can address the spirit of the need in addition to the rules of any particular procedure or methodology.

Finally, this last point has caused some reviewers to raise a concern about this kind of caution. By insisting that this book doesn’t have the complete answers to anything, and that people will have to think for themselves to do good systems work, it leaves the door open for anyone to justify to themselves any choice they care to make.

This is a valid concern. I have worked with lots of people who have made bad management and engineering decisions, and who were convinced they were right. In truth everyone makes poor decisions, and everyone has limited perspective. Every single project and every single person involved in building systems will face difficult decisions and will make some of them poorly.

In the end, I have decided that this is not something one can address with a book. I do not know who will read this work in future; I cannot know or address the specific problems they will have. I cannot have a conversation with each of you to try to sort through the actual problems and decisions you encounter. All I can do is provide you with one perspective, and hope that you will add it to your own in useful ways.

Part I: System stories

A set of case studies illustrating what can go right and wrong in a project to build a system.

Chapter 3: Making a simple system

30 April 2024

This book is about both what a well-built system is and how to make that happen. To begin, I’ll start with a simple story: building a small cottage model out of Lego™ bricks.

This story is made up, but it reflects some of the situations I have found in real projects I have worked on. It deliberately illustrates problems in a simplified and perhaps exaggerated way to make them clear. The simplifications include: a very small team, and one that doesn’t need to grow during the project; customer “needs” that are simple; and a project that does not need to consider real emergent properties like safety, security, or even mechanical strength.

3.1 The request

A customer wants a small cottage model, built out of Lego™ bricks. They would like the cottage to be white. They would like it to have a window. They have a base plate they would like it to fit on.

Someone on the team works with the customer to get this information and understand the needs. This results in a sketch, which the customer agrees reflects what they have asked for (Figure 3.1).

undisplayed image
Figure 3.1: Sketch of customer needs

3.2 Building the cottage

The project gets its team together, and they begin discussing how to design and build the cottage. Based on the sketch concept, they decide to split the work: one person for each of four walls, and one person for the roof.

The team discusses some basic design parameters. The decide on the length and height of each of the walls, based on the size of the base plate and the rough ratio of the sides in the sketch. They also decide which wall will get the window.

Each person on the team then begins designing and building their part, based on the sizes they have agreed on. The result is a set of five assemblies (Figure 3.2).

undisplayed image
Figure 3.2: Initial components

Right away there are some visible problems.

The team then try to integrate the assemblies together to make the cottage. The result is not good (Figure 3.3).

undisplayed image
Figure 3.3: Initial integration, with problems

There are integration problems.

At this point, the team addresses some of these issues. They add roof supports to the front and back walls, and redesign all the walls to interlock at the corners.

The result is a structure that integrates all the components (Figure 3.4).

undisplayed image
Figure 3.4: Initial integrated cottage

There are still problems with the integrated cottage.

The problems with the side wall come from one of the team members rushing to rebuild that wall after they were reminded that the cottage was to be all white and not have red stripes.

The missing door is a specification problem that came to light when the customer saw the completed cottage. The original sketch developed with the customer didn’t include a door—it only had an annotation about a window. People implicitly know that cottages need doors but builders may miss out on the door if it isn’t explicitly specified.

After some systems work, the team corrects the problems, fixing the side wall and adding a door. Correcting the problems involved taking the cottage more than halfway apart and rebuilding it. The result meets what the customer wanted (Figure 3.5).

undisplayed image
Figure 3.5: Final integrated cottage

3.3 Retrospective on building

There were several problems that the team encountered building the cottage.

The team did not work with the customer to develop a thorough understanding of the customer’s needs. The team only had a minimal writeup of the needs, and that writeup left an important need implicit (the need for a door).

Next, the team did not develop a concept of the system (the cottage) and check that concept with the customer. For example, the team could have made a more realistic drawing of the cottage, and talked with the customer about how the cottage would be used. Checking a concept would have probably caught the missing implicit requirement for a door.

To their credit, the team decomposed the cottage into components (walls and roof), defined some dimensional requirements each would meet, and assigned someone to design each component. Unfortunately the team did not work out and document the interfaces between components. This meant that no one looked at how the walls would be joined (interlocking or not), and no one looked at how the roof would be supported on some walls.

One of the team members building a wall did not follow requirements about color—or perhaps the color requirement was missing or unclear.

Finally, the team members did not communicate with each other. Ideally, each one would have shared abstract designs for their component with the people building components connecting to that component. Sharing these designs would likely have caught that each team member had different understandings of how their components would be joined together.

The outcome was that it took longer than it should have because there was rework that could have been avoided.

Of course, this story is simpler than building a real building would be. A real building has multiple internal component, such as electrical, plumbing, or HVAC systems, that would create many more interfaces among components. A real building has to be designed to be mechanically sound; this requires systematic analysis to ensure that the building will stay up event in unusual events like storms or earthquakes. A real building also has safety concerns, like fire safety. Finally, building a real building is regulated in most places, requiring permits, inspections, and approvals from external authorities to ensure regulatory compliance.

3.4 Adding to the cottage

Some time passes, and the customer decides that they would like a larger model cottage, and they make a request to add on to the initial version. The team that built the original cottage has moved on to other projects.

A new team talks with the customer to learn what the customer wants. How much larger do they want the extended cottage to be? Should it be extended horizontally or vertically? The customer indicates that an extension adding between 50% and 100% of the original floor area would be sufficient, and the customer prefers a horizontal extension.

The team next has decisions to make about the overall design of the extension. They settle on an approach that matches the style of the original part and adds a little over half the floor area. They suggest to the customer that a window in the extension would be a good idea, and the customer agrees.

The new team does not have access to the team that made the original design decisions. They have to reverse engineer the design approach used by examining the cottage as built.

The original cottage was located toward the back of the base plate, and the team has decided that the extension should be at the back of the original. This implies that the team will have to move the original cottage forward. The team examine the original structure and determine that it can be moved on the base plate without problems.

The new team works together to design and build the extension. They have learned about the problems that the original team had, and so they manage the interfaces between walls and with the roof better. However, they don’t have access to the decisions that the original team made about interlocking the walls for strength, and so they build the extension as a separate unit.

undisplayed image
Figure 3.6: Cottage with addition

3.5 Retrospective on addition

This illustrates a common scenario: that changes are made to a system long after it was originally built. The changes can be complex projects on their own. The original team may be long gone, or they may no longer remember details that were not written down. Knowing the design decisions and their rationales for the decisions affects how the changes are designed.

The changes not just add features (new space), but add interfaces between new parts and the original, and can change the interfaces within the original.

The team that build the original cottage did not document the design decisions they made. The team building the addition had to reverse engineer the design from the built cottage. The lack of information about the rationale for how walls were connected led to a different, less structurally sound approach for connecting the addition to the original structure.

The project to build the addition took longer than it could have if the team had not had to reverse engineer the design. The lack of design rational led to a structural solution that is sufficient for plastic bricks but would not work in a real structure.

In this story, the new team did learn from the original experience that they should do systems-level work. They worked through the interfaces between new parts, and this led the new project to go more smoothly than the original. The lesson is that learning over time matters.

Once again, this story is a simplification of a real building project. A real building would have far more interfaces: electrical circuits and plumbing might need to be extended. The structure of a real extension would have to be integrated into the original structure.

This story did not show the value of designing the original to be expanded. In the example, the original cottage could have been placed forward on the base plate so there was space for a later addition. In a real building, by analogy, designing the electrical main panel to have space for additional circuits and enough capacity to add more usage would make an addition easier.

3.6 Principles

As I present these stories, I will link them to the principles in Chapter 8 that can provide solutions.

Project leadership. Some of the problems in this story relate to how the cottage-building project was led. The most relevant principle is Section 8.1.3—Principle: Systems view of the system. The original team’s work would have gone more smoothly if they had had someone responsible for ensuring that the system made sense as a system.

System-building tasks. Some of the problems related to how the original team went about its work—which resulted in problems with the final system product.

The team. This story does not illustrate many problems with the team itself. However, the team building the original cottage built each of the components in isolation, and did not discover that their parts would not integrate until the parts had been built.

Chapter 4: Stories about building systems

This chapter presents some case studies of how people have built complex systems.

4.1 Developing a spacecraft mission without engineering the system

10 April 2024

The project. I worked on a NASA small spacecraft project. The project’s objective was to fly a technology demonstration mission to show how a large number of small, simple spacecraft could perform science missions. The mission objectives were to demonstrate performing coordinated science operations on multiple spacecraft, and to demonstrate that the collection of spacecraft could be operated by communicating between one spacecraft and ground systems, and the spacecraft then cross-linking commands and data to perform the science operations.

The problem. The mission had one set of explicit, written mission objectives to perform the technology demonstration. It also had a number of implicit, unwritten constraints placed on it, primarily to re-use particular spacecraft hardware and software designs.

Those two sets of objectives resulted in conflicts that made the mission infeasible. There were three key technical problems: power consumption was far in excess of what the spacecraft’s solar panels could generate; the legacy that could not communicate effectively over the distances involved; and the design had insufficient computing capability to accurately compute how to point spacecraft for cross-link communication.

Conflicts like these are not uncommon when first formulating a system-building project, and NASA processes are structured to catch and resolve them. The NASA Procedural Requirements (NPRs), a set of several volumes of required processes, require projects to formalize mission objectives and analyze whether a potential mission design is feasible. This work is checked at multiple formal reviews, most importantly the Preliminary Design Review (PDR).

At the PDR, expected project maturity is:

Program is in place and stable, addresses critical NASA needs, has adequately completed Formulation activities, and has an acceptable plan for Implementation that leads to mission success [italics mine]. Proposed projects are feasible with acceptable risk within Agency cost and schedule baselines. [NPR7120, Table 2-4, p. 30]

This project, however, failed at three of the necessary steps. First, the project did not perform top-down systems engineering, such as a proper documentation of mission objectives, a concept of operations, and a refinement of those into system-level and subsystem-level specifications. In particular the implicit and undocumented constraints were never documented as requirements; they were tacitly understood by the team and rarely analyzed. Those requirements that were gathered were developed by subsystem leads, and they were inconsistent and did not derive from the mission objectives. Second, individual team members did analyses that showed problems with the the ability of the radios, their antennas, and the ability to point the spacecraft in such a way that cross-link communications would work. The people involved repeatedly tried to find a solution in their individual domain of expertise to fix the problem, and the problems were never raised up to be addressed as a systemic problem. Finally, the PDR was the final check where these problems should have been brought to light as the refinement of mission objectives and the concept of operations would fail to show communication working. Instead, the team focused on making the review look good rather than addressing the purpose of the review.

Outcome. The project proceeded to build the hardware for multiple spacecraft, began developing the ground systems and developing the flight software. After several months, the project neared the end of its budget, and the spacecraft design was canceled. Something like two years’ worth of investment was lost, and the capability of performing a multi-spacecraft science mission was never demonstrated.

The agency later found some funds to develop a much simplified version of the flight software and relaxed the mission objectives substantially to only performing some minimal cross-link communications. A version of that mission was eventually flown.

Solutions. The project made four mistakes. Each one of them could have been corrected if the project had followed good practice and NASA required procedures.

First, the conflicting mission objectives and constraints should have been resolved early in the project. NASA has a formal sequence of tasks for defining a mission and its objectives, leading to a mission definition that is approved and signed by the mission’s funder. If the project had followed procedure, the implicit constraints would have been recorded as a part of this document. Documentation would have encouraged evaluation of the effects of those constraints.

Second, the project did not do normal systems engineering work. The systems engineering team should have documented the mission objectives, developed a concept of operations for the mission, and performed a top-down decomposition and refinement of the mission systems. In doing so, problems with conflicting objectives would have been apparent. The systems leadership would have been involved in analyses of the concept, and thus been aware of where there were problems.

Third, the team lacked effective communication channels that would have helped someone working one individual problem raise the issues they were finding up to systems and project leadership, so that the problems could be addressed as systems issues. For example, one person found that the flight computer would not be able to perform good-enough orbit propagation of multiple spacecraft so that one spacecraft would know how to point its antenna to communicate with another. A different person found problems with the ability of the radios to communicate at the ranges (and relative speeds) involved.

Finally, the PDR should have been the safety net to find problems and lead to their resolution. The NASA procedural requirements have a long list of the products to be ready at the PDR. More than 30 specifically the responsibility of systems engineering [NPR7123, Table G-6, p. 81], and the project overall has a similar number of products [NPR7120, Appendix I]; there is some overlap between these lists. The team took a checklist approach to these several products, putting together presentations for each topic in a way that highlighted progress in the individual topics but failing to address the underlying purpose: showing that there was a workable system design.

Had any of these mechanisms worked, the systems and project leadership would have detected that the conflicting mission objectives were infeasible and led the project to negotiate a solution.

Principles. This example is related to several principles for a well-functioning project.

4.2 Marketing and engineering collaboration

12 April 2024

The project. I worked at a startup company that was building a high-performance, scalable storage system. The ideas behind the system came from a university research project, which had developed a collection of technology components for secure, distributed storage systems.

The company had developed several proof-of-concept components and was transitioning into a phase where it was getting funding and establishing who its customers were. The company hired a small marketing team to work out what potential customers needed and to begin building awareness of the value that the new technology could bring.

The problem. The marketing team had experience with computer systems, but not with storage in particular. They could identify potential market segments, but they did not have the background needed to talk with potential customers about their specific needs.

The engineering team were similarly not trained at marketing. Some of the team members had, however, worked at companies that used large data storage systems and so had experience at being part of similar organizations.

Solutions. The marketing team set up a collaboration with some of the technical leads. This collaboration left each team in charge of their respective domains, with the technical leads helping the marketing team do their work and the marketing team providing guidance about customer needs to the engineering team.

One of the technical leads acted as a translator between the marketing and engineering teams, so that information flowed to each team in terms they understood. Technical leads joined the marketing team on customer visits, helping to translate between the customers‘ technical staff and the marketing team. The marketing team conducted focus group meetings, and some of the technical leads joined in the back room to help frame follow-up questions to the focus groups and to help interpret the results.

Outcome. The collaboration helped both teams. The marketing team got the technical support and education they needed. The engineering team got proper understanding of what customers needed, so that the system was aimed at actual customer needs.

Principles. This example is related to the following principles:

4.3 Missing implicit requirements

13 April 2024

The project. This occurred at the startup I worked at that was building a scalable storage system.

The problem. The team had a focus on making the system highly available, to the point where we had an extensive infrastructure for monitoring input power to servers and providing backup power to each server. If the server room lost mains power, our servers would continue on for several minutes so that any data could be saved and the system would be ready for a clean restart when power came back on. We did a good job meeting that objective.

What we forgot is that people sometimes want to turn a system off. Sometimes there is an emergency, like a fire in a server room, and people want the system powered off right away. Sometimes preventing the destruction of the equipment is more important that losing a few minutes’ worth of data. We had no power switches in the system and no way to quickly power it down.

Outcome. In practice this wasn’t too serious a problem because emergencies don’t happen often, but it meant that the system couldn’t pass certain safety certifications.

Solutions. We made two mistakes that led to the problem.

The first mistake was that everyone on the team saw high availability as a key differentiator for the product, and so everyone put effort into it. This created a blind spot in how we thought about necessary features.

The second mistake was that we did not work through all of the use cases for the system and so implicit features, including power off. Building up a thorough list of use cases can serve as a way to catch blind spots like this, but the team did not build such a list.

Principles. This is related to one principle:

4.4 Building at a mismatch to purpose

15 April 2024

The project. I consulted on a project to build a technology demonstration of a constellation of LEO spacecraft for the US DOD. This constellation was to perform persistent, world-wide observations using a number of different sensors. It was expected to operate autonomously for extended periods, with users world wide making requests for different kinds of operations. The constellation was expected to be extensible, with new kinds of software and spacecraft of new capabilities being added to the constellation over time.

One company organized the effort as the prime contractor. That company built a group of other companies of various sizes and capabilities as subcontractors. The team won a contract to develop the first parts of the system.

The problem. The constellation had to be able to autonomously schedule how its sensors would be used, and where major data processing activities would be done. For example, someone could send up a request for an image of a particular geographic region, to be taken as soon as possible. The spacecraft would then determine which available spacecraft would be passing over that region soon. Some of the applications required multiple spacecraft to cooperate: taking images from different angles at the same time, or persistently monitoring some region, handing off monitoring from one spacecraft to another over time, and performing real-time analysis on the images gathered on those spacecraft.

The prime contractor selected its team of other companies and wrote the contract proposal for the system before doing systems engineering work. This meant that neither a detailed concept for the system’s operation nor a high-level design had been done.

After the contract was awarded, the team had to rapidly produce a system design. This effort went poorly at first because the system’s concept had not been worked out, and different companies on the team had different understandings of how the system would be designed. The team had to deliver initial system concept of operations and requirements quickly after the contract was awarded. The requirements were developed by asking someone associated with each expected subsystem to write some requirements. Needless to say, the concept, high-level design, and requirements were all internally inconsistent.

After the team brought me on to help sort out part of the design problems, we began to do a top-down system design and establish real specifications for the components of the system. We were able to begin to work out general requirements for the autonomous scheduling components.

The project team had determined that they needed to use off-the-shelf software components as much as possible, because the project had a short deadline. One of the subcontractor companies was invited onto the team because they had been developing an autonomous spacecraft scheduling software product, and so the contract proposal was written to use that product.

However, as we began to work out the actual requirements for scheduling, it became apparent that the off-the-shelf scheduling product did not match the project’s requirements. The requirements indicated, for example, that the system needed to be able to schedule multiple spacecraft jointly; the product only handled scheduling each spacecraft independently. The system also had requirements for extensibility, adding new kinds of sensors, new kinds of observations, and new kinds of data processing over time. This suggested that strong modularity was needed to make extensibility safe, but the off-the-shelf product was not at all modular.

Outcome. The mismatch between the decision to use the off-the-shelf scheduling product and the system’s requirements led to both technical and contractual problems.

The technical problem was that the scheduling product could not be modified to work differently and thus meet the system requirements. The project did not have the budget, people, or time to do detailed design of a new scheduling package that would meet the need.

The contractual problem was that the subcontractor had joined the project specifically because they saw a market for their product and were expecting to use the mission to get flight heritage for it. When it became clear that their product did not do what the system needed, they discussed withdrawing from the project.

In the end, the customer decided not to continue the contract and the project was shut down.

Solutions. This project made three mistakes that, had they been avoided, could have changed the project’s outcome.

First, the team did not do the work of early stage systems engineering to work out a viable concept and high-level design before committing to partners and contracts. This would have made it clear what was needed of different system components. It would also have provided a sounder basis for the timelines and costs in their contract proposal.

Second, the team made design and implementation choices for some system components without understanding the purpose that those components needed to fill.

Finally, the team made commitments to using off-the-shelf designs without determining whether those designs would work for the system.

Principles. The solutions above are related to the following principles:

4.5 The persistence of team habits

6 May 2024

The project. I consulted for a company that was working to build an autonomous driving system that could be retrofitted into certain existing road vehicles.

The company had started with veterans from a few other autonomous driving companies. They began their work by prototyping key parts of a self-driving system, to prove that they had a viable approach to solving what they saw as the key problems. This resulted in a vehicle that could perform some basic driving operations, though it was always tested with a safety driver on board.

The team focused only on what they saw as the most important problems in an autonomous driving system. They believed that it was important to demonstrate a few basic self-driving functions as rapidly as possible—in part because they believed that this would help them get funding, and in part because they believe that this would help them forge partnerships with other companies. They focused on a simplified set of capabilities, including sensing, guidance, and actuation mechanisms for driving on a road.

The problem. This focus meant that the team developed a culture, along with a few somewhat documented processes, that was focused on building a prototype-style product, even as they began to fit their system into multiple vehicles and test them on the road (with safety drivers). When they found a usage situation in their testing that their driving system did not handle as they felt it should, they added features to handle that situation to the sensing and guidance components and to simulation tests they used on those components. In other words, the engineering work was driven largely reactively.

The team did not spend effort on analyzing whether the new features would interact correctly with existing features, relying on simulation testing to catch regressions. They did not develop a plan for features that they would need, and for how they would integrate other systems with the core functions they had already prototyped.

Some of the team members had some awareness that they needed to improve the safety of the driving system and the rigor with which the team designed and built the system. These team members, some of whom were individual engineers and some who were leaders, tried from time to time to define some basic individual processes—like defining requirements before design, or conducting design reviews. Their goal was always to move the team incrementally toward sound engineering practice.

None of these attempts worked: each time, a few people would try a new procedure, task, or tool, but a critical mass of the team would keep working the way they had been in order to keep adding new features in response to immediate needs.

Outcome. After nearly two years, the team had not changed its practices and continued to work as if they were building a prototype. The team in general did not define or work to requirements; they did not analyze the systems implications of potential new features before implementing them. The team was making little progress on developing a safety case for the system.

Solutions. The fundamental problem was a misalignment between the incentives that drove the team in the short term and long-term practices needed to build a safe and reliable system.

The team as a whole, from the leadership down, developed habits focused on developing a proof of concept that would let the company get additional funding, as well as attract good staff and help the company build partnerships. This was the right choice for the company in its early days, because a company that cannot get funding does not get to move on to the long term. This short-term focus drove the habits and culture of the early company.

Later, as the company got funding and built up a team to build the system, they needed to change their practices. Changing a team’s culture and habits is hard: the team’s practices have been working out initially. The team’s habit of focusing on short-term results, in particular, defined how they organized all their work.

In order to change practices to be a company building a product that is viable in the long term, teams like this make a deliberate change to their culture, habits, and practices. A disruptive change like this does not happen spontaneously: a team’s culture defines the stable environment in which people can do what they understand to be good work. This creates a disincentive to make a change that disrupts how everyone works together.

Deliberate and pervasive changes come from the team’s leadership. The leadership must first recognize that a change is needed and work out a plan for what to change, how quickly, and in what way. The leadership then have to explain the changes needed, create incentives that will overcome the disincentives to change, and hold people on the team accountable for making the changes.

Principles. This case reflects some more of the principles outlined in Chapter 8.

4.6 Heavyweight, understaffed processes

24 April 2024

The project. A colleague was an engineer working on an electronics-related subsystem at a large New Space company that was building a new launch vehicle.

The team in question was responsible for designing one of the avionics-related subsystems and acquiring or building the components. This required finding suppliers for some components and ordering the necessary parts.

The problem. The company had processes in place for both vendor qualification and parts ordering. They included centralized software tools to organize the workflow.

The vendor qualification process began with submitting a request into the tools. The request was then reviewed by a supplier management team; once they approved a supplier, the avionics team could start placing ordering requests to buy parts. The purchase request would similarly be routed to an acquisition team that would make the actual purchase from the supplier.

The intents of this process were, first, to take the work of qualifying potential vendors and managing purchases off the engineering team, and second, to ensure that the vendors were actually qualified and that parts orders were done correctly.

From the point of view of the engineers building the avionics, the processes were opaque and slow. They would put in a request, and not know if they had done so properly. Responses took a long time to come back. At one point, my colleague reverse engineered the vendor qualification process in order to figure out how to use it; the result was a revelation to other engineers.

It also appeared that the positions responsible for processing these requests were understaffed for the workload. In practice these people did not have the time to do proper reviews of the vendors most of the time.

Outcome. Having supply chain processes was a good thing: if it worked, it increased the likelihood that the acquired parts would meet performance and reliability requirements, that the vendors would deliver on schedule and cost, and that the cost of acquiring parts remained within budget.

However, getting vendors qualified to supply components and then getting the components took a long time, delaying the system’s implementation and then delaying testing and integration.

The suppliers and the parts did not get the intended scrutiny, which may have let problem suppliers or parts through.

The company acquired a reputation with its employees of being slow and difficult to work for.

Solutions. There are four things that could have been done to make these processes work as intended.

First, the processes should be documented in a way that everyone involved knows how the process works. In this situation, it seems that people playing different parts in the process knew something about their part, but they did not understand the whole process; if there was documentation, the people involved did not find it. The process documentation should inform all the people involved what all of the steps are, so they understand the work. It should make clear the intent of the process. It should also make clear what would make a request successful or not.

Second, the processes should be evaluated to ensure that every step adds value to the project, compared to not doing that step or doing the process another way.

Third, the supporting roles—in this case, those tasked with reviewing and approving requests—should be staffed at a level that allows them to meet demand.

Finally, the project should regularly check whether its processes are working well, and work out how to adjust when they are not working.

Principles. The following principles apply:

4.7 Planning the transcontinental railroad

24 April 2024

The project. The first transcontinental railroad to cross North America was built between 1862 and 1869 [Bain99]. It involved two companies building the first rail route across the Rocky Mountains and the Sierra Nevada, one starting in the west and the other in the east. It was built with US government assistance in the form of land grants and bonds; the government set technical and performance standards that had to be met in order to get tranches of the assistance. The technical requirements included worst-case allowable grades and curvature. The performance requirements included establishing regular freight and passenger service to certain locations by given dates.

The problem. The companies building the railroad had limited capital available to build the system. They had enough to get started, but continuing to build depended on receiving government assistance and selling stock. Government assistance came once a new section of continuous railroad was accepted and placed into service. In addition, the two companies were in competition to build as much of the line as possible, since the amount of later income depended on how much each built.

This situation meant that the companies had to begin building their line before they could survey (that is, design) the entire route. They operated at some risk that they would build along a route that would lead to someplace in the mountains where the route was uneconomical—perhaps because of slopes, or necessary tunneling, or expensive bridges.

Because the building began before the route was finalized, the companies could not estimate the time and resources needed for construction beyond some rough guesses. The companies worked out a general bound on cost per mile before the work started, and government compensation was based on that bound. In practice the estimate was extravagantly generous for some parts of the work.

Solutions. The initial design risk was limited because there were known wagon routes. People had been traveling across the Great Plains and the mountains in wagons for several years. While the final route did not exactly follow the wagon routes, the early explorations ensured that there was some feasible route possible.

The companies built their lines in four phases: scouting, surveying, grading, and track-laying. (In some cases they built the minimal acceptable line with the expectation that the tracks would be upgraded in the future once there was steady income.) Scouting defined the general route, looking for ways around bottlenecks like canyons, rivers, or places where bridges or tunnels would be needed. Surveying then defined the specific route, putting stakes in the ground. The surveyed route was checked to ensure it met quality metrics, such as grade and curvature limits. After that, grading crews leveled the ground, dug cuts through hills, and tunneled where necessary. Finally, track-laying crews built bridges and culverts where needed, then laid down ballast, ties, and rail. After these phases, a section of track was ready for initial use.

Scouting ran far ahead of the other phases, sometimes up to a year ahead. Survey crews kept weeks or months ahead of grading crews. The grading and track-laying crews proceeded as fast as they could. All this work was subject to the weather: in many areas, work could not proceed during winter snows.

Outcome. The transcontinental railroad was successfully built, which opened up the first direct rail links from one coast of North America to the other. The early risk reduction—through knowledge of wagon routes—accurately showed that the project was feasible.

The companies were able to open up new sections of the line quickly enough to keep the construction funded. The companies received bonds and land grants quickly enough, and revenue began to arrive.

The approach of scouting and surveying worked. The scouting crews investigated several possible routes and found an acceptable one. While there were instances of tentatively selecting one route then changing for another—sometimes for internal political reasons rather than technical or economic reasons—no section of the route was changed after grading started. In later decades other routes were built, generally using tunneling technology that was not available for the first line. Many parts of the original line are still in regular use.

Principles. The transcontinental railroad project was an example of planning a project at multiple horizons, when the work of implementing begins before the design is complete, and where the plan and design is continuously refined.

Part II: Systems background

Foundational definitions used throughout the rest of this book, including:

Chapter 5: What making systems is

9 May 2024

This book is about the work involved in making a system—what a system is, and how to do a good job making one.

Part I presented a set of case studies that showed how system-building project can go well—or not. This leads to two questions: How does one build a system well? And how does one avoid the problems?

To start finding answers to these questions, consider three aspects of making a system: what a system is; the activities involved in making it; and the people who do the activities that make the system.

A system. A system is “a regularly interacting or interdependent group of items forming a unified whole”.[1] Other definitions speak to a set of items or components that work together to fulfill a purpose.

This definition includes some of the key aspects of a system.

For artificially built systems, the system is the outcome of all the work that people do to make the system.

Having a purpose distinguishes a system built by people from systems in nature. A natural system often just exists, and any meaning or purpose to it assigned after the fact by people. A human-built system, on the other hand, does something for someone. The purpose of human-built systems can be described in terms of what it does for someone, and why it is worth the effort to make a system do that.

Most systems are not static: they will evolve rapidly as they proceed from concept through design; once it is in operation, they will continue to evolve as their users’ needs evolve.

The next chapter, Chapter 6, discusses more about what a system is.

Making a system. The work of making a system can be seen as a string of activities, the life cycle of the system. It begins with an idea. That idea might be a user’s need, or it might be an idea for a new way to do something that might fill an as-yet-unidentified user’s need. The work proceeds to translate that idea into designs and then into a working system. This work goes through a number of steps, such as developing a concept, specifying its pieces, designing and implementing them, integrating the parts and verifying the assembly. Once a system has been built, it can be placed into operation. A system that has been in operation may change over time: users’ needs change, or technology changes. Eventually, every system is retired and disposed of.

All these activities are done by a team of people who are building the system, and the point of spending the effort is for the system, at when built and in operation, filling its purpose.

Chapter 7 discusses more about how to make a system.

Who does the work. A team of people working together does all the activities involved in making the system. For complex systems, the team can get large and may involve people at different companies and with different skills.

The team of people is itself a system: a set of people, whose purpose is to build the objective system, who interact with each other through discussion, documentation, and artifacts. A team that is functioning well is able to focus their efforts on the purpose of the system they are building. The team is organized so that its members have information they need each to do their part, and to communicate so that the pieces of the system that they create work together.

Key roles. A team that functions well, like any human-built system, does not happen by accident; it happens because someone takes the effort to design and implement it so that it works well.

In practice, there are three roles that do this work of organizing and running the team. These roles may be divided among team members in many different ways, but every team building a complex system needs the three roles filled somehow. The roles are:

The intersections. Having teased apart the ideas of system, system-building, and people, and the ideas of systems engineering, project management, and project leadership, the next step is to acknowledge that none of these things are in fact separate.

The objective of a project is to produce a system. The way to produce it is to do system-building work. The people in a team do that work. All three must fit together: the way that the work gets done determines whether the resulting system meets its purpose. How the team is organized, its culture and habits, govern how the people will do the work.

While systems engineering, project management, and project leadership are different roles and involve different skills, they work together. Leadership by itself gets nothing done; that comes from engineering and management. Leadership and management without systems engineering might produce a system but it probably won’t work. Leadership and engineering without management usually means a lot of engineers running around doing cool things but also wasting time and resources and not actually getting things done. Management and engineering without leadership isn’t able to make decisions or take responsibility.

The people filling each of these three roles also need to understand their counterparts’ roles. A systems engineer who designs something that would require more time or resources than the project has is not going to be effective. A project leader who does not understand the work the team does is not going to model good work practices. A project manager who does not understand the engineering is not going build a plan or schedule that makes sense.

Systems work, in the end, is about doing work that makes a coherent whole out of the parts it has to work with. The work of making a system is just as much systems effort as its product is. Only when the parts fit together does the work get done as it should.

Chapter 6: Elements of systems

21 May 2024

6.1 Introduction

Working with systems is about working with the whole of a thing. It is a bit ironic that to make the whole accessible to rational design, we need to talk about the parts that make up systems work.

That is one of the first points about systems. Most systems are too complex for a human mind to remember and understand as a whole at one time. To work on these systems, we must find ways to abstract and to subset the problem. This book discusses some of the techniques for slicing a system into understandable parts, along with ways to use those techniques and why to use them. In the end, however, everything in here deals with carefully-chosen subsets of a system.

This chapter covers some of the essential concepts and building blocks that are the foundation for the techniques discussed in the rest of this book.

The subjects for systems work can be divided into five groups:

The first four subjects are connected by a reductive approach to explaining complex systems, in which the high-level purpose is explained by reducing it to simpler constituent parts and structure, and conversely expressing the purpose as emergent from these simpler parts. The final subject is about ensuring that the system does what it is supposed to do (and only that).

6.2 System purpose

Every system that is designed and built has a purpose. That is, someone has an expectation of the benefits that will come from building the system, and they believe that those benefits will outweigh the costs (in resources, time, or opportunities) that will be incurred building the system.

Every system must be designed and built to address its purpose, and no other purposes, at the lowest cost practically achievable. This point may seem uncontroversial on its surface, but I have observed that the majority of projects fail to work to this standard, and incur unnecessary costs, schedule slips, or missed customer opportunities. Every design choice must be weighed according to how well each option helps satisfy the purpose or not; if an option does not, it should not be chosen.

Making design decisions guided by a system’s purpose means that the team must understand what that purpose is. The purpose must be recorded in a way that all the team members can learn about it. It also needs to be accurate: based on the best information available about what the system’s users need, and as complete as can be achieved at the time. The record of the purpose should avoid leaving important parts implicit, expecting that people will know that systems of a particular kind should (for example) meet certain safety or profitability objectives; people who specialize in one area will know some of these implicit needs but not others. The purpose documentation should also include secondary objectives, such as meeting regulatory requirements or leaving space in the design for anticipated market changes.

The understanding of a system’s purpose and costs will shift over time, both as the world changes and as people learn more accurately what the value or cost will be. When the idea for the system Is first conceived, the purpose may be accurate for that time but the understanding of the cost is likely to be rough. As design and development progress, the understanding of cost improves, but the needs may change or a customer may realize they misunderstood some part of the value proposition.

A system’s purpose also changes over longer periods of time. People add new features to an existing product to expand the market segment to which it applies or to help it compete against similar products. The technology available for implementing a system changes, creating opportunities for a faster, cheaper, or otherwise better system.

Systems leadership have to balance the needs for a clear and complete statement of a system’s purpose with the fact that the understanding of purpose will change over time. The agile [Agile] and spiral [Spiral] management methodologies arose from this need for balance between opposing needs. Later chapters ! Unknown link ref address how systems engineering methodologies can help address this need.

Working in a way that is driven by system purpose requires discipline in the team and its leadership. Many junior- and mid-level engineers are excited about their specialist discipline, and want to get to designing and building as quickly as possible—after all, those are the activities they find fulfilling. I have observed team after team proceed to start building parts of a system that they are sure will be the right thing, without spending the effort to determine whether those parts are actually the right ones. Those design decisions may end up being correct many times, which leads to a false confidence in decisions taken this way (“I’m experienced; I’m almost always right!”). The flaw is that the wrong decisions can have a high cost, high enough to outweigh any benefit from the rapid, unstudied decision.

I have seen many teams say—rightly—that they need to make some design decision quickly, see whether it works, and then adjust the design based on what they learn. This line of reasoning is both a good idea and dangerous. If a team actually does the later steps of evaluating, learning from, and changing the design then this approach can result in good system design. (This is discussed more in later chapters XXX on prototyping.) However, most teams lack the leadership discipline to perform to this plan: once there is some design in place, pressures to keep moving forward drive teams to live with the bad initial design and accept complexity and errors. It requires discipline and commitment from the highest levels of an organization to take the time needed to learn from an early design and change what they are doing. The leadership must be prepared to push back against pressures to just live with a poor design and instead to require their team to take the time to learn and adjust, and to be clear with external parties, such as investors, that the plan is a necessary and positive way to realize a good product.

In Chapter ! Unknown link ref, we discuss techniques that can help to keep a system’s development grounded in its purpose, while adapting to changes in purpose and learning about the system’s design choices over time.

6.3 System boundary

A system has a boundary that defines what is within the system and what is not. What the system does (its functions) and what it uses to do them (its components) are within the system.

The rest of the world is outside the system. The outside world includes the system’s environment: the part of the world with which the system interacts.

The boundary defines the interface between the system and its environment.

What is inside the system and where the boundary lies are within the control of the project building the system. The project must adapt its work to everything else outside the system boundary.

6.4 System parts and views

Systems are designed and built by people. The methods used to build them must account for two human issues. First, most systems today are too complex for one person to keep in mind all the parts at one time, leading to a need to work with subsets of the system at any given time. Second, most systems also require multiple people to design or build, either because of specialties or the total amount of work involved. This leads to the need to break the work up into parts for different people to work on.

There are two techniques used to address this need. First, systems are divided into component parts, typically in a hierarchical relationship: the system is divided into subsystems, which are in turn subdivided, until they reach component parts that are simple enough not to require further subdivision. Second, people approach the system through narrow views, each of which covers one aspect of the system but across multiple component parts—such as an electrical power view, an aerodynamics view, or a data communications view.

Dividing the system into component parts creates pieces that are small enough to reason about or work on in themselves. The description of the part must include its interfaces to other parts, so that the design or implementation can account for how it must behave in relation to other parts. However, the interface definitions abstract away the details of other parts, so that the person can concentrate their attention on just the one part.

Dividing up the system also allows different people to work on different parts, as long as both parts honor the interfaces between them. The division into parts, and the definition of interfaces, create divisions of responsibility and scope for communication for the different people. This is addressed further in the Teams section (Section 7.3.3).

The hierarchical breakdown of the system into components and subcomponents provides a way to identify all of the parts that make up the system, ensuring that all can be enumerated. It also defines a boundary to the system: the system is made up of the named parts, and no others.

Reasoning about views of a system provides a similar and complementary way of managing the complexity of reasoning about a system by focusing on one aspect across multiple parts, and abstracting away the other aspects. This allows different people to address different aspects, as long as the aspects do not interact too much. For example, specialist knowledge, such as about electrical system design, can be brought to bear without the same person needing to understand the aerodynamics of the aircraft in which the electronics will operate.

Sidebar: Non-reductive systems

This approach of defining a system in reductive terms—using parts and structure—is not a formal necessity of systems in general. Rather, this approach is used as a way for ordinary people to define, build, and check systems.

There are numerous examples of non-human processes that have developed complex systems that are not easily explained reductively. Many of these were developed using evolutionary methods, both biological and electronic. Others arise from other optimization and machine learning techniques. These generative design tools have been demonstrated in mechanical and electronic design.

Consider the circuit discussed by Thompson and Layzell [Thompson99]. This circuit was developed by evolving a design on an FPGA, so that the result would distinguish between inputs at two different frequencies. The resulting circuit design achieves its objective, but is not readily understandable by decomposing the design into individual elements on the FPGA—indeed, the presence of some cells that did not appear to be used directly appeared to be essential to the circuit’s function. Further, the circuit only worked well on the specific FPGA chip on which it was evolved; when moved to another FPGA of the same model, it was reported to work poorly.

While these designs are not readily understood by decomposition, they still must be verified for conformance with their purpose. This starts with a clear definition of purpose, from which the fitness or objective function used in optimization can be derived. For critical systems or components, the objective function must not only specify what the desired behaviors are, but also the undesired behaviors and the behaviors when the system is outside its intended performance environment. In some methods, the objective “function” can be an adversarial neural network that must itself be trained based on the system’s purpose. The result of the generative or optimization method must also be verified against the purpose to check that the result is in fact correct—which can catch errors in building the objective function, or subtle dependencies on environment.

6.5 Structure and emergence

Decomposing a system into component parts is one part of the system’s design; the other part is how those components relate to each other. The relations between parts define the structure of the system. These relations include all the ways that components can interact with each other, at different levels of abstraction. At low levels, this might be interatomic forces at the molecular level; at medium levels, mechanical, RF, force, or energy transfers; at higher levels, information exchange, redundancy, or control.

The structure needs to lead to the system’s desired aggregate properties, such as performance, safety, reliability, or specific system functions like moving along the desired path or providing reliable electrical service.

The aggregate properties are emergent, and arise from the way the structure combines the properties of individual components.[1] The structure must be designed so that the system has the desired emergent properties and avoids undesired ones. For example, a simple reliable system has a reliability property that arises from the combination of two or more components that can perform the same function, along with the interaction patterns of each component receiving the same inputs, each component generating consistent outputs, how the two or more results are combined, and how each component responds to failure.

The structure must be designed to avoid unanticipated emergent properties, especially when those properties are undesirable. In a safe or secure system, for example, it is necessary to show that the system cannot be pushed into some state where it will perform an unsafe action or provide access to someone unauthorized. Avoiding unanticipated emergent properties is one of the hardest parts of correctly designing a complex system.

The structure must be well-designed for the system to meet its purpose, and for people to be able to understand, build, and modify it. In particular the structure needs to be:

There are good engineering practices that should be followed to achieve these aims, as we discuss in ! Unknown link ref.

Finally, the structure determines the interfaces that each component part must meet. Those interfaces in turn determine a component’s functions and capabilities, which guide the people working on the component, as discussed in the previous section.

6.6 Evidence

It is not enough to design and build the system; the team must also show that the system meets its purpose.

The team developing or maintaining the system must be able to show that the system complies with its purpose to customers, who need to know that the system will do what they expect; to investors, who need evidence that their investment is being used to create what they agreed to fund; and to regulators, especially for safety- or security-critical systems, who are charged with ensuring that systems function within the law.

The team also needs to ensure that pieces of the system meet the system’s purpose as they are developing or modifying those pieces. They must be able to judge alternative designs against how well they meet the purpose, and once built they must be able to check that the result conforms to purpose.

The process of showing that a system or a component part fulfills its purpose involves gathering evidence for and against that proposition, and combining the evidence in an argument to reach an overall conclusion about compliance. There are many kinds of evidence that can be gathered: results of testing, results of analyses, results of expert analysis, or results from performing a demonstration of the system. These individual elements of evidence are then combined to show the conclusion. The combination usually takes the form of an argument: a tree of logic propositions starting with the purpose and devolving hierarchically into many lower-level propositions that can be evaluated using evidence. The process must show that the structure of the argument is both correct and complete in order to justify the final conclusion.

Pragmatically, arguments about meeting purpose usually follow a common pattern, as shown below. The primary argument that the implementation meets the purpose consists of a chain of verification steps. The implementation complies with a design, which complies with a specification, which complies with an abstract specification, which complies with the original purpose. As long as each step is correct, then the end result should meet the original purpose—but at each step there is the possibility of misinterpretation or missing properties, or that the verification evidence at each step is not as complete as believed. In practice this approach leaves plenty of uncaught errors in the final implementation. To catch some of these errors in the chain of verification steps, common practice is to perform an independent validation, in which the final implementation is checked directly against the original purpose.

undisplayed image

Some industries, particularly dealing with safety-critical automotive and aerospace systems, add an additional kind of evidence-based correctness argument. This is often called the safety case or security case, and consists of an explicit set of propositions, starting with the top level proposition “the system is adequately safe” (or secure) and showing why that conclusion is justified using a large hierarchy of propositions. The lowest-level propositions in the hierarchy consist of concrete evidence; intermediate propositions combine them to show that more abstract safety or security properties hold. ! Unknown link ref

Finally, evidence takes many forms, depending on what needs to be shown. Some correctness propositions can be supported by testing. These typically show positive properties: the system does X when Y condition holds. Some of these conditions are hard to test, and are better shown by analysis or human review of design or implementation. Negative conditions are harder to show: the system never does action X or never enters state Y, or does so at some very low rate. These require analytic evidence, and cannot in general be shown by testing.

We discuss matters of correctness, verification, validation, and the related arguments in ! Unknown link ref.

6.7 Using this model

The model in this chapter provides a way to think and talk about systems work. As a team begins a systems-building project, it will be gathering information or making decisions that can be organized using this model. The model can help guide people as they work through some part of the system. For example, the system’s purpose is reflected in the emergent behavior of the system, which in turn depends on the structure of how components interact. When the system is believed to be complete, the team should be able to verify that all of the relations indicated by this model are defined and correct. Later, as the system needs to evolve and the team makes changes to the system, this model helps them reason about what is affected by some change.

This model of systems provides a foundation for organizing the work that needs to be done to build the system. The next chapter presents a model for this work of building a system or component. The information about one component is represented in a set of artifacts, and there are tasks that make those artifacts. The structure of the artifacts, and thus of the tasks, is based on the model of systems and components in this chapter.

Part III goes into greater detail about each part of this model.

Chapter 7: Elements of making a system

29 March January 2024

7.1 Introduction

The previous chapter defined what a system is. In this chapter, I turn attention to how to make that system. “Making” includes the initial design and building of the system, as well as modifications after the initial version has been implemented.

Making the system is a human activity. Building a system correctly, so that it meets its purpose, requires a team of people to work together. Building systems of more than modest complexity will involve multiple people, usually including specialists who can work on one topic in depth and people who can manage the effort. It involves people with complementary skills, experiences, and perspectives. Such systems take time to build, and people will come and go on the team. Systems that have a long life that leads to upgrades or evolution will involve people making modifications who have no access to the people who started the work.

This chapter provides a model to organize and name the things involved in the making of a system—the activities, the actors, and what they work with. Later chapters provide details on each part of this model. This model includes both elements that are technical, such as the steps to design some component, and elements that are about managing the effort, such as organizing the team doing the work or planning the work. Note that this model does not attempt to cover all of managing a system project—there is much more to project management than what I cover here.

The model presented in this chapter only serves to name and organize. I do not recommend here different approaches one can take for each of the elements of the model; only attributes that good approaches should have. Later parts of this book address ways to achieve many of these things. For example, the team that is designing a system should have an organization (a desirable attribute), but I do not address which organizational structures one can choose from.

The assembly of all the parts involved in making a system is itself a system. In those terms, this chapter presents the purpose (Chapter 9) of the system-making system and a high-level concept for how to organize the high-level components (Chapter 11) in that system.

7.2 Objective

This model of making captures the activities and elements involved in executing the project to make or update a system.

The approach used for making the system should:

7.3 Model

The making model has five main elements:

  1. Artifacts: the things created that make up the system and its records
  2. Tasks: the activities that are performed to make artifacts
  3. Team: the people who perform tasks
  4. Tools: things that the team uses in performing tasks
  5. Operations: how the team manages the work to be done
undisplayed image

7.3.1 Artifacts

The artifacts are the things that are created or maintained by the work to make the system.

The artifacts have three purposes. First, the artifacts include the system’s implementation—the things that will be released or manufactured and put in users’ hands. The artifacts should maintain the implementation accurately, and allow people to identify a consistent version of all the pieces for testing or release. Second, the artifacts are a communication channel among people in the team, both those in the team in the present and those who will work on the system later. These people need to understand both what the system is, in terms of its design and implementation, and why it is that way, in terms of purpose, concept, and rationales. Finally, the artifacts are a record that may be required for future customer acceptance, incident analysis, system certification, or legal proceedings. Those evaluating the system this way will need to understand the system’s design, the rationales for that design, and the results of verification.

The artifacts should be construed broadly. They include:

Artifacts other than the implementation are valuable for helping a team communicate. Accurate, written documentation of how parts of the system are expected to work together—their interfaces and the functions they expect of each other—are necessary for a team to divide work accurately.

Many engineers focus solely on the implementation artifacts, especially in startup organizations that are trying to move quickly, and do not produce documents recording purpose, design, or rationales. If the organization is successful and the system they are building enters service, at some point this other information will be required—as the team membership turns over, or as the complexity of the system grows, or as the team finds flaws that need to be corrected. The startups I have observed have all had to reconstruct such information after the fact; the reconstructed information is less accurate and costs more than it would have been if it had been recorded from the beginning.

Finally, the artifacts should be under some kind of configuration management. Artifacts will evolve as work progresses. One artifact may be a work in progress, meaning others may want to review or comment but that they should not count on the artifact’s contents being stable. An implementation artifact may reflect some design artifact; when the design artifact is revised, people must be able to see that the implementation reflects an older version of the design. When the implementation artifacts are packaged up and released, the resulting product needs to have consistent versions of all the implementation parts.

7.3.2 Tasks

These are the individual activities that team members perform. The tasks use and generate artifacts. I rely on the colloquial definition of “task” and do not try to formalize the term here.

Systems projects usually have vast numbers of tasks. These include tasks for designing, building, and verifying the system; they also include tasks for managing the project, reviewing and analyzing parts of the system, and approving designs and implementations.

There are usually far more tasks to be worked on than people to do them. Tasks also usually have dependencies: something needs to be designed before it is implemented, or one part of the system should be designed before another.

Tasks, in themselves, need to be known and tracked. People on the team need to know what they can be working on, and who is doing other tasks that might relate to their work. Managers need to be able to track what is being done, what tasks are having problems, and ensure that tasks are coordinated and completed.

Operations, discussed below, addresses questions of what tasks are needed and which ones should be performed in what order.

7.3.3 Team

These are the people who do the tasks. They are not an amorphous group of indistinguishable and interchangeable parts; each person will have their own abilities and specialties. Each person will also have their own authority, scope, and responsibilities.

The team should be organized. This means:

In addition, the team needs to be staffed with enough of the right people to get work done. This means that people with management responsibility need to know who is on the team and their respective strengths, as well as the workload each one has and the overall plan for moving the project forward.

7.3.4 Tools

These are things that the team uses to get its tasks done. The tools are not part of the system being produced, though they are often systems in their own right. An end user of the system being produced will not use these tools, either directly or indirectly.

The tools include things like:

7.3.5 Operations

Operations is about organizing the work that the team does. Its primary function is to ensure that the right tasks are done by the right people at the right time.

Operations sets up “a set of norms and actions that are shared with everyone” in the project [Johnson22, Chapter 2]. It gives people in the team a shared set of rules and procedures for doing their work, and it uses those procedures to manage a plan and tasks that coordinate that work. When people share a set of rules and procedures, they can each have confidence in how others are working and in the results that others produce.

There are two primary objectives for operations: making sure the work proceeds efficiently, and ensuring product quality. Operations has secondary objectives, including keeping the organization informed of progress and needs.

Ensuring the project runs efficiently implies several things.

Ensuring quality means:

undisplayed image

I look at operations through the lens of the tasks that people on the team will do. Operations is about tracking what tasks need to be done, who is working on them, and how those tasks are going. Operations is, in a way, a feedback control system that keeps the flow of tasks running smoothly.

Operations is more than overseeing tasks, however. It is equally about guiding the team through its work, especially in how people should coordinate their efforts. This starts with setting out the guidelines for how work should get done: procedures and process. That leads to planning, which sets the longer-term direction for the project’s work and allows project management to check whether the work is proceeding well. Planning leads to managing the work being done at the moment. All of these depend on information that supports decisions that have to be made.

I use the following model to define the parts that make up operations. This model has a flow from a project life cycle, which is established early in the project and changes rarely, through parts that organize the work, onward to day-to-day tasking. I explain this model in more detail in Chapter 20.

Life cycle. This defines the overall patterns of actions that the team will perform as it does the project. It defines phases of work and how one phase should happen before another. A typical phase is made up of many tasks; it covers (for example) the the work designing some component. The life cycle also defines milestones, which provide planned times when checks on work are done in a phase.

A life cycle pattern says things like: “First work out purpose, then specifications, then design, then implementation. At the end of each of these phases, have a review with one person designated to approve moving forward.”

undisplayed image

There are many different life cycle patterns, and usually an organization or a project will need to pick one—and then customize the life cycle to meet its specific needs. Sometimes the life cycle will be determined by external requirements; for example, NASA defines a common life cycle for all its projects [NPR7120].

Procedures. While the life cycle defines in general what to do, the procedures define how to do some tasks. They provide specific instructions for how to do particular actions or tasks. The instructions might take the form of a checklist, a flow chart, or a narrative.

People on the team need to know how to do things that require coordination. While team members should be able to do most of their work independently, at some point they will need to work together. The work will go more smoothly if everyone understands when they need to work together and how to do it.

There are also some tasks that are procedurally complex, even when only one person is involved. For these tasks it is helpful to have written down the steps to perform—which serve in effect as a checklist.

Procedures should be defined for tasks where getting the actions right is critical or where the task is complex. In the example below, checking a document artifact into a repository is simple, but needs to be done correctly. Performing a design review and approval has potentially many steps to go through: communicating the design to others for review, an approval decision by a designated team member, and changing the status of design documents to show that it has been released. When the life cycle defines a point in the project when something should be checked, such as during a review, procedures ensure that all the needed checks actually happen.

undisplayed image

Documented procedures help the team perform tasks accurately, helping to make sure that steps aren’t missed. They also help the team do those tasks in compatible ways so that one person’s work can build on another’s.

I have seen teams that try to operate without some ground rules for working together. This can work quite well for teams up to three or four people, and when the artifacts they produce do not need high assurance (that is, when what they produce is not safety- or security-critical). On larger teams that have not written down their basic process rules, I have always seen failures to communicate or consult. These failures sometimes led to errors in the system that had to be corrected later once found. Sometimes they led to one person damaging another person’s work, requiring time and effort to recreate overwritten designs.

Documenting procedures also provide a way for the project to learn and improve. If some procedure is not working well, the team can identify which procedure is the problem and then change it. As long as team members then follow the revised procedure, the team’s ability to work should improve over time. Contrast this to not documenting a procedure: some people may have opinions on how to do it better, and they may start doing it the new way, but not everyone will know about the change, and people may forget it after a little while. This makes learning slower and less reliable.

Plan. The plan defines the overall intended path forward to a completed system, along with selected milestones along the way. It is a current best estimate of the general steps needed to move the project toward that goal.

A plan records the approach the team intends to take to build the system. It lays out the phases of work expected, in coarse to medium granularity. In doing so, it records decisions like the flow from specification to design to implementation to verification. It records when the team decides to investigate different ways to design some component, perhaps prototyping some of the ways. It documents expected dependencies and parallelism.

undisplayed image

The plan is, therefore, a record of how parts of the life cycle pattern are applied to this specific project. Just as there are many patterns that a project can choose to use, there are many different ways to organize the project’s work. I discuss these choices in depth in Chapter 20.

A plan is not necessarily a schedule. A schedule is usually taken to mean a sequence of events with a high confidence of accuracy and completeness. A plan, on the other hand, reflects the uncertainties that come with developing a complex system. In the beginning, the plan can be specific about a few things in the near term but must be vague about the longer term until enough design has been completed to fill out later work. As a project progresses and more and more becomes known, the plan should converge to something like a schedule.

A plan is broader than a list of specific tasks. It consists of a number of work phases, and dependencies among them. This information then guides the specific tasks, as discussed in the section on tasking below.

Plans are used in prospect, in the moment, and in retrospect. They should provide guidance on what direction the work will likely go in the future, even when that direction has uncertainty. They are used in the present to track what is happening now. They provide history of what has been done, to understand how the team’s work compares to predictions and to provide accountability for everyone responsible for working on the project.

I have never encountered a project that had a single plan for the whole duration of the work. Plans have always been dynamic. Early in the project, we knew that we needed to develop a concept for the system but did not yet know enough to sketch out the work involved in building that concept. Later we had a general structure for the system, but there were technical questions to resolve; once resolved, we would know what we were building. Later in the project, we would find defects or we would get a change order, resulting in unanticipated work.

Tasking. This is the day-to-day definition of tasks to be done, their assignment to team members to perform, and tracking their progress.

Tasking involves continuous decision-making: the choice of which tasks should be performed next, or which tasks should be interrupted to deal with higher-priority tasks. These choices merge several streams of potential tasks: ones that derive from the nearest parts in the plan; ones made newly urgent by a change in what is known about the system; ones about fixing errors that have been discovered; and tasks related to new outside requests.

undisplayed image

The team will need to keep track of both the potential tasks and the ones that have been assigned and are being worked on. This implies record-keeping artifacts.

The criteria for deciding about tasks should be encoded in procedures, as discussed above. The procedure for choosing tasks can be viewed as a control system that responds to project events to affect the set of tasks assigned for work, with the aim of making the project’s execution run efficient. “Efficiently” means meeting the goals set out above for operations: ensuring that the right work is done, that people aren’t blocked from getting work done, and that the work follows orders or dependencies needed for high-quality work.

How the tasking control system works depends on the development methodology used in the project. Agile development, for example, often focuses on making tasking decisions at regular intervals (for each “sprint”); other methodologies focus on making tasking decisions continuously.

Support. The decisions made during operations take into account several kinds of supporting information. These include:

Sidebar: Resource-constrained projects

Traditional project planning approaches grew out of projects, such as building construction, that focus first on time and budget. This kind of project treats the completion date as the driving factor in organizing work, assumes that in general as many workers can be brought in as are needed to complete the work quickly, and that parallelism between tasks is limited primarily by dependencies between tasks. For example, in building a house, one contractor typically brings in a team to frame the structure, while another brings in a team to add the electrical wiring or plumbing into the structure. Each of these teams can bring in as many people as needed to get the work done, and then those people go on to another construction project elsewhere when their part is done.

This model of project planning leads to tools organized around a graph of dependencies between tasks. These tools usually provide analyses like critical path analysis, which shows the longest path through the graph of tasks and therefore the hardest constraint on how quickly the work can be completed. Planning the project well often hinges on understanding the dependencies between tasks and the critical path through them.

Most complex technical system projects, on the other hand, do not fit this model well. Each person working on the project needs to understand the context of their work, and there is usually a substantial cost to add someone to the project—largely in them learning about how the project works and how the system is organized. The collection of trained people on the team constitutes a valuable resource that the organization tries to keep around to maintain the system or to work on similar systems.

This approach leads to a different approach to planning work. While dependencies are certainly important, there are often many tasks that any one person can work on (and it is common to expect some degree of multitasking). In this case, getting the order of operations precisely right is not as important. It is more important to ensure that everyone can stay busy and that any major dependencies are accounted for.

7.4 Using this model

This chapter has presented a model for thinking about the work involved in making a system. This model, in itself, does not prescribe any particular way of managing building a system; it only names the topics that need to be addressed and provides some objectives by which an approach can be judged.

In Part IV, I go into more depth about each of the elements in this model.

Those who manage a project will need to decide how they will go about organizing their work. As I noted earlier, how a project is organized and run is itself a system, and the techniques discussed in this book apply as much to designing and operating the project’s operations as they do to designing, building, and operating the system product. Chapter 6 and Part III discuss the model for what a system is.

Part V discusses how the work of building a system can be organized around the life cycle of a project. Chapter 21 introduces the idea of a life cycle. It also introduces the idea that a life cycle model provides a basis for working out the tasks that need to be done to build the system. Subsequent chapters discuss what each of the phases of a life cycle, along with the artifacts and activities that go into each one.

Part XI discusses ways to organize the team that will do the work.

Part XIII presents approaches for planning and organizing the tasks that need to be done.

Chapter 8: Principles for a well-functioning project

3 May 2024

I have been a part of many projects. These projects built a wide range of systems, including specialized small business record keeping, local government IT applications, low-level graphical user interface tools, large storage systems, spacecraft systems, and ground transportation.

Some of these projects went well. They produced systems that were useful for their customers. The systems held up over many years of use, working correctly and supporting their users in ways they needed. The projects proceeded (fairly) smoothly: no major unexpected flaws, teams that worked together well, completion within close to the expected time and resources.

Paraphrasing Tolstoy, all well-functioning projects are alike; each project that has problems has problems in its own way [Tolstoy23]. Though there are several ways for a project to go well, there are far more ways they can go wrong—and it takes deliberate effort to keep a project on the path that goes well.

I have watched many of these projects struggle through problem after problem, most of them self-inflicted. The causes have included poor team organization, lack of a coherent system design, lack of taking the time to think through designs, lack of design, internal organization politics, and many others. The struggles led to canceled projects, startups that had to get extra funding rounds and missed their market opportunity, and unsafe systems being used in public spaces—often consequences not just for the people building the system but for their funders and for society at large.

This book was inspired by observing these problems and finding ways to do better the next time.

So what does a project need to do to function well? To develop a useful, safe system, on a reasonable schedule and budget? To keep its team functioning at a sustainable pace, without internal disruptions? The rest of this book seeks to provide some answers.

My general principles fall into four categories:

  1. The project or organization leadership;
  2. The tasks for building the system;
  3. The plans for building it; and
  4. The team that builds it.

For each of these, I will list some principles I have found important to making a project that runs well, or to keep it running well.

8.1 Project leadership

I have watched many projects, especially in startup companies, try to create a team of the best specialists: executives who are skilled at fundraising and external relations; an HR person who has a track record at recruiting; someone with marketing skills and connections, and a few engineers who can build the key technical parts of the system. Most of the projects that have staffed only with such specialists have either failed or had serious problems with execution.

These projects had a gap at the center of the work. Everyone is responsible for some piece, but there is no one whose responsibility is to link the pieces together: to build either the team or the product as a coherent system. People in the team generally don’t really understand each others’ work. They have trouble finding how to work with each other. The executives don’t understand the work or the team, and issue instructions that don’t make sense. The team makes poor technical decisions because no one understands how the artifacts they are building must work together.

This gap leaves three needs unmet. First, there is communication and translation between the executive team and the engineering team. Second, there is organizing and running the engineering team. And third, there is maintaining a systems view of the team’s technical work.

8.1.1 Principle: Communication and translation

Have at least one person in the organization who can communicate with people in the executive team, marketing, and engineering, and translate among them.

The executive team is, in most organizations today, a collection of specialists in running the company as a whole: corporate activities, finance, legal, public relations, marketing. I have found this to hold equally for independent companies, especially startups, and for projects that are part of larger organizations. The details may differ but the roles are largely similar.

The engineering team is also mostly a collection of specialists in one area or another, according to the needs of the system being built. They will understand parts of the system, but few of them are tasked with making all the parts cohere so they work together. Most of the engineering team will have been contributing by having specific, deep skills.

The communication need is to represent these parties to each other. The executive team is responsible for setting the overall direction for the project. The engineering team needs this direction translated into actionable directions. The executive team also must be responsible for high-level safety and security decisions (e.g. what kinds of safety hazards the company will address in its system products). The executive team has the responsibility for these decisions, and those then need to be translated into the safety and security engineering processes. In the other direction, the engineering team needs to provide feedback to the executive and marketing teams on the feasibility and cost of different possible feature or market decisions the executive team could make.[1] The project management part of the engineering team also has the information about how work is progressing and can provide information about the time and people needed to reach different milestones.

8.1.2 Principle: Provide staff to run the engineering team’s operations

Designate at least one person to oversee how the team building the system operates. This person (or people) organize the team, and adjust how it operates as the team grows and the work progresses.

An organization is a system, and a team of more than a handful of people will not self-organize in a useful way. I will argue below that this system needs careful design to work well.

I consulted with a small startup that did not have someone responsible for organizing the engineering team. The startup had begun as a very few people, who were figuring out the basics of what their company could build. The co-founders did not create an organization below the executive level; instead, they expected that they could all just work together and figure it out. And, predictably, they did not figure it out once they added a few more people to the team and had to specialize.

Johnson [Johnson22] discusses how to organize a growing company, and I recommend her work to the reader. She presents many ideas about what to do to organize a company’s operations. While that book focuses more on the human-oriented parts of operations, such as hiring and performance evaluation, the ideas it presents provide a solid foundation for parts specifically about engineering, such as how to organize design and implementation verification (which are as much a human activity as a technical one).

An organization that is going to successfully build a complex system will need to designate someone as having the primary responsibility for creating and maintaining the team’s structure and patterns of behavior. Either that, or they need to get improbably lucky.

8.1.3 Principle: Systems view of the system

A team building a complex system must have at least one person who is responsible for the system as a whole, not just its parts.

A coherent, working system does not occur by chance. It requires deliberate effort for a collection of parts to work together, and for the collection to fulfill the purpose for a system.

This deliberate effort can be achieved, theoretically, by a group of uncoordinated specialists. However, this amounts to the Infinite Monkey Theorem, where enough workers and enough time can produce any system. For realistic systems, many more times the projected lifetime of the universe might be enough.

In reality, the majority of the engineering team is responsible for parts of the system, not the whole thing. It is not the job of these people to be responsible for the systems view of the whole; nor is it usually their training or experience.

Building a system requires coordination so that the parts work together. This can be achieved by designating one or a few people to be responsible for the coordination, or by having the parts-builders work by consensus. Work by consensus requires skills and time that few people have, unless the team has no more than perhaps five or six members.

Building a coherent system also requires having a way to measure coherence and satisfaction of system purpose. If a team is to work by consensus, all members of the team must have a consistent understanding of these criteria. If a smaller group is responsible for the system as a whole, then fewer people are required to share this understanding.

The shared understanding starts with the purpose for the system. The definition of the system’s purpose is outside the engineering team’s scope; it comes from the customer or their proxies by way of marketing roles (Section 6.2 and Chapter 9). The translation of information about customer needs into an actionable system purpose is the responsibility of a system role. This includes documenting the system purpose, developing a concept of the system, and writing down top-level system specifications. In doing so, the role works with the executive and marketing teams to confirm that the purpose and concept as developed match what the customer and organization actually intend.

The systems role is responsible for ensuring that the component parts of the system fit together into a coherent system. To meet this responsibility, the systems role is responsible for the design of the high-level decomposition of the system into parts, and how those parts are related—the functional and non-functional relationships (Section 6.5 and Chapter 12). While the systems role delegates the work to design and build the components, the role does check that the results match the specification of how the components interact. The systems role also guides the order of work, especially for how to plan integration.

8.1.4 Principle: The team is a system

A well-performing team is deliberately designed to have a structure that gives each member incentives and support to work together. The team’s leadership establishes the design, and monitors the team’s function to adapt the team structure when needed.

An effective team does not happen by accident. When a team is not given a structure and rules about how to work together, they will find ways to work. They will build up habits in response to a few specific early needs—and those habits will not make for a team that communicates well, cooperates well, or makes good systems decisions.

When medium to large teams try to self-organize, they react to problems they face immediately, and each person determines their response based on their own values and self-interest. The team members are not trained or incentivized to plan the team’s organization for future needs; instead, they find ways to work through individual problems as they come up. The team members in general do not have a view of the entire effort that will be needed to build the system, and so they find solutions based on their specific needs.

Team work exhibits variations of collective action problems. [Olson65] These problems occur when a group must work together; each member of the group must contribute in some way, and in return everyone in the group receives some benefit. The optimal strategy for an individual is often at odds with the optimal strategy for the common good. Many commonly-known cooperation problems, such as the tragedy of the commons or the prisoner’s dilemma, are kinds of collective action problems. (In fact an engineering team represents a particularly complex kind of collective action problem, because the contributions of different group members can combine non-monotonically: the value of one person’s contribution to the common good can be negated by another’s contribution.)

In other words, the natural tendency for a group is to form an organization that is reactive to immediate needs and to individual objectives, rather than the long-term objectives of the project as a whole.

Creating an effective team is, therefore, a deliberate act. It involves working out what the team needs to do as a whole, and then designing a structure for the team. That structure should address:

Maintaining a team’s effectiveness is also a deliberate act: good project leadership monitors how the team is doing and adapts organization or processes when needed. The team organization, or its processes, or its role assignments may work well for a while, but not fit the team’s needs as well later. The project’s leadership may set up a team organization or process and then find it doesn’t work as well as expected.

The organization of a team can be evaluated against the objectives in Section 7.3.3: how well people know how they fit into the organization and how that affects the actions they take.

I discuss matters of designing a team in Chapter 19 and in Part XI.

8.1.5 Principle: Team habits

A team with good habits and culture can get work done. A team with poor habits will not, except by unlikely random chance.

Whether a team follows procedures and processes depends on whether following them is the norm for the team.

Teams follow habits. Habits and norms provide stability to team members: when they know what to expect, they can get on with their work. This creates an incentive to keep following habits and not change them.

Establishing the habit at the beginning of a project is not difficult. Changing their habit later is quite hard and rarely successful. The leadership of a team has one opportunity to set up a team to follow a process without undue effort. When they squander that opportunity, a project has difficulty from then on. If people in a team do not have a de jure process to follow, they will work out ways to get things done, and those habits will be the default way they work. Those habits are likely to have been worked out in reaction to a few specific, immediate situations and won’t account for the indirect ways that one piece of work affects another, and thus will not meet the project’s needs well.

It is possible to change a team’s habits after the fact. However, it takes time (a lot of it) and effort. The transition from one way of working to another will take time, as people will follow their habits without thinking until new habits set in. People will need constant reminders and incentives to change their behavior. There will be a period when people are doing a mix of old and new, which can increase chaos for a while (and often creates extra work to clean up the differences). People will feel extra stress and often there will be a decrease in morale or civility in the team until they settle into the new norm.

Most of the projects I have worked on over the years have been about innovation. The people who start such projects do so because they are excited about what they can build, whether about the technical aspects or the market aspects. They are motivated to get moving as quickly as they can. They usually are trying to make a prototype or do a demonstration as soon as they can. They do not have excitement about the work of crafting a team; if they need that, they will get to it later when they have the prototype built, or when they have the next funding round…

This tendency is often exacerbated by the way some funders behave. They reward market opportunity and technical originality, which incentivizes a team to build the market case and technology demonstrations as quickly as possible. Funders rarely reward or even evaluate whether the project leadership has capability to form a well-functioning team. When a team’s ability to execute effectively and efficiently is not valued by the funder, they will not put the effort into crafting the team.

A project’s leadership must incentivize and model following processes in order to build a team’s habits. I am aware of a company that set out anti-corruption processes, including ethical standards and a hot line for reporting violations. The leadership did not, however, make it clear to the employees how these would be acted on, and there was no demonstration of the standards being enforced. The employees correctly realized that the leadership was not serious about enforcing the standards, and it led to significant internal theft.

8.1.6 Principle: Keep it lightweight and actionable

People will use processes that they can figure out how to follow and that clearly give them benefit. Don’t make processes more difficult than what the team can do.

People will generally follow prescribed practices and procedures as long as 1) they know about them; 2) understand them well enough to perform them; and 3) the practices have high value relative to the effort required.

The first aspect implies that processes and procedures are documented and organized in a way that team members can find them. This also implies that when people join the team, they are taught how to find and use them.

A practice or procedure must be both clearly written and actionable for people to understand it and use it. I have encountered “plans” or “procedures” on multiple projects that amounted to a list of aspirations, rather than a specific set of actions that someone could follow. In one example, a security incident response procedure said things like “we will contact the responsible parties”, without naming who the responsible parties are (or even better, listing them with contact information). Had there been an actual incident, vague statements like this one would have led to time spent figuring out who the responsible parties were, and likely coming up with a wrong answer when under the time pressure of trying to resolve a critical incident.

A process or procedure that requires too much time or effort will lead people to try to create workarounds, usually subverting the reason that a procedure was established. This is the problem of a procedure that people perceive as too “heavy”. Keeping procedures as simple as possible will help. At the same time, some work is simply complicated, perhaps needing several people involved because it affects all of them. When some work is necessarily complex, it is vital to clearly document the process so that everyone involved understands both their own role and what the others involved will be doing.

I will discuss these topics more in Chapter 20, and especially in Section 20.8.3.

8.2 System-building tasks

Most engineers understand the need to use good technical judgment as they build a part of a system, but it is just as important to follow good practices in how the team approaches the work.

8.2.1 Principle: Start with a purpose before doing work

Understand why something is being built—its purpose—before trying to design and build it.

This is one of the most important principles in this book, and it applies in a great many ways.

“Purpose” here means the objectives for some work, the need that is to be met by doing the work or the reasons that it is worthwhile to spend the time and resources involved.

If someone starts designing or building something without understanding the purpose of the work, it is unlikely that what they build will actually meet the need that caused them to start the effort. And even if they do meet the need, perhaps by focusing on the purpose part way through the work, they are likely to have spent time and resources in false directions.

When someone takes on a task, whether to build part of the system or to oversee team operations, it is that person’s responsibility to ensure that they accurately understand the purpose of the work. Ideally they will be told the purpose as part of the task, but the person is still responsible for confirming that they correctly understand the purpose. I have found that taking explicit steps to confirm understanding saves time and effort, even for small tasks.

At the same time, the person who defines a task is responsible for ensuring that there is a clear purpose to the work and communicating that purpose to whoever takes on the task. In other words, the purpose for work is involved in a communicative action.

This principle applies to building a whole system. As I discussed in Section 6.2, a system needs a purpose—a customer need, for example—that it will fulfill. This purpose originates with the customer, or whoever will use the system and the value that the system will provide them.

The principle also applies to building components of a system. Each component (Section 11.2) has some role in the system: functions, behaviors, or properties that it should have that contribute to the system as a whole meeting its high-level purpose.

Other work also should have purpose. Organizing the team, or maintaining the project plan, or reviewing a component design are all tasks that have purposes. Someone doing these tasks should understand why the organization or review is being done, and they should ensure that how they do the work addresses that purpose even if associated procedures don’t spell out every step involved.

I argue in an upcoming principle that successful projects perform checks to ensure that the work that is done correctly fulfills its purpose. Without a clearly-defined purpose, it isn’t possible to determine whether a design or implementation or plan is correct or accurate.

I discuss how purpose fits into a system-building project throughout the rest of this book. I address the purpose for a system in Chapter 9. Each chapter in Part IV, on how to make a system, discusses the purpose of steps in building a system. As I present more specific topics, such as specifications (Chapter 33) or designs (Chapter 35), I present the purpose for that aspect of system-building before talking about what it is or how it works.

8.2.2 Principle: Evaluate tools before adopting them

Investigate whether tools, procedures, methodologies, designs, or implementations fit the project’s purpose before adopting them.

Every complex system is different from others in some way. The differences may be technical, such as how some component must behave, or they may be operational, such as the kind of team, the organization hosting the project, or the customer’s needs.

Differences mean that things taken off the shelf may or may not address the project’s need. An off-the-shelf electronics board might be a good fit, or it might not be available within the time needed, or it may lack a key security feature, or it may have reliability features that the project’s design does not need (but that do not interfere with how the board will be used). Similarly, a development methodology might address the project’s need for moving quickly and being flexible, but it might not work for a project’s distributed team.

In many cases the off-the-shelf methodology or design can be used in many different ways. The team may need to make choices about which of those ways are helpful for this specific project. The team may need to adapt procedures or methodologies for the procedure to fit what this project needs.

A well-functioning project will evaluate something that can be adopted, whether it is a component design or a procedure or a tool, against what the project needs that thing to be. Something that might be adopted can be measured in terms of the benefits of using it, the costs of adapting it to meet the project’s needs, and the costs of using the thing without adapting it. If the benefit outweighs the costs, then the thing can be used. If the thing does not quite meet the project’s need but can be adapted, then an investigation will reveal how to adapt it.

Sometimes a project will be obliged to adopt a process or use a component that is not a good fit. In that case the thing should be evaluated so that the team has a clear-eyed understanding of what problems could arise, and they can work out mitigations to avoid the worst problems.

This principle has a serious risk: that it will become an excuse for the Not Invented Here syndrome. No projects have the time or resources to invent everything from scratch—especially when reinventions often lose sight of the experience that has gone into building existing procedures or components. A team has to balance using tools that are pretty good but not perfect against the cost of inventing from scratch.

The idea of satisficing applies. This is when one applies a solution that is good enough to satisfy a need, without attempting to find a perfect solution. Writing of adapting buildings:

The solutions are inelegant, incomplete, impermanent, inexpensive, and just barely good enough to work. The technical term for it, which arose from decision theory a few decades back, is “satisficing”. It is precisely how evolution and adaptation operate in nature.

Even after generations of satisficing, the result is never optimal or final. […] The advantage of ad hoc, make-do solutions is that they are such a modest investment, they make it easy to improve further or tweak back a bit. [Brand94, page 165]

8.2.3 Principle: Take care with build-versus-buy decisions

Carefully evaluate each choice of whether to design or build something within the project, or acquire it from outside. Be particularly careful about the team’s ability to accurately make this evaluation.

Projects often have choices about whether to design and build something themselves, or to acquire if from somewhere outside the project.

Too often, the choice is made without deliberation. When the wrong option is chosen, the cost can be significant: spending resources to acquire something that doesn’t work well, or to build something that is not very good.

There are reasons to choose to build something inside the project. These include:

There are also reasons to acquire a component.

Sometimes there are overriding concerns in making the decision. If the team does not have someone with the skill to develop the component, it will have to be acquired. If no outside organization offers a component that fits, it will have to be built. If the time to build is too long, then it will have to be acquired, or vice versa.

Other times the decision depends on the costs and benefits of each option.

Two cost considerations are often overlooked. First, a custom-built component can be made to be a perfect match for the system’s needs, while an acquired component may have to be adapted or may have unneeded features (which can become a liability). The cost of adaptation has to be considered in addition to the cost of acquisition. Second, a custom-built component presents opportunity cost as well as the direct cost of building it. If a custom-built component is not essential to the system purpose or the related business purpose, then the resources used to build the component might be better used on something more central to the purpose.

Teams, and individual team members, need to consider their ability to make an objective build-versus-buy decision. I have observed many people who choose to build something new not for sound technical or business reasons, but because they are excited about building that thing. I have seen other cases where someone decided to acquire a component because they were not interested in the effort required to design and build it well. Worse, too often the Dunning-Kruger effect [Kruger00] applies: that the person making the decision is not aware of whether they have the knowledge to make an accurate decision, or are not aware of how their biases are driving a decision.

8.2.4 Principle: Follow the spirit, not just the letter

When a project has adopted a procedure or tool, that procedure or tool has a purpose. When using it, keep the purpose in mind and make sure that purpose is met—not just following a procedure or using a tool blindly.

A well-functioning project does not adopt its procedures or methodologies on whims; it addresses them to purposes. In organizations like NASA, the procedure standards represent several decades of accumulated experience. While the procedure may not be written to make the purpose and experience clear, these reasons exist behind what has been written.

I worked on a NASA project that reached its Preliminary Design Review (PDR) milestone. The team followed the long NASA checklist for what should be presented at that review. Unfortunately, the team did not keep in mind what the PDR was actually for: ensuring that the early, conceptual design coheres as a system and showing that the system is ready to proceed to steps that will involve greater investment. Instead they developed material that checked each box on the agenda, without addressing the system as a whole. The reviewers could tell that the design did not make sense; moreover, the review failed to reveal the actual problems that the design had.

A team should document the reasons or purposes for which they adopt a procedure or a tool. Similarly, each person on a team should put effort into understanding why the team has adopted procedures and tools.

8.2.5 Principle: Document things so there is a future

Document both how things work and why they work so that people can understand the system when they work with it in the future.

It is easy to want to design or implement at full speed, keeping focused on the immediate goal: getting the thing built.

That goal misses the larger purpose of building something—that the built thing meets its purpose and specification, and that it continues to do so as the system evolves.

In practice, the initial design and implementation of a component involves much less effort than is spent on checking that implementation, integrating it with other components, fixing bugs, and making changes later. A project that is building a system to succeed in the long term optimizes for all these other tasks, not just the initial design or implementation.

All these later tasks involve understanding specification, design, or implementation of a component. Understanding means not just being able to see the design or implementation artifact, but also knowing why the component is what it is. This includes documenting the rationales that led to significant decisions about the component. It also includes providing people a guide to understanding the component’s design or implementation, especially if there are subtle aspects to the component that are easy to miss if one is looking just at a design document or an implementation.

When someone is the code for some component and asked to change some behavior, and that person isn’t the one who initially implemented that component (or they are the same person, but it was a while ago), they begin by building up a mental model of how the component works. Once they have that mental model, they can proceed to work out how to change it. They will think of different ways they could make changes, and evaluate them to see if the changes will have the effect they intend and that the changes will not have some other undesired effect.

Building up an accurate mental model involves working out constraints that led to the component’s design, major decisions about how the component is structured, and how different parts of the component work together to achieve its functions. This information is not encoded directly in software source code or mechanical drawings or circuit designs; all those things are the products of a process that works through all those other things on the way to producing the design or implementation artifact.

The person who is tasked with changing a component, and then building up a model of how that component works, can get information two ways: from documentation or by reverse-engineering it from the implementation artifacts. In practice it is usually best to do both. A circuit design is the truth about how an electrical component works, and so this is the most accurate way to learn about the implementation. However, a circuit design or software source code leaves out the rationale for why the design is the way it is. Having documentation about the design, about why the design is the way it is, and a guide to the implementation will help the person understand the component more accurately and more quickly.

Of course, having documentation only helps if that documentation is accurate. If the documentation doesn’t match how the component was actually implemented, then the documentation will lead someone astray when they try to learn how a component works.

There has been a saying in agile software development that “the code should be documentation”. This is usually interpreted as “the code should be the only documentation”, which is not what the people who developed agile methodologies intended.[2] The point in the agile methodology is that software code is necessarily documentation, and it should be written so that it is clear and readable so that others can read and understand the code.

I have experienced both the advantages of having good documentation and the disadvantages of having no or inaccurate documentation. Many years ago, I developed a multithreading package for a research system. That package included a peculiar thread-synchronization primitive tuned for that specific application; correct implementation depended on some unobvious code in one place. It took some time to analyze the design to identify that condition, and if I had not written it down I would not have remembered it correctly when I had to modify the package a year or two later. On the other side, on a personal project I was developing a responsive, single-page web application and developed a combination of JavaScript code running in the browser and Ruby code running in the server to achieve it. I did not document the design, and when I needed to improve it after a couple of years I had to reconstruct the design. I spent much more time than I would have liked on that reconstruction.

8.2.6 Principle: Build in checks

Make independent checks of all critical specifications, designs, and implementations a normal and expected part of project work. Define in advance who will do the checks and when they will do them.

Having one person check another’s work is a basic mechanism for maintaining quality, safety, and security in a system. It applies equally to technical work, such as verifying that a design matches specifications, and to project operations, such as checking that a procedure is working as intended or that team communication is flowing.

Note that this does not mean that developers can avoid writing unit tests or performing design analyses. They should be doing those, and independent checks should be done as well.

There are many advantages to performing reviews or checks:

There are two significant disadvantages that can lead to a team skipping checks. First, checks take time and effort. When a team is pressed for time or short handed, it’s easy to let a check go by. Second, done poorly a review can feel like a lack of trust or like an attack on someone’s work.

Nonetheless, checks and reviews are important enough that a well-functioning project will find ways for checks to happen.

Having checks be a built-in norm for the team helps address the disadvantages. If everyone knows that checks are going to happen, the time and effort involved will be planned for. People will notice if checks are being skipped, and will ask why—helping to ensure that the checks actually do happen. Separately, when everyone’s work is checked, it becomes easier to convey the sense that no one is being singled out or is not trusted.

I discuss how checking can be built into a project’s life cycle patterns in Chapter 20.

8.2.7 Principle: Work against cognitive biases

Take deliberate, ongoing actions to avoid the negative effects of cognitive biases, such as confirmation bias or team echo chambers, and missing or incorrect information.

The work of building a system involves making many complex decisions. These decisions are based on the information that the person making the decision has, along with their skills, experience, and biases.

Incorrect decisions can be made when people work with beliefs or biases that are inaccurate. This leads to concepts or specifications that reflect the errors, and from there to designs and implementation that do not meet system needs. There are many terms for these various situations, including confirmation bias, echo chambers, or recency bias.

These errors arise from many different causes:

These biases can lead to serious system flaws when incorrect decisions are made about high-level system design or safety and security functions.

There is no one method that will eliminate these problems. Indeed, many of these problems are a necessary flip side to cognitive behaviors that have positive outcomes, such as group agreement and pruning a search space when making decisions.

A well-functioning team takes deliberate and ongoing steps to reduce the problems that come from cognitive bias. These address the problems from two directions: prevention, by making complete information available, and reducing occurrence, by building into the project’s procedures methods to avoid or catch problems.

A project can reduce the chances of cognitive bias issues by maintaining complete written records of key information. Information about customer needs (and how those were determined) and rationales for design decisions are most important. Completeness in designs and verification records also helps. Sharing information that changes widely as well as documenting it in writing helps avoid team members working from outdated assumptions.

Reducing occurrences of erroneous bias involves finding ways to see around the bias into information that would have been ignored or dismissed. This almost always comes from finding a way to get perspective that sees a problem from a different perspective. Training team members to take deliberate steps that will try to falsify their hypotheses gives each team member their own improved perspective. Building in reviews where decision rationales are explained to people with different perspectives helps catch biased decisions before they cause errors. Designating someone to be a devil’s advocate in discussions about complex decisions makes it clear that the team is taking the possibility of bias seriously.

Continuous training for team members in their own disciplines and in related ones improves their skills, in addition to what they learn by experience. Greater knowledge and skills helps combat the kinds of cognitive bias related to the Dunning-Kruger effect. Training in related but different subjects improves their open-mindedness, giving the team members new perspectives to use in thinking through decisions.

Project leadership has an important role in avoiding problems that arise from bias. Good leadership models behaviors where the leader explicitly looks for falsifying evidence and alternative perspectives. The leadership has the ability to allocate effort to investigating decision alternatives and being the devil’s advocate in discussions. The leadership sets expectations for the rest of the team by inspecting decision rationales to ensure that steps have been taken to address possible biases.

8.3 Plan for building the system

Complex systems, with dense graphs of relationships between their parts, cannot be built without a plan. A project cannot get such a system built by following a random walk through the space of possible tasks. However, plans have often been over-done, trying to lay out a definite schedule where in fact there are unknowns and then having scheduling crises when something runs long or over budget. A middle ground that remains honest about what is known and what is not, that allows flexibility as the project moves forward, and that also guides the work in a consistent direction works better.

8.3.1 Principle: Prioritize work by risk or uncertainty

Put effort into work that carries risk or uncertainty as early as possible.

Common project management practices advocate paying attention to the critical path: the set of tasks that must be completed on time in order for the project as a whole to complete on time. If any one of these critical tasks runs late, the project as a whole will be late. Each task has some measure of slack, the amount that it can start early or run late without delaying the end of the project. If a task has no slack, meaning it must start and finish on time, it is part of the critical path. Most projects have at least one sequence of critical tasks from the start (or from the present) to the end of the project.

This definition of critical path is useful but overly simplistic. It is useful because it gives a way to identify work that can put the project at risk, and once identified that work can get extra attention to make sure it goes as planned. The definition is simplistic because, at least in the basic formulation, it assumes that the graph of tasks and the duration and dependencies of each task are all known.

The critical path method is a special case of the general principle of using risk and uncertainty to inform project planning. In general, what work could lead to the project being delayed, or to the project failing?

There are at least four kinds of risk or uncertainty to consider.

First, there is the risk that some external event will affect the project. A customer might change their needs. Regulation might change, affecting how the system must be designed. A supplier might go out of business and thus not deliver components. Weather might delay an essential testing operation. Some geopolitical event might happen that changes the ability to manufacture an essential part.

Second, there is uncertainty about how to build part of the system. At the beginning of a project, there is neither concept nor design for the system and so the time required to build it is uncertain. As the design begins to develop, there will be some parts of the system that have low technical risk because they involve well-understood problems, such as wheels for a road vehicle. There will be other parts that cannot be built using available designs, such as a spacecraft that needs low-mass, low-power radio subsystem that can communicate with another spacecraft. If the team can find or develop an appropriate radio, then the project can move forward—but if it can’t be, then the system design or the mission will require significant re-work. It may not even be possible to meet the customer needs within the time and budget they require.

Third, there is uncertainty about the time and effort required to build something. There may be a likely technical solution for some component, but the difficulty of constructing it may have hidden surprises. The time needed for a supplier to provide a purchased component might not be known until a contract is signed with them. The complexity of testing the integration of certain components and fixing bugs might not be understood.

Finally, there is schedule risk from a “long lead” task or sequence of tasks that will take a long time to complete.

A well-functioning project searches out risks and uncertainties like these and puts attention and effort on them. Deliberately spending effort addressing technical and schedule risks early in a project means that potential problems are addressed when it is cheapest to handle them. Consider finding out halfway through a project that there simply is no component available to fill some need. Addressing this might require a redesign of much of the system—but much effort has already been spent building parts of the system that now must be discarded. This is a waste of resources; more seriously, it presents a problem that all too often leads project management to decide to fudge the solution and build a system that does not work as needed.

This principle requires dedication to examining the state of the project thoroughly and without bias.

8.3.2 Principle: Prioritize integration

Integrate components as early as possible. When possible, integrate mockups or skeleton components before building out the component details.

There is common wisdom that the cost of fixing an error in a complex system generally increases over time, up to the release into production. While the hard evidence for this is lacking, I find general acceptance that this occurs, though with plenty of exceptions.[3] The idea of increasing cost over time has led to methods that successfully catch errors early, including concept, requirement, and design reviews, test-driven development, and automated checking tools.

Studies such as those reported by Leveson [Leveson11, Sections 2.1 and 2.5] suggest that the greatest cause of system failures now comes from design errors related to the interaction of separate components: the robustness of individual components is not the problem, but instead how components work together. This appears to be the case even with requirement and design reviews, which certainly catch many errors before they are implemented.

I have found two methods help reduce integration-related errors.

The first method is to use semi-formal, top-down design analysis methods in conjunction with design reviews. I recommend the STPA method that Leveson presents. [Leveson11] The Mars Polar Lander loss review called out the lack of such analyses as a significant contributor to the loss of the spacecraft [JPL00, Section 5.2.1.1, p. 16].

The other method is to organize development around integration, so that the component interactions can be tested (not just analyzed) as soon as possible. This principle means focusing on how components will work together before implementing fully detailed components. This leads to building the system in increments, starting with a collection of stub or skeleton components that implement a few parts of the component behaviors and integrating them together into a partial system with limited capabilities. This partial system is then tested, with an emphasis on seeing if the interactions work correctly. Once problems with the integration are sorted out, another tranche of functionality can be added and tested. Along the way, one always has a partial system that runs.

Integration first has two benefits. First, if the component interactions do not work well, multiple components will be affected by a redesign. Detecting the problem before investing in all the details of the components means less re-work. Second, it is usually easier to test interactions with mockup or skeleton components than with “real” components. One can instrument the mockups to observe detailed states that are harder to observe in a complete implementation. One can also add fault injection points to make it easier to create off-nominal test scenarios.

This principle is not one to apply blindly, however. The purpose of integration-first development is to address uncertainty or risk that comes from potential component interaction problems. Some components may have their own internal technical risks, and sometimes it is more important to sort out that risk before addressing component interaction risks. Of course, the ideal would be to address both in parallel.

8.3.3 Principle: Have a long-term plan

Maintain a plan for how to get from the present to a completed system. Detail out the near future; have a concrete but less detailed plan in the medium term; and have a general approach beyond that. Evolve the plan as understanding about the work changes.

Consider the task of planning a route for walking from one place to another. If one has a map of roads or trails connecting the locations, one can search out a path by using a standard shortest-path graph algorithm, which evaluates various parts of paths in an orderly way until it finds a “best” path.

This is analogous to building a system with few unknowns. One can start by designing the system on paper and checking it out. This approach is a low-risk way to build a system, as long as one can be sure that all of the components can be built as designed and that their integration into a system will work as planned. This situation applies when building a system that has strong similarity to other systems, so that there is an existing body of knowledge about what works. This is the basis for repeatable engineering methods, as evaluated by standards such as CMMI. [CMMI] It is also the situation that led to the waterfall system development methodology.

What if there is no map? What if the terrain in between is unknown, and the distance is far enough that one can’t do something like climb a hill and look?

Most projects that are working to build an innovative complex system have a situation like this. At the beginning, there is no obvious path to follow to get to the desired system; indeed, there may not be any path that gets there if the desired system is not feasible.

The team working on the project needs a plan that will guide their work, giving it a general direction for the long term, some concrete plan for the medium term, and details in the short term. As the work progresses, some of the medium-term work will turn into specific, detailed tasks. Some of the tasks will provide information that fleshes out parts of the general, long-term work into more concrete medium-term work. Sometimes bug reports or change requests create new short-term tasks that change the medium- and long-term parts of the plan.

A plan like this benefits the team. It helps ensure that people get all the tasks done, without some getting missed. It conveys decisions about how work is prioritized, which helps the team work independently. It gives a basis for measuring progress and predicting whether milestones will be reached on time.

The act of maintaining the plan provides the opportunity to think about priorities (such as those in the previous principles) and the dependencies between parts of the work.

A flexible, evolving plan strikes a middle ground between a fixed schedule and a purely reactive tasking approach. A fixed schedule, of the kind often associated with the waterfall development methodology, often either becomes a fiction after a few weeks when unknowns intrude onto the planned perfection, or the schedule becomes flexible and takes effort to maintain without a discipline to doing the maintenance. A purely reactive approach, which can be seen as in Agile methodologies taken to an extreme, has the risk of the team wandering around chasing whatever immediate priority comes along, and then having execution difficulties when some work requires more planning than one sprint’s duration.

Of course real projects rarely take either extreme approach; in practice real projects adjust schedules over time. Having a discipline for maintaining a plan from the beginning helps the evolution proceed smoothly.

8.3.4 Principle: Set up intermediate internal milestones

Define regular internal milestones for showing a part of the system working in an integrated way.

Internal milestones that demonstrate some system function give the team a focus for their work in the medium term.

Each milestone demonstrates a set of system capabilities working, especially if those capabilities involve integrating functionality in multiple components. The milestones include a demonstration of the new capability working, in order to prove that the system is working and to give the team a concrete success to celebrate. Internal milestones like these put the team’s focus on a part of the system, leading to capabilities that are integrated together early. (This approach supports the principle of prioritizing integration, above.)

The functionality for each milestone should represent some significant amount of work. I have scheduled such milestones about two or three months apart. If a project is using Agile-style sprints, the milestone should include the effort from several sprints.

I have often focused these milestones on some high-level system function or on some pathway through the system. In the software effort on one multi-spacecraft project, the first milestone demonstrated that the basic software and communication frameworks functioned in a testing environment. The next milestone showed simple control loops in the flight software working; the milestone after that, collective guidance for the collection of spacecraft. Each milestone built on the work of the ones before it.

Of course, not all of the team need be involved in one of these milestones. Part of the team may be working in parallel on other functions. In the multi-spacecraft system example, other parts of the team were working on spacecraft hardware design, mission design, ground systems, and so on.

There is a risk in this approach: that the team takes too narrow a focus and fails to account for the larger system. Any focused effort, whether for an internal milestone or for something else, must be balanced by consideration of the whole system. In the project above, the systems engineering team kept working in parallel to the software teams in order to ensure that the software designs continued to meet mission needs and would integrate properly with the spacecraft hardware and ground systems.

8.3.5 Principle: Use prototyping safely

Use prototyping to validate a concept or determine if an approach is technically feasible. Never let a prototype escape and become treated as a part of the real system.

Building a prototype of a component or a part of the system is an excellent way to learn about how the component or part can be built, and how it will work. It is also a good way to check that a potential design will meet its needs.

Building a prototype is also one of the more dangerous activities that a team can do while building a system. The risk is that a prototype will appear to function in the way needed and will be treated as if it is an initial version of the “real” component, even though it is not.

A prototype has value when it can be developed quickly, at lower cost than its “real” counterpart. Taking shortcuts, implementing only some parts of functionality, not performing much verification—these are all positive approaches to building a prototype and negative for building a component to be used in the final, deployed system.

One example of what can happen comes from a colleague. He was tasked with building some sample software code that would show developers how one could construct a particular kind of application on a new operating system product. The sample code was intentionally simple; it illustrated a particular flow of activity that an application would need to do. It was not a full application in itself. He took some shortcuts in non-essential parts of the code, making the primary part of the application robust but (for example) making some helper functions non-reentrant because they were not an essential part of what was being illustrated. Unfortunately, after this code was published as part of a tutorial, people began blindly copying the helper functions—even though the example was labeled as illustrative only. This led to other organizations releasing buggy applications because they took the easiest and fastest route to building their application by just copying the helper functions.

I observed another example in an ambitious autonomous vehicle system. The company in question began development of their vehicle by building prototypes of several key systems, both hardware and software. In doing so they learned a lot about the problems they were trying to solve. The prototyping effort did what it should: it provided information about how the system should be designed as well as a platform for experimenting with algorithms (such as some of the control systems). Unfortunately, the company did not label or treat these artifacts as prototypes; they saw them as early versions of the real system. The prototypes allowed them to demonstrate vehicles that could perform some operations to investors. This led to increasing pressure to get more features implemented, and to correct problems they found with the vehicle operations as soon as possible. The prototypes had never been designed for reliability, safety, or security, and early safety analyses found significant flaws. Interestingly, the company did treat their hardware platforms as prototypes, and built a hardware platform that was designed to meet safety and security requirements to replace the early prototype boards.

These examples point to both the positive and negative sides of prototyping. To the positive, in both examples, developing a simplified version of the system in question helped people understand the problem at hand. The effort to develop the prototype went faster because the effort focused on only the essential element of what needed to be learned, and omitted aspects that would be needed for a production system. On the negative, in both cases the prototypes ended up being treated as production ready. The prototypes, having been built without the rigor needed for correct, safe, or secure function, led to flaws in the system products. These flaws increase the cost of building a working system, and the flaws tend to be discovered late in development when it is far more costly to correct them. (One startup company I worked with had to rebuild a third of its project when they realized how much they were spending to try to patch up the prototype-quality software they had written; they had to go through extra venture funding rounds to get their product released.) end missed

Using prototyping, thus, is a necessary and helpful part of building a complex system, but it must be done with discipline that keeps prototypes separate from the “real” system components.

Some project managers have talked with me about solving this by policy: they will have their team build a prototype but they will ensure that the prototype is not used for production, and they will put building a real component into the schedule. Unfortunately I have then seen this resolve fade away quickly as the project begins to run late or have funding issues or have an important demonstration coming soon. These imperatives have always, in my experience, taken precedence over system correctness and even over the longer-term cost and schedule to build a working system.

Prototypes are used more safely when they cannot be used in the real system. For example, people often construct storyboards or slides of the user interface for an application. These storyboards allow the developers and potential users to explore how the interface will work, but they cannot be made into an executable application. Similarly, building a software prototype using languages or tools that cannot be integrated fully into the production system helps keep that software from being used in production. Using prototype hardware that is similar but perhaps in a different form factor allows a team to see if a hardware design can work without risking the prototype being put into production.[4]

8.3.6 Principle: Analyze for feasibility

Analyze a system concept for feasibility before committing large amounts of resources to it.

I have worked on multiple projects that were, in retrospect, infeasible. Project A was trying to build a collection of cubesats to perform a demonstration of cross-link communication between the spacecraft. No radio or flight computer was available that could achieve communication between spacecraft except for a brief period at the start of the mission. Project B involved designing a commercial system for which no commercial business case existed—the system was fundamentally a public good that would not generate a commercial return on investment. A third Project C depended on multiple competing government contractors voluntarily developing a shared system architecture, when the rational behavior for all the contractors was to focus only on their own work. Yet another, Project D, depended on secure operating system technology that did not yet exist.

In all these cases, large amounts of money and effort were spent before the projects were canceled.

With hindsight, it is clear that the problems with all but one of these projects could have been detected early. In Project A, basic systems engineering could have created a mission concept of operations and modeled whether available radio and computing hardware was up to the task. The incentive for competing contractors in Project C not to collaborate was clear from the beginning, but the management overseeing the project chose to continue anyway. The missing technology in Project D was identified early but the customer insisted on proceeding.

Project B was the exception. It was defined as a two year limited-time exploration of the problem. At the beginning of the project, no one involved knew whether the system was feasible or if there was a business case. Over the course of the project we learned about the nature of the system, including that it produced a public good [5] rather than a private good, and thus it was not a sensible commercial product.

8.4 The team

A project’s people do the work of building the system. The team is itself a system made up of complex parts, and how effectively it works depends on how well it is organized and led. Supporting a team with the structure it needs, and in particular with the communication channels it needs, gives the team a fighting chance of working effectively and working through the difficult problems that will come along.

8.4.1 Principle: Document team structure

Define clear roles and responsibilities for each member of the team. Document and share that information so everyone has an accurate understanding.

As I noted earlier (Section 8.1.4), the team is itself a system. As a system, it has structure—who is on the team, what their roles and authority are, and how people should communicate (Section 7.3.3).

There are many ways projects can structure their teams. The specific choices depend on the nature of the project—the number of people, the range of disciplines involved, whether there is one organization or many.

In a well-functioning project, everyone on the team will have a common understanding of what that structure is. Each person will know who they should communicate with and when. Each person will know what their areas of responsibility and authority are, so that they know when they can make a decision and when they should work with someone else. They also will know who to go to for answers to questions about other parts of the system.

A shared understanding of team structure becomes most important when people find problems to address. If one person finds a problem with the design of a component, they will need to work with the people who are responsible for components sharing functional or non-functional relationships (Section 12.2). If there are interpersonal problems between two team members, the responsibility for escalating problem resolution should be clear.

Clear team structure enables delegation. In a project of more than trivial complexity, the work must be shared among multiple people. Sharing responsibility only works when both parties trust each other: that both will do their part of the work, that both will communicate what should be done and the progress that has been made, and that both will communicate when they find a problem with the planned work. This trust depends on a shared understanding of the rules about responsibility and communication.

8.4.2 Principle: Plan on reorganizing the team as it grows

Adapt the structure of the team as it grows, to reflect the increased coordination needed as the number of interactions increases.

A very small team, of up to around five people, needs little formal structure, because all the people can interact directly with all the others to coordinate the work. A large team needs formal structure, with defined scopes of responsibility and communication paths. In between, the team needs some degree of structure.

As a team grows, it will move gradually from the size where it needs little structure to needing more and more structure. It will reach points where it is outgrowing the structure it has had and needs to change to have a more formal structure. I have observed that teams need to change at around 5, 30, and 70 people.

In a well-functioning project, the leadership monitors the team’s performance to detect when the team is reaching a size where it needs a change in structure.

Some of the signs that a team needs to move to a more formal structure include:

8.4.3 Principle: Have shared procedures

Document procedures that everyone on the team will use for important tasks.

Procedures define how people perform certain tasks (Section 7.3.5 and Section 20.4). These procedures should be documented and easy for everyone on the project to find. The team should have a cultural norm of following the procedures—not just the letter of the procedure, but the spirit of it as well.

People working together means one person does part of the work, then another builds on their work. For this to succeed, people need confidence that the work they build on has been done properly. Part of that assurance comes from having shared procedures and having a team norm that everyone is following those procedures.

Some procedures are simple lists of steps or checklists. For example, if a team is using a shared artifact repository like git, everyone needs to follow conventions about how to check in work, maintain branches, and baseline versions (such as by pulling to a main branch). If someone does not follow the procedures, then the state of the repository can become damaged.

Other procedures are more complex. Completing a Preliminary Design Review (PDR) in the NASA life cycle (Section 23.2.1) means that the project is ready to commit money and resources to begin detailed design and, later, implementation. This is a check on the whole project, not just on the design of one part. Passing the review implies that many project artifacts are completed, at least to a preliminary level: cost and schedule baselines, security and export control plans, orbital collision and debris avoidance plans, specifications to at least three levels, technical concepts, operational concepts, and many others. If the project continues but some of these checks are not true, then the project is likely to have serious problems later. (This was the case on a NASA project I worked on.)

8.4.4 Principle: Define regular communication paths

Document regular times and media for team members to communicate with each other.

The work the team does is interconnected. A decision about one part of the system affects other parts, following the system structure relationships. The decisions are based on information that, in turn, comes in part from the other parts of the system. Others on the team are responsible for ensuring that the project is making progress, including detecting when something is not going as expected.

Regular communication ensures that this information is pushed to the people who need it. A well-functioning team knows when to share information (such as times when decisions are being made), and who to share it with (the people whose work it will affect). Such a team will also avoid pushing information to those who do not need it. This avoids inundating people with useless information and thereby obscuring information they do need.

To achieve this, make sure that the project’s operational procedures include defined points when team members are expected to communicate. This might include times like starting on the design for a component, when changes are proposed for an interface, and when a component’s design or implementation are ready for review and approval.

Other team members need regular communication for other purposes. Status updates provide information to update the project’s plan. Other communication ensures that the team is working well, helps project leadership keep a finger on the team’s productivity and satisfaction, and provides a way for everyone on the team to learn the project’s overall goals. Johnson discusses communication as a foundation for team functioning [Johnson22, Chapter 2] and how communicating feedback is essential for keeping team members working at their best [Johnson22, Chapter 5].

8.4.5 Principle: Define exceptional communication paths

Define and document clear expectations about when and how someone will raise issues with others. Make this an essential part of the team’s cultural norms.

Delegation and sharing work is essential to a team that is building a complex system, and they are based on mutual trust. One part of that trust comes from each party doing their work well, following the project’s procedures and the team’s norms. The other part is being able to trust that people will communicate when there is a problem. (See Section 19.1 for more on this.)

There are many things that can go wrong. Someone can find an error in a specification or design. They can find that they don’t have the resources or skills to complete some task. People can have disagreements that they cannot resolve. A supplier can be late providing some component.

When these things happen in a well-functioning team, people will communicate—not keep the problem to themselves. The project’s operational procedures should make it clear how to handle some of these cases. For example, when someone finds a design error, they work with the person responsible for the design to find a solution, and they let others doing work that could be affected by the design change know. Ideally, they will ask for feedback from these other people to make proposed changes work for related parts of the system.

Communicating about exceptional situations only works if both the person raising an issue and the recipients can trust that the message will be heard, acted upon, and that all the parties involved will handle the matter respectfully. Much has been written about how to create an environment where this happens—see Johnson [Johnson22], for example—and I will not try to add to what others have written.

8.4.6 Principle: Train team in communication skills

Communication is only effective when information passes accurately among the participants, and when everything that needs to be communicated gets heard. Effective communication is a skill that can be learned.

There are many ways communication can go wrong. One person can say something and the other person understands something different. Something can be said that causes the hearer to have an emotional reaction that interferes with hearing and understanding. Two people can be trying to exchange multiple pieces of information, but things interfere and some key information doesn’t get shared. Someone can have something important to say, but withholds the information out of fear of an inappropriate reaction from the person who needs to hear it.

In safety-critical environments, such as air traffic control, pilots and controllers talk using a pre-defined vocabulary, follow pre-arranged patterns for who can talk when, and each party always reads back key information to confirm correct understanding [JO711065]. These rules have been developed over the years to ensure that each party can speak when they need to, that everyone involved will understand what is said in the same way, and that key information is checked.

A well-functioning team has a shared culture of communication practices. These practices include many of the principles found in ATC communication, such as careful definition of terms and reading back or paraphrasing to confirm what has been heard (sometimes called active listening). In addition, people will have uncomfortable things to say and hear while working to build a system and the team’s communication practices will have to handle messages that could trigger emotional reactions without breaking trust within the team. The communication practices also should encourage regular communication to actually happen rather than people forgetting to talk to each other.

There is a lot of useful information available in book, courses, and classes on how to improve communications within a team.

8.4.7 Principle: Provide independent resources for checks

Explicitly organize the team so that people have responsibilities for checking others’ work, including through reviews and by doing testing. Manage relationships in the team to keep the checking from being taken personally.

Building checks into the work plan is a principle listed above. The principle of doing checks requires having team members available to do those tasks. Having someone who did not do the design or implementation perform checks improves the odds that they will find a problem because they do not have implicit assumptions/biases of the designer or developer. This implies that a well-functioning team will be staffed to provide for independent checks, and that some team members know they will be responsible for checks.

It is easy to underestimate the effort required for reviews and tests. Doing a meaningful design review takes significant effort, because the reviewers need to actually understand the design—not just look for particular easy-to-find markers that might indicate a problem.

I have heard many opinions about how much of a team’s effort should be allocated to reviews and checks, anywhere from half the effort to a small fraction. My own experience has been that the teams where about one-third of total effort was allocated to reviews and testing had better outcomes than the teams with less effort available. The appropriate fraction of resources is likely dependent on many factors not yet appreciated.

Reviewers and testers can end up having an adversarial relationship with designers and implementers, and so the way reviewing and testing tasks are allocated requires some care. In one organization I worked with that had permanent testing teams separate from developer teams, the developers looked down on the testers and relations between the teams were sometimes difficult. While some tension is useful so that the work remains independent, careful management will monitor the relationships and work to ensure that the interactions between developer and checker do not become personal and that the skills required for both roles are honored.

Part III: Systems

A detailed model of what systems are. This includes

XXX add synthesis

Chapter 9: Purpose

17 August 2023

9.1 Introduction

Creating a system requires time, effort, and many other resources. The result of spending those resources should be worth the expenditure: the system should do something useful for someone.

This is another way of saying that the system should have a purpose, and that the purpose should be expressed in terms of what the system can do for the people or organizations that will depend on it. This definition of a system’s purpose means that it depends both on what the system does and who it does it for; both must be worked out to be able to accurately reason about a system’s purpose.

The list of who the system is for should be expansive, including everyone who has an interest in the system. This includes the system’s users, who will need to benefit directly from what the system does. It also includes the people or organizations who build and maintain the system and their investors, who will need to get benefit from the effort and resources they put into making the system. It includes others, such as regulators or industry groups, who represent the public interest in avoiding dangerous activities. This list amounts to the (often-abused) term stakeholder, interpreted broadly.

Each of these stakeholders will have a different interest in the system. The needs of each stakeholder must be discovered and recorded. Users derive benefit from the system’s explicit behaviors. Builders and funders derive benefit from compensation for the system, and in the longer term from the potential opportunity to evolve the system, provide it to others, or develop new systems. Regulators, industry groups, and the public derive benefit from how the system affects the world at large in terms like safety, fairness, or security. All these needs must be satisfied, and they cannot be satisfied reliably unless they are known.

9.2 Why purpose matters

Purpose provides a basis for decisions about whether something is worth doing, or to choose among different ways to do something. It guides the design and implementation: each part of the design can be judged on whether it adds to meeting the purpose or not. The sum can be judged on whether it meets all or enough of the purpose to justify building or deploying the system

This principle applies to parts of the system as well as to the system as a whole. Each part has a purpose that it needs to fulfill in order for the system to fulfill its purpose.

Purpose matters because of what happens when one does not give it enough consideration. I illustrate this with two examples, from among the dozens I have encountered.

Early in my career, I was tasked with building software that would be used by machine shop workers to process repair work orders and manage parts inventory. This system would be installed on minicomputer systems with terminals around the shop. I had what I thought was a clever idea for the user interface, based on the ideas of non-modal UIs that were beginning to enter the world in the early 1980s. The result met all of the functional requirements needed—and was completely unusable. I had focused on building something I thought surely would be good without doing the work to understand the needs of the shop workers who would use the system.

More recently, I worked with a startup that was building a software system to control a small vehicle. The software designer had decided that the foundational software infrastructure should provide an event loop mechanism, where the infrastructure would cycle at some frequency, and in each cycle would call functions to read sensor data, perform computations, and write commands to actuator devices. This is a common design pattern for this kind of system, and a reasonable starting point. However, when the designer was asked how they envisioned this being used to implement PID controller logic, it turned out that they had not ever considered what a controller would need and many necessary capabilities were missing. By the time the first version of the system was released for deployment, the vehicle had no control systems implemented.

The common thread in these examples is that in neither case did the person responsible work through the system’s purpose in order to ensure that what was built would be useful. Instead, the designs were based on an unvalidated belief about the right design, and the choices resulted in unusable implementations.

In both cases, a significant amount of time was spent building a system that did not work. In both cases the resulting system could potentially have been redesigned and reimplemented, but building the wrong thing had used up the available time and delivery deadlines were close by the time they were finished. In the case of the shop management system, the project subsequently failed as a result. In the vehicle control system, at the time of writing it remains to be seen if the team can get funding and time to correct the errors.

Both examples would probably have turned out better if effort had been put into a proper articulation of what the system needed to provide before anyone went into depth on design.

I discuss gathering information about purpose and documenting it in Chapter 31.

9.2.1 Not monolithic or fixed

While it would be nice if purpose could be defined once and then remain fixed for the life of the system, this rarely happens.

First, a system’s purpose is rarely fully understood, especially in the beginning of a project to build the system. A team can begin by talking to potential stakeholders and finding out what they need, but inevitably someone will realize some important system behavior well after design or implementation are in progress. Not all of the stakeholders may be apparent at the beginning: for example, in one project I worked on, insurers turned out to be an important stakeholder, but we didn’t appreciate that for quite some time. A team must expect that their understanding of a system’s purpose will be rough at the start and become more accurate over time.

Second, purpose is not usually monolithic: there are many things that could be part of the system’s purpose, and usually people want many more things that are practical to build. The list of potential features usually has to be narrowed down from a long list of user or stakeholder wishes to a short list of the most important features—perhaps with a plan to add more capability over time. This means being able to separate the different features or properties and rank them by importance and achievability. A team must expect that items will be added and removed from a system’s agreed-upon purpose as time goes by.

Finally, needs change. If a project to build and deploy a system takes a few years, the world in which it is deployed will likely be different from the world when the project started. Available technology may change, or the user’s market may have shifted, or new regulation may come.

The result of these conditions is that a system’s purpose is not fixed, and the team building the system must be prepared for these changes. Being prepared means regularly checking for changes in stakeholder value and recording what is learned. It means using design and development processes that can adapt to these changes when they happen. And it means a management commitment to managing change honestly, pushing back on user requests when needed and supporting the development team when changes need to be made. It also means that an organization must be prepared to end a project when the system’s intended purpose no longer has enough value to its stakeholders.

Chapter 31 discusses how to gather information about purpose, and how to work with that as the understanding changes.

XXX add references to prototyping and end user validation

9.2.2 Inconsistent or conflicting purposes

Having multiple stakeholders usually means that two or more stakeholders will have incompatible needs or desires. Even a single stakeholder may have conflicting desires.

This can cause two problems. First, conflicting needs make it hard to design a system that meets its purpose. Second, conflicting objectives make it harder to rank and choose among potential system objectives.

There is no simple recipe for handling such inconsistencies. One first has to recognize when an inconsistency or conflict exists, which requires understanding what all the stakeholders are saying and understanding the implications of that information. Then one has to work with the stakeholders to find a resolution—be that a negotiation that produces a compromise, or a realization by one party that their needs cannot be met. This can lead to difficult discussions, especially with customers: it is hard to tell a customer that current regulations make some feature they strongly desire illegal.

9.3 Explicit purpose

A system’s or component’s purpose can be separated into explicit and implicit parts. I use a simplified eVTOL aircraft as an example to explain these.

The explicit part is what stakeholders who will directly use the system say they need. This includes:

The stakeholder can only rarely specify exactly what they want. They may have a general idea, but it often requires several discussion sessions for them to express the idea clearly. The team eliciting the purpose from them usually needs to employ active feedback techniques, providing the stakeholder with an interpretation of what the team thinks they have said in order to validate that they have correctly understood the needs.

See ! Unknown link ref for more about different kinds of stakeholders and projects, and what must be done to learn about each kind.

9.4 Implicit purpose

A system’s implicit purpose comes from stakeholders who are involved in the system but are not its direct users.

9.5 Using purpose

A system’s purpose must guide its design and development. This means that the purpose provides the standard on which design and management decisions can be made. There are several activities in system development that depend on purpose.

undisplayed image

First, a project must actively gather and validate its understanding of the system’s purpose. This activity must be explicitly planned for, and sufficient time and resources provided. The resulting information should be validated with the customer and recorded in artifacts that can be referenced throughout the life of the system.

Second, the desired purpose is almost always more complex than what can be developed feasibly at first. The initial desires need to be ranked and pared down to what is essential.

Third, every project has a “go-no go” decision checkpoint, when the team decides whether to proceed with building a system or not. The fundamental question is whether a system can be built that meets all its important purposes, and this requires an analysis to determine whether that is feasible. Is it likely that a system can be built that meets the customer needs? And that will provide necessary compensation to the organization that builds it? Will other stakeholders agree to it? If the answer is no to any of these, then the team should not proceed further in building the system.

Next, purpose should guide design and implementation decisions. Each part of the system must play a role in meeting a stakeholder need, and the team should be able to articulate how it does so. If some part does not support the system purpose, it should not be built. If there is a choice to be made between different design or implementation approaches, the one that best meets the system’s purpose should be the choice. Moreover, the team must be able to explain how each of these choices were made. Chapters ! Unknown link ref present methods for ensuring this happens.

Finally, the system’s design and implementation should be checked against the decided purpose. ! Unknown link ref discusses system validation and acceptance.

Sidebar: Summary

Chapter 10: System scope

20 May 2024

10.1 Introduction

A system’s purpose defines why it exists—the reasons it might be built.

What the system is comes next. This is a high-level view of what the system is and will do—and not how it does those things.

The definition of what a system is starts by defining the boundary between the system and the rest of the world. There are things that are part of the system, which I will call the system’s scope. The rest of the world provides an environment in which the system operates. Interactions between the system and its environment take place at the boundary between them.

undisplayed image

The things within the system’s scope are what is being built, and thus under the control of the team making the system. This includes the functions, behaviors, and qualities of the system that are visible from outside the system. These are interactions between the system and its environment across the boundary between the two. These interactions should fulfill the system’s purpose.

What is in the environment is not under the builders’ control. The team building the system should understand these things, but they can’t be changed.

The environment includes the things that interact with or use the system. This includes things that go in and out of the system, physical places where the system operates, and the ambient environment (atmosphere, electromagnetic forces, dust, vibration, or radiation). The environment also includes those who will use the system, and thus define the purpose for the system to exist.

A caution: the system’s scope covers what the system does, and does not address how the system does that. Matters of how the system is designed to meet its scope are separate.

Sidebar: Keeping specification and design separate

One often reads statements like “good practice is to keep specification separate from design” or “requirements should not address the how, only the what”.

Why is this good practice?

The separation comes from the difference in the tasks involved in working out what something should do and how it should do it. Working out a system’s purpose or scope is a matter of working with a customer, real or potential, to learn about needs in the world outside the system-building project. Everything relating to the purpose or scope should come from other people and organizations—the team may choose which needs they will try to meet, but they cannot in general act as if the customer wants something different from what they actually do. The design, on the other hand, is about figuring out what kind of system design will meet those needs. All the decisions about the system’s design are within the control of the team, as long as those decisions end up supporting the customer’s needs.

In other words:

Purpose and scope come from the outside, not from within the project.

Design and implementation are for the project team to work out.

When design decisions are mixed in to scope or specifications, it is often a sign that the team has skipped over some of the deliberative steps of working out why some design is the best choice and jumped directly to a conclusion. This also can impose false constraints on the team: I have seen people avoid looking at design alternatives because they believe that some design decision came from a customer and can’t be changed.

Mixing scope or specification and design also can cause problems later, when the system is modified. Someone working out how to change a system needs to know why certain design decisions were made in order to understand the effects of changing the design. When specification and design are mixed, people often don’t record the rationale behind the design decision and the people who later need to understand the rationale have to guess.

10.2 Why scope and boundary matter

Building a system starts with working out the system’s scope. All of the specifications of the system are a refinement of the scope, and all of the design follows from that.

One will want to know how big the effort to build a system will be, at some point early in a project. This depends on knowing what the system will be.

Knowing what is in the environment—and thus not changeable by the team building the system—defines constraints on building the system.[1]

Finally, defining a system’s scope, boundary, and environment provides a way to check that the team understands the customer’s purpose properly by asking the customer to review the scope and boundary.

10.3 Content

The what of a system is the root of all the design of the system and its pieces. As discussed in the next chapter, the model of the whole system is the root of a hierarchy of component parts that define how the system is made. That chapter provides a model for how to define each component, including the system as a whole.

The team will use documentation of system’s scope and boundary over and over as they build the system, meaning that the information should be organized in artifacts that people can readily find and understand. (See Chapter 17 for discussion of what this implies.)

The system’s scope includes a few kinds of information: a concept, objectives, constraints, assumptions, and environment.[2]

A concept for the system provides a description of what the system will do. The concept is generally narrative, telling stories about the system. The concept should include major usage scenarios, for how the system’s customer will interact with the system and how the system will interact with its environment. I have often used a few diagrams to illustrate the concept. People looking at the concept should come away with an understanding of generally what the system will do and, equally important, what is not in the system’s scope.

Objectives (or goals) are a more organized way to present similar information. This takes the form of a list of the things people want out of the system: its behaviors or functions, and its properties. These will be general statements, and the process of developing a specification of the system will refine these into something more precise.

Constraints list limitations on acceptable system designs. The constraints do not establish what the system does, but only constrain how well it does those things. Many constraints relate to safety or security. For example, the system might need to meet some safety standard. Initial constraints will be general, and they will be refined as the system’s specification and design are worked out. Many constraints lead to analyses that work out in greater detail what these constraints imply.

The assumptions record information that affects how the system can be designed but that might be forgotten or might be missed. (This is similar to making objectives explicit or leaving them implicit.) This is often organized as a list. The assumptions guide later design decisions.

Finally, the environment lists information about the world in which the system will operate. The environment constrains how the system can operate or how it must be designed: to accommodate a certain level of vibration, for example, or that cellular radio coverage will be variable over a region where a vehicle will operate.

10.4 Using scope and boundary

The scope and boundary are a realization of the system’s purpose. The record of the scope should be traceable to the purpose. The team uses the scope and these traces to check that the definition of scope meets all of the system’s purpose, and that there aren’t significant parts of the scope that are not based on some part of the purpose.

The act of defining the system’s scope helps reveal the details of system’s purpose and constraints. Discussions with a customer or other stakeholder are usually informal and incomplete. The discussions result in notes and drawings, but they are rarely directly usable for working out the system’s specification. The tasks of working through records of those discussions and organizing a model of the system’s purpose will reveal ambiguities in what the customer has said, or gaps or inconsistencies. The team can then work with the customer to resolve those issues so that the definition of scope is more complete.

The team will use the definition of the system’s scope to document top-level specifications for the system, which then inform the system’s design and its decomposition into component parts.

As the project moves forward, the team will work out the design for high-level system properties such as safety, security, or reliability. The tasks that build the designs for these emergent properties (Section 12.4) begin with the definitions of what safety or security the system is expected to provide. Those definitions are part of the scope.

Sidebar: Summary

Chapter 11: Component parts

20 May 2024

11.1 Introduction

In this book, I describe a system in terms of its parts and its structure. The system overall has a purpose, which can be described in terms of things it should do or properties it should have. The system meets this purpose by combining the parts together with the structure of how the parts interact. One should be able to show that the desired system behavior and properties follow from the combination of parts and structure.

In this chapter, I start by discussing components, the term I will use for parts. In the next chapter I will discuss structure, and how the combination of parts and structure leads to emergent properties that meet system needs.

Terms. I use the term “component” as a generic term for a part of the system. Some approaches use different terms, such as “element” or “item”. Other approaches use different terms depending on the level of encapsulation: system, subsystem, component, subcomponent, for example. I use the term “component” throughout, with “system” reserved for the system as a whole, and “subcomponent” used to denote a component that is part of another component.

11.2 Definition of component

A component is something that is part of a system and that people can think of as a unit. “Unit” implies some kind of singular aspect to the component: one purpose, one implementation, or one boundary, for example.

The unitary nature of a component means that the world can be divided into that which is within the component, that which is at the boundary, and that which is outside the component—its environment and the rest of the world.

This definition implies that different people will see different things as unitary components, often depending on the level of abstraction one wants to work with. One person may think of “the electrical storage system” as a unitary component, while another person may think of battery cells and power regulator chips as components, and the electrical storage system as a collection of components. Both views are correct, and both are useful at different times or for different people.

The focus on unitary purpose or boundary is a way to address complexity in a system. The focus is meant to help humans organize and understand the system they are working with by taking a divide-and-conquer approach. It means that some people can focus their attention solely on the component, making sure that it is designed and implemented to meet its purpose while not having to think about the rest of the system. The focused attention on the component must be complemented by attention on the system structure that connects the component to others, as described in the next chapter.

There are three related principles that can help identify what is a component and what is not. (Some of this is based on principles presented by Parnas [Parnas72].) These are only guides, and there are exceptions to each of them.

The goal of the first principle is to organize components around their purpose. If a thing has multiple purposes, that suggests that it might be divisible into smaller parts, each with a sharper focus, or that part of the thing might be better combined into something else with a similar purpose. On the other hand, if there is some feature that is implemented by more than one component, then those components are candidates to merge together. This is particularly true when those components contribute to some important emergent property (Section 12.4).

The second principle addresses how independent a thing is from other things. Independence can be viewed in terms of causal relationships with other components, as covered in the next chapter. The more tightly two things are related, the more they will have to be designed, implemented, and tested together; the less they are related, the more they can be worked on independently. If two things are strongly related, one should consider merging them into a single component; if they are loosely related, they can be more readily treated as separate components.

The final principle is also related to independence. If the design or implementation of a thing can be replaced with little or no effect on the design of the rest of the system, then that is evidence that the thing is independent and can be treated as a component. Having clear and narrow interfaces between the thing and the rest of the system is a sign that the component is independent. More broadly, replaceability is often an indication that something should be considered a separate component.

There is one additional indication that something should be treated as a component: when it is something that is usually sold or acquired as a unit. Electronic chips, antennas, motors, and batteries are all generally bought as units. Software packages are often acquired as units, whether bought or acquired as open source. A person hired as a contractor to fill a commonly-defined role can be seen as a component in a system.[1]

11.2.1 Component purpose

Every component has a purpose, which defines how that component contributes to the system as a whole. “Purpose” is a broad term, including behaviors that the component should have, properties it should exhibit, or functions it should provide. A component’s purpose is not necessarily defined precisely; sometimes, the purpose is a somewhat ambiguous prose statement of what a human wants the component to do or be. Turning that ambiguous statement of purpose into a precise and actionable definition is part of the engineering process. I discuss this in ! Unknown link ref.

I discussed the purpose of the whole system in Chapter 9. The system purpose is the purpose for the top-level component in a hierarchy, which represents the whole system.

Most human-designed components have a single primary purpose or property, possibly with multiple secondary purposes. Consider a battery: its primary purpose is to store electrical energy and make it available to other parts of the system. The battery may have a number of secondary purposes, such as providing mechanical structural rigidity, providing thermal mass to help maintain a constant temperature in the system, or contributing to the location of the system center of mass.

Each component has a number of properties that derive from its purpose: its state and behaviors, the inputs and outputs it can provide, and constraints on how it should be used. The documentation of these properties provides an unambiguous and precise specification of the component.

People working on the component need to have the purpose (and the specification that derives from it) available as they do their work. This information guides how they design the component, and how they verify that a design or implementation meets its needs. It is important that all of the purpose is available to them in one place so that they know they are considering everything they need to consider, without hidden surprises they couldn’t find.

11.2.2 Limits of the component approach

Components help human engineering and understanding—but when humans aren’t doing the design, there are limits on how the approach applies.

Consider a mechanical structure designed with a generative design tool. The tool can take in a specification of what the structure should achieve—forces, attachment points, and so on—and will find a design that optimizes for given criteria such as weight or cost. These structures often do not resemble ones people design because the tool can explore a more complex design space than a person can, and as a result often produce substantially better results than the human designs. Such designs can also potentially co-optimize multiple functions, such as a mechanical structure that includes channels for coolant flow within the structure or that meets RF reflection properties. While a person could make such a design, generative tools can do so at far lower cost.

As a second example, consider a neural network trained to recognize elements in a visual scene. The neural network is designed by performing a training process that uses a large number of examples of the kind of recognition the system should perform. The resulting network is typically much more accurate than a manually-designed algorithm. However, it is difficult to investigate the network itself to determine how the connections in the network lead to accurate (or inaccurate) image recognition. It is difficult to look at a specific connection in the network and explain how it affects the result, or how changing that setting will change recognition properties.

Both these examples are components that will be part of a larger system. As components, they have a defined purpose, from which a specification can be derived defining what the component should do. From there, automated methods take over to produce the design (for the mechanical part) or directly produce the implementation (for the visual recognition component). If these components were designed by people, we would expect that we could review and understand the component’s design as a check on its correctness. As machine-generated components, however, we only verify that the design or the implementation complies with its specification.

There is one significant difference between the two examples: how they can be verified for compliance with their specification. A mechanical component’s specification is generally complete: all of the conditions in which the component should function and the component’s behavior in each environment can be specified. This means that compliance can usually be checked using finite element analysis software tools, and example components can be built and subjected to their intended loads. Components implemented using neural network methods, on the other hand, usually are expected to function in a complex environment that is too large to fully enumerate. The training methods use a number of example cases, and induce from those examples an implementation that should properly generalize to all, or enough, real cases. The compliance of the component therefore cannot be completely verified, but must be done statistically.

11.3 Divide and conquer: the component breakdown structure

The component approach involves breaking down the system into unitary component parts, in order to make each part manageable by a person. However, as we have seen, different people use different levels of abstraction to understand the parts of the system.

In practice, people divide up a system first into major subsystems, and those into smaller components, and so on until the components are simple enough to deal with. This recursive division defines components at varying levels of abstraction: the electrical power system as a whole, with the power storage, power distribution, and power generation components as parts of the overall power system.

The following is an (intentionally) partial breakdown structure for a spacecraft, illustrating how the spacecraft as a whole (the “space segment” of the whole system) is organized into multiple trees of components.

undisplayed image

This recursive division creates a tree-structured component breakdown structure of the parts of the system. The breakdown structure organizes components in a way that helps people find components, including both finding a specific component that they are looking for and discovering related components that they do not already know about. The structure also defines levels of abstraction that allow people working at different system levels to focus their attention.

The breakdown structure organizes components, but it does not define the system structure, which I will discuss in the next chapter. The system structure defines how components interact with each other, which generally crosses different parts of the breakdown structure tree.

The system and high-level components should be broken down into subcomponents that have a strong internal relatedness and weaker relationships between subcomponents, as I discussed earlier. In doing so, the high-level component provides an abstraction of its subcomponents. This usually means breaking into subcomponents either by function or physical location. Most people think first of dividing by function: the electrical system, the hydraulic system, the communication system. Location is often more implicit. For example, a space flight mission is organized first by ground system, launch system, and flight system (physical locations) and then by function in each location.

A system will not necessarily have a single optimal breakdown structure. When that happens, one must pick some approach and stick with it. Some systems will have lower-level components that contribute to multiple high-level functions. If the system is organized according to the high-level functions, then the low-level components could fit into multiple branches of the hierarchy. I will discuss this further in the next chapter , when I cover how one uses hierarchy to organize the structure of the system.

Keeping components together that are functionally related is important. Part of the purpose of the hierarchy is to help designers and implementers: the hierarchy should guide them toward the information they need and should not hide or lead them away from that information. I have worked on some projects where the team decided to consider the esthetics of the hierarchy and tried to balance the depth of branches. While the resulting hierarchy was easier to draw, actually using the organization became more complex and error-prone. High-level components no longer provided an abstraction of a collection of subcomponents as a whole. Instead, the collection of related subcomponents was split between two or three high-level components; nowhere was the one abstraction of the whole set represented. Building specifications, tests, and project plans became harder because related things were no longer related in the hierarchy.

11.4 Component characteristics

Each system component is defined by a number of characteristics. These characteristics define an external view of the component: information about the component that can be observed without knowledge of how the component is designed internally. The characteristics constrain the component’s internal design, but should only include those aspects that will affect how the component fits with other components to make up the system.

There are six kinds of characteristics in the component model used in this book:

Form. The “shape” of the component. The component does not typically change its form over time. For physical components, form is obvious: the geometry of the volume or area that the component occupies. Form might include the material of which a physical component is made. For electronic or data components, form is how it is packaged: a data file in some format, or a software component in the form of an executable application.

Examples include:

State. This is the mutable “condition” of the component at a particular point in time. More formally, state is the information that is necessary and sufficient to encapsulate the past history of the component, so that any reaction that the component performs to some input is fully determined by the input and the state. State can be discrete (such as binary-encoded digital data) or continuous (such as the angle and angular momentum of a rigid body at a point in time).

Practical examples include:

Actions or behaviors. These are the state changes that the component can perform. Some behaviors are reactive, meaning they are initiated by some input. Other behaviors are continuing, meaning that they continue to be performed without further input.

Examples of reactive behaviors:

Examples of continuing behaviors:

Interfaces. These are the ways in which a component is connected to other components in the external world, and is the only way for to observe the component’s behavior from outside. Inputs can be given to a component, and output can be received from it. Inputs and outputs create a causal relationship between actions in one component and another. Inputs trigger reactive behaviors in the component that receives the input. Outputs can be a result of a reactive behavior, or an observation of a continuing behavior. Outputs are the only way another component can observe information about a component.

Examples of inputs:

Examples of outputs:

Non-functional properties. Components often have some properties that do not change over time (or change very slowly). These properties are not state per se, but they create important constraints on the component’s design and implementation and affect how the component should behave.

Some non-functional properties:

Environment. A component is also characterized by the expected environment in which the component will operate. This can be viewed formally as part of the component’s interface, but in practical terms it is useful to call it out separately. The environment specification typically includes information like the storage and operating temperature range, humidity, atmosphere, gravitation or acceleration, electronic signal environment, or radiation.

! Unknown link ref details more about components and how to specify them.

11.4.1 Characteristics and hierarchy

A high-level component provides an abstraction for the subcomponents that make it up. This implies that each of the characteristics of a high-level component—its form, state, behaviors, and so on—needs to be reflected in the subcomponents. For example, if the high-level component has some state A, then one or more of its subcomponents must have some state that, when aggregated, implements A. If the high level component has form B, then the subcomponents when put together must have that same shape.

Consider a radio communications component. The purpose of the component is to send and receive data packets with another radio somewhere else. The radio component has interfaces to communicate data with another local component, an interface to emit and receive RF signals, and other interfaces for control, configuration, power, and heat transfer. This example radio component, similar to those that might be used on a small spacecraft, has an antenna that is initially retracted but can be deployed on command.

undisplayed image

The radio is built of a number of subcomponents. These subcomponents must implement the state of the radio overall, as well as all its interfaces. The diagram below shows a simplified possible implementation.

undisplayed image

The set of subcomponents implements each of the interfaces named in the high-level radio component. Many of them are provided by the transceiver component, but the antenna handles the RF signal sending and receiving.

The state of the high-level radio is divided over the subcomponents. Again, much of the state is contained in the transceiver component, as it performs the data manipulation. The deployment state is a physical property of the antenna: it is either retracted or extended.

In the example implementation, however, there are multiple powered components—the sensor and actuator related to deploying the antenna in addition to the transceiver. This results in a more complex power state than defined in the higher-level radio component: some of the components could be powered on while others could be powered off, rather than a binary on/off overall state. During design, discrepancies like this should lead to improving the specification of the state of the high-level component.

11.5 Downsides

As I have noted, breaking a system into separate and independent components benefits the people who need to understand the components. This advantage generally outweighs other considerations, but there are downsides to this approach.

The first downside is that a reductive approach doesn’t allow for many kinds of system optimization. Having two separate components means that the two are not jointly optimized.

Software language compilers illustrate this. If each program statement is considered independent, the compiler translates each statement into a block of low-level machine code. However, optimizing compilers break this independence, and gain large speed improvements in the generated code. For example, a code optimizer can detect when two statements perform redundant computations and merge them. An optimizer can detect that a repeated computation (in a loop, for example) can be moved out of the loop and performed only once.

Software optimizers allow a developer to write understandable code, and it performs optimizations that can be proven to maintain correctness but that make the resulting machine code hard for a person to understand and verify. There remains the possibility of system optimizers that perform similar translations, but they are not generally available today.

The second downside is that breaking a system into many components creates an organizational problem: how does one name or find a particular part? A hierarchical component breakdown can help organize the pieces.

11.6 Why components matter

We split complex systems into component parts in order to make parts that are understandable by the people who have to work on the parts. The approach also makes it easier to manage parallel design, implementation, and verification of the parts. If one wants to acquire a component from an outside source, having a definition of what the component is helps the acquisition process.

Each of the people working on the system needs information to work on their parts. Defining a component provides a locus around which to organize the information related to a component. Having a model of what a component is provides a basis for designing artifacts that will contain the right information.

Different people will need to work at different levels of abstraction in the system. Organizing the components hierarchically provides these different levels of abstraction.

The people working on the system need to find pieces of the system, both when they are looking for information about a specific piece and when they are trying to learn what the pieces are. The hierarchical structure provides a way to name and find information about a component, and provides a structured index to help people browse and discover.

Finally, it is generally understood that the structure of a system is related to the structure of the team that builds it [Conway68]. I discuss this further in Chapter 19. XXX add ref to detailed team structure chapter

Sidebar: Summary

Chapter 12: Structure and emergence

1 July 2024

12.1 Introduction

Component parts of a system define the building blocks out of which a system can be built, but by themselves they do not create the complex, high-level behaviors that systems are built to exhibit. System behaviors and properties arise from how the component parts work together. How the components are connected, and how they interact over those connections, is the structure of the system.

In this chapter, I define what is meant by system structure and provide examples of how behavior can emerge from the combination of components and their interactions.

To build a system, one generally has to build a model of what the system is and does. This model will play essential roles in designing a system and analyzing its design. Enquiry into how to organize information about a system’s structure helps one develop a useful model, and so in this chapter I present an informal way to model a system’s structure.

12.2 Definition

The meaning of “system structure” has been debated, but I use the following definition, chosen for its engineering utility:

Structure is how each component part’s behavior relates to each other component part’s behavior.

This structure can be expressed as the graph of how components affect each other.

Components can be related in two different ways:

Functional relationship. The functional relationship is a relation from one component to another that maps how some output on an interface of one component can potentially be received on an interface of another component, and thereby cause a reaction in the receiving component. That is, the functional relation is a map of possible interactions that can be viewed as a directed graph, with components as nodes and directed edges showing how causation can flow between them.

undisplayed image

Consider two electronic components connected by a signaling line, similar to those used in several serial communication standards. One component is able to send a signal on the line by changing the voltage relative to a common ground; the other component is able to observe the voltage and determine what signal was sent. By sending a sequence of different voltage levels, the sender can transmit a series of zero and one bits over the line to the receiver. The receiver can decode the bits into a message, perhaps containing a number, and act on the message it has received.

This functional relation is separate from and mostly independent of the component breakdown, defined in the previous chapter. The component breakdown is primarily about organizing the parts so they are identifiable, and do not imply a causal relationship. Functional relationships show how components in different parts of the component hierarchy work together. The component breakdown is helpful for defining levels of abstraction; we deal with those in the next section.

Non-functional relationship. A non-functional relationship between two components indicates how their behaviors may be related in non-causal ways, such as two components being independent of each other or showing correlated behaviors. These effects do not depend on interaction between the components, but instead are based on inherent characteristics or history of each component.

undisplayed image

Independence and correlation are typical non-functional relationships. These terms are defined in the usual statistical sense. Informally, two components are independent if the probability of some event occurring on both components is the same as the product of the probability of each event occurring on its own. Events on two components exhibit some degree of dependence if the probability of both occurring is different from the product of each event occurring on its own. For a positive correlation, when one event occurs the other is more likely to occur. For a negative correlation, when one occurs the other is less likely to occur. At the extremes, one event occurring means the other is certain to occur, or that the two events never occur together.

Many non-functional relationships are the result of common-cause events. This can occur when two otherwise-independent components A and B have functional relationships with a third component C. When an event occurs in C, it interacts with both A and B so that both change their states. After such an event, the states of A and B are no longer independent.

undisplayed image

System reliability is often built on a foundation of failure independence. For example, data can be stored in two copies, so that if one copy fails the other remains available. A scheme like this fails when both copies fail, and so the copies are designed to be independent to minimize the chances of both failing together. Independence can be a result of using different technologies to store each copy, or using devices from different manufacturers. Two devices from the same manufacturing batch might share a common manufacturing defect, which would increase the probability that both will fail.

12.2.1 Examples of functional relationships

Here is a list of some kinds of functional relationships that I have encountered in systems I have worked on. The first few relationships are simple and primitive from an engineering point of view, while the later examples are built as abstractions on top of simpler relationships.

12.2.2 Examples of non-functional relationships

Non-functional relationships capture ways that components can behave in coordinated ways without a direct causal relationship between them. These are typically states or behaviors that occur because two components share some common state (or do not share such state).

The following examples all relate to the independence or dependence of different components that are being used redundantly to improve reliability.

12.3 Abstraction

An abstraction is a summary or reduced form of a more complex thing, usually focused on the essential or intrinsic aspects of the complex thing. The abstraction is separate from any concrete instantiation of it.[1]

People use abstraction to manage the complexity of a large system. In an airplane, people talk about “the electrical system” or “the powerplant”—things that are built out of thousands of subcomponents, but which are usefully thought of as whole things in themselves. While the component breakdown structure, in the previous chapter , is one example of abstracting the details of multiple components into one larger, abstract component (or subsystem), most complex systems have multiple, overlapping ways to abstract and simplify views of the system.

12.3.1 Kinds of abstraction

Because abstraction is a powerful tool, it is used in multiple ways in understanding complex systems. Here are three of those ways.

First, abstraction is used to understand component structure. It shows how one complex, abstract component is made up of a number of subcomponents. The focus is on the hierarchical containment or decomposition of components. The high-level component has a single set of functions or properties it provides in the system, and the abstraction shows how this set is provided by functions or properties in subcomponents. Note that the high-level component need not have a unitary physical realization; instead, it may be realized in many physical subcomponents spread throughout the system’s physical space.

Second, abstraction is used to show how the objectives for a system or component are realized in lower levels. The focus is on how an objective (or purpose) is decomposed into objectives for lower-level parts of the system. It can be expressed as the tracing of a high-level objective to lower-level objectives. An abstracted objective need not be constrained to follow the system’s component hierarchy. High-level objectives might be decomposed into lower-level objectives within the same component. High-level objectives might also be decomposed into objectives for multiple different components, some of which are not close together in the component hierarchy.

Finally, abstraction is used to reason about the chain of how an abstract property or objective is mapped through layers of specification to a design or implementation that realizes that objective. This is similar to means-end hierarchies, which have been used to reason about how products are selected. Leveson uses a five-layer approach, starting with system purpose, mapping that to system design principles, then black box behavior, then physical and logical function, and finally physical realization [Leveson00, §4.2.1]. This use helps people reason about the specification and design process as well as about the structure of the system.

12.3.2 Abstraction of objectives or component structure

In general, abstracting structure is about taking a relation between two (or more) high-level components and breaking it down into relations between subcomponents. In the example below, two high-level components A and B have a functional relationship. A and B are both abstractions of a set of subcomponents. The relationship between A and B is an abstraction of the relationship between the A.2 and B.1 subcomponents.

undisplayed image

As a concrete example, consider software on two microcontrollers that communicate over a serial line. The software on each breaks down into an application software component and a serial driver. The serial drivers communicate (over a serial cable) directly.

undisplayed image

Non-functional relationships can follow a similar pattern. If two high-level components A and B exhibit some kind of correlated behavior without direct causation, and those high-level components decompose into lower-level components, then at least one of the subcomponents of A must have a corresponding non-functional relationship with a subcomponent of B.

12.3.3 Overlapping abstractions

Abstraction is not necessarily purely hierarchical: some high-level abstractions overlap. Two different people can look at the same component and need to work with different aspects of it, and see it as part of different high-level abstractions. This is common in systems of even moderate complexity.

Consider an aircraft with modern avionics and engine systems. The avionics provide many functions: flight deck displays, pilot inputs, navigation, radio communications, autopilot, among many others. The powerplant provides thrust to move the aircraft and electrical power to run other systems, but in a modern aircraft it also includes an engine controller (FADEC) that provides autonomous management of engine operations.

undisplayed image

The avionics and powerplant have overlap. The flight deck display will display engine status: thrust, temperature, thrust reverser deployment, and alerts when there are engine problems. The pilot thrust levers are connected to avionics, but provide commands to the engine controller. The autopilot needs to know the capabilities of the engines and how to provide them with control settings.

This overlap leads to a question: is the engine display function part of avionics or part of the powerplant? The answer is that it is part of both, depending on who is looking at that part of the system.

Consider a specific avionics unit for general aviation aircraft: the Garmin G3X display [Garmin13]. It can connect to an engine interface adapter, which in turn connects to sensors or a digital engine controller on the engine. The display is a general-purpose component, which can provide a pilot with many different kinds of information; engine status is just one function. The G3X unit contains a configuration database that defines what engine information it will be receiving, how to display that information to the pilot, and the conditions when it should issue alerts. This database resides within the avionics display unit, implying that someone designing the avionics system will be concerned with it. However, the database is specific to the powerplant installed on the aircraft—changing an engine model requires changing the database—and so it is of concern to people designing the powerplant.

undisplayed image

This pattern is common in systems that have multiple functions: some particular component will contribute to multiple high-level functions, and different people will see that component as part of one abstraction or another based on what functions they are working on. Models of the system must accommodate these overlaps.

When two abstractions overlap, shared components must support both abstractions by implementing behaviors and properties that accurately support both higher-level abstractions. In the G3X avionics example, the configuration database needs to address the configuration of the powerplant as well as the interface to support pilot information displays. This can add complexity to designing the shared component, since behavior that supports one abstraction must not interfere with behavior necessary to support the other.

12.3.4 Abstracting a relationship

Some relationships between high-level, abstract components are themselves abstractions.

Consider once again the example of two microcontrollers that communicate with each other, as in the earlier section, but this time they communicate using a wired Ethernet rather than a serial cable. At the abstract level, there is a functional relationship from A to B where A sends data to B.

undisplayed image

The data communication relationship, however, is an abstraction. The microcontrollers communicate using an Ethernet, which might consist of a pair of cables and a switch. The cables and switch reify the abstract relationship, meaning they take the abstract and make it into something real.

undisplayed image

The inputs to and outputs from the reified data communication link are the same (at the abstract level) as the high-level abstract relationship: data gets transferred from microcontroller A to B.

This is an example of a general pattern. Two components at a high level may have a functional relationship, and both the components and the relationship between them decompose into a number of subcomponents. The consistency between the high-level abstraction and the lower-level details must be maintained, of course, but there is nothing that requires that a high-level relationship can only decompose into lower-level relationships.

In fact this pattern continues recursively down to the lowest observable levels. In the example, microcontroller A passes data into the Ethernet cable as a set of low-level electrical signals. Those signals, in turn, are made up of yet lower-level electromagnetic behaviors of the atoms in the conductors that join the microcontroller to the cable.

12.3.5 Consistency

A high-level abstraction and a lower-level implementation of the abstraction need to be consistent with each other. Speaking broadly, the high- and low-levels are consistent with each other if the low level implements everything in the high level abstraction, and everything in the low level implementation is reflected in the abstraction—that is, that neither level adds or removes anything from the other.

Abstraction does imply simplification, however. The high-level abstraction of a distributed software component might have a “logging” relationship to a centralized monitoring system. The decomposition of that relationship might involve a logging subcomponent within the software that uses a network connection to send log records to a receiver component within the monitoring system. The high-level logging relationship focuses on the ability to reliably and securely send log information to the monitoring system. To be consistent, the lower-level details must provide a way to transfer that information—using the network to move the data, for example. The statement that the information is sent securely—which would need to be better defined at the high level—might be matched by state and behaviors of the endpoint software components to authenticate each other and encrypt data in transmission.

Continuing this example, the lower-level implementation would not be consistent with the high-level abstraction if the network communication mechanism provided a way to send information in the other direction, from the monitoring system to the distributed software component.

We can put this in somewhat more formal terms as follows.

This definition of consistency means that an abstract component or relation has to reflect all of the states, behaviors, or interactions that the lower-level components or relations can have, so that the abstract things model all of what the lower levels will do, and it cannot add to what the lower-level parts do. In reverse, the lower-level components or relations must implement all of what the abstract components or relations do, without adding other behaviors or interactions.

12.4 Emergent system properties

Emergence is the complement of abstraction: it is how high-level properties or behaviors arise from the properties or behaviors of a collection of lower-level components and their interactions. Put another way, one designs the emergent properties in a system to make abstractions true. Previously, in Chapter 6, I introduced the idea that system properties and behaviors are emergent from the properties and behaviors of the components that make it up, combined with the way those components interact. This idea continues recursively through a system, where each high-level abstraction is achieved by designing how subcomponents work and interact.

Emergent behaviors or properties are usually things that cannot be sensibly talked about at lower levels: these are properties that the individual components do not have on their own, but that the aggregation does when the components are combined. In physics, concepts such as gas pressure are emergent: no individual gas molecule has meaningful pressure, but the collection of a large number of molecules in an enclosed space gives rise to measurable pressure. Similarly, the shape of a leaf is emergent. No cell making up the leaf in itself has a property of the shape of the leaf, but the aggregation of all the cells as well as how those cells interact as the leaf grows (that is, morphogenesis) leads to a consistent shape that can be perceived of the whole.

In engineered systems, properties such as safety or “correct behavior” are emergent from the design of components and their interactions [Leveson11]. Consider an automobile: it has a property that the driver must be able to control its speed. The driver’s ability to control arises from the driver’s ability to give commands to regulate speed and the vehicle’s correct response to those commands. The vehicle’s speed arises from a combination of motor behavior, brake behavior, wheel interfaces to the road surface, vehicle inertia, and external forces like wind or gravity. One can talk about the rotational rate of the motor, or the degree to which brakes are applied, but driver control over speed arises from the combination of all these things.

An emergent, high-level property is said to supervene the low-level properties of components. A change in the high-level property can only occur when there is a change to the low-level properties. This principle implies that one can in general design low-level properties in order to achieve a desired high-level property. It may be difficult to do this design, of course, but it is possible; properly-designed low-level properties do not necessarily create undesired emergent behavior.

Designing a system so that a desired property or behavior emerges from components involves placing constraints on how lower-level components behave and interact. This is a top-down approach to handling emergent behavior. Reliability properties, for example, are often met using redundant components; for those redundant components to provide reliability, they must be connected in a way where one component can provide service when another fails—a property arising from how the redundant components interact with other components. The redundant components must also exhibit a non-functional relation of some degree of failure independence. I will discuss several more examples in the coming sections.

It is generally more effective to work top down, from a desired emergent property of an abstraction to the components and relations that will make it up, than to work bottom up, starting with a set of component behaviors and hoping a desired abstract property will emerge. Component properties combine in unexpected ways, and determining whether they combine in a way that produces the desired result and at the same time avoids unintended consequences is most often a nearly-intractable problem. Working top down means determining the constraints that must apply to the components and structure that implement the abstraction; analyzing (or designing) the components to determine if they meet those constraints is a simpler and more tractable problem.

For example, the software components inside most operating systems cannot be evaluated for good evidence that they provide the operating system’s intended features in all usage scenarios—and practical experience with popular operating systems shows that most contain large numbers of undiscovered errors. Those operating systems were generally built from the bottom up, with new components being developed on their own following only a minimal goal of function, and then added to an existing system. Only a very few operating systems or software systems of comparable complexity have been analyzed to prove that they actually implement their stated function correctly, and those examples have all started with clear definitions of the abstract behavior and worked from there to design the lower-level components and structure [Klein14].

12.4.1 Examples of emergent properties

Emergent properties can be simple or complex; what they share is that the combination of properties or behaviors from multiple components yields something of a nature that would not apply to the individual components. Here is a set of examples illustrating different kinds of emergent properties or behavior, ranging from the almost trivial that one might not ordinarily think about as emergent to the complex, and including both desired and undesired emergent behaviors.

12.4.1.1 Reliable data communication

Reliable communication happens when information is sent from one place to another, with the information received matching the information sent. “Reliable” is usually qualified: a maximum probability that any arbitrary bit or message that is received does not match what was sent, and qualifications on the environmental circumstances such as distance between sender and receiver, or the absence of deliberate interference.

At a high level, communication involves an information source and an information sink. The source and sink have a functional relation of sending information from one to the other.

At the lower level, communication involves a set of components. The information source and sink remain. The functional relationship between them is reified by a chain of components: a transmitter, a receiver, and the medium between transmitter and receiver. It also involves various encodings used in sending from transmitter to receiver over the intermediate medium. The components have functional relations from one to the next, for moving information along this chain of components. The transmitter and receiver have a non-functional relationship: agreement on the encodings to be used to move information over the medium between them.

undisplayed image

Neither the transmitter or receiver in themselves move information reliably from source to sink. Instead, reliable transmission is a simple emergent property of combining all the lower-level components and their relations. The reliability comes from properly matching the designs of the transmitter and receiver, including how they encode signals for transmission and reception, so that they can achieve the desired reliability on the medium that connects them.

12.4.1.2 Door closing and latching

Consider a door, perhaps to a cabinet. The door can be open or closed. When open, it can be closed by a person acting to close it. If no one acts on the door, it might remain open or close on its own. When the door is closed, it remains closed until a person takes a specific action to open it. “Remaining closed” means that the door stays closed even when force up to some defined limit is applied to the door. These behaviors should occur reliably for at least some number of open-and-close cycles. They only need to hold reliably in some benign environment (no deforming forces, no corroding atmosphere, and so on).

This is an example of an emergent property of a high-level component that can be achieved by properly designing the subcomponents that make it up.

undisplayed image

One possible implementation of the door that would meet this high-level property uses a latch to hold the door closed. When the door swings closed, the latch engages and keeps the door closed. The latch can be connected to a knob or lever that a person can use to release the latch, allowing the person to perform a two-part action to open the door (release the latch, apply force to the door to move it open).

The high-level door thus decomposes into three subcomponents: the basic door, a latch, and a knob. These three subcomponents, plus the door’s user, have four functional relationships:

  1. Latch to door: the latch holds the door closed when engaged.
  2. Knob to latch: the knob can be moved to disengage the latch.
  3. User to door: apply force to open or close the door.
  4. User to knob: apply force to turn the knob.
undisplayed image

The high-level opening action that a user can apply to open the door decomposes into a sequence of lower-level actions: a turn action applied to the knob, an opening force applied to the door, probably followed by a release action on the knob. The high level closing action decomposes into, first, ensuring that the knob is released, then applying a closing force to the door.

The implementation admits states that do not directly map to the high-level states of the door. For example, the implementation allows the user to turn the knob and then take no further action. This leads to a state of the system where the door is in the closed position and the latch is disengaged. If the environment applies an opening force to the door, the door is not restrained and will swing open. A designer will have to work out what these intermediate states are, and determine whether they are acceptable or not. (In this case, the situation might be resolved by saying that the high-level “open” condition maps to any implementation state where the door position is not closed or the latch is disengaged. Handling intermediate implementation states is not always so simple.)

The knob and latch will have properties that, together, support the high-level property that the door will remain reliably closed through some number of open-and-close cycles. These properties likely involve constraints on the wear imposed on each of them each time the door opens or closes, and the amount of wear before they begin to be unreliable. Similarly, the property that a closed door stays closed when some amount of force is applied to the door decomposes into properties on the latch and knob to ensure they will hold the door in position.

The overall property of remaining closed is an emergent property of the design of the latch and knob. The latch by itself is not closed or open by itself; that is a property of the door that arises when the latch is engaged and the door is in a closed position.

12.4.1.3 Failure resilience

A failure resilient component is one that can mask one or more failures of its parts while continuing to provide correct behavior. This is one way to meet a goal that a component is reliable or available; the other way is to make the fundamental reliability of the component higher.

For a concrete example, consider a control system for an autonomous road vehicle. The control system takes in commands from a user or other outside system, then must provide correct, active control of the vehicle’s attitude and movement to travel on the commanded path. Typical acceptable failure rates are one in 10-7 to 10-9 operational hours. The vehicle should fail safely where possible when the control system fails, but I will leave that aside in this discussion.

Many systems achieve this level of failure resilience using redundancy and voting. In this approach, multiple independent processors run the control algorithm synchronously, each receiving the same sensor input and generating actuation output. The actuation output from each processor is fed to voting logic, which determines whether a majority of the processors are generating consistent output and if so applies that output to the plant being controlled. If one of the processors fails by stopping, or by generating different outputs, the voting logic masks out the presumed failure.

undisplayed image

The combined components will generally perform the same operations as one single computing component by itself, but the combination will fail less frequently. This improvement is an emergent property of the combination. It depends on two non-functional relationships between the redundant components: that they all exhibit the same behavior, and that they generally fail independently.

For the example vehicle control system, I found that the approach of using three identical embedded computers was (based on reliability analysis, not measurement) likely to provide only a modest improvement to overall vehicle safety. The redundant computers were not fully independent: they ran the same code, they shared the same power source, and were subject to heat and vibration in the same environment, all of which increased the chances two or more computers would fail together. They had a greater degree of independence to matters like a cable vibrating out of its connector or dust shorting out traces on the boards. In other applications, such as spacecraft, there are more sources of independent failure, such as radiation upsets. For spacecraft and aircraft, the cost of unreliability is also higher than for a road vehicle, making this approach to redundancy worthwhile.

An incident involving an Airbus A330 landing on 14 June 2020 illustrates how lack of independence between supposedly-redundant computer systems leads to failure [TTSB21]. The Airbus A330 uses three flight control primary computers; on landing, these control the braking, thrust reversers, and spoilers that slow the aircraft on the runway. In this incident, there was an error in the flight control law implemented in all three flight computers. On touchdown, the flaw was triggered in one flight computer after another until all three had failed, leaving the pilots only manual control of the brakes. The pilots were able to apply manual braking to stop the aircraft before running off the end of the runway. The failure occurred because there was a design flaw common to all three flight computers, meaning that there was no redundancy in the face of the particular condition that occurred on that landing.

12.4.1.4 Undesired emergent properties

Components are usually designed and organized so that together they achieve the desired emergent system properties. However, the same design can exhibit other emergent properties that are undesirable.

Network congestion is a commonly-cited example of undesirable emergent behavior. In its simplest manifestation, when multiple streams of data meet and cross at some router in the network, the streams can overwhelm the router’s capacity to process and forward data. The router typically drops some packets in order to try to keep up, which causes some of the streams in turn to detect missing packets and retransmit them—causing even more traffic through the router. This was first observed in the Internet in October 1986, when a particular congested network link was moving about 0.1% of the data it normally could when not congested [Jacobson88].

This has led to congestion avoidance and congestion control mechanisms in Internet protocols, which aim to either keep stream data rates below the level when congestion starts or recover quickly when congestion does occur. The sender and receiver behaviors in the congestion control mechanisms, however, have been found to lead to behavior synchronization across multiple senders, leading to oscillating loads that repeatedly overwhelm a bottleneck, then back off, wasting resources for a while, until the cycle leads to another period of congestion [Zhang90].

These behaviors are similar to other situations where behavior is unstable, and once it starts to behave poorly it gets progressively worse. In many of these cases, congestion or overloaded conditions make it more difficult for mechanisms that would address the situation to work.

The lesson to draw from the possibility of undesirable emergent behavior is that system designs need to be analyzed to look for such negative behavior—not just analyzed to ensure that desired behaviors happen. This is related to a kind of confirmation bias ! Unknown link ref where one is motivated, usually unintentionally, to look for evidence that confirms what one wants or expects. It often requires deliberate effort to look for evidence of negative behavior.

12.4.1.5 Spacecraft imaging a ground location

The final example takes the basic principles in the previous, simpler examples and combines them into a realistically complex case.

Consider a spacecraft system that is intended to take images of ground locations and send those images to users on the ground. The system includes many different parts:

The process of taking an image involves every one of these parts, as well as others omitted from the example to keep the list from getting too long to read. It includes:

If any one of those steps fails to happen properly, the system as a whole will fail to achieve its objective. At the same time, no one component involved in these steps achieves the system objective by itself. In other words, the system behavior of taking an image of a ground location is an emergent property of the system as a whole.

This example is typical of most system properties and behaviors, in that achieving the desired behavior involves many components working properly together. This implies that all these components have been designed to have their individual properties, and that the components have been wired together with the right functional and non-functional relations to work together.

This example also illustrates a common issue: that components depend on other components for their function. For example, the ability for the spacecraft to communicate with the ground depends on the spacecraft being able to determine when it is coming in range of a ground station. This means that the spacecraft must be able to tell where it is, which might rely on the GPS system. If there were to be a problem with the GPS constellation, the spacecraft would not be able to communicate correctly. This kind of dependency creates non-functional relationships—in this case, a non-functional relationship between ground station systems and spacecraft communications that communications will function only when the GPS constellation is working properly.

12.4.1.6 Safety and security

Leveson argues that safety is a fundamentally emergent property:

Safety, on the other hand, is clearly an emergent property of systems: Safety can be determined only in the context of the whole. Determining whether a plant is acceptably safe is not possible, for example, by examining a single valve in the plant. In fact, statements about the “safety of the valve” without information about the context in which that valve is used are meaningless. Safety is determined by the relationship between the valve and the other plant components. As another example, pilot procedures to execute a landing might be safe in one aircraft or in one set of circumstances but unsafe in another. [Leveson11, §3.3]

I argue that related properties, including security, are similarly emergent and must be understood, designed, and analyzed in terms of how components are related.

12.5 Working with structure

The notions of components, structure, and emergence form a foundation for the work to be done when designing and building a system. Upcoming chapters will define the tasks, artifacts, and processes involved in terms of this basic model of how systems can be organized.

For example, the design of a system consists of artifacts that document what the components are in the system, and the desired properties and relations that connect them. Verifying the design involves gathering evidence for and against whether the behaviors that emerge from the components and their relations match the desired system behaviors. A design can be evaluated based on properties of the graph of relations between components, and the graph of relations can guide investigations into whether subtle non-functional relations (such as expected component independence) will hold.

In addition, there are common design patterns of components and relations that provide guidance for implementing complex behaviors. These design patterns can be expressed in general terms of components and relations, making the patterns broadly applicable rather than specialized to a particular use case.

Sidebar: Emergence all the way down

I have taken a pragmatic approach to abstraction and emergence, focusing on the kinds of relations and abstractions one actually encounters in building most real systems. This means only drilling down into lower layers of abstraction as far as is needed, and not as far as it could go.

Consider data that is exchanged between two electronic components. Data is an abstract component that has no direct physical reality; it is an emergent property of lower-level components and relations. The data itself is dependent on mechanisms for observation and interpretation by people—including agreement between sender and receiver on what the data “mean”. The data are transmitted from one component to another using low-level electrical signals over wires; the signals are designed to move the data from one component to the other. The low-level electrical signals are themselves an emergent property of yet lower level atomic and electromagnetic behaviors in the transmitter, wires, and receiver. These may in turn be emergent properties of yet lower level structures and forces, some of which may not yet be understood.

It is intriguing to think about how far one can take this approach. Luckily we can usually stop at some practical level and take the rest for granted.

Sidebar: Summary

Chapter 13: System views

20 May 2024

13.1 Introduction

Systems are too big for one person to understand all the facts at once. It’s necessary to focus on subsets to manage the scale.

At the same time, different people have different interests as they are working on a system. They need a particular kind of information about part of the system, but do not need to be distracted by other kinds of information.

These needs for subsetting lead to developing multiple views on a system. Each view defines a subset of the information on a system, with the subset defined to support a particular person’s needs and interests. Ideally, each person can do their work using one view or another, and when all the work has been done using many different views the work has addressed all of the system.

Some of these views have a technical focus, being about the function or properties of the system and its parts. These views support those who design, analyze, implement, or verify parts of the system. Other views are non-technical, supporting people who manage the project, organize the teams doing the work, handle scheduling, and similar tasks.

Views highlight some information and hide other information in order to help someone perform a task. If the view shows too much information, then the person using the view will have trouble finding the specific pieces of information they need. They may, indeed, be distracted by irrelevant information. On the other hand, if the view is hiding information that the person needs, they are likely to work with the incomplete information they have and infer that the system does not include the missing information.[1]

The view concept I am defining here is a general mechanism for subsetting information about the system. There are several architecture framework standards that define “view” and “viewpoint” concepts, including DODAF [DOD10] and ISO 42010 [ISO42010]. The view concepts in those framework standards arise from ideas about the processes that should be used to build systems well, and are thus more specific than the general idea presented here. These standards focus on developing models of a system’s design, with subset views that are motivated by exploring the objectives that system stakeholders have in the system. The approach in these standards is one way to use the general idea of subsetting information about a system based on some focus; I will discuss this further in later chapters when I turn attention to how to build systems using the foundational concepts presented now.

13.2 Technical views

Technical views are ones that subset the contents of a system in a way useful to the designers, implementers, or verifiers of the system. These views focus on how a part of the system functions or is organized in some technical sense.

These views can focus in different ways, depending on the specific need:

A view focused on a set of components is useful to someone responsible for a particular subsystem or abstraction. The view can collect all the components, at varying levels of abstraction, related to one part of the system. This might be defined as one or more subtrees in the component hierarchy (Section 11.3)—for example, all the components that make up an electrical power system for a spacecraft. This might also start from some other abstraction. Views like this can be used when working out how an abstraction is to be realized in concrete subcomponents (Section 12.3). It can also be useful for checking whether certain design properties hold, like total mass.

undisplayed image

A view focused on a path through the system is useful for working out or checking how behaviors are realized. Such a view might start with an event in one component, then trace how one event causes events in adjacent components, onward until the high level behavior is complete. Views like this are useful when checking where a path might have gaps that need to be addressed. It is also useful for checking that a causal path among abstract components and relations is properly realized in concrete subcomponents.

Looking at a path can help reveal what conditions need to hold for each step in the path to occur properly. For example, in the spacecraft commanding example in the previous chapter, a ground pass has to happen successfully if a command message is to be received at the spacecraft. A successful ground pass requires a functioning and available ground station, accurate ground knowledge of where the spacecraft will be, knowledge in the spacecraft of where a ground station is and when it will be in range, and the ability to operate the communications subsystem.

undisplayed image

The third kind of view focuses on trees or graphs of dependencies. This information is useful to someone who is verifying that some safety or security condition holds. It is also useful for revealing where there are unexpected vulnerabilities in a system. In particular, looking at the transitive closure of dependencies can reveal unexpected shared dependencies between two components. In the spacecraft commanding example above, a spacecraft’s ability to know when it should operate its transceiver for a ground pass might be based on the spacecraft knowing its location through GPS. This creates a dependency on a GPS receiver on board and the correct function of the GPS constellation. Further, it may require the spacecraft to maintain an attitude where GPS antennas can see the GPS constellation; this may conflict with other demands on spacecraft attitude (like pointing an antenna toward a ground station). Both the communications transceiver and GPS receiver may rely on a shared electrical power system.

undisplayed image

These three kinds of views are not mutually exclusive. Often someone can benefit from starting in one view, such as a path through the system, and then use other views to explore or refine the system, such as checking on dependencies.

13.3 Non-technical uses

Some views are useful for managing project execution. As a manager or lead, I have been responsible for working out what tasks people need to do to develop the system to some milestone, along with potential dependencies among tasks and estimates of the time and resources needed. I have needed to understand the system in order to derive this information about tasks.

For example, I have often started with a high-level design for a part of a system, containing a few abstract components and relations and a few paths through them for performing key behaviors. I have used one or two paths through those components to sketch out milestones that the team can design and develop toward; at each milestone, the designs or implementations will be integrated to demonstrate some level of functionality. This management step uses views of a few paths through the system. After that, I have worked from the view of components and relations that feed into each milestone to work out a set of design and development tasks that will get each part ready for its milestone. These steps use information about the components and relations involved to work out both the individual tasks and how those tasks might depend on each other, leading to constraints in how the effort can be scheduled. I expand on these techniques in ! Unknown link ref.

Following paths through a system, as well as tracing through the ways that abstractions are decomposed, allows one to find gaps in the current understanding. These gaps represent uncertainty, which can lead to risk. Further, following paths through the system that lead to and from some uncertainty to other components or relations helps one work out how much other parts of the system may be affected by uncertainty. This allows one to judge the potential effects of changes that may arise from the uncertainty; the magnitude of the effects is part of determining how much developmental risk some gap poses. I discuss how to use this kind of analysis in ! Unknown link ref.

Sidebar: Specifying a view

The descriptions above may seem focused on extracting subsets of a defined system, but the view concept is intended more generally.

In set theory, subsets are often specified one of three ways: by listing the elements of the subset; by constructing the subset through combinations of set operations such as intersection and union; and by specifying a characteristic function—essentially, a description of a query on the set.

All of these have been useful to me at one time or another. While a system is being designed, the population of components and relations that make it up will be changing constantly. The path through components and relations to achieve some function will be steadily refined; in many cases, there may be two or more alternative designs for parts of the same path to compare. This case lends to a query-like formulation of views, which are updated as the system’s contents change. On the other hand, tasks to verify that a design or implementation are correct and complete benefit from being an unchanging snapshot. This way someone can step through each part of the system, verifying each piece and each integration, without having that work change as people make changes to the system.

Sidebar: Summary

Chapter 14: Evidence of meeting purpose

20 May 2024

14.1 Introduction

An implemented and operational system needs to meet its purpose (Chapter 9). After all, that purpose is the reason that resources have been spent on developing the system and using it. Meeting purpose means two things: that the system does all the things it is supposed to, and that it does not do things it is not supposed to.

One cannot assume that a system meets its purpose. Each system needs to be evaluated to determine whether it actually does or not, and if not, how and where it does not. The evaluations catch design and communication errors that occur when one party thinks they have specified what is needed, and another party does not understand what was meant or makes a mistake in translating the specification into practice.

How a system works changes over time as well, and regular re-evaluation catches cases where operational behavior diverges from what is needed for correct or safe operation. This includes wear and tear on the system that must be corrected with maintenance. It also includes changes in how the system is operated—from operator practice to management organization and environmental context.

In this work I talk about gathering evidence of a system meeting its purpose.

Parts of a system’s purpose can be specified quantitatively or qualitatively. Quantitative purposes can lead to deterministic ways to check that the system meets the purpose. Complex quantitative purposes, however, aren’t necessarily so easily evaluated: computational complexity or the difficulty in actually measuring system behavior can lead to quantitative properties that cannot be easily or definitively evaluated.[1] For these complex quantitative problems, one must be satisfied with statistical evidence that indicates whether the property is likely true. Qualitative purposes are not amenable to proof of satisfaction or not. These purposes are evaluated by human judgment, which again leads to evidence but not proof of satisfaction.

Systems engineering processes often use the terms verification and validation (or just V&V). These are both special cases of the general need to gather evidence for and against whether a system meets its purpose or not. In this chapter I focus on the general matter of checking a system, and I will note in this chapter and in later chapters ! Unknown link ref when these specific uses of evaluating a system apply.

14.2 When to evaluate a system

Checking whether a system meets its purpose is an ongoing need, starting from when the system is first conceived, through system design, implementation, and operation. In general, a system should be evaluated any time its purpose changes, or any time its design, implementation, or operation changes.

In practice, there are five times in a system’s life cycle when the system—whether in design, in implementation, or in operation—gets checked against its purpose.

  1. At each of the individual steps from initial concept, through specification, design, and implementation.
  2. At the time when the system is accepted for deployment.
  3. Periodically and regularly while the system is in operation, to monitor for drift.
  4. At each step when a change is requested, from concept through design and implementation.
  5. At the time when a changed system is accepted for deployment.
undisplayed image

During development, systems are checked in two ways: step by step, and a separate evaluation of the whole system when implementation is complete. The step by step checking occurs at each development step, including generating a concept for the system, generating a specification, designing, and then implementing the design. The expectation is that if each of these steps is correct, then the concept will follow the purpose, specification will follow concept, and so on, and the resulting implementation will properly meet the system’s purpose. (See the figure in Section 6.6.) In practice something gets missed or misinterpreted at some step of development, and so the argument that each step is correct does not hold. Separately evaluating the implementation at the end directly against the original statement of purpose allows one to cross-check the step-by-step evaluation. It helps one find which step had a mistake and thus where to make corrections.

Evaluations are part of the process of working out components‘ specifications and designs. The idea of safety- or security-guided design [Leveson11, Chapter 9][Horney17] is to start with safety or security objectives as part of a component’s purpose (or the system’s purpose), refine those objectives into parts of the component’s specification, and then use this to help guide design work. Using safety or security objectives means conducting analyses of specifications or designs to see if they address the objectives, and adjusting the specification or design until there is evidence that they do meet the objectives.

Any time the system’s purpose changes, the system must be re-evaluated in light of the change. This involves repeating steps in the life cycle shown above. Re-evaluation is easy when early in initial design; the later in the life cycle, the more expensive re-evaluation gets. The scope of what parts of the system need to be re-evaluated can be limited by examining the structure of the system and how a change propagates from one component or relation to another.

A system should be evaluated regularly while in operation. In practice, systems drift over time from how they are originally designed and implemented. People who are part of the system, whether as operators, oversight, or management, can shift in their understanding of what they need to do, and often find shortcuts for their role as they adapt to how the system is to work with. Mechanical parts of the system can wear, changing their behavior or increasing the chances of failures. The environment in which a system operates can change, perhaps with people moving near an installation that was previously isolated or maintenance budgets being cut. As a simple example, in one early software system I built, the software included a billing module that would create itemized invoices to be sent to insurance companies that were expected to reimburse for medical expenses. Over time, the people who should have been running the module and creating invoices forgot to do it as regularly as it should have, leading to revenue problems for the business. Leveson discusses several other examples [Leveson11, Chapter 12].

Finally, a system’s purpose usually changes over time. The users need new features, or some assumption about how they will use the system will be found to be wrong. Regulations or security needs may change. All of these lead to a need to change the system’s design and implementation. The team will recapitulate the development process to make these changes, including evaluating the updated concept, design, and implementation against the new purpose.

14.3 Kinds of evidence

There are two kinds of evidence: positive evidence and negative evidence. Both are needed to evaluate whether a system meets its purpose.

Positive evidence is an indication that the system properly implements some desired property or behavior ! Unknown link ref. Positive evidence is what most people think of first: that the mass of system hardware is within some maximum amount, or that the system performs action X when condition Y occurs.

Negative evidence is an indication that the system does not do something it is not supposed to ! Unknown link ref. Safety and security evaluations are fundamentally about collecting this kind of evidence: that the system will not do some unsafe action or enter into some unsafe state. Negative evidence is therefore vital to determining whether a system meets its objectives, but negative evidence is generally much harder to establish than positive evidence. In practice, analytic methods are the only ways we currently have to establish the absence of a condition.

Bear in mind that, as the saying goes, absence of evidence is not evidence of absence; that is, no amount of testing that fails to find an undesired condition can establish with certainty that a realistic system is free of that undesired condition. Negative evidence through testing requires testing every possible scenario, which is infeasible for anything other than trivial behaviors. Testing a very large number of scenarios can potentially generate a statistical argument for the absence of an undesired condition, but only if the scenarios chosen can be proven to provide sufficient, unbiased coverage of all possible scenarios, including rare scenarios. I have never found an example of someone being able to construct an argument for the significance of the test scenarios in a real-world system. Kalra and Paddock [Kalra16] present an analysis for testing autonomous road vehicle safety, and show that it would require an infeasible number of driving miles to show the absence of unsafe behaviors—and they conclude that alternate means are needed to determine whether autonomous road vehicles are sufficiently safe.

Many undesirable behaviors or conditions cannot be completely eliminated from a system, and instead the standard is to show that the rate at which these behaviors occur is sufficiently rare. For example, aircraft are expected to experience failures at no more than some rate per flight hour in order to be certified for operation. These safety conditions lead to a need for evidence of statistical bounds on rate of occurrence at a given confidence level.[2] If these bounds are sufficiently loose, then a carefully-designed test campaign can provide statistically significant evidence. However, statistical significance and confidence rely on the test scenarios either being selected without bias, or with a way to correct for selection bias. This means, for example, ensuring that there is no class of scenarios that are avoided in selection. It also means understanding the probability of rare but important scenarios occurring and accounting for that rarity in the number of scenarios tested or in the way scenarios are selected.

14.4 Methods of gathering evidence

There are three general methods for gathering evidence about systems satisfying their purpose:

Experimentation tests an operational system (or part of a system) to show positive evidence about some desired capability. This is the gold standard for gathering positive evidence.

Experimentation is usually divided into two categories: testing and demonstration. Testing involves setting the system into a defined condition and providing it defined inputs, measuring the system’s response, and comparing that response to expectations. Tests are expected to be repeatable. Demonstration is more open-ended, where the system is operated for a while, possibly by people, and not always in a fully-scripted, repeatable way. Demonstrations can address some non-quantitative conditions, such as whether people like something or not.

Inspection or review is a way to check a design or system for things that cannot be readily measured by experimentation. These methods use human expertise to check the system for specific conditions. It is primarily used to gather positive evidence, but it can be useful for gathering negative evidence when other methods don’t apply. In the simplest form, inspection checks simple conditions that would be difficult to automate; for example, that a physical car has four wheels. For more complex reviews, humans observe and think about what they observe in the system to determine whether what they observe meets expected behavior.

Analysis can be used to collect both positive and negative evidence. Indeed it is generally the most useful way to gather negative evidence—which is often about thoroughness, and analytic methods are better at ensuring all possibilities have been examined. Analysis takes as input a model of the system, extracted from its design or its implementation. It then applies algorithms that work to prove or disprove statements about that design, such as whether there exists some sequence of state transitions that would cause a component to enter an undesired state. The evaluation is usually performed using automated computational tools, though it can sometimes be done by hand for analyses of modest complexity. I have used analytic methods occasionally, usually for foundational components or abstractions on which the system depends for its correct operation. The first time I used it, on the design of a synchronization mechanism in a multi-threading computing environment, it caught a subtle flaw that would have occurred rarely but would have been difficult to detect. On another project, colleagues and I proved the correctness of the design of a family of distributed consensus algorithms—which helped us accurately implement the algorithms. The SeL4 operating system kernel [Klein14] has been formally proven design and implementation, showing that its implementation provides key confidentiality, integrity, and isolation properties as well as functioning according to its specification.

14.5 Completeness and minimality

Separate from these methods for gathering evidence, one also needs evidence of completeness and minimality.

When a system is believed to be complete, one doesn’t want only to show that one or a few purposes are met; eventually one needs to provide evidence that all purposes are met. This does require knowing what the purpose is, and then being able to provide evidence showing each part of it has been satisfied.

One also needs to show that the system as designed or implemented does not do things that don’t derive from and support the purpose. This includes showing that safety and security properties (of bad things not happening) are met. It also includes ensuring that people have not inserted features that the end users do not need or want, which would imply that development resources have been mis-spent and that the system can potentially do things the users will find undesirable.

Sidebar: Summary

Chapter 15: Synthesis

14 October 2024

15.1 Introduction

The previous chapters have covered a model for understanding what a system is. The coming chapters are about how to build that system.

This chapter presents where these two arcs meet: the web of artifacts that make up a system and record what it is—the system artifacts. This web is a reification of the abstract model in the previous chapters; the making of this web is the work of making the system.

The system artifacts are a (directed) graph. The nodes in the graph are artifacts that record some aspect of the system, such as a component’s implementation or a requirement or a verification result. The edges in the graph record relationships between nodes, such as a “part-of” relationship between two components, or a “satisfies” relationship between a specification and a design.

The graph contains all the information about the system, as discussed in the previous chapters. It starts with one or more project ideas. These ideas lead to stakeholders and to purposes and constraints. These flow to nodes making up the concept, and from there onwards to implementations and verification evidence.

The development and maintenance or evolution phases of a system-building project are about creating and maintaining this structure. The graph does not spring into existence fully formed; nor is it build in one sweep from the top (idea) to the bottom (implementation). The work starts at the top of the graph with the ideas and stakeholders, with the rest of the structure unknown at the start. People explore downward in multiple threads. Some threads progress deeper than others at any given time. Some threads explore in one direction, find that they have gone in an unhelpful direction and start over in a different direction. Most nodes start out simple, and are added to and corrected many times as the team explores the graph.

The business of making a system well is, then, about starting from the basic idea and, first, exploring and building the structure of system artifacts efficiently; and second, building a structure with good form. Efficiently means using as little time and other resources as possible to build the structure; this implies using resources concurrently and minimizing the amount of work that is re-done (because of false starts or poor workmanship). Good form means that the structure accurately reflects stakeholder objectives and constraints; the system meets those while avoiding other features that are not about customer needs; and the graph accurately reflects the derivation from objectives to implementation. A good system artifact graph will contain all the information people need, but not too much more.

In the next part, I will discuss how one goes about making a system, which amounts to organizing how the team builds this structure in an organized way, without losing track of what they are doing.

15.2 Structure of artifacts

The graph of system artifacts contains all of the information about a system—its purpose, structure, implementation, and why all of that is the way it is (the rationale).

The graph has many types of nodes, including:

There are many different kinds of edges between these nodes, showing the relations between the artifacts. Some of the most common are:

The graph in Figure 15.1 illustrates a small part of of the upper levels of an example system artifacts graph. The illustration is a simplification; the specification node in the graph, for example, is actually a collection of requirement trees, each of which have derivation relationship to components above and below. Similarly, a design is actually a collection of many things; the illustration leaves out the relationship between the subcomponents within one higher-level component. In practice the graph for even a fairly simple system will have at least tens of thousands of nodes; for complex systems, orders of magnitude more.

This graph of artifacts is what the project builds, bit by bit, over the course of development.

undisplayed image
Figure 15.1: Top portion of the system artifact graph

15.2.1 Layering

Looking at the whole artifact structure, as opposed to just particular kinds of information, reveals problems with narrower views.

Most projects and almost all tools organize requirements into a tree. Each component in the breakdown structure has its own set of requirements. These start with the system objectives and constraints, and flow down from one component to another along the component breakdown structure.

In isolation, this structure is missing essential information about the mapping of requirements on component C to the requirements on its children C1, C2, …, Cn. The set of Ci implicitly encodes how C is decomposed into subcomponents. By itself, this leaves out the roles of each subcomponent within C and the relationships between the subcomponents. The requirements on Ci derive only in part from the requirements on C; they depend equally on the role that Ci plays within the larger component.

Concretely, consider a spacecraft electrical power system (EPS). The EPS has many requirements, such as being able to provide some minimum wattage to other spacecraft subsystems over the course of the average orbit, being able to passivate (permanently shut down) the EPS, to control which subsystems are powered on or not, and so on. The EPS is built from several components: solar cells that generate electricity, a battery to store the generated energy, a power distribution unit (PDU) that switches where electricity is going, and others.

undisplayed image

The passivation requirement, which comes from regulation requiring all energy sources in a spacecraft to be fully and permanently de-energized when the mission ends, must be implemented by those subcomponents. This means that the battery must be drained to zero charge and the EPS must be placed in a mode in which it will never afterward provide any power to any other spacecraft subsystem. This includes the battery never being recharged. Figure 15.2 shows a set of requirements that encode this.

undisplayed image
Figure 15.2: Requirements flowdown for spacecraft electrical power system passivation

One way to meet the passivation requirement would be to add two features: fuses between the solar panels and the PDU and between the battery and PDU, which when blown would permanently disconnect them; and a mechanism to drain the battery, perhaps by connecting it to a circuit with a resistor that will dissipate any stored energy.

In the requirements flow-down above, there are design decisions hidden in the requirements tree: where and how these functions will be implemented. The disconnection fuses might be implemented as part of the PDU or as part of the solar panels and battery; the design decision was to assign these functions to the solar panels and battery rather than having them within the PDU. Similarly, the function to drain the battery could be implemented in the battery or in something else attached to the PDU. The hidden decision was to assign that function to the battery.

In fact requirements (or specification) and design (in the component breakdown, and the way high-level components are decomposed); they form alternating layers and one cannot be correctly understood without the other. A design records how a component will be made up of subcomponents, and defines the roles and functions that are allocated to each of the subcomponents. The roles and functions define the purpose for the subcomponent. The requirements placed on the subcomponent are derived from the requirements of the higher-level component as transformed through the purpose of the subcomponent.

undisplayed image

15.3 Exploration and development

Development of the system is a process of exploration to find and create the contents of the system artifacts graph. This can be thought of as similar to the way the original North American transcontinental railroad was developed (Section 4.7), but with many more choices.

In an idealized situation, where there were no significant uncertainties and where the team had system design patterns to follow, the exploration could start with the idea, proceed to work out the stakeholders and their needs, and then the team could work their way downward step by step until they reached the implementation and verification artifacts at the bottom edge of the graph. I am not aware of any project where this ever happened.

Unfortunately, that kind of steady, unidirectional progress is what too many people expect when they think about the whole mass of information that a project develops. Knowing instinctively that such development doesn’t actually happen, they turn away from organizing the project’s work and focus on building the system implementation artifacts in whatever way they can manage. This is especially true of startups and of small projects.

What gets lost is that the system artifact graph is the end of building a system, not the process for building it. The state of the artifact graph at any moment is a record of what the project has done to that time. The graph is not the plan; the plan is its own artifact (Section 20.5).

The process of making all the parts of the system artifact graph—how the graph grows and changes—is always complex. The process is non-linear and non-monotonic: people will sketch some node in the graph, explore around it, and come back to improve the sketch based on what they have learned. Some people may start working from the top (stakeholders and their needs) downward, but others may have ideas about how parts of the middle of the graph might be structured, and work from the middle out. When there are choices for how some component might be structured, the team explores multiple tentative options, creating multiple parts of the graph, before selecting one of the options and removing the other tentative parts.

The challenge for a team is to develop the graph efficiently and correctly, while allowing for the necessary complexity of working in non-linear ways.

There are two goals for developing the graph efficiently by not wasting resources. The first goal is to minimize errors and rework. If someone develops a component design that does not meet its specification, or an implementation with bugs, then at least some of that effort has been wasted and more effort will be spent fixing the problems. Rework also happens when two people perform duplicate work when only one is needed. The second goal is to keep all the team working without stalling waiting for someone else to complete something. The time someone spends waiting without being able to do useful work on the project is time lost. Organizing the work to maximize potential concurrency can help achieve this second goal.

There are other techniques that keep a project moving quickly, such as assigning work to the person best able to do it. These are in addition to the business of exploring to find the system artifact graph; they are covered in ! Unknown link ref.

While a system is being developed, parts of the system artifact graph will be uncertain. Some parts will still be empty, waiting for someone to start working them out. Some nodes will be incomplete, needing more detail or to be checked for correctness. There may be components that have been specified but for which no feasible design yet exists. The overall process of developing a system is one of driving these uncertainties down to zero (which only happens when the last implementation completes verification and the system is done).

Any time someone is working on something that depends on something that is uncertain, there is a risk that the work they do will have to be redone. The greater the uncertainty, the greater the likelihood of rework. This suggests that the development should proceed cautiously from highly certain parts to less certain ones, and not moving onward until uncertainty has been worked down. Unfortunately this does not work: high-level decisions are often tentative, depending on whether something at a lower level works out. For example, one might decide that the electrical power system from the previous section should include a battery, solar panels, and a power distribution unit. Whether this decision works out will depend on whether battery and solar panel components are available. If they aren’t, or if the available components are too heavy or don’t produce enough power, then that initial design would have to be revisited. At the same time, the decision to use a battery and solar panels for a spacecraft operating in Earth orbit is probably feasible—that is, there is only moderate uncertainty—and so it is reasonable to move forward tentatively with that approach. As I will discuss in ! Unknown link ref, I would then plan to investigate whether there are appropriate solar panels and battery available as next steps.

The final structure of the system artifacts graph should be correct. Some of what makes it correct:

The process of exploring and developing the system artifact structure will necessarily involve evaluating alternatives for one part or another. A good final structure will come from having explored enough alternatives to have confidence that good choices were made (even if they were not formally optimal). Information about what alternatives were considered and why the result was chosen should be included in the rationales in the structure.

Finally, a good final system artifact structure is complete to a reasonable level of detail, but does not go overboard. It should contain enough information that people can learn what they need from looking at the graph and reading about its nodes, but without including details that won’t help them understand both what the system is and why it is that way.

15.4 Views

The system artifact graph quickly becomes large and intricately connected. For even modest systems, it becomes more than one person can hold in their mind.

People need tools to help them see the parts of the graph that are relevant to their work, while keeping parts that are not useful to them out of their way. I discussed the idea of views in Chapter 13: the ability to focus on subsets of the system or its associated artifact graph.

There are some common ways people need to be able to view subsets of the information. When they are working on one component, they need to see all the information about that component and all its context (relations, derivations, neighbors, parents). When they are working on the structure—components and relations—they need structural information. When they are working on an emergent behavior that crosses many components, such as a safety or security property, they need to see the affected components and their relations, along with information about how component behaviors at different levels combine to create the emergent behavior. When they are verifying some aspect of part of the system, they need to see what the specification is they are verifying and structural information to help them work out how to verify the aspect—perhaps by building a test that uses component interfaces.

I have found that I often use three kinds of views:

15.5 Relation to making a system

The preceding chapters have presented an abstract model for what a system is. This model provides tools for thinking about systems: their purpose, scope, components, and structure.

The coming chapters cover how to build the system. They present a model organized around the tasks people do, in which a team uses tools to create artifacts, and the work is organized by how the project operates.

The system artifacts graph is where these two threads meet. The system artifacts are a reification of a system’s abstract model; they record all the information about a system and include the final implementations that are eventually manufactured and deployed. The system artifacts, therefore, must accurately represent the information in the model. The system artifacts are the object of all the work building the system. The nature of the system artifacts graph constrains how the team organizes itself and its work. In particular, the life cycle and planning that a team adopts are statements of how the team will explore and develop the system artifacts.

Sidebar: Summary

Part IV: Making a system

A detailed model for how to go about building a system:

Chapter 16: Approach

21 May 2024

Making a system is about the activities to build the system and the people who do that work. In Chapter 7, I laid out a basic model for these activities and what they involve. The model involves five elements (repeated from that chapter):

undisplayed image

The model is organized around the tasks that are performed to build the system. The tasks generate artifacts, including design and implementation. The team is the people who do these tasks. The people use tools to do some of the tasks. And finally, the plan organizes the work.

This model provides a template for thinking about how to set up the processes and policies for a system-building project. That is, when it comes time to do a project, one can use this model to help guide the decisions about how the project will be run. In this book I do not specify how one should make these decisions—each project has its unique needs, and no one recommendation will be a good solution for every project. Instead, the model provides a framework for understanding what decisions need to be made, and in later chapters I provide menus of choices for different parts of the model.

All the pieces of running a project are themselves a system, whose purpose is in general to get the system built. In this part, I follow a general approach for designing any system in order to lay out a set of functions that each part of the model can have. In doing so, this lays out a framework for the criteria by which someone can judge potential designs for their project’s organization.

undisplayed image

The approach, then, begins with working out the purpose of the system for running the project. The purpose in turn derives from the stakeholders who must be satisfied with the execution. In the rest of this chapter, I lay out a template list of stakeholders and the needs each of them might have. This set of needs will then provide guidance for what each component part of the model—artifacts, team, plan, and tools—should have in order to satisfy the stakeholders.

16.1 Purpose

The primary purpose of the system that is the project is:

Get the system built, accepted, in operation; maintain and evolve it.

There are also secondary objectives that different stakeholders will have, which we will discuss next. This includes, for example, needs of the organization hosting the team that does the work: the organization in most cases expects at least to be able to cover the cost of development. If the organization doesn’t believe that it can cover the cost, they may well decide not to pursue the project.

In the next step, I identify potential stakeholders. Following that I will identify potential needs each can have, what different kinds of each there can be, and how each stakeholder relates to the organization that runs the project.

16.2 Stakeholders and needs

The first step in working out a system’s purpose is to identify the stakeholders who define the purpose (or put constraints on the project that are, in effect, part of the purpose).

I group stakeholders into five classes:

  1. The customer for which the system is being built;
  2. The team that builds the system;
  3. The organization(s) of which the team members are part;
  4. Funders who provide the investment to build the system; and
  5. Regulators who oversee the system and its building.

Each of these are meant to be roles, rather than single entities. For example, when a system is built under contract for an organization who is paying for the work, that organization is both the customer (they will be using the system) and the funder (they are paying for the system-building).

16.2.1 Customer

The customer is the person or organization(s) that will use the system once it has been built and deployed. The system’s value in the world in the end derives from what the customer can do using the system.

The customer primarily cares about the system meeting some need they have. In addition, they care that the system:

Variations. The simplest kind of customer is when one organization contracts with another organization to build the system for the first organization. In this case, it is clear who needs to be satisfied with the system (the one paying for it).

Other times the customer is internal: when an organization determines that it needs some system for its own use. Who defines the purpose of the system is then usually clear—though sometimes it is unclear who defines the purpose, because there is not such a clear separation between the “customer” and the builders.

Finally, the more complex situation occurs when the customer is hypothetical. This occurs when an organization builds a system product in hopes of providing it to future (paying) customers. In this case, there is no one person or organization who can dictate the system’s purpose. Instead, the team designing the system must build up an idea of who potential customers are and what they might want.

I discuss the different kind of customers further in Section 32.6.

Relation to broader organization. Most organizations have someone or a team responsible for finding and working with customers. This might be a business development group, or a sales and marketing group. These people will be responsible for actually working with the customer, and they should stand in as a proxy for the customer during internal discussions. The systems aspects that I discuss here support the interface between the marketing or business development people and the people who build the system that is delivered to the customer.

16.2.2 Team

The team is the collection of all the people who do the work to design and build the system. This group includes developers and engineers, managers, contracting specialists, marketing, and everyone else who does tasks related to getting the system built.

Many of the things that the team needs are not directly related to building the particular system, but are aspects of the organization for which they work. An organization’s policies and management have the most effect on whether the team are satisfied, but there are aspects of systems work that can support (or hinder) the organization.

The people in the team need, in general:

This list is based on the analysis documented in Section A.2.2.

Variations. The team can be as simple as one or two people, or it range to a large team of hundreds. The team can be all within one organization, or it can be spread over multiple organizations (such as when multiple organizations collaborate on a project). A team can also be viewed as including external vendors who provide parts of the system or essential services.

Relation to broader organization. Most of a team’s needs are matters of project management and business operations, not of systems-building in itself. The organization defines its human resources policies, for example, which address matters of how people are evaluated or paid, and how they can report problems.

However, the organization of systems work can help to meet these needs. Accurate staffing depends on understanding the work to be done, which in turn depends on the system’s design. Well-defined job descriptions and processes help people understand how to get their job done, contributing to people feeling secure in their position.

16.2.3 Organization

The organization is the entity or entities for whom the people in the team work, and which provide a legal entity for the project. I use the term “organization” rather than “business” or “company” because there are many kinds of organizations that can run a project: a government, a consortium of other organizations, a non-profit organization, or an informal group of people can all run a project.

All organizations share one concern: the ability to deliver the system. This includes having the ability to communicate with the customer (or model potential customers) and the ability to hire and support the team doing the work.

Organizations also share a need to maintain their reputation. If an organization has a reputation for delivering good systems, on time and on budget, they will be more likely to be able to keep going.

Some organizations have additional needs, focused around how the project will position them to deliver other things to other people. An organization may need to show a profit—enough to fund the organization’s overhead and to deliver returns to funders. An organization may need to be able to sell the system to potential customers. And an organization may need the project to position the organization for future work, based on improving the organization’s capability and maintaining its reputation.

Variations. There are many different kinds of organizations. These include:

Relation to broader organization. Obviously, most of an organization’s needs are addressed not by the team building a system, but by the organization’s management and operations. The systems project supports these needs, however. The organization needs to be able to estimate the cost and time involved in a project in order to ensure that it has the funding needed to complete the project. The organization’s reputation depends in no small part on its ability to execute the systems-building project, so things that helps the project move ahead efficiently and smoothly will be good for the organization.

16.2.4 Funders

Funders provide the capital or other resources needed to build the system.

A funder has one primary need: the return on their investment. The return may be monetary (profit from sales of the system) or it may be more intangible (a business ecosystem, regional economic development).

Some funders will have secondary needs, such as enhancing their reputation and positioning themselves for funding future projects.

Variations. Funders can be external to the organization building the system, providing investment in the expectation of a monetary return. Venture capital funding is one example of this kind of funder.

The customer can be a funder when the customer pays for building the system. This can be a commercial customer funding the project through a contract. This can also be a government organization providing a development contract. The expected return in these cases is primarily the system itself, and secondarily less tangible benefits like the development of capacity to build similar systems.

A project can also use internal funding. This occurs when an organization has the capital to develop a system itself. The organization generally expects a return on its investment either by improving the organization’s own capabilities, such as by building a tool that helps the organization run better, or by providing a product that the organization can sell for a monetary return.

Relation to broader organization. While the organization has the primary responsibility for working with funders, a systems-building project can help meet the funders’ needs by building the system efficiently, using the investment well, and by producing a good system, which helps ensure that the expected return will occur.

16.2.5 Regulators

Regulators in general are people or organizations independent from the team and project. The regulators provide an external check on organizations and products to ensure they meet safety and security regulations, or that they provide legally-required public value.

Regulators need compliance with regulation in the system and in the work the team does to build the system. The regulator may verify that regulations have been met by inspecting the final system or by auditing records of the system’s creation. The regulator may block a system’s deployment until the system can be certified as meeting these requirements, as happens with aircraft. Alternatively, the regulator may depend on the team to know and follow the regulations and only check the system’s compliance when something goes wrong. The US automotive industry is an example of this.

The systems-building process, at minimum, supports regulators’ needs by knowing and following the regulations. This often can involve dialog with the regulatory organization to ensure that the team has all the information it needs, and to ask for clarification or guidance when the team is unsure about the regulation. The team also needs to maintain records that can be checked to show how it has complied with regulations. When the system requires certification before being deployed, the team usually needs to engage with the regulators to ensure the process goes smoothly.

Variations. A government organization is the obvious regulator. They have the charter to look after the public interest, especially when a project has incentives that would work against that interest.

Industry organizations can act as de facto regulators. A group of companies can come together to set voluntary standards for the systems they make. The groups that standardize the Internet (the Internet Corporation for Assigned Names and Numbers, ICANN) or WiFi (the IEEE Standards Association and the Wi-Fi Alliance) for interoperability are examples. These organizations do not have authority to penalize systems that do not comply, but a system that does not is not allowed to claim compatibility.

Finally, there are non-governmental organizations that set safety or security standards, often for a particular industry. ISO, SAE, and others provide safety standards (such as [ISO26262] or [ARP4754]) and companies have grown up around them to help other organizations comply with the standards. These organizations also have no authority to penalize non-compliant systems directly, but compliance is usually evidence used to show that government regulations are met, or provide a defense against lawsuits.

16.3 Mapping needs to model

The previous section introduced a set of stakeholders that have an interest in how the project operates, and a summary of each of their needs. The next step is to work out how the model for performing the project can support meeting those needs (see the diagram above). This involves mapping the stakeholder needs to each of the parts of the model (artifacts, team, tools, plan).

I developed this detailed mapping. Appendix A reports the details of each stakeholder and their needs, along with the full derivation from needs to the requirements for the pieces of the system-building model. The mapping has the form of tables of requirements or objectives, with each stakeholder need mapped to one or more objectives for each part of the system-building model. The result is that every stakeholder need is either supported by aspects of the system-building model, or is explicitly labeled as the responsibility of others outside the system-building project. The derivation also shows that every objective listed for the system-building model is justified by helping meet some stakeholder need.

The remaining chapters of this part of the text explain what each part of the model should be or do. These chapters are based on the derivation in the appendix.

Sidebar: Summary

Chapter 17: Artifacts

25 May 2024

17.1 Purpose

Artifacts are all the things created in the process of making a system. It starts with records of the purpose of the system, and the requirements it must fulfill. It includes the implementation of the system ready to deploy—such as hardware inventory in a stock room and software ready for installation. The artifacts include everything in between, including design, source code, verification records, rationales for decisions, records of reviews and approvals, and many, many more. The artifacts also include information used by the team to help do its job, such as information about who is on the team, processes to follow, and how the team operates.

The objectives for artifacts are documented in Section A.3.1.

The artifacts have three functions: as deliverables, as communication, and as a record of the project for auditing.

As deliverables, the implementation artifacts are the actual system to be deployed. It should be possible to take a set of implementation artifacts, assemble them (following instructions that are themselves artifacts) and have a working instance of the system. These artifacts are joined by things like records of regulatory approval and information associated with serial numbers or versions showing the history of the specific artifacts deployed in the system.

Most of the artifacts, however, are for communication: between people working on one task and another, between the customer and system designers, between those who implement and those who verify. Sometimes those people are working concurrently, such as when two people design two components that are expected to work together. Sometimes the communication is between someone who specifies attributes for a part of the system and someone who implements that parts. The communication is also between someone who made a design decision and someone who, years later, must understand that decision in order to make changes to the system.

Audit is a special case of communication. It is between the project and someone outside who will be checking the project’s work. In many cases the external party will have an adversarial role, looking to find mistakes or violations. Regulators, for example, may look through records to check that the team has followed processes that meet regulatory requirements.

Note that there are many ways to achieve the objectives laid out in this chapter. Each project will need to determine how to handle its own artifacts. The specific solution will depend on the complexity of the project, the size of the team, and requirements from the organization or industry. The appropriate solution may change over time: as a team grows, it may need more formal mechanisms.

I have seen a range of working approaches for handling artifacts. Two projects kept track of planning information on designated whiteboards. Others maintained plans in project management tools. (The whiteboard approach had a problem: one time someone erased the board. Luckily there was a recent picture of its contents.)

I have also been on projects that had an overly complicated solution. One project was a joint venture between multiple companies on multiple continents. That project used multiple repository tools for different kinds of information. There was a process for proposing design and implementation changes, but no one knew quite what it was or how to follow it. After a few years that joint venture fell apart, in part because the teams could not figure out how to work together.

Whatever solution you adopt, it is important that it fit your project and team. It should be capable enough to manage the kinds of artifacts your team will use, and simple enough for the team to use.

The objectives in this chapter can help you work out what capabilities your solution should handle.

17.2 General principles

The artifacts are meant to be shared, at least within the team and sometimes to people outside. The people using these artifacts will come and go, so supporting people who will use them in the future is as important as sharing in the moment.

This leads to some general principles about artifacts.

People should be able to find the artifacts they need. An artifact is not useful it the people who need it don’t know it exists, or if they don’t know how to find it. The artifacts should be organized in some way that helps people find them.

“Finding” has multiple aspects. It can mean that when they know something exists, they can get to that artifact conveniently. It can mean that they know that a general kind of thing probably exists, and they need to be able to navigate through to the artifacts of that kind. They may not know what is out there, and need to be able to browse or discover artifacts in order to learn about the system. Or it might mean that they need to have confidence that they can itemize all of a certain kind of artifact, without missing any.

People should have confidence that they have found the correct artifact. In the worst case, someone will look for a particular thing and find three or four potentially-relevant artifacts. Which, if any, of those should they believe? What if they disagree with each other?

This principle generally means, first, that any particular piece of information or artifact should be in one place. There should not be two different artifacts that appear to be authoritative sources for the same piece of information. It also means, second, that when there are legitimately multiple versions of an artifact, those versions should be clearly identified and that a user should see consistent versions of different artifacts unless they take explicit actions to see different versions.

The artifacts should be maintained securely. The system that the customer will ultimately use is based on many artifacts that the project maintains. If someone subverts or damages some of those artifacts, the resulting system will be compromised. If someone destroys some of the artifacts, some of the team’s work will be lost.

This argues at minimum for maintaining the integrity of the artifacts, meaning that the artifacts or the collection of them cannot be modified in an unauthorized way. (Good practice is that any change to an artifact can be traced reliably to the person who made the change.)

Some of the artifacts may need to be kept confidential, if they contains secret information. Almost every project has some information to be kept confidential, at minimum as part of maintaining the integrity of artifacts. (Login credentials, for example.)

17.3 Kinds of artifacts

This section lists the kinds of artifacts that the analysis in Appendix A showed contribute to meeting stakeholder needs. The artifacts are listed in the order in that analysis.

17.3.1 Purpose and constraints

These artifacts include clear documentation of the customer’s purpose for the system. Every feature of the system derives, directly or indirectly, from this purpose. If that purpose is not written down, the team is unlikely to accurately design to meet those needs—and is likely to add features that the customer does not want (so-called “gold plating”). These artifacts should be visible to most of the team in order to guide them as they design, build, or verify the system.

The customer’s non-functional constraints should be included. This includes the safety, security, and reliability they expect.

Constraints from other stakeholders should also be documented. The organization may place constraints on the project, such as expected profitability. Regulators can place many constraints that must be met to license or certify the system.

The understanding of the purpose or constraints will change over time. A customer will find they have needs they did not initially realize, or they will discover that whatever purpose was agreed with the team is not quite what they meant. An organization or regulators may change their constraints as time goes by.

There should be an explicit record of the changes requested or identified. If a change is accepted—and the project may choose to reject some changes—then it should lead to a new version of the purpose and constraints. It should be possible to determine whether other artifacts, such as requirements or design, are consistent with a particular version of the purpose and constraints.

The specific kinds of artifacts include:

17.3.2 Team information

Maintaining information about the team helps the team work together.

I worked on one project where the management did not want to put together an organization chart or a list of team members. I ended up talking to the wrong person about a particular technical subject—that person was happy to talk about it, but it turned out they were not actually on the part of the team working in that area. Their opinions turned out not quite to agree with those of the person actually in charge, but I hadn’t been able to find the person I should have been talking to.

This kind of confusion is more common than people expect, and it results in people getting the wrong information, or in people not getting information they should.

Information about the team is only valuable if it is accurate, however. The team should have someone responsible for keeping it up to date—meaning that ideally updating the information is a normal part of the processes (Chapter 44) for bringing in a new team member or changing assignments.

The specific kinds of artifacts that will help include:

17.3.3 System artifacts

These artifacts are the system that is being built—the majority of the work of a project.

The system artifacts include:

The exact set of these system artifacts depends on the process and life cycle (Section 20.3) that the project uses. If the life cycle has some review milestone that a part of the system is supposed to meet, then there may be documents or analyses specific to that review.

That said, good system building practice involves some core kinds of artifacts: specifications, designs, and implementation.

The artifacts should include some items that are more about the system building process than about the deliverable system itself. These include:

How the team maintains these artifacts can vary widely. Many software efforts use version control systems, which maintain versioned software artifacts in a repository server. Many hardware design tools either provide their own versioning repository, or are designed to work with a separate repository system. For hardware artifacts—not their design—one must work out where to store and how to track each physical artifact.

17.3.4 Verification artifacts

Verification artifacts support verifying that the system (or components in it) meet their intended purpose and specification, and that they are free of errors.

These artifacts include:

These constitute both a record of what parts of the system have been checked and found to meet their verification criteria.

Verification should be repeatable. The artifacts maintained for doing verification checks should be complete enough that different people can perform the checks in the same way. The instructions for performing checks should be clear. The test equipment should be maintained and people should have instructions on how to use it. Software test environments should be controlled so that when a test is run twice, it is in the same environment both times.

The verification results are generated by people performing checks, and used by people reviewing part of the system to ensure it has been checked before it is accepted as working. They may also be audited by regulators or other outsiders who will be checking whether the project has built the system properly.

17.3.5 Release, manufacture, and deployment

Releasing and deploying a system are complementary steps. Releasing involves taking implementation artifacts and making them available for manufacture or distribution.[2] Manufacturing the system follows if needed—involving producing and assembling hardware, or packaging software into a deployable form. Deployment takes the manufactured system and sets it up for a customer to operate.

undisplayed image

The artifacts should include the procedures used to release, manufacture, and deploy the system. The release procedures define the sequence of steps involved in taking a version of the implementation artifacts, checking that they have been verified and meet the intent of a release (such as the features implemented or bugs fixed), and placing copies of those artifacts in a separate area as a release. The manufacturing procedures define how to take those released artifacts and manufacture products that are ready for deployment: assembling hardware according to a released hardware design, for example, and giving them serial numbers. The deployment procedures tell how to take those manufactured artifacts and install them so that they are a working customer system.

There are different variations on this flow of operations depending on whether one is releasing and deploying a whole system or an update, whether the artifacts are electronic (software or data) or physical (hardware components), and whether the system will be mass produced or not.

Hardware components will generally start with a release of a hardware design. That hardware design is the basis for manufacturing instances of the component. Whether it is a single unit made in house or many units produced in a dedicated facility, the manufacturing procedure determines how the hardware products are made. Before finishing manufacture, hardware components are typically given an identity, often recorded as a serial number, that identifies the specific component instance and associates it with records like which design release version was used, what subcomponent parts were used, date of manufacture, and so on. Then the part is placed in inventory from which it can be deployed.

Software components most often follow a different path. Being electronic information rather than physical, there is no “manufacturing” step. The release procedure gathers implemented software and creates a deployable package from it. The manufacturing procedure gives the package an identity (a release number) and signs it or otherwise sets up security protections. It can then be copied to a server that makes it available for distribution and deployment.

Deployment procedures take hardware from inventory and software from a distribution server and puts it into use for a customer. This could be as simple as letting customers know that a software update is available for download. It could involve moving a number of physical components to a customer site, setting them up, and performing deployment checks to ensure that the installed system is working. It could be as complex as delivering a spacecraft to its launch provider, preparing it for launch, and having the spacecraft start up on orbit.

The whole process of producing deployed systems often generates a lot of records. Hardware devices have associated records about what specific design was used, what subcomponents were used, when and were it was manufactured, and then accumulate service records: when deployed, what defects were reported, what repairs made, how the device was disposed at end of life. Software has similar records: the identity of the software image, the versions it contains, how it was built, when it was made available for deployment, where it has been deployed, and its service history.

17.3.6 Project operations

Artifacts that support operations can be broken down in the same way that operations itself is (Section 7.3.5 and Chapter 20).

The project’s life cycle and procedures can be maintained in simple documents. Because these documents serve as a reference for team members, it is important that people be able to find easily the parts of the documentation they need for a particular situation: for example, if someone is setting up a design review for a particular component, they need to find the procedure for design reviews. The documents also need to support people reading through the life cycle or procedures to learn how the project operates in general. Having a good table of contents or index and accurate summaries can help them understand the breadth of operations before they need to learn about some specific procedure.

I have worked on several projects—especially including NASA projects—that develop complex “management plans” and “systems engineering management plans”. I have found that few people in the team actually use these documents. The management plans often follow a template that speaks to the team’s aspirations (“the team will do X”) but does not lay out the actual procedures (“do X by doing Y and Z”). The information in these plans is also often organized for a management reviewer, rather than for the people who need to follow the procedures. As a result, the documents sit unread after being approved and the team operates on shared lore about how to do one task or another, and the plans become increasingly out of date as the team’s practice diverges from the original intent.

Instead, the life cycle and procedure documentation should:

Beyond the life cycle and procedures, planning and tasking activities involve creating and maintaining records. These artifacts are often maintained using specialized tools, such as project planning tools and task management (or issue management) systems.

Operations also maintains records of supporting information, such as budgets, risk registers, and lists of technical uncertainty.

17.3.7 Regulatory artifacts

Working with regulators typically involves a lot of records. The team uses some of these to guide how it builds the system. Other records form a legally-binding record of what the project has done and how the team has interacted with the regulators.

First, the artifacts should include records of the regulations that the project must comply with. This might be as simple as references to publicly-available reference sources (such as web sites that make current government regulations available). It may also include documents that explain what these regulations mean. This information is only of value if it is accurate; this means it must be kept up to date as regulations change. (In some fields, it is worthwhile having someone who tracks likely upcoming regulatory changes so that the team can anticipate those as well as working to current regulations.)

The artifacts should also include records of the processes that the team needs to follow working with the regulators. For example, if the system must obtain a license before being deployed for use, then there will be a process for applying for that license. Again, this information must be kept up to date to be useful. The processes are often difficult to find or interpret, so it is helpful to maintain documents that explain the process as well as just a record of the process.

Second, systems that need licenses or certification will require applications to regulators. The application information should be maintained, including copies of any application forms (with dates!) and any supporting documents generated as part of putting the application together. For example, I helped one team apply for a license to operate a small spacecraft in low Earth orbit. The license application included an orbital debris assessment report, which was sent to the regulator as part of the application packet. The assessment report included information generated by a debris assessment tool [NASA19]. The database used by the assessment tool was an artifact to be maintained, along with the report itself.

Correspondence with regard to the applications also needs to be maintained. This should include any information that shows how the team took steps to follow the application processes.

Next, the project must keep records of licenses or certificates that have been issued.

Finally, the project will need to maintain evidence that the system it has built complies with regulation, whether a license application is involved or not. These take the form of a mapping from a table of regulatory requirements to the evidence of compliance with each of the requirements. The evidence can be complex: for example, showing that the probability of a particular hazard occurring being below a mandatory threshold.

17.4 Managing artifacts

Artifacts are the result of the team’s work, and thus they carry value to the team and its customers. They represent the system being built. They are used continuously to inform and manage the team. They are often used long after they are created, to audit the work and to guide modifications to the system.

The artifacts change over the duration of the project. An early design draft gets revised into a version used to build the corresponding component. Later, the design is revised for a second-generation component.

These conditions lead to three general principles for managing artifacts: security to protect integrity, organization so people can find the artifacts, and change management.

17.4.1 Security

The artifacts need to be managed in a way that preserves their value by maintaining their integrity. Losing or damaging an artifact results in a loss that could be anything from annoying (losing minutes from a status meeting) to fatal to the project (damaged implementation of a critical component). The artifacts should be protected against both accidental loss, such as a server breaking, and malicious loss. For data artifacts, this means using resilient storage systems with good cybersecurity. For physical artifacts, it means storing artifacts in storerooms that maintain a benign environment and that provide physical security.

Access to the artifacts should be limited to authorized people using access control mechanisms. These mechanisms reduce the risk of malicious damage by limiting who can get to the artifacts. For artifacts that need to be kept confidential, limiting access helps reduce knowledge leaking to unauthorized people.

17.4.2 Organization

A random jumble of artifacts is of little use to people on the team. The team members need for the artifacts to be organized in a way that allows them to find the ones they need accurately and quickly.

There are two kinds of “finding” that team members will do.

In the simple case, they will know what they need: the design document for some component, or the risks associated with the project, or widget serial number X. To find something specific, they need to know where to find artifacts and how those artifacts are organized. They can use that organization to get to the specific one.

The other case is when someone knows they have a need but does not know exactly what they are looking for. This might be someone who has recently joined the project, or someone who is working in an area they aren’t familiar with. These people will need to be able to see and learn how the artifacts are organized, and will need a guide to help them understand what is available.

Finally, there should be one logical place for each artifact, and artifacts should not be duplicated. (There might be copies for redundancy, but the people looking for one artifact should see those copies as if they were one thing.) Two people looking for the same information should not end up finding two different artifacts that cover the same topic and that have diverged from each other. This leads to people building incompatible components, sometimes in ways that are hard to detect but that lead to errors in the system.

17.4.3 Change management

As I have noted, artifacts change regularly over the course of a project. However artifacts are managed, they need to account for the effects of these changes.

Some artifacts, like records of task assignments and progress, change often but at any given time there is only one accurate copy of the information.

Most system artifacts, on the other hand, evolve in more complex ways. At any given time there may be multiple versions that are works in progress—containing incomplete changes that their creators don’t believe are ready to be used by others. Some of those in-progress versions may develop to become accepted versions, ready for others to use: a design that is ready to be implemented, or an implementation ready for integration testing. A version that has been used like this may later become obsolete as an updated version comes along.

This pattern of change calls for supporting versioning on this kind of artifact. Versioning means that one can find multiple versions of the artifact, and each artifact has an identifiable status so that someone can know whether they should be using that version to build other artifacts, or just looking at the version to understand it.

The dependencies of one artifact on another, such as a design leading to an implementation, and and implementation leading to verification test results, means that mutually consistent versioning is also important. When looking at an overall version of the system, it should be clear that (for example) the design for component X has been updated, the implementation for that component is in progress of being updated to match the design, and any verification results are from an older implementation that may no longer be accurate.

Most project life cycles and procedures define different statuses that an artifact version can have, along with procedures for how that version can change status. While the details differ, the statuses generally include some sort of work in progress, proposed, approved (or baselined), and superseded. The procedures generally say what has to happen for a version to move from one status to another, such as defining that a proposed design needs a review and approval step to be accepted as a baseline.

17.4.4 Implementing artifact management

There are many tools and processes in use today for managing artifacts. At the time of writing, no one tool works well for all kinds of artifacts, and so a project must stitch together its approach to managing artifacts out of multiple different tools.

Electronic artifacts. Software development uses version control systems to manage electronic files. There are many such systems, all of which provide a storage repository with a few common features:

Other industries use document control systems to manage collections of electronic files. These systems also provide a repository for a collection of files, but the generally focus on the management of documents rather than just versioning. They commonly include features like:

In addition, tools such as CAD systems or requirements management often include versioning and workflow features. These tools support creating different versions of an artifact, and defining a workflow for the procedure to be followed for approving a version as a baseline.

In practice the tools for managing artifacts do not often work together, requiring a project to (for example) select one tool for managing software artifacts, one for CAD system artifacts, another for structured systems engineering artifacts (such as requirements or specifications), and another for documents that do not fit neatly into these other categories.

Hardware artifacts. Many projects will create physical artifacts—mechanical components, electronic boards, manufacturing jigs, and testing equipment. These physical components need:

Sidebar: Summary

Chapter 18: Tools

25 May 2024

18.1 Purpose

Tools are things that people use while designing and building the system. The tools are not part of the system itself; they are not delivered to an end user. Their purpose is to help the team do their job. Each project will have its own needs for tools, so this list is meant to inspire ideas rather than prescribe what may be needed for building any specific system. There are, however, some common principles for selecting and managing tools.

This chapter brings together information about many different kinds of tools, with references to the other parts of this volume that discuss details.

Please note: I do not recommend specific tools.

18.2 General considerations

There are a few general principles that apply to selecting tools generally.

First, most tools will be used for shared work. Tools should be evaluated on how well they help the team work together. Computer-based tools that manipulate shared data, such as CAD tools, should make it easy for multiple people to access the information concurrently. They should support the project’s approach to versioning information ! Unknown link ref. Physical tools should be accessible to those who need to use them. This is especially important to consider if people work in multiple physical locations.

Second, many tools require training to be used effectively and safely. The project must ensure that each person has been trained to use a tool safely before they are allowed to use it. That implies that tools should be evaluated on the quality of educational material available on how to use them.

Third, good tools are integrated so that they work together. Tools that can share information can provide greater value to the team than ones that cannot.

Next, tools should support the general life cycle and procedures the project uses. They should fit into the project’s procedures for managing artifacts, versioning them, and reviewing them.

Finally, tools should be secure. Good tools will support the project’s overall approach to security, including controlling access to information based on a person’s role in the project. This includes both electronic and physical security.

18.3 Kinds of tools

This section provides an overview of all the kinds of tools discussed elsewhere in this volume, with references to the sections that provide details. The overview can serve as a checklist for a team working out what tools they need.

18.3.1 Storing and managing artifacts

The tools for storing and managing artifacts are discussed in Section 17.4.4.

Electronic artifacts. Alternatives include:

Hardware artifacts. These can use:

18.3.2 Specification tools

As I will discuss in Part VII, the team will develop specifications for system components. A specification defines a component’s external interfaces—in systems terms, how the component is part of functional and non-functional relationships (Section 12.2).

There are several kinds of specifications (Section 33.4), including requirements, interface definitions, and models.

Requirements (Chapter 34). Requirements provide textual statements of things that are to be true about a system or component. Requirements can be managed using:

I list a number of considerations for selecting requirements management tools in Section 34.13.

Interface definitions (! Unknown link ref). Interface definitions specify how one component can interact with others. These can be written using:

Models (! Unknown link ref). Mechanical, mathematical, electronic, behavioral, and other kinds of models are used as specifications. Relevant tools include:

18.3.3 Design tools

A project’s design phase works out a set of designs for the system and its components that satisfy the corresponding specifications (! Unknown link ref).

A design records the structure of each component—whether a high-level, composite component or a low-level component (Chapter 11). It also records analyses that lead to designs and rationales for how a design ended up as it did.

There are two kinds of design artifacts: the breakdown structure and the designs themselves. The model in Section 11.4 has six parts to a component design: form, state, actions or behaviors, interfaces, non-functional properties, and environment.

Breakdown structure (Chapter 36). I recommend that the component designs be organized by the component breakdown structure. This structure organizes the components into a hierarchical name space, giving each one a unique identifier and showing how one component is made out of others.

On most projects, I have used a spreadsheet to list all the components, the breakdown organization, and their names. This has worked well enough, and I am not aware of tools that explicitly support such organization.

Form (! Unknown link ref). The form represents the aspects of a component that do not change, or only very slowly. The design of physical components is generally handled using CAD tools. These tools use notations or drawing standards appropriate to each subject.

State, actions, behaviors (! Unknown link ref). This part of a design addresses the parts of a component that change readily.

Non-functional properties (! Unknown link ref). These properties change slowly and are not part of the component’s form.

Environment. This is the environment in which the component is expected to operate, or in which it may be stored. This is usually recorded in text.

18.3.4 Analysis tools.

These tools help the design process by providing feedback on how well a particular design works. They also are used when verifying a proposed design.

18.3.5 Build tools.

These tools help translate designs or implementations into operable components that can be integrated into a running system, or used for testing.

The built artifacts will need to be stored and tracked, as discussed above .

Physical artifacts. The building of physical artifacts is, in effect, manufacturing one or a small number of those artifacts. These can be built in multiple ways.

In-house building will require maintaining a stock of the materials used in the components. This may include a stockroom of pre-acquired parts, such as metal or plastic stock and fasteners, or suppliers that can provide the needed material quickly.

The building process should be deterministic: if the team builds multiple instances of the same component, the components should all look and behave the same way. This places constraints on whatever tools and procedures are used to build the components.

Software artifacts. Software artifacts are built by translating source code into binary and packaging it in forms that can be installed on a target system.

The software build process must be repeatable: if the same software is built twice, the result should be identical in behavior (differing only in things like version numbers, timestamps, or affected signatures). This usually means that the software build tools should be under configuration management so that identical tools will be used each time.

18.3.6 Testing tools

Testing involves taking a component, or collection of components, and subjecting it to some sequence of activities to verify that the component behaves as specified.

Testing occurs at two different times during system development: as people are building parts of the system and when a component or the system is being verified for final acceptance. These two uses lead to somewhat different needs in the tools for testing.

Tests need to be accurately reproducible: someone should be able to run a test one time on one component, then run the same test later on the same component and get the same result. Of course some component behaviors are not fully deterministic, but accounting for that, one should be able to count on passing a test meaning that the component really does meet the specification being tested. If a test fails, people need to be able to reproduce what happened to understand the flaw and to determine whether a fix works.

Reproducibility places constraints on testing tools. Physical tests will need to be done in consistent environments, using control and measurement tools that can be calibrated to ensure they are behaving consistently. Software tests similarly need to be run in controlled environments.

Hardware testing. Testing hardware components can range from measurements of single components to integration tests of subsystems or even the complete system. The tools available vary widely, depending on the kind of testing being done.

All hardware testing will involve:

Tools that support testing electronic components can include:

Tools for testing mechanical components include:

Integrated system testing can go well beyond the tools listed here. Flight testing a new aircraft, for example, is far more complex than suggested by these tools. I leave the design and operation of such testing to others better versed in it.

Software testing. Software testing generally involves setting up a number of test cases or scenarios, running the software being tested, and recording the results. There are many different tools that can be used, and these depend on the kind of test being performed and the environment or language being used.

Categories of tools include:

18.3.7 Operations tools

The team uses other kinds of information to manage its operations—about the team, about procedures, plans, and to support decision-making.

Team information (Section 17.3.2). This information is organized around the roster of who is on the team, along with their roles and authority.

This information links to other other tools, some of which are often outside a project’s scope. These include:

These relationships get updated whenever someone joins the team, leaves the team, or their role changes. Using tools that guide people through the procedures for these updates will make the changes more accurate.

Life cycle and procedures (Chapter 20). Teams follow a project life cycle and procedures to do their work. These consist of steps that people should follow to get specific tasks done.

Workflow management tools exist to help guide people through these procedures. These tools can help by:

Plans and tasking (Section 20.5 and Section 20.6). The project maintains plans for how the system-building work will move forward and the work currently in progress. The plan records the work that the project will be doing, at varying levels of confidence and detail, while tasking tracks the specific work that people have been assigned. This information is used both to make sure that the team do the work that is needed, without important tasks getting forgotten, and for forecasting the time and resources needed to move forward.

Maintaining plans and tasking is an exercise in managing a lot of detail. Many tools are available to help with these.

In practice, many of the tools available have been designed for projects other than systems-building, and do not support systems projects well. Many project scheduling tools are based on methodologies worked out for predictable work like building construction, where the tasks can be known fairly accurately in advance. These tools often are organized around a Gantt chart of the work, prompting their users to estimate duration and make task assignments early in the project. This works poorly in systems projects that have significant uncertainty early in the project, and where the degree of certainty (or predictability) improves unevenly as time goes by. This often results in a false sense of confidence in the project’s schedule early on, and requires a lot of effort to try to keep the schedule adapted as work moves forward.

It is worth spending effort working out how a project will manage its planning and tasking, and ensure that any tools chosen will support that approach.

Support. Project operations maintains other kinds of information as well, for which tools are sometimes available. These include:

18.4 Managing tools

Good tools can enhance a team’s performance. Poorly chosen or implemented tools can harm it. One must choose tools carefully and apply thought to how they are implemented and used.

A project’s tools are themselves systems, and should be treated with the same care as the system being built for a customer.

Each tool should have a purpose. Spending the time to work out who will benefit from a particular tool, both directly and indirectly, can provide useful guidance when choosing between options for that tool.

The engineering support tool industry has generated many products that can be used, meaning there are often many possibilities to choose from. While sometimes the team can cut a decision process short because they already have experience with one particular tool, in the other cases it is worth setting out some criteria for making the choice.

Factors that can influence the choice of tool include:

Once a tool has been chosen, it will need to be purchased or built, and deployed for the team to use. This usually requires finding space for the tool, whether that is physical space in a lab or capacity on a compute server. The acquired tool will need to be deployed and integrated into the project’s systems: adding information about the tool to an inventory database, setting up a service schedule if needed, integrating software systems with the project’s security mechanisms.

Team members will need to learn how to use new tools. For some tools, this can amount to providing a written introduction or presentation on how the tool works. More complex tools will require more formal training. If there are safety or security risks in using the tool, the project should ensure that people are required to receive training before using the tool. It is common to track formally which people have gotten this kind of safety training.

Sidebar: Summary

Chapter 19: Teams

26 June 2024

Building a complex system involves a team of people to do the work. The people in the team fill many different roles: developers, managers, customer and regulatory interfaces, support staff, among others.

A team of more than perhaps three or four is not an amorphous blob of anonymous people; it is organized so that each person has a role. The way a team is organized may arise spontaneously or deliberately, but it will end up with an organization. A well-functioning project will design its team organization and take deliberate actions to maintain its good function.

In this chapter I discuss the issues to be addressed when deciding how a team should be organized, including its structure, roles, and communication.

19.1 Purpose

Building a complex system requires many people to share the work. One person cannot do all of the work: they will be overwhelmed, it will take too long to complete the system, and the project will likely require skills no one person has.

In the model of making systems (Chapter 16), the team consists of a group of people who do the tasks that create the artifacts that make up the system. The team are informed by the project’s operations—plans, procedures, life cycle—and use tools to do their tasks.

The team is a social entity. The people in the team work together and interact constantly. How well they get along with each other influences how well they get work done.

A team is, however, less than a complete society. The team’s social structure is relevant only to the work they do on the shared project. The team structure does not define how the people on the team organize the rest of their lives: these fall to community and family interaction. This means that the social structures in a team are simpler than those of a complete society.[1]

It is generally understood that the structure of a system is homomorphic to the structure of the organization that is building the system [Conway68]. This means that people must work to ensure that the structure of the team and the structure of the system are compatible, possibly by organizing the team around the system structure when possible. Doing so requires having an understanding of what the system structure is, and the hierarchical component breakdown (Chapter 36) provides part of that understanding. In the other direction, the team’s organization will inevitably bias how the system is organized and built; being aware of the two organizations helps one to see unhelpful bias reflected in the system organization.

One can look at the purpose and needs of the team from the point of view of the people in the team and of the customers, organization, and funders who want to see the system built (see Section 16.2 for a discussion of these stakeholders).

Members of the team (Section 16.2.2, Section A.2.2) generally look for satisfaction in their work, enough help to get the work done, and a working environment that gives them a secure sense of how to do their work. Team members generally want team cohesion, when the people have developed bonds and trust that allow them to work together without friction. That is, they are motivated first by how the project affects them. The needs of the stakeholders are a secondary concern, mainly in how meeting those needs contributes to satisfaction and compensation.

Other stakeholders (Section 16.2.1 through Section 16.2.5) look to the team to build the system efficiently and accurately. They are motivated by the value that having the system completed will bring and by the cost of building it. The needs of the team members are secondary, in the ways that the well-being of the team contributes to the cost or benefit of building the system.

Meeting the stakeholder needs involves:

An effective team balances these two classes of need—those of the people on the team and those of external stakeholders. The needs can be in conflict when the need to build a system efficiently and rapidly means that someone on the team has to do a task that they don’t enjoy. More often, both classes of need can be met by organizing the team and its culture. A team member’s satisfaction increases when they have confidence that their work is contributing to the project’s success, which comes in part by assigning tasks to the most appropriate people, avoiding duplication and rework, and ensuring that people communicate well. In general, when a team is able to use resources—people, tools, funding—effectively, the team members‘ confidence in the project will increase.

19.2 Model of teams

The following is a model for reasoning about teams. I will use this model in a later section (Section 19.3) to discuss how a team’s structure and culture can be understood, and how that can be used to manage a team.

The model begins with people. Teams are fundamentally social structures, made up of a group of people, each of whom have their own skills and experience. These people are sharing the work of building the system and of the needed supporting activities.

The role of the team in a project is to do the work of building the system. The work can be understood in terms of time-limited tasks and ongoing roles. A task is a particular piece of the work, with an intended result and a limited duration. A role is an ongoing assignment of responsibility, which leads to performing tasks within the scope of that responsibility.

Consider a team from the point of view of one team member. That team member has tasks to do, and roles for which they are responsible. They need to know what tasks they should be doing, what roles they are responsible for, and what they are not responsible for (so that they can refer instead to others who do have the appropriate role). As they do their tasks, they need input: how their task fits with other tasks, including ones that other people are doing, and how parts of the system are supposed to work. They will have questions to ask of others. In the course of doing a task, they will make decisions—about concept, about design, about implementation. These decisions will in turn affect others. From time to time they will find problems, both technical and social, and will need to identify who to work with to resolve the problems

At the same time, the team member sees themselves as part of the group. They will need to understand the team’s culture and norms. They will want their social needs met, developing trustful relationships with others they work with. The personal relations that someone has with others on the team influences who they choose to work with and who they will avoid, and influences how well they work together when they need to.

How someone works in a team can be expressed in terms of the team’s basic structure. The elements of this structure include:

The objectives of the team and other stakeholders are emergent properties that arise from the low-level interactions among people on the team, following the structure.

Some of these elements deal with separating people from each other, while others deal with uniting them [Durkheim33, Chapter 3, pp. 115-122]. Authority and division of labor are about how each person has their own role, and they are expected to refrain from exceeding those bounds. Communication, groups, and trust, on the other hand, are about how people are joined together to achieve more than they could individually. A team needs both to function well: the ability to work in a group depends in part on each person knowing their role.

19.2.1 Communication

The communication elements of the model describe how people on the team share information about the project.

The work of building the system will be divided up amongst the team members. When one person, for example, designs one component, they will need to communicate with the people designing related components (using the model of relations in Chapter 12). Similarly, a team member who is handling planning and tasking (Section 20.2) will communicate with many other team members to track progress and status.

There are four general times when people will need to communicate:

  1. When they are looking for information that another person may have. For example, when someone finds they need to know how some component is going to behave.
  2. When they have information that will affect someone else’s work. For example, when one person decides on a component design, and that component interacts with another component.
  3. When they need a decision or action. For example, when someone has completed a proposed design and procedures indicate that the design should be reviewed and approved before moving to implementation, or when someone has a team problem that needs to be resolved at a higher level.
  4. When a decision or action has results. For example, when reviews are done, or when action is being taken on a team problem.

Communication can push information from where it is generated or known to people who need that information. Communication can also pull information from someone who has it, by asking them a question.

Communication can happen interactively or asynchronously. Interactive communication happens when two people are communicating directly with each other. Asynchronous communication happens when one person makes information available and another finds that information later. Documentation is a way for one person to communicate with another over long periods of time.

Communication happens when a decision or action is needed, or when one of them has produced a result that others need to know about.

Communication patterns can thus be characterized by:

These communication patterns are encoded in team culture, in procedures that people use to do tasks, and in how people are organized into groups.

19.2.2 Groups

Many people like to work together: interacting regularly, sharing work, building social bonds. Working in a group is helpful when people are working on closely-related tasks or have closely-related roles. How closely depends on the person; some people are gregarious and gravitate toward groups while others reserve their interactions for fewer, more trusted people.

People can come together as a group when doing tasks together, or closely-related tasks requiring lots of interaction. They can do so spontaneously based on the work, or because a group is organized deliberately. People can also come together based on shared interests, experience, or work discipline.

A group is more than just people who communicate a lot. A group generally gives its members with some sense of identity and shared purpose.

One person can be part of multiple groups. It is common, for example, for one person to be part of one group that has been deliberately organized to work on a collection of components, while being part of a second deliberately-organized group based on work discipline, as well as being part of ad hoc, informal groups based on social interactions.

Groups can promote trust. When the people in a group behave respectfully toward each other and demonstrate behavior in line with team norms, the high level of interaction within a group provides a way for the group members to establish trust. When trust develops within a group, it can also promote feelings of trust for people outside the group: if person A recommends person B to person C, and C trusts A, then C is more likely to assume that B is trustworthy.

Groups can also promote distrust. If two people within one group don’t get along, they can create a rift among more people. A group also runs the risk of in-group identity turning into out-group dislike, expressing itself as teams working in silos because they lack trust for people in the out-group.

Sometimes people need to form a group with people they don’t get on with. This happens when there is a need for them to work together that overrides their relations with each other.

Groups can be characterized by:

19.2.3 Trust

Trust is a condition describing part of the relations between people in the team.

Trust arises from social norms and respect. By norms, I mean standards of behavior both for interaction between people and for technical work, to which everyone on the team is expected to conform. By respect, I mean each one believing that the others have worth or value, and acting accordingly.[2] Trust is the confidence that others will follow the team’s norms, and act and communicate with respect.

Trust starts by one person learning from experience that they can trust another person. Trust arises from demonstrated behavior. People may enter into a working relationship with someone with a predisposition to trust them, but that is different from demonstrated reasons for trust. A team culture that incentivizes people to behave in trustworthy ways can result in that predisposition when someone learns that someone they trust also trusts a third person. A team, however, cannot meaningfully incentivize trusting someone; the team can only incentivize someone behaving in a way that can earn someone else’s trust.

Because trust comes out of experience working together, not everyone in a large team will know everyone else well enough to have a trusting relationship. In those cases, trust operates at a level of groups rather than individuals: person A believes that the people in group C are trustworthy based on reputation and team cultural norms. This is a weaker form of trust but just as essential for a well-functioning team.

Ideally, trust is reciprocal but it does not have to be.

When person A trusts person B, the two of them can work together more effectively compared to when they do not trust each other. A can share work with B and expect that B will follow the team’s norms about doing accurate work and communicating well. B can expect that A will delegate a task and then respect B enough to avoid micromanaging them. As long as the trust remains, both A and B have less anxiety about the work being done, both are more productive, and both get greater satisfaction than they would otherwise.

Lack of trust leads to the opposite results. If A assumes that B will not behave in ways that accord with team norms, then A will believe that they need to check on B’s work more often. A and B will share less information with each other and will be less willing to share work. Poorer communication will lead to errors in the work, and result in more work and greater anxiety for both parties.

A breakdown of trust can happen between groups as well as between individuals. When a team has a breakdown of trust, they do not communicate. Factions within the team stop coordinating their work, hiding information from each other. I was part of one large multi-company software project with teams at several sites; the teams would try to undermine each other in order to get their version of some software component accepted into the system. After a few years the project ended and the product languished. As another example, specific failures on the Boeing CST-100 Starliner crew capsule have been blamed in part on team mistrust. For example:

Neither team trusted one another, however. When the ground software team would visit their colleagues in Texas, and vice versa, the interactions were limited. The two teams ended up operating mostly in silos, not really sharing their work with one another. The Florida software team came to believe that the Texas team working on flight software had fallen behind but didn’t want to acknowledge it. (A Boeing spokesperson denied there was any such friction.) —Eric Berger in Ars Technica [Berger24].

Trust can be characterized by:

19.2.4 Authority and responsibility

While the previous model elements—communication, groups, and trust—are about people uniting to work together, the next two elements are about how people are different from each other.

In effective teams, each person does the right work. They know what is expected of them, and what is beyond the scope of their authority.

Authority and responsibility deal with how the project’s work is split among the team members; that is, the role that each person has.

I treat authority and responsibility as two parts of the same thing. Authority is the right to make decisions or do work on some topic. Responsibility is the obligation to do that work, and to do it well. The two go together: responsibility without authority is perverse, while authority without responsibility means bad decisions.

A role is associated with some scope of work. The scope defines what subjects the person is responsible for. The scope can be defined many ways as long as its meaning is clear enough that everyone will interpret it the same way. Scope for technical work might be based on system component (“person A is responsible for the design of component X”). It might be based on discipline: “person B is responsible for all security analyses”. It can also be based on a procedure (“person C is responsible for making orders from vendor Y”), or on operational work (“person D maintains the plan for meeting the Z milestone”).

The scope defines the right to make decisions or take actions. If one person has the role of designing component X, they are responsible for ensuring that component X is well designed and they have the authority to work out what that design is.

Conversely, if some topic is outside someone’s identified role, they must refrain from making decisions or taking responsibility for that topic.

A role is different from a task. A role is a long-term, ongoing responsibility to do work associated with a scope. That work may include tasks that fall within that scope, but a task has limited duration and a concrete intended result. The person who has a role is often responsible for doing work that is not part of a specific task. For example, the person responsible for design of some component will handle the task to create the design or tasks to correct errors in the design, but they are also responsible for answering questions about that component from other people on the team.

The goal is that every element of the work has someone who is responsible for it, every person has something they are responsible for, and that it is clear to everyone who is responsible for what.

Communication. When someone has authority to make decisions, at some point they need to communicate those decisions (or their effects) to others who will use that information in making their own decisions and taking their own actions.

Sharing roles. More than one person may take on a role. For example, a role that involves providing support to a customer may have more work than one person can do. When people share a role, they have a responsibility to coordinate their work so they give consistent answers or make consistent decisions. That they share the role should be clear to all the people involved and to people who may need to work with them.

Inadvertent overlap in roles can lead to errors. If two people both believe they have the authority to do a certain bit of work, but they are not aware they are sharing the role, they can make conflicting decisions, tell others conflicting information, or produce conflicting artifacts. Each of these situations can lead to errors in the system being built or in the way the team operates.

Delegation. Authority and responsibility can be delegated. Delegation means that one person confers some part of their role to someone else, possibly for a defined period or with restrictions on the kind of authority granted. The delegation might transfer the role from one person to another, so that the first person no longer fills the role (perhaps temporarily). Alternately, the role may be shared with the other person, in which case both people are responsible for the work and for coordinating with each other. A delegated role might also be rescinded.

One way to use delegation is for one person to have the overall role for some system component, and for that person to delegate responsibility for specification to someone skilled in specification, delegate design to a designer, and so on. The person with overall responsibility for the component typically reserves authority to review and approved work that has been delegated to others.

Sidebar: Delegation and micromanagement

Projects involving many people require sharing work. If someone doesn’t share work, then they will be overwhelmed, will take too long to get work done, and will be a single point failure in the project.

Delegating or sharing work implies a dynamic between the two people involved. Person A delegating the work defines the work that Person B, the delegatee, is to do. Person B does the work and periodically gives progress updates. Once the work is delegated, Person B can proceed independently and Person A can turn their attention to other things.

One way this can go wrong is if Person A doesn’t let Person B get on with the work independently, and instead tries to micromanage the work. Learning the habit of managing loosely takes time and effort—but it requires trust between the two people involved. That trust in turn depends on Person A having confidence that Person B will follow shared norms doing the work.

Another way this can go wrong is if Person B isn’t able to complete the work independently. If Person B finds a problem with the work, such as design error, that is beyond their scope, they can raise the issue to Person A and jointly resolve the problem. If Person B is unable to do the work, perhaps because they don’t understand the problem or find they lack a necessary skill, they can raise the issue and jointly handle the problem. If Person B tries to muddle through, however, they stand a good chance of not doing the work needed, leading to Person A needing to check their work in detail and possible redo the work.

In other words, sharing work requires having clear expectations of how to define delegated work and when to raise exceptions.

Resilience. A well-functioning team is able to handle problems when they arise. A team’s resilience depends in part on how authority is structured within the team.

There are several kinds of problems a team will encounter:

There are patterns in common to how many of these problems can be planned for. Providing redundancy in how authority is organized is at the core: planning in advance for someone to take over important roles when needed, building in checks of work, and assigning roles that create alternative communication paths to resolve problems. All of these in turn depend on communication so that someone can take over a role or check work.

Formally, these kinds of structures add nuance to the definitions of scope for the roles that need to be resilient. For example, three kinds of roles are defined to catch and resolve technical mistakes: the role to do the work, the role to check it, and the role to ensure that the check is done. These imply a limitation on the authority of the first role to make any arbitrary decision about the work, because the work must be checked by someone else. It adds a responsibility to ensure that the work is reviewable (for example, adequately documented) and that the relevant artifacts are communicated to reviewers. Similarly, having someone who can take over a role implies that someone who is a backup for the work is responsible for keeping current and stepping in when needed—and also refraining from acting on the role when the regular person is doing their job.

19.2.5 Division of labor

Division of labor is the principle that people do different kinds of work, meaning they have different authority and responsibility. This is desirable because different people have different skills and experience, and because work should not be duplicated unnecessarily.

Division of labor in systems-building is different from the classical usage of the term. The original usage was about a serial production system or assembly line, where one person does one step, hands the result to someone else who does a second step, and so on until the product is complete. (Smith, for example, uses the example of making pins [Smith22, Book I, Part 1].) The argument is that a worker’s specialization improves their productivity, and that avoiding the cost of switching from one task to another eliminates wasted time.

Division of labor is directly related to roles as discussed in the previous section. The roles define the units of labor to be divided among the team members.

Systems work divides labor in more ways than just serial production. Work can be divided by component, with a hierarchical structure from system to lowest-level component. It can be divided into supporting role, such as planning and team management, versus system-building roles. Not all roles need full time attention, leading to one person taking on multiple roles. Some roles are associated with specific procedures, such as coordinating purchasing.

Someone in the team has a role to decide how roles are assigned. This might be one person for a small team, or the role might be divided up and distributed to multiple people. These people should follow well-understood norms and procedures for making the decisions about who is assigned what role, including communicating those decisions to everyone affected. The way roles are assigned should take advantages of the way people differ: in their likes, their skills, their experience, and their desired growth.

The way work or roles are divided affects how people grow their skills and experience. If people are assigned work only on their current skills, they will not grow. Giving people tasks that stretch them can lead to improved skills, but can also lead to them doing the work badly and learning bad habits. Learning works best when someone being stretched can get mentorship from someone with relevant skills or experience.

19.3 Using model of teams

The high-level objectives such as efficient and accurate system-building or team cohesion are properties that emerge from the details of how people on the team interact. The structure and norms of the team can be designed and managed to promote these objective, and the model above provides a way to think about the structure.

Note that these properties emerge from how people actually behave, not from how the team is designed or how it is supposed to work. That is, the outcomes depend on the mental models that each team member has of how they work in the team, and the habits that come from their mental models.

Achieving desirable outcomes therefore means getting two things right: designing and maintaining a good intended structure for the team, and the team taking that structure on board and behaving accordingly.

19.3.1 Team culture and structure as a control system

Leveson et al. [Leveson11] discuss how to design systems so that they produce desired emergent behaviors while avoiding undesired behaviors. Their approach treats the problem as a control system, where a control process monitors and shapes the behavior of a lower-level process in ways that lead to the desired high-level results.

undisplayed image

The control system in this approach consists of a controller, which monitors the state of the team (the controlled process) and makes decisions about actions the team should take. One or more people in the team take on the role of being the controller. The controller has a process model, which includes the controller’s beliefs about what the team should achieve, how the team is structured, and how all the people in the team are doing. The controller gets feedback from the team in the form of observed behavior and of things team members tell them. Once the controller determines that it is time to act on some issue, the controller can take steps (control actions) to change the team’s behavior.

The social norms and habits of respect come from the example set by the team’s leaders: those team members who have greater scopes of authority, or who are recognized as experienced in their discipline. In practice, one team member who is working on the details of some component has little influence to create the team’s norms, but can cause disorder and disrespect that spreads. The establishment and following of positive norms is a collective action problem that requires some degree of compulsion [Olson65]. Preferably, the compulsion is in the form of rewards for following good examples of adhering to norms, but sanction is needed to back up the rewards.

The model in this chapter can serve to organize the design of this control system.

Who is responsible? The responsibility for making the team work is spread over everyone in the team.

Looking at the team as a control system, there are two reciprocal classes of roles: the roles that fill controller functions and those for everyone on the team (the controlled process).

Everyone on the team has the role of being a team member. This role has the responsibility of following team norms and procedures. In terms of a control system, each person is responsible for accepting and following instructions from the controller, and for providing feedback about work and about how the team is functioning. In particular, when anyone on the team detects that there is a problem in how the team is functioning, they are responsible for communicating about the issue with someone whose role includes resolving the issue.

The controller part of the control system can be broken down into three classes of roles. These are:

  1. The observer role: a person who receives feedback in the control system, meaning they observe how team members are doing their work and are responsible for deciding when there may be an issue to resolve.
  2. The decider role: a person who is responsible for deciding how to respond to an issue; that is, for deciding on a control action that should address whatever situation has occurred.
  3. The exceptional role: a person who is responsible for detecting problems with the normal control system roles or for receiving reports about them. This role comes into play when the normal observer and decider roles are not handling a problem. When someone reports a problem with the control system, it is sometimes called a skip-level or whistleblower report.

These roles can be used to support many different team structures. For example, a traditional hierarchical department/team structure can be represented by each department’s or team’s manager filling the observer and decider role for their department or team. The manager over a manager can fill the exceptional role to address problems with a manager’s work. Separately, many organizations create an explicit whistleblower function to address potential corruption or illegal behavior; the people in this function then fill part of the exceptional role.

Process model. This model is how the controller understands both the objectives for the team and the state of the team.

The process model includes the team’s objectives and how well the team is meeting them; its structure, and how people are working with that structure (or not); the roles each person on the team has and how they are progressing on the work associated with those roles; and generally how well each person is doing.

The observer and decider roles use this information to determine when part of the team is how working as it should, and to decide what steps to take to make things work better.

Unlike the control system for a machine, the process model for a team accounts for the well-being of the people on the team.

The process model also needs to consider what people on the team actually understand of the team’s culture and procedures, and their roles. In managing a team of skilled and well-meaning people, I have found that miscommunication or misunderstanding is the most likely source of problems.

Control decisions. Those people who have the decider and exceptional roles are responsible for deciding when there is an issue to be addressed, and what actions to take. Sometimes when there is some indication of an issue, the choice will be to wait and gather more information.

Some issues may appear in one part of the team but, on investigation, will be found to have causes in other parts of the team. If the decider or exceptional role is shared among multiple people, decisions will require deciders working together.

Problems in team execution can arise because the team has outgrown their current structure, not because any one person is behaving wrongly. I discuss this further below.

Control actions. The control actions are how people influence the team to keep it working on track.

The example set by a team’s leaders is perhaps the most important influence. If someone the team looks up to is doing some activity one way, they will be likely to follow: if someone is seen to be careful following a procedure to get design reviews, for example, others will be motivated to do so as well.

This raises the question of who is considered a leader. Leadership is a social construct; it is not necessarily an explicit role that someone in the team is given. People who are given roles with extensive scope of authority and responsibility are often seen as leaders. Others who are understood to have experience, who mentor other team members, and who establish social connections are also treated as leaders. Having this level of social influence in the team comes with a responsibility to model desired behavior, and should be considered when taking action to fix a team performance issue.

Instructions from one person to another based on scope of authority are a second kind of control action. If someone has a role of managing a subteam, the manager (who has a decider role) can instruct someone on the team to change their behavior. The instruction need not be hierarchical; for example, when two people are peers working on designing related components and are expected to come to agreement on how those components will interact, one of them can inform the other that they will not agree to some part of an interface design.

As I noted in the sidebar on delegation and micromanagement above, there are choices to be made about how instruction should be given. It can be directive, telling the recipient exactly what they should (or should not) do. This is appropriate when related to following a procedure that requires precision, such as operating test equipment that has the potential to cause injury. In other situations this can turn into micromanagement and inhibit the recipient’s ability to improve and work independently. On the other hand, the instruction can take the form of letting someone know that there is a problem and letting them work out how to address the problem. This approach helps the recipient learn and grow, especially if they can discuss potential solutions to get feedback. However, if the recipient is not able to figure out how to make a change, this approach will leave a problem unresolved.

Next, a decider can address an issue through education. Mentoring someone about part of their work can improve their work in the long term as well as addressing an immediate issue.

Finally, sometimes the appropriate control action to address an issue is to change the team’s structure. This can happen as the team grows and authority and communication structure reaches a scalability limit. It can also happen as a project transitions from one phase of work to another—for example, when moving from the initial design and implementation into preparing for placing the system in operation.

Team member behavior. The people on the team make up the controlled process in this approach. A controlled process receives communication from the controller with input that is intended to change the process‘ behavior. The process then generates feedback to the controller as it goes about its behaviors.

The team being made up of people, not machines, so they hear and act in a human way to communication from people in the controller role. When the team’s structures are designed, the ways that people communicate should be worked out so that those who are giving instructions know how to communicate to people, and so that people on the team know how to tell when instructions are being delivered.

People react when they receive instructions or assignment, of whatever kind. In a well-functioning team, the team members will act to confirm the instruction they think they are receiving, then adjust their work behavior accordingly—changing the roles they are working on, adjusting some technical work, and so on.

However, real humans don’t always neatly follow this behavior. Sometimes they misunderstand the instructions. Sometimes they ignore the instructions. Sometimes they develop resentment at the instructions. The communication between people with a controller role and the people on the team must include checks to catch misunderstandings, and continuous communication so that leaders understand how people in the team are feeling about their work.

The controller needs feedback from people on the team in order to continue to make accurate control decisions. In a team, this means that people on the team are providing information. The people have a responsibility to keep those overseeing their work informed of their progress. They are also responsible for communicating when they are dissatisfied with their work situation, or when they observe issues with the project.

Feedback. The people forming the control system have several ways to get feedback and observe the team. Some of these mechanisms can be designed into the system formally; others are informal behavior by the people with control roles.

Getting explicit feedback and reporting from team members is the first formal mechanism. As people on the team make progress on different tasks, they inform the controllers of the work completed, the problems found, and the steps yet to do. These reports can take many forms: updates to a task tracking system, regular status communications, and informal discussions. Explicit reporting has the advantage that it can occur regularly and in a form that encourages documentation of status. It has the disadvantage that it can become impersonal, and the reports can become inaccurate (especially optimistic) over time because team members want to look good to their teammates.

Someone in a controller role can complement these explicit reports with regular informal communication. Some organizations have advocated “management by walking around”, in which a manager informally talks with those who they oversee, without a regular schedule. This interaction ideally happens in person, so that the manager and the team member can treat each other as people and build up social bonds. In person communication also has the advantage of the full range of communication methods, such as body language. These informal communications have the disadvantage of not producing a documented record of what was learned, and if done clumsily they can lead to a team member feeling like they are being constantly monitored.

A well-designed team will account for problems in communication between a team member and those who oversee their work directly. A team can build in periodic “skip level” communication, where a team member can discuss their work and their state of mind with someone other than their direct managers, in order to detect and resolve problems with a manager. A well-functioning team will also provide feedback channels for team members to report larger or more systemic problems. In many industries, organizations are required to provide a way for anyone on the team to report corrupt or illegal behavior, for example.

Whatever feedback mechanisms a team uses, the channels should be designed to address bias and sampling problems. For example, if someone only reports on progress when they complete a task, there is no way to detect when they are having problems completing some task--in other words, reporting at completion biases information toward good news. Having only one path for information to come from a team member through a manager to higher-level controllers can show bias as the manager digests the information they receive and passes along a summary. This is one reason to combine multiple ways to get feedback.

Working with the control system. The team’s structure has to be designed and redesigned. It is expected to result in getting the system built efficiently and accurately, and for the team to maintain its satisfaction and productivity throughout.

Achieving this end only happens when the team is organized deliberately. While historically the organization choices have been made initially based on experience and then by incremental changes, it is possible to do better by explicitly designing and analyzing the team’s structure as a system.

The control system approach allows one to use techniques for designing and analyzing systems that have important emergent properties. In particular, the STAMP model of accident causes [Leveson11, §4.5] and the STPA hazard analysis technique [Leveson11, Chapter 8] provide a sound basis for analyzing how the team is organized. They provide a disciplined methodology for determining what hazards the team could face, such as duplicating work, failure to communicate, disagreements between people, or errors in the work. It also provides a structure for reasoning about how these hazards can occur and how to design the control system to eliminate or handle when the underlying causes occur.

As one example, the STPA hazard analysis methodology calls for identifying and addressing cases where multiple controllers can generate control actions for one controlled process. In a team, this happens when two people have some kind of controller role for one team member. These controllers can give conflicting instructions, or they can give instructions that have unexpected side effects. The hazard analysis methodology includes identifying cases when this can occur and then defining how the multiple controllers will coordinate their decisions to avoid conflicts [Leveson11, §4.5.3].

Team structure and planning. The structures and procedures that a team follows are related to but separate from how the team plans its work. The interactions within the team’s control system are continuous and immediate. They serve to maintain the social bonds that keep a team together. In building social cohesion and team culture, a team’s structure makes the team able to plan and carry out its work.

19.3.2 When to use the team model

The purpose of having a model of teams is to provide a language for describing how a real team is organized, and to provide tools for working out how a team’s organization might need to change.

There are four times to use the team model:

  1. In ongoing team operation;
  2. When forming a new team;
  3. When a team’s structure needs maintenance; and
  4. As a team outgrows its structure.

Ongoing team operation. A team’s structure and culture should determine how the team works and interacts. The purpose of treating the team as a control system is to ensure that it continues to work well, and to provide a basis for adjusting the team’s structure or its members to meet that end.

In ordinary operation, the assignment of roles and tasks has a great effect on how the team is functioning. A good assignment of roles will have at least one appropriate person covering each needed role, and work to spread the workload as evenly as possible across the team. In control system terms, to whom a task or role should be assigned is the control decision. It is based on the decider’s understanding of each person’s ability, current workload, and interests. Communicating the assignments make up the control actions, and then team members do the work.

Once assignments have been made, those with control roles monitor progress. One person may become busier than expected, and workloads adjusted in response. Someone may have unexpected trouble doing some task, which needs to be detected so that that person can be given help.

The team’s culture and social cohesion also needs to be monitored and managed. In control system terms, the controller sets the expected norms and procedures and communicates those to the team. The control actions that communicate this information take many forms: documented procedures, documents of team charters and cultural norms, and the examples set by leaders. The team, as the controlled process, will observe all this input and respond in their behavior. The controller is then responsible for watching how the team members work together, learning how they feel about each other, and identifying when some people aren’t meeting the expected cultural norms or getting along, and make adjustments accordingly.

Team member evaluations are a part of many organizations‘ procedures. These provide an opportunity to give people feedback on how they are doing at performing tasks and how they are fitting into the team’s culture. Having clearly-defined cultural norms and work assignments enables people to give feedback that measures a team member’s work against criteria that everyone should understand in the same way. (And if the feedback process reveals that some people do not understand the criteria in the way they were intended, then this is feedback to the team that the documentation needs to be improved.)

When people detect that the team is not operating as planned, they initiate corrective action. The kind of action depends on the kind of problem. If one person is not working as expected, the actions can be focused on that person: giving them suggestions or education, changing their work assignments, or in the worst case moving them out of the team. If the problem is between multiple people, then the next step is to determine why they are not working well together in order to address the working relationship between them. Sometimes, however, investigation will reveal that the team’s structure or culture needs to be improved. I discuss that below.

Forming a new team. A new team is an opportunity to design the team’s structure and culture. While this is often left to chance, with the first team members jumping into technical work, spending effort early in the process to plan how the team will work pays off in a project that functions well as it proceeds and starts to face organizational challenges.

The way that a team starts out affects how it continues to work years later. The habits that a team forms in its early days continue to influence how people work, and changes to these habits is difficult and slow (Section 8.1.5—Principle: Team habits). This means that some effort early on will pay off for a long time.

The model in this chapter provides a way to organize the thinking about how a team should work. How should authority and work be divided among people? Should the team have a hierarchical department structure, or a matrix organization? What cultural norms are expected for people’s behavior toward each other? What procedures should the team follow to do different parts of the work?

The structure of a team is not just a theoretical construct. To work, it must fit the abilities and experience of the people involved. A structure that requires perfection will not work, because nobody can work perfectly. A structure that people can’t understand or is too complicated won’t be followed, or worse, people will try to follow it but do so in some odd way.

The team’s design provides opportunity to think about how to make the team resilient. What functions should be shared? How can people working on one part of the system help each other? How do those people who are responsible for maintaining team operation keep aware of how the team is doing? What should happen when there is a serious problem within the team?

The decisions about how the team will work should be documented for everyone to read and follow. Having these documented—and brief—helps get everyone into agreement. Going through a process of building draft documents and getting team feedback helps build consensus early on. It also makes the task of adding new people to the team easier (they can read the documents) and smoother (each new person gets the same information others do).

Documenting the rationale for why the team is designed the way it is, along with analyses of how the team’s structure will meet its objectives, provides a basis for maintaining the team’s structure as it grows or as the work changes. It also helps people understand the spirit of the structure, helping them interpret the intent behind the documented structure.

A team typically grows and goes through distinct phases where it needs different kinds of structure; I discuss this below. The initial team structure will likely be simpler than what the team will need a few months later. However, the initial design for the team’s structure should include some thinking about the team will work as it grows. Because the social organization that is a team has inertia in its habits, starting out the team’s structure in a way that can grow into what it needs to be later will help avoid reorganizations that upset team operations and affect productivity.

Maintaining team function. Sometimes the planned team structure doesn’t work the way it was expected to.

When this happens, the response should be to work out why the structure isn’t working, and then determine how to change the structure to work better.

The tools for systems accident analysis are available to analyze why a team is not functioning. The STAMP methodology provides an organized way to determine possible reasons that a team is not functioning as desired [Leveson11, Figure 4.8]: people with the decider or exceptional role are not providing the needed instructions for some reason; those people do not have an accurate understanding of the state of the team; they are not getting accurate feedback from the team; the team are not getting instructions or acting on them as expected; or conflicting control actions. An analysis following this kind of methodology can reveal where the underlying problems are, and in turn suggest ways to change the team structure or culture.

For example, a team has people who are not following defined procedures for getting component designs reviewed and approved before moving on to implementation. An analysis might find that some people are not aware of the procedure (a problem with the control action/controlled process), suggesting that improving documentation and education about the procedure. On the other hand, the problem might come from one group being under pressure to deliver quickly, and those who are supposed to do reviews or give approval are not able to respond at the needed pace. This would suggest that streamlining reviews or adding reviewing resources would address the problem, or that the group needs schedule relief.

As a different example, consider a team organized into groups based on the component hierarchy. This team is having trouble with component integration: components that interact have passed design and implementation reviews, but when they are combined for verification they do not function as expected. This situation could arise from many sources. The organization into groups might be inhibiting communication between groups, leading to interfaces being designed that do not meet the needs of components being built by different groups. This might, in turn, come from flawed procedures that do not account for cross-group reviews; or it might come from group managers that don’t get along; or it might come from a problem with how interface design artifacts are managed. An issue like this could also show related problems, such as the project management not detecting the problems quickly or accurately until there is a crisis. I have found that in situations like this there are often several small changes that need to be made together to address the problem.

Team growth. A team’s organization generally starts small and informal, as a very small group starting to investigate a customer’s need or a potential system project. As the project moves forward, the team grows and its needs for structure change. The team also changes as people join and leave, and as people move from role to role.

I have found that most teams go through phases as they grow—rather than showing smooth changes over time. These changes arise from the combination of complexity growth, development of group relationships, and the growth in understanding of the work ahead.

Small groups (of just a few people) have been observed to go through a development sequence [Tuckman65][Tuckman77]. These small groups begin as the group forms, and the people work out how they should relate to each other and how to get work done. As time goes by they develop into a cohesive group that gets work done and where people trust each other. (The studies do not discuss how this process can fail, leading to a group that does not cohere or disbands.)

The interpersonal complexity of a team grows with the size of the team. The number of potential connections between team members is O(n2) in the size of the team. In my anecdotal experience, the amount of time spent on coordinating work within the team grows in line with the number of connections. If there is no structure to the team, at some point the amount of time and effort spent on communication will exceed the amount spent doing work building the system.

When a project starts, the nature of the system to be built is not well understood. The team has to go through a process of working out the purpose of the system, developing concepts, and eventually beginning to design. Along the way, the team gets increasing understanding of the work ahead.

In practice, the combination of these causes leads the team to change its organization over time. At the beginning, the initial exploration of what the project might be (working a purpose and finding some initial concepts) is typically a small group. This small group will go through a process of learning to work together, but typically the group can self-organize and does not need hierarchy for much. As the work progresses and a few more people join the project, they will initially try to fit into the self-organized small group. These additions will alter the interpersonal relationships, but at some point the complexity of using consensus will necessitate creating some initial structure. The team will settle into this structure. But as the team continues to grow, it will initially accommodate people into the structure but eventually reach another point where more structure is needed to manage complexity.

The message is that a project should expect its team organization to change over time. Almost every project I have been part of has been resistant to addressing a need for changing team structure, and has put off dealing with it until a crisis occurs. In every case this cost the organization time and money, needlessly setting back the project. A project’s leadership should be alert to the need to periodically reorganize the team so that this can be done before it causes problems.

19.3.3 Example: conflicting instructions leading to inconsistent design

I worked on two projects that had problems building their systems because someone on the team got conflicting instructions on the objectives for some component they were supposed to be building.

In one case, a software developer was tasked with implementing a particular CPU scheduling algorithm in a real-time operating system kernel. This scheduling algorithm had been chosen in order to make certain system safety properties work, and to enable some high-level control features. The developer in question did not understand the assignment, and reached out to someone else—someone not authorized to make decisions about the CPU scheduling algorithm. The developer got advice from the other source and implemented a different scheduling algorithm. The other algorithm could not provide basic safety and control features the system needed. As this project was being executed on a cost-plus contract, the developer’s organization had to pay for someone to remove the work the developer had done and implement the correct algorithm.

In another case, one senior system architect (systems engineer) was responsible for a particular feature set of the system. The system architect was working with a pair of developers to work out a design for those features. A second senior system architect, who was not responsible for that part of the system, was having a conversation with the developers and instructed them to design the features in a particular way. This conflict in instructions to the developers led to confusion that took several days to detect and resolve.

Both these problems reflect two common team design flaws. First, both are instances of conflicting control ([Leveson11, §4.5.3]), in which a controlled process (the developer) receives conflicting control actions. Second, in both cases design authority (Section 19.2.4) had been assigned, but developers got instructions from someone else. In the first case the developer sought out advice from an inappropriate source; in the second, a senior person gave instructions outside of their authority.

The techniques for addressing a potential system hazard apply to the conflicting authority: first try to eliminate the conditions that can lead to a hazard, then make it unlikely to happen, reduce the likelihood of it causing a problem, and then try to limit the damage when it does happen.

The first line of defense is thus to organize the project so that conflicting decisions and authority do not occur, or make it unlikely. This is most easily done by having for each part of the system exactly one person authorized to make decisions, and making that information clearly available to everyone on the team. Note that this does not mean that only one person is allowed to design; rather, it means that one person has responsibility for the design. The responsible person can and should delegate the design effort as much as possible to the people actually doing the work, and the responsible person should focus on setting objectives for the design, guiding the design, and checking that the results are acceptable.

Theoretically, a team can avoid conflicting decisions or directions by having a few people operating in a way where they reach consensus before making decisions. In practice consensus algorithms work well enough for computer systems but people find it hard to work that way: communication happens informally, people are in a hurry, or someone has a good idea they get enthusiastic about and don’t wait to share it with others for agreement first.

The second line of defense is to have regular review points in the project when discrepancies can be caught.

19.4 Directory

Two of the first things people on the team need to know are their own roles and who else is on the team. Once they have that information, they can communicate with others to learn other things they need to know.

Consider the following scenarios.

  1. Person A is working on some component. That component has an interface with another component, and so person A needs to coordinate how they implement their part of that interface with someone working on the other component.
  2. Person B has finished a design for an update to a component. Project procedures say that they need to have the design reviewed and approved before moving on to implementing the design. Person B needs to find out who the reviewers and approver will be.
  3. Person C discovers an ambiguity in the specification for a component, and they are concerned that this ambiguity may lead to a flaw in the designs that follow from the specification. Person C needs to find the people responsible for the specification so they can discuss the potential problem and find a resolution to the ambiguity.

For all these scenarios, the people need to determine who on the team is responsible for some part of the system beyond what they are working on themselves.

To meet this need, the project should maintain some kind of directory of people on the team. This should record:

This information is generally fairly simple, but it must be kept current. If people come to believe that the directory is likely out of date they will not trust it.

undisplayed image
Sidebar: Summary

Chapter 20: Operations

11 February 2024

Operations covers how people on the team organize the work of building the system.

I introduced the basic ideas of operations in Section 7.3.5. In that section, I model operations as five parts: life cycle, procedures, plan, tasking, and support. In this chapter I go into more detail about each of these parts. The material in this chapter is based in part on the needs analysis reported in Appendix A.

This chapter details out the model for operations in general, without recommending specific solutions.

Note that this chapter is focused on the operations directly involved in building the system. An organization has other things that fall under business operations; I defer to others to address that broader topic.

20.1 Purpose

Operations is about organizing work, in the form of tasks. It is complementary to team and artifacts, which I discussed in previous chapters. Operations ensures that people know what tasks they should be doing, similar to knowing what they should be producing (artifacts) and who they do it with (team).

I leave “task” largely undefined, relying on its colloquial meaning. It should be taken to mean some unit of work to be completed.

Operations organizes the work so that:

  1. The right tasks are done at the right time by the right person. Each task is performed by the person with the right skills to do it, and who has an appropriate role in the team. This is accomplished with tasking based on the plan.
  2. Everyone does their work in compatible ways. Each team member has a single common model for how to do their work. The rules are documented clearly and understandable by the whole team. People understand what steps will be coming up after they perform a specific task. These conditions lead to people having confidence in what others are doing, and they allow detection and correction of problems. This is accomplished with the life cycle and procedures.
  3. The work is done efficiently. The team avoids work that is not actually relevant to the system being built. They minimize work that is a dead end, re-work due to quality problems, waiting time because of dependencies, and overhead for operations. Operations allows detection and correction of problems, including accountability and feedback on schedule. This is accomplished with procedures and plan, especially in how they account for uncertainty and risk.
  4. The work is of high quality. The team thinks through needs before moving forward, allowing for controlled exploration or prototyping. Work is checked independently to catch flaws, and steps do not fall through the cracks. Work is coordinated so system parts fit together, and flaws are detected and fixed. This is accomplished with procedures and tasking.
  5. The work meets deadlines and budgets. The project can project forward the time and resources required to reach milestones, allowing for uncertainty. The work is actually possible for the team to complete. Project management can detect and resolve potential problems by adapting the plan, changing system objectives, or getting more resources. Progress is visible to project management, funders, and the organization. This is accomplished with plan and tasking, especially flexibility in planning.
  6. Adapts with need. The project gracefully handles requests for changes in purpose or regulation. The project gracefully handles learning more things about the system as work progresses. The operations support people making decisions that change the plan. This is accomplished by building points where things are checked into the life cycle, and using procedures to deal with change.
  7. The project supports its customer and funder. The project’s execution fits with acquisition and funding processes. This is encoded in life cycle and possibly procedures.

Each project will work out its own approach to operations. The list above provides objectives against which an approach can be measured.

20.2 Operation model

The model operations in Section 7.3.5 has five parts:

undisplayed image

The life cycle is the overall pattern of how the project works, with phases and milestones.

Procedures are the checklists or recipes for performing key tasks.

The plan is an evolving understanding of the path forward for the project.

Tasking is the assignment of tasks to people, and figuring out what tasks each person should do next.

Support maintains tools and information needed to do the other parts.

I have ordered the parts of this model by rate of change and at which decisions are made. The life cycle is established early in the project and changes slowly after that. Procedures change a bit more frequently, but not often. The plan is updated on a regular cadence, while tasking is continuous. Support activities go on throughout the project.

20.3 Life cycle

A project’s life cycle[1] is the set of patterns that define the big picture of how the work unfolds. It encodes ideas like the system going through phases: development, deployment, update, retirement. It can define phases within those. Within development, for example, there can be concept development, specification, initial design, detailed design and construction, integration, and acceptance.

The idea of a life cycle can apply to the whole system project, or to specific parts of the work. Each component in the system, for example, can have a life cycle for its development or for an update to the component. The life cycle can apply recursively to subcomponents.

undisplayed image

In general, life cycle patterns define when one step of work should be done before another, when steps can proceed in parallel, and the conditions that define when a step is ready to start or when it is complete.

In this way, a life cycle can be viewed as an abstraction of the steps a project or a part of a project will go through. It is expressed as a set of patterns that guide how people do the work: the order in which steps are planned and performed. The actual sequence of events in the project may not match the pattern exactly, but the patterns give people a way to talk about what should happen and compare actual work to the ideal. The life cycle is not a schedule. It is a set of patterns that guide the team as they work out a plan and schedule tasks that achieve that plan.

I use the term life cycle patterns in this text to emphasize that these are ideal sequences of events, simplified to help people organize the work they do. Some of the patterns can be used repeatedly during a project, such as using the life cycle pattern for building each system component

The life cycle is not specific to a particular system project. A life cycle pattern can be more or less well suited to a project depending on attributes of the system being built—most especially how often there are irrevocable or expensive-to-reverse decisions. An organization can build up a library of patterns, improving them with experience and sharing the learning among many projects.

A project’s life cycle patterns help team members understand how the work they are doing fits with other work. It provides guidance on what they should expect from work that others are doing that will lead into work they will do. It helps people work out who is doing work related to their own, and who to talk to about that work. It helps people understand what steps will be coming up after they perform one step.

The life cycle approach is not in itself a methodology; nor does it imply a particular diagraming model or formal semantics. It is a technique for working out how the project will order its work that has evolved from a combination of examining several different life cycle standards, observing how teams use Gantt charts for scheduling, and the common practice of sketching things out on a whiteboard. I use an informal diagramming notation that is inspired by the diagrams used in NASA documents.

I introduce the general idea of life cycle here, without advocating for any particular patterns. I discuss specific examples and guidelines for building a life cycle pattern in Chapter 21 and Chapter 30.

Model of life cycle patterns. A pattern generally consists of:

In other words, the life cycle can be viewed as a directed graph of phases, with annotations on each phase. (Because the dependencies are time-like, the graph is also acyclic.)

undisplayed image

For example, a simple life cycle pattern might say that a project must start with a phase where it works out and documents the customer’s purpose for the system before proceeding on to other work. That purpose-determining phase would conclude with a milestone where the customer reviews and agrees on the team’s purpose documentation. The next phase would involve developing a general concept for the system. This phase would include review milestones, checking that the concept will meet the customer’s purpose and that it can likely meet the organization’s business objectives. After those reviews, there might be a milestone where the organization makes a go-no go decision about whether to proceed with the project.

A life cycle pattern can be coarse-grained or fine-grained. A coarse-grained pattern would have phases that apply to the whole project at once, and take weeks or months to complete. The NASA family of life cycles [NPR7120] is coarse-grained: it is organized around major mission events like approval to move from concept to design, or from fabrication to launch. Fine-grained patterns might apply to a single component at a time, such as a component being first specified, then designed, then implemented, then verified, as a sequence of four phases with review checkpoints at the transition between phases.

Some life cycle patterns have phases that used many times in parallel in a project. Consider a fine-grained life cycle pattern that applies to each component. The general pattern might be:

undisplayed image

A project might apply this pattern to each component being built. When multiple components are being developed in parallel, multiple instances of this pattern will be proceeding at the same time, and different components may be at different points in their cycle.

undisplayed image

Non-linear progress and rewinding. The project’s life cycle patterns do not necessarily imply one-way linear progress; work does not actually happen perfectly linearly.

While someone is working on the specifications for a component, they may well be thinking of design approaches. Following the four-step pattern above, the design thinking falls into a phase later than the specification work. Until the specification is final (and the specification phase is completed), any design work must be considered conditional: it might be made irrelevant as new specifications are worked out. Making the phases explicit helps people understand what work can be relied on as baselined (Section 17.4.3).

A project or the work on one part of the system can potentially move through a phase, progress to another, and later rewind back to the earlier phase. A project might rewind to a design phase when a flaw is discovered. Some implementation work has been done and might still continue while the design is reworked, as long as there is someone to do it and parts of the implementation are unlikely to be affected by the redesign.

The dashed line in the following diagram shows how work proceeds on one component. It proceeds through specification and design into implementation, with accompanying reviews ensuring that both the specification and design are complete. During implementation, the need for a design change arises, and work reverts back to the design phase. Once the redesign is done and reviewed, work goes back to proceeding with implementation.

undisplayed image

Planning versus measurement. There are two ways that one can view life cycle patterns. The first way is as a path that guides the work: one must go here, then here, then there. The other is as a way to measure progress. Being in some phase means certain things are believed done, while other things are in progress and yet others will be done later. These two views are compatible, and it is useful to use both viewpoints.

The difference between the two comes when dealing with changes. If the work on some component is in phase X, what happens when an error is found in work from an earlier phase? Or when a request for a change in behavior arrives? And what if one chooses to build a component in multiple steps, creating a simple version first then adding capabilities over time?

This is where viewing the pattern as a measure of progress is helpful. Consider a component that is to go through specification, design, implementation, and verification phases. When the work is in implementation, the implication is that specifications and designs are complete and correct. If someone then finds a design problem, the expectation that design is complete is no longer true. This situation leads to those tasks needed to make true again the condition that the design is correct. Put another way, this “rewinds” the status of the work on that component into the design phase. People will then do the tasks needed to advance back to the implementation phase by correcting the design and performing a review of at least the design changes.

Development methodology. The life cycle model is connected to the development methodology that a project chooses to follow. A project that uses an agile-style or spiral development methodology will use different patterns for some development steps then what a project following a waterfall methodology will use. I will discuss this further in Section 20.5 below.

Documentation. A project should clearly document the life cycle patterns it will use and make them accessible to the whole team. While the patterns are used directly for planning, making them accessible to everyone ensures that everyone knows the rules to follow and reduces misunderstandings about what is acceptable to do.

For some projects, the life cycle will be determined by an external standard. NASA defines a family of life cycles for all its projects [NPR7120]. This flow is designed to match the key decision points where the project is either given funding and permission to continue, or the project is stopped. It defines a sequence of phases A through F, with phases A-C covering development, D covering integration and launch, E covering operations, and F covering mission close out. Specific kinds of projects or missions have tailored versions of this overall life cycle.

Many companies have similar in-house project life cycle standards that revolve around decision points for approving the project for development and ensuring a product is ready for commercial release.

20.4 Procedures

Procedures define how specifically to do actions or tasks defined in the life cycle. They often take the form of checklists or flow charts.

Procedures are related to the system being built, but are generally portable between similar projects.

Having implemented clear procedures will:

Having common procedures for the whole team makes key work steps less matters of opinion and more based on shared fact. This can improve team effectiveness by removing a source of conflict between team members.

A project can realize these benefits only when the team members know what procedures have been defined, when they can find and understand the procedures, and when the team uses those procedures consistently.

Three steps help team members know what procedures are defined. First, the procedures should be defined in one place, with a way to browse the list of procedures as well as a way to find a specific procedure quickly. Second, the life cycle should indicate when one procedure or another is expected to be used. For example, when the life cycle indicates that an artifact should be reviewed, it should reference the procedure for performing the review. Finally, new team members should be shown how to find all this information for themselves.

Understanding and using procedures depends on the procedures being actionable: they should indicate the specific conditions where they apply, and provide a list of concrete steps for someone to perform. This is especially true for procedures that will be used when someone someone is under stress, such as in response to a safety or security accident. I have often seen “procedures” that say things like “contact the relevant people”—which is unhelpful. The procedure needs to list who the relevant people are (or at least their roles) so that a person in the middle of incident response can contact the correct people quickly.

20.5 Plan

The project life cycle and procedures define how people should get work done in the abstract, before any work actually gets done. Now I turn attention to how the actual work is organized in the plan and tasking, using the life cycle patterns and procedures as structure.

The plan is a record of the current best understanding of the path forward for the project. It contains the foreseeable large steps involved in getting the system built and delivered, and getting it to external milestones along the way. It guides the work, as opposed to people working on tasks at random.

The plan:

Plans versus schedules. I differentiate between a plan and a schedule.

A schedule is a “plan that indicates the time and sequence of each operation”.[3] In practice, a schedule is treated as an accurate and precise forecast of the tasks that a project will perform. People treat the timing information it provides as firm dates, and will count on things being done by those dates. Schedules are often part of contractual agreements. Because people outside the project use a schedule to plan their own activities, a schedule is hard to change.

Schedules are appropriate when the work to be done can be characterized with sufficient certainty. In most construction projects, for example, once the building design is complete, the site has been checked for geologic problems, and permits have been obtained, the remaining steps to actually construct the building are generally well understood and the time and effort involved can be estimated with confidence. However, before the site has been inspected one might not be able to create an accurate schedule because there could be undiscovered problems in the ground (perhaps an unmapped spring or an unstable soil layer).

The plan, on the other hand, is not a detailed schedule. It is a general indication of the steps to be taken, along with as much information about time required for different steps as can be estimated. It will reflect varying degrees of certainty about the steps and timing, from fairly certain in the near term to highly uncertain later in the work. It provides guidance, but it does not represent a promise of dates or exact sequencing of events.

undisplayed image

A plan is dynamic and constantly changing, as it is a reflection of where the project currently stands.

At the beginning of a project that requires innovation, the team is just beginning to work out what the system will be, and so they cannot build a schedule because there are too many unknowns. As the project works out the customer needs and basic concept, the flow of work becomes a little clearer but most of the work ahead is still unknown. People will continue to learn more and more about the system, and at each step there will be fewer unknowns and the certainty of plans can improve. Even so, the exact schedule is not known until the very end of the project—when there are no places left that could hide surprises.

To some degree, the difference between a schedule and a plan is an attitude. A schedule is something people treat as a contract, and so it does not accommodate uncertainty well. It includes a lot of detail; if that detail is uncertain, people will be constantly updating that semi-fictitious detail. A plan is a flexible current best estimate that doesn’t promise much except to accurately reflect what is known, and avoids information that might appear accurate but in fact is not certain. A schedule is useful to someone writing a contract to get something done. A plan is about an honest accounting of where the project stands and where it is going, and thus more useful to the people building the system.

Plan contents. A plan gathers four types of information:

  1. The set of work steps that can be foreseen to be needed. Some will be detailed; others will be vague or general.
  2. Milestones, both internally-defined and those imposed from the outside.
  3. Dependencies among the work steps, and between work steps and milestones.
  4. Estimations of uncertainty about all of these.

The chunks of work and milestones form an acyclic graph, with dependencies as edges between the work or milestones. The work can be annotated with estimates of resources or time required, to the degree those are known—and they should not be annotated if the information cannot be estimated with reasonable confidence.

In addition, some projects will give work steps a priority or deadline. Tasks that should be done soon should be scheduled early, perhaps to meet a deadline, to address uncertainty, or to account for a task that is expected to be lengthy.

There is no set format for recording a plan. I have used scheduling tools that use PERT charts and Gantt charts as user interfaces. I have used diagramming tools that help the user draw directed graphs. I have used graphs and time tables written on white boards. I have used tools meant for agile development, with task backlogs and upcoming iterations. All of these have had drawbacks—scheduling tools are not meant for constantly-changing plans; agile development tools are structured around that methodology; drawings on white boards and drawing tools are hard to update over time or to share.

Making and updating the plan. The plan starts at the beginning of a project, and is continuously revised until the project ends.

Assembling an initial plan starts with knowing the status of the project and working out the destinations. At the beginning of a project, the status is that the project is largely undefined beyond a general notion of what customer problem the system may solve. The endpoint might be delivering a working system, or it might involve expecting to deliver a series of systems that grow over time.

undisplayed image
Initial plan for a new project.

If the project is already in progress, one starts on the plan by working out what is currently completed and in work.

undisplayed image
Example initial plan with milestones filled in.

The next step is to fill in major intermediate milestones and work steps. The project’s life cycle patterns should provide a guide to these. For a new project, the life cycle might indicate that the project should start with a phase to gather information about customer needs. As the first phases progress, the team will begin to develop a concept for the structure of the system. If the customer or funder has required some intermediate milestones, those can be laid in to the plan, along with very general work steps for getting ready for each of those milestones.

undisplayed image
Example life cycle pattern for the overall project.

It is normal for the plan to have large work steps that amount to saying that somehow the team will get something completed or designed or whatever. In the example above, “implement system” is completely uncertain when the project starts. When one does not know how part of the system will be designed, or how to implement some component, or even how some part of the work should proceed, it is better to put in a work step that accurately reflects the uncertainty. Being accurate about what is known and not known prompts people to find answers to the unknowns, gradually leading the plan toward greater certainty.

The plan then grows according to the system design. As the team works out the components that will make up the system, each new component creates a stream of work to be done to specify, design, implement, and verify that component, as specified by the life cycle. All these add new work steps into the plan, along with dependencies from one step to the next.

undisplayed image
Example pattern for developing a component (linearly).

The plan should be revised regularly. It will change whenever there is some change to the likely structure of the system and as each component proceeds through its specification and design work. Many components will require some investigation, such as a trade study or prototyping, before they can be designed. The plan will evolve as those investigations generate results.

Part way through building the system, the plan will typically become large and show significant parallelism. This is also normal and desirable, because it reflects the true state of development. Mid-project there usually are many things that people could be working on. The plan should reflect all these possibilities so that those managing the project know the true status of the work and can make decisions with accurate information.

undisplayed image
Example plan in progress. Some steps are complete, some are in progress.

The life cycle patterns a project uses provide building blocks out of which people can construct parts of the plan, but they do not dictate the plan entirely. Maintaining the plan is not simply a mechanical process of adding a set of work steps each time someone adds a new component to the design. There are three more factors to consider, and these make maintaining the plan a task requiring some skill.

First, the various components will be integrated into the system. The steps to put the components together and then verify that they interact correctly adds more work steps.

Second, a component does not necessarily proceed linearly through specification to design to implementation. Often the design will require investigation, perhaps a trade study to compare possible alternatives. In many cases it is worth building a simple prototype of one or more of these alternatives to learn more about the component before settling on a design. This turns a design step into several steps. Sometimes the outcome of an investigation is that the whole approach to designing a set of components is wrong and design needs to be revisited at a higher level. (This is the rewinding discussed in the section on life cycle above.)

Third, many system development disciplines, such as agile or spiral development, do not proceed linearly with developing a component from start to finish in one go. They often focus on building a simple version of a component or of a collection of components first, and then adding features over time.

Each project will have its own style for addressing these factors, and this will be reflected in the specific work steps included in the plan. For example, when a project follows a spiral development methodology, the plan for developing a part of the system might have several internal milestones: first a simple version of the components that can do some minimal function, then another version or two with increasing function. There might be design, implementation, and verification steps for each component involved for each milestone.

A project should document what methodology it has chosen, so that team members know what to expect and so they can plan consistently.

Plan and tasking. The plan is used to guide tasking—the assignment of specific tasks to specific people (Section 20.6). The plan includes work steps that are in progress and ready to be executed. These are the sources of tasks that people can pick up and work on.

Most of the time, there will be more tasks that are ready to be worked on than there are people to do them. The plan organizes the tasks and thereby helps the process of deciding what someone should do—whether a manager makes task assignment decisions or people pick tasks for themselves. If work steps in the plan include priorities, those will help guide task assignment decisions.

The plan and tasking together support accountability and measurement. They should allow someone to identify when a plan was changed, to see if the change was an improvement in retrospect. They should help identify when some tasks were completed faster or slower than expected, or completed with quality problems. This information can be used to improve forecasting and to identify tasks and procedures that should be restructured.

Plan and forecasting. Most projects will have deadlines they must meet. Customers want estimated delivery dates, so they can make preparations for steps they will take to put the system in operation. Funders may want intermediate milestones to show that their investment is on track. Others want to know the budget—money and time—required to get the system built, or to meet other internal milestones. The team will need to manage project execution in order to meet those deadlines.

One can look at this as a control problem. Forecasts using the plan provide the control input: based on the current plan, including its uncertainties, is the project likely to hit a deadline or not? The control actions rearrange the work steps in the plan or to add and remove steps. Adding or removing steps often means adding or removing capabilities from the system, also known as adjusting the system to fit the time available.

Forecasting using the plan will always be imprecise because the plan reflects the actual uncertainty in the project. In some industries it is possible to estimate the time and effort required for work steps, within a reasonable error bound, once the system is well enough understood—for example, in many building construction projects. However, when building systems that do not have extensive comparable systems to work from, estimates will be unreliable for much of the project’s duration.

There are ways to manage a project’s plan to reduce uncertainty as quickly as possible. I discuss those in ! Unknown link ref.

20.6 Tasking

Tasking is about the day-to-day management of what tasks people are working on and what tasks are ready to be worked.

The choices of what tasks are ready is based on the plan, along with bugs that have been found, management tasks that need to be done right away, and ongoing tasks that do not show up in the plan.

Tasking builds on the plan. The plan should be accounting for which tasks need to be done sooner than others in order to meet deadlines or to avoid stalling because of a dependency between tasks.

The objectives for tasking are:

One can treat tasking as a decision or control process that works to meet those objectives. Other scheduling disciplines, such as job-shop scheduling and CPU scheduling, can provide useful ideas for how to make choices about who should work on what.

There are many different choices about when, who, and how much. Each project will need to define its own approach, usually following whatever development methodology the team has selected. The approach should be documented as a procedure that the team follows.

Decisions about tasking can happen at many different times. It can happen reactively, when one task is completed, when a task someone is working on is stalled waiting for something else to happen, or when some urgent new task arrives (such as a high-priority bug or an external request). It can also happen proactively or periodically, putting together a set of tasks for someone to do ahead of time.

Tasking can be done by different people as well. In the teams model (Section 19.2.4), the authority to make tasking decisions is a role that can be assigned. One person can have a scheduler role and make these decisions. A group can divide up tasks by discussing and reaching consensus. Each person can take on tasks when they are ready for more. Combinations of these also work.

Finally, tasking decisions can occur one task at a time, or they can focus on giving each person a queue of tasks to perform.

A large project will have a very large number of tasks—potential, in progress, and completed—to keep track of. Using a shared task tracking tool of some kind is vital. Without one, tasks will be forgotten, or there will be confusion about how they have been assigned. The tracking tool is another one of the tools that the project should maintain (Chapter 18).

Each task must be defined clearly enough that the person doing the work can properly understand what is to be done, and so that everyone can agree when the task is complete.

20.7 Support

The decisions made in planning and tasking need supporting information.

Risk and uncertainty affect choices of what should be done sooner or deferred. I have often chosen to prioritize work that will reduce risk or clarify uncertainty, in order to make the project more predictable down the road. Many projects maintain a risk register, which lists matters that could put the project at risk. These risks are often programmatic, such as the risk that a delayed delivery from a vendor will cause the project to miss a deadline. I have on some projects maintained a separate, informal list of the technical uncertainties yet to be worked out; for example, how should a particular subsystem work?

Project management will also need to manage budgets. Programmatic budgets, most often funding, affects how the project execution can proceed. Technical budgets, such as mass, power, or bandwidth, are aspects of the system being built. For both types of budgets, the amount of the resource that has been used and the amount left need to be tracked. The project will need to estimate how much more of them will be needed to finish the project. If there isn’t sufficient resource left, then the project management will have a decision to make—whether reallocating resources, reducing demand, or finding more resources.

Almost every project will need to report on how the work is progressing, relative to deadlines and available resources. The plan mechanisms should help people obtain and organize this information.

20.8 Using the operations model

20.8.1 Managing operations

Managing operations has much in common with managing the team structure (Section 19.3.1). The approach to operations is defined early in a project, leads to habits that the team uses, and is maintained as the project moves along. Operations fits into the control system model proposed for teams.

The team decides on its initial approach to operations when the project starts. The initial approach might be simple and worked out on the fly, appropriate for a project that must first sort out what kind of system it is going to build, as happens in a startup company. Other projects will inherit an existing approach to operations based on the organization they are part of or from the team’s previous experience.

A team establishes habits of how to do their work early on in a project. People mostly act on their internalized understanding of how they are expected to work, not by referring to documents or by working out what they should do from first principles. When these habits match the behaviors needed for the project to meet its objectives—such as those listed in Section 20.1—then the project can proceed smoothly. If there is a mismatch, then the project will have problems.

This implies that a project should establish its operations approach early, meaning the life cycle patterns and procedures it uses, along with assigning roles to handle plans, tasking, and support. The earlier that a project makes these decisions, the sooner the team can learn them and begin building habits around the intended approach. I have supported projects that prioritized writing code over working out operations; when those projects later tried to confront the operational problems they were having, they were unable to get people in the team to change the habitual and dysfunctional behaviors they had worked out in the absence of direction.

At the same time, no approach to operations will be perfect, and a project’s needs often change over time. A well-functioning project has people responsible for watching how well project operations are working, detecting when there are problems, and adjusting the life cycle, procedures, or assignments from time to time. The control system approach introduced for teams is helpful for thinking about how to maintain a project’s operations.

20.8.2 Development methodology and operations

Each project will at some point choose a development methodology to follow. There are several popular methodologies, such as waterfall development, spiral development, or agile development, along with a great many variants of each.

The operations model I have presented can support any of these methodologies. The methodologies affect the life cycle patterns, how the plan is structured, and how tasking is done.

Waterfall development is characterized by developing the system linearly, starting with a concept and working through design and implementation of each of the pieces, then integrating those pieces together to form the final system. The life cycle pattern for waterfall development will reflect this linear ordering, and plans will follow the life cycle pattern.

Spiral development is organized around a set of intermediate milestones. The system becomes a bit more complete at each of these milestones (or iterations). Each milestone adds some set of capabilities to the system and the system, or some part of it is integrated and operable at each. The life cycle pattern for spiral development will define the way each spiral or iteration proceeds. The plan will reflect how the team will reach each of the upcoming milestones.

Agile development is organized around short cycles (called sprints in some versions of the methodology). Each cycle typically lasts a few weeks, and adds a small number of capabilities to the system. The system is expected to be integrated and operable at the end of each cycle. Unlike spiral development, the objectives for each cycle are typically decided at the beginning of the cycle based on the set of tasks that are ready to execute, and priorities for each task. This means that agile development is primarily about tasking, and it relies on a plan that defines what all the ready tasks are. Agile development can be complemented by a life cycle pattern that imposes discipline on the order of tasks—such as doing specification and design before implementation, or setting a review and approval step to ensure work quality.

In practice, most projects end up using a combination of methods.

The cost or difficulty of changing a decision usually drives a project to combine methods. The easier it is to change a decision, meaning undoing the work of some tasks already completed, the more agile the methodology can be. The more costly it is, the more care that should be taken to ensure that changes downstream are unlikely. (I discuss this further in Section 21.10, using the idea of reversible and irreversible decisions [Bezos16].)

The cost of change is significantly lower near the beginning of a project, when there is less work to be redone and when one change will not cause a cascade of changes to other work already completed. As work progresses, a particular change will become increasingly costly.

The cost of change also depends on the kind of work involved. Software and similar artifacts are malleable. The cost of changing a line of software source code or changing one line in a checklist is, in itself, tiny, though a change in the software may cause a cascade of changes in other parts of the system and may cost time and effort to verify. Changing a built-up aircraft airframe, on the hand, is costly in itself—in both materials and in effort.

These differences in the cost of change lead to differences in life cycle patterns and planning related to potentially-expensive decisions. For example, the NASA family of life cycles [NPR7120] follows a linear pattern in its early phases so that key aspects of the project can be worked out before the agency commits to large amounts of funding, especially for building aircraft or spacecraft hardware. Parts of some of these projects follow a more agile process after they have passed the Critical Design Review milestone Section 23.2.1.

20.8.3 Practical considerations

Some people will look at the life cycle and procedure parts of this operations model and say that it is “process”—a term that has acquired a negative connotation. Yes, the life cycle and procedures do define processes that are supposed to guide the team. Process, when done well, helps a team work more effectively and more happily. Such process is simple: it is a guide for how to do common sequences of events, or tasks that are critical to be done a certain way. It provides a checklist to make sure things get done and aren’t missed. It encodes checks to make sure technical work is done correctly.

I have outlined the advantages that life cycle patterns and procedures can bring to operations in Section 20.1 above.

In my experience, the potential disadvantages, and the reasons people have come to dislike the idea of process, arise from three misuses of operations: making it too heavy, making it too complicated, and defining something the team is unwilling to use.

As an example, a colleague told me about a project they had worked on where getting approval to order a fairly simple part (for example, a cable) took multiple approvals and potentially weeks to complete (heavy process). Indeed, nobody was even sure exactly how to go through the process to get an approval to get the part ordered (complicated process). The processes were, presumably, put in place to ensure that only parts of sufficient quality were used and to manage the spending on parts acquisition. In practice the amount of money spent on people’s time far outweighed potential cost savings, and the amount of work required for people to review an order over and over meant that the reviewers did not have the time needed to perform meaningful quality checks.

A “heavy” life cycle or procedure is one that takes more effort or more time than is warranted for the value it provides. This works against the objective of doing work efficiently. Each part of a life cycle pattern or procedure should have a clear reason for being included. The effort and time involved should be compared to that reason, and the procedure or pattern should be redesigned if the comparison shows it is too heavy. To avoid this, each procedure and life cycle pattern should be scrutinized to eliminate any steps that are not actually needed.[2]

A complicated life cycle or procedure is one that involves many steps, often with complex conditions that have to be met before some step can proceed. In the example from my colleague, nobody on the team could figure out all the steps that needed to be done. This can be avoided by, first, ensuring that procedures are as simple as possible, and second, by documenting them and making that documentation easy for people on the team to find and understand.

Teams are generally willing to follow procedures, as long as a) they know what the procedures are; b) they understand the value of following them; and c) following procedures has been made a part of the team’s norm. This means that the life cycle patterns and procedures should be documented, and their purpose or objectives should be spelled out. Normalizing following the procedures, however, is not something that can be accomplished by just writing something down. This has to be practiced by the team from the start, with leadership setting examples (see Section 19.3). Involving the team in setting up the life cycle patterns and procedures can help people understand and buy into the process.

Bear in mind that when a project adopts a particular life cycle pattern, the project is making an implicit commitment about staffing. If the pattern indicates that certain reviews must happen before key events happen, like ordering an expensive piece of equipment or beginning a complex implementation effort, then the project must ensure that there are enough people with enough time to perform those reviews. If the project does not staff enough, people on the team will quickly learn the (correct) message that the project or its organization does not actually care about the reviews and will begin to work around the pattern.

How all of this is handled for a particular team depends a lot on the team’s size. It’s common for a project to start with simple life cycle and planning when it is small and the project is uncertain. The project will need to shift strategies at times as the team grows, as the work becomes more complex and interconnected (see team growth in the previous chapter).

Sidebar: Summary

Appendixes

Appendix A: From stakeholder need to model purposes

8 January 2024

A.1 Introduction

In Chapter 16, I presented an approach for determining what features and capabilities should be supported in the project in order to do a good job of building a system, and meeting stakeholder needs. In this appendix, I present the detail of that derivation.

Bear in mind that this derivation results in a set of objectives for a project. It does not say how any particular project should meet these objectives; each project must decide those things in ways that meet the specific needs of that project and that system. The objectives can be seen as a set of considerations that each project should examine as they decide how to run the project.

The derivation only addresses matters that are related to the project’s approach to building a system. There are many other factors outside this scope: matters of project management, or of policy in the organization that hosts the project. Where appropriate I have made notes of these matters external to the system-building scope.

A.1.1 Stakeholders

The set of stakeholders is:

  1. The customer for which the system is being built;
  2. The team that builds the system;
  3. The organization(s) of which the team members are part;
  4. Funders who provide the investment to build the system; and
  5. Regulators who oversee the system and its building.

I introduced each of these in Section 16.2.

A.1.2 Model elements

I introduced the model for making systems in Section 7.3. This model is organized around the tasks that need to be performed to build the system, and has the following elements:

  1. Artifacts that are created by performing tasks, and represent the system and records about it;
  2. The team that builds the system by performing tasks and making artifacts;
  3. The tools that people on the team use in doing tasks; and
  4. The plan that organizes what tasks need to be done, in what order, and using what resources.

In addition to these elements, I have include an element for matters external to the system-building project for matters that stakeholders need but that aren’t about building the system itself.

A.1.3 Derivation

The derivation maps stakeholder needs onto objectives for parts of the model.

undisplayed image

The result is a set of objectives or capabilities that people should consider when working how how the project should operate.

I discuss each stakeholder in the sections that follow, along with tables of the needs or objectives of each. The objectives that support these stakeholder objectives are annotated with a right-pointing arrow: →.

A.2 Stakeholders

A.2.1 Customer

The customer (see Section 16.2.1) is a stakeholder who wants the system built because they are going to use the system. They may or may not be funding system development directly—if they are, then they are also a funder below.

model:2 Customer
2.1 Fill purpose
The project must deliver a system that meets the customer’s purpose
2.1.1 Know purpose
The project must know what the customer’s purpose for the system is
model.artifacts:2.1
model.plan:3.2.1
model.team:2.1.1.1
2.1.2 Build to purpose
The project must produce a system that meets the customer’s purpose
model.artifacts:1.1, 2.1, 4.2, 4.4, 4.5, 5.1, 5.2
model.plan:1.2, 2.1, 2.2, 2.3, 3.3, 3.3.2
model.team:2.2.1, 2.2.2, 2.5.1
model.tools:3.1, 3.2, 4.1
2.1.3 Know requirements
The project must know the customer’s reliability, safety, and security requirements
model.artifacts:2.1.2
model.plan:3.2.1
model.team:2.1.1.1
2.1.4 Meet requirements
The project must produce a system that meets the customer’s reliability, safety, and security requirements
model.artifacts:2.1.2, 4.5, 5.1, 5.2
model.plan:3.3, 3.3.5
model.team:2.2.1, 2.2.2
2.1.5 Free of errors
The project must produce a system that is free of errors
model.artifacts:4.5
model.plan:3.3.5
model.team:2.2.2
2.2 On time and budget
The project must deliver a system by the required deadline and within the needed budget
model.plan:1.2.5, 4.1, 4.2
2.2.1 Know budgets
The project must know the budgets and deadlines for delivering the system
model.plan:3.2.2, 4.1
2.2.2 Know consumption to date
The project must know the resources and time that has been used to date that count against budgets or deadlines
model.plan:4.1.1
2.2.3 Project forward usage
The project must be able to project the resources and time required to complete the system or meet other deadlines
model.plan:1.2
2.2.3.1 Uncertainty
The project must be able to estimate the uncertainty in any forward projections of resources or time
model.plan:1.2.1
2.2.4 Control execution
The project must be able to control execution to adjust resource and time consumption
model.plan:1.2.4
2.3 Certifications
The project must deliver a system that has appropriate certifications or approvals
2.3.1 Know regulations
The project must know the regulations or standards that apply to certification/approval
model.artifacts:8.1
model.plan:3.2.5
2.3.2 Follow process
The project must follow any processes required to get certification/approval
model:2.5.2
model.artifacts:8.2, 8.3
model.plan:3.3.1.1, 3.3.2.1, 3.3.3.1, 3.3.7
2.4 Release and deployment
The project must be capable of releasing a version of the system and deploying it to a customer
model.artifacts:1.1, 6.1
model.plan:3.4
model.team:2.5.1
model.tools:3.5, 4.3
2.5 Evolve system
The project must evolve the system in response to changes in customer or other needs
2.5.1 Receive requests for change
The project must be able to receive and process requests for change from the customer
model.plan:5.1, 5.3
model.team:2.1.1.2
2.5.2 Receive regulatory changes
The project must be able to receive and process changes in regulatory requirements
model.plan:5.2
model.team:2.3.1.2
2.5.3 Know purpose of change
The project must know the purpose of the change (and the change in system purpose that results)
model.artifacts:2.2
2.5.4 Build to meet change
The project must be able to produce a system that meets the changed purpose while maintaining the system’s other purposes and requirements
model.artifacts:1.1, 2.1, 2.2, 4.2.1
model.plan:1.2, 2.1, 2.2, 2.3, 3.3.6
model.team:2.1.1.2, 2.2.1, 2.2.2, 2.5.1

A.2.1.1 Filling purpose

A customer has some purpose for the system, meaning something they want to achieve by deploying and using the system. This is the problem that the customer wants solved, which is a higher-level concern that the specific features that the system will provide.

A customer may have additional requirements on the system. They likely have a need for a minimum level of reliability. They likely have needs related to safety and security of the system.

The project needs to build a system that can meet this purpose and the requirements.

The project can meet these needs by:

A.2.1.2 On time and budget

The customer likely has a deadline by which they would like the system delivered. They likely also have a budget for how much they want to invest in acquiring the system. At minimum, customers generally want the result as soon as possible and for as low a price as possible.

To meet these needs, the project should:

A.2.1.3 Certifications

In many industries, some kind of certification or approval is necessary to operate the system. An aircraft, for example, needs a type certification from the local aviation authority as well as approval for a specific instance of the aircraft. Even if there is no overt certification required, there are often regulatory standards to be met.

The project must build the system in compliance with regulations. When certification is needed, the project must follow the process to get that certification.

To achieve this, the project should:

A.2.1.4 Release and deployment

The customer needs the system actually to be delivered and put into operation. The project must deliver the system, and provide or support its deployment.

To do this:

A.2.1.5 Evolve system

The the system is successful, the customer often finds that it can be made even better with some changes. Or the customer’s needs may change, and they will want the system to adapt to meet their changed needs. The project should be able to maintain and evolve the system to support the customer’s changing needs.

A system may also need to change when regulations change.

The project can support an evolving system by:

A.2.2 Team

The team (see Section 16.2.2) is the collection of stakeholders who build the system. These people need the things that skilled, technical workers generally need: satisfaction, security, confidence, compensation.

Meeting these needs is mostly outside the scope of systems-building itself. These needs are largely met by the project and organization management who create the environment in which the team works. Still, there are aspects of systems-building that can help (or hinder) meeting the team’s needs.

The analysis of a team’s needs presented here is somewhat idealistic. It focuses on skilled workers who are not readily interchangeable, whose value to a project derives in part from the knowledge they carry about the system being built. It assumes workers motivated largely by work satisfaction and have essential material needs met by their compensation. These assumptions lead to a particular balance of power between the team and the organizations that employ them. This would not apply to a team of interchangeable workers or workers whose material needs are not well met by their employment.

model:3 Team
3.1 Satisfaction in the work
The team must have work that challenges them and results in satisfaction in what they produce
3.1.1 Positive outcome of work
The team must have confidence that their work will have a positive outcome
model.external:1.1
model.plan:1.1, 1.2
3.1.2 Challenging work
The team must find that the project’s work challenges them and makes use of their skills while remaining achievable
model.external:1.1, 1.2
3.1.3 Avoid irrelevant work
The team must believe that they are not being asked to do irrelevant work as part of the project
model.artifacts:1.3, 1.3.1
model.external:1.2
model.plan:1.2.5
model.tools:1.1
3.2 Appropriate staffing
The team must be staffed with the right people to do the work
3.2.1 Sufficient staffing
The project must have sufficient staff, with the right skills, to build the system
model.external:1.3
model.plan:1.2.3, 6.1, 6.3
3.2.2 Not overstaffed
The project must not be overstaffed in a way that leaves some unable to make meaningful contributions
model.external:1.3
model.plan:1.2.3, 6.1, 6.4
3.3 Sufficient supporting resources
The project must provide the team with sufficient resources to do the work
model.tools:3.2, 3.3, 4.1, 4.2, 5.1
3.4 Secure position
The people in the team must feel secure in their position in the team
3.4.1 Understanding of fit
The team members must understand how they fit into the organization
model.external:1.5
model.team:1.1
3.4.2 Clear expectation
The team members must have a clear and correct understanding of their responsibilities in the project
model.plan:1.2.7, 2.4, 3.2.3, 6.2, 6.3
model.team:1.2
3.4.3 Fair evaluation
The team members must have an expectation that their work will be fairly evaluated
model.external:1.7
model.team:1.2.1
3.4.4 Clear lines of authority
The team members must have a clear understanding of the authority of others in the project
model.artifacts:3.1
model.plan:3.2.3, 6.2, 6.3
model.team:1.1.1, 1.1.2
3.4.5 Ability to raise issues
The team members must have the ability to raise issues about the team and about the system, without retribution
model.external:1.4
model.plan:2.4, 3.3.5
model.team:4.1
3.5 Fair compensation
The team must be fairly compensated for their time and effort
model.external:1.6
3.6 Belief in project
The team must be able to believe in the project, its purpose, and its leadership
3.6.1 Belief in objective
The team must have confidence that the organization is accurately working with the customer
model.plan:1.1
model.team:2.1.2
3.6.2 Ethics
The team members must believe that the system will be used in ways that accord with their ethical beliefs
model.artifacts:2.1.3
model.external:1.8

A.2.2.1 Satisfaction in the work

Team members are expected to need satisfaction arising from the work they are doing on the project.

The satisfaction comes in part from believing that the work they are doing will have some positive outcome. That outcome might be that they see the system deployed and having a positive effect on the world. It might be that they see their work acknowledged, publicly or privately, even if the system ultimately is not deployed. It could come from social standing among their peers improving because of their association with the work.

Skilled workers also want work that makes use of their skills—which leads to a sense that they, as a specific individual, are making a contribution to the work. Work that challenges them or from which they learn things contributes to that satisfaction.

Doing work that is seen as not relevant or not likely to have value decreases their satisfaction. If asked to do something that is not achievable, they will lose enthusiasm. If they are asked to do work that they perceive as irrelevant, they will feel a lack of their individual value.

To support team satisfaction, the project can:

Other aspects are outside of the project’s scope.

A.2.2.2 Appropriate staffing

A team needs to have enough of the right people to do the work—but not too many people. Having enough people on the team who can do the work contributes to a team member’s sense that the project has a good chance of having a positive outcome.

Having too few people, or too many people who lack necessary skills, leads to an overworked and burnt out team.

Having too many people leads to team members who don’t have useful work to do. It can lead to people making up new work just to feel like they are contributing.

The “right” staffing level is dynamic. It changes over time as the project moves forward: a particular skill in designing electronics boards may be important for one period in the project, but once the necessary hardware has been designed and built, the need decreases. It changes over time as people change. As a team members learns new things, they may find that they should move on to a different project. Life events occur that change a person’s motivations and needs. The key is not to always have the perfect cohort working on the project, but to have a pretty good group and work to address changes as they happen. If the team has trust in their management that the management is able to address team composition, people will generally stay satisfied.

Ensuring appropriate staffing involves:

Much of this is outside the scope of the project itself. The organization holds the funds used to pay staff. It also provides the ability to hire and fire people.

A.2.2.3 Sufficient supporting resources

As with staffing, the team needs resources to do their work: a place to work and the tools to do the work, for example. They may need consumable resources as well. For example, a team might need a ready supply of liquid nitrogen in order to test a hardware component that is supposed to operate at low temperature.

If the team lacks these resources, they can’t do their work. This affects their satisfaction.

The project needs to have:

A.2.2.4 Secure position

Team members need to have a sense of security in their position. This means that they need to believe that they understand their position in the project and organization, believe that they will be treated fairly, and believe that issues they raise will be addressed. The opposite of this is when they have a sense of insecurity—because they do not understand what is expected or how they are evaluated, or because they believe that problems will not be resolved, even if they raise an issue.

The sense of security allows people to put their effort into their work, rather than spending their time and energy on worry. The sense also helps keep people on the team so that their knowledge of the system continues to benefit the project.

This comes from the project:

The organization also needs to:

A.2.2.5 Fair compensation

A technical worker needs to believe they are being fairly compensated for their time and effort. They need to be compensated well enough that they are not distracted by want. That compensation may be monetary, but it may take other forms as well.

Setting compensation policy is usually a responsibility of the organization, not the project.

A.2.2.6 Belief in project

Skilled workers often have choices about what project they work on. Many of them are motivated by a belief in the work they do: that it will help its users, or that it will result in some good for the world. If they come to believe that either or both is not true, they will be demotivated.

The project should:

The organization should also maintain an ethics policy that details:

A.2.3 Organization

The people in the team work for the organization, which provides a home for the project (see Section 16.2.3.) The organization is responsible for finding funding for the project and providing a legal entity for doing the work. The organization provides the business operations that make the project possible.

There is no one kind of “organization” that fits all situations. The organization might be anything from a single person, to a company, to a consortium of organizations, depending on the project. The organization might exist to return profit in exchange for the work, or it might be a non-profit or a governmental organization that looks for non-monetary benefits from the project. Some organizations exist only to build and deliver one system; others expect to deliver the system to many customers and to build more systems in the future.

Many of an organization’s needs are not to be met by the system-building project itself; they are met by how well the organization’s management and business operations. Nonetheless, how the system is built can help or hinder business management or operations.

The diversity of kinds of organizations means that the list of needs below has to be tailored for each project and each organization.

model:4 Organization
4.1 Ability to deliver
The organization must have the ability to deliver the working system to the customer
4.1.1 Ability to communicate with customer
The organization must be able to communicate with the customer
model.team:2.1.1
4.1.1.1 Conflict resolution
The organization must be able to negotiate and resolve conflicts between the team and the customer
model.team:2.1.1.3
4.1.2 Support for the team
The organization must have the infrastructure to support the team
4.1.2.1 Leadership
The organization must have leadership that can run the organization in a way that enables the team
model.external:1.4, 1.5, 1.7, 1.8
4.1.2.2 Infrastructure
The organization must have the ability to staff and finance the team
model.external:1.3, 3.1
4.1.2.3 Resources
The organization must have resources to hire the team and for them to operate
model.external:1.3
4.1.2.4 Workplace regulation
The organization must provide a workplace that meets regulation
model.external:1.9
4.2 Ability to sell
The organization must have the ability sell the system produced (when appropriate)
model.external:2.1
4.2.1 Articulate value
The organization must be able to articulate the value of the system product being sold
model.artifacts:2.1
4.2.2 Market
There must be a market for the system being sold
model.artifacts:2.1
model.external:2.2
model.plan:3.2.1
4.2.3 Sales and marketing team
The organization must have a sales or marketing capability, with the resources to do its job
model.external:2.3
4.3 Profit
The organization must get enough profit from the project to fund overhead and to support future projects
model.external:3.2
model.plan:1.2.5, 1.2.6, 3.2.6
4.4 Positioning for future work
The organization must be positioned for future projects and/or maintenance of this system
4.4.1 Reputation
The organization must have a reputation for being able to build systems well
model:2.1, 2.2, 2.5
4.4.2 Reusable capability
The organization must have capabilities in processes, teams, and tools that will apply to future projects
model.external:4.1
4.4.3 Ongoing improvement
The organization must be able to learn and improve its capabilities over time
model.external:4.2

A.2.3.1 Ability to deliver

The purpose for the organization pursuing a system-building project is to deliver a system to the customer. If the project does not deliver something, the organization will see little return on its investment and effort.

Of course, an organization might get a contract from a customer and get started, only for the customer to cancel the contract. (Hopefully the organization has taken this into account in its planning.) The organization still needs to have been able to deliver the system, even if the work was stopped.

The ability to deliver has two aspects: communication with the customer and support for the team, in addition to the general ability to build a system for the customer.

When the system being built has a specific customer, the organization needs to be able to talk to them, keep them updated on progress, and hear concerns or issues from the customer. When there is disagreement, the organization needs people who can negotiate and resolve issues.

The project can help this by maintaining the interface with the customer, including having people assigned to work with the customer, documenting what they learn from the customer, and negotiating with the customer as issues arise.

The project team can do little without the organization supporting them. The team needs leadership; it needs workspace and other infrastructure; it needs human resources and payroll and accounting support. The organization needs to:

A.2.3.2 Ability to sell

If the system is expected to be delivered to multiple customers over time, the organization needs to be able to find those customers, make the case to them that the system will benefit them, and work out a deal to deliver the system.

I have written this need in terms of selling, but the needs apply when something is being delivered not for monetary return. An open-source project that is freely available to users does not sell the system for money, but the way that project has value is for users to pick up, deploy, and use the system. The project may want to attract developers to build up an ecosystem of related products or services. Meeting these needs involves making potential users aware of the system and making the case that they will benefit from the system.

To be able to sell the system, the organization needs to:

A.2.3.3 Profit

The organization will be expecting to get some kind of return on its effort. That may be a monetary return, but a non-profit or government agency may look for a non-monetary return, such as a community benefit.

The project can support this in two ways. First, the organization can set business objectives for the project, such as expected profit. The project can keep records of these objectives, and take them into account in the system’s design. Second, the project can organize its work as efficiently as possible so that investment goes as far as possible (consistent with deadlines). The project’s management can monitor the time and money being spent and work out how to adjust the project if it looks likely that the project will not meet the organization’s expectations on return or profit.

A.2.3.4 Positioning for future work

Many organizations build multiple systems over their existence—whether this is building multiple bespoke systems for customers, or building multiple products that are delivered to many customers. The ability to continue to deliver profitable systems is a major part of a company’s stock performance: the stock price is determined by the market expectation of future returns to the investors.

An organization’s reputation affects its ability to attract customers and investment, as well as its ability to hire talented staff. The reputation depends in part on its ability to deliver good systems.

An organization can become more productive over time—and thus improve its reputation, its ability to deliver, and its profitability. This comes from learning and improving. If the organization builds up a staff that knows how to run a system-building project well, future projects can be executed more efficiently. Better tools will help the next projects. However, learning and improvement does not often happen by chance; it happens when an organization sets out to learn from its performance.

The project can:

The organization should:

A.2.4 Funders

The funders provide the investment that funds the team building the system (see Section 16.2.4.) The funder provides these resources in the expectation of some kind of return, be that monetary or not. A venture capital funder is most likely to look for a monetary return from future profits from the organization it is funding. A company providing internal funding more likely is looking the project to add to the company’s capabilities, which will in turn enable the company to increase its future profits. A government agency is likely looking for something that will benefit the public in some way.

As noted earlier, there are many different kinds of funders, from venture capital to company internal funding to customers paying for development.

model:5 Funders
5.1 Return on investment
The funder must get at least the expected return on its investment
model:4.2, 4.3
5.1.1 Visibility
The funder must have sufficient visibility into the organization’s behavior and progress to determine when the project is at risk of not providing a return on investment
model.external:4.3
model.plan:1.2, 4.1, 4.2
5.1.2 Influence
The funder must have influence on the organization or project in order to address performance that will jeopardize return on investment
model.external:4.3.1
5.2 Ability to attract future investment
The project must help the funder attract investment for future projects
model:5.1

A.2.4.1 Return on investment

Funders provide capital to run the project on the expectation that they will get some return on that investment.

In some cases, the return will come from profit realized in building the system (Section A.2.3.3) or from an increase in the value of the organization (Section A.2.3.4). In other cases the return will come from the value of the system after it is delivered and deployed (Section A.2.1.1, Section A.2.3.1).

The funders will also expect to be able to track the organization’s and project’s progress, and to raise issues when they find that there may be a problem that could jeopardize the funder’s return. The organization needs to have people whose responsibility includes interfacing with the funders.

The project can support the interface with funders by maintaining a realistic plan for the work, managing its budget, and keeping the organization informed of progress. The project should also have the processes in place to respond when the funders raise an issue that leads to a potential change to the system.

The project may also need to maintain accurate records and artifacts that allow the funder to audit the project—verifying that the information the funder has received is accurate and complete.

A.2.4.2 Ability to attract future investment

The funders get the capital they invest from somewhere. In many cases, the investment capital comes from their customers: individual and institutional investors for venture capital, legislatures and the public for government investors. The funders will keep their investor customers satisfied if they can show that their investments produce the expected returns, leading to a reputation for using capital wisely. At the same time, funders want to avoid bad press from projects that have problems, which can reflect on the funders’ ability to select organizations or projects.

The ways that the project can address this funder need are all included in the previous section, on the funder’s need for return on investment.

A.2.5 Regulators

Regulators (Section 16.2.5) provide an independent check on work to ensure that it meets regulations or standards, thus ensuring that some public good is maintained that the organization or project might not otherwise be incentivized to meet.

The interaction between the project and regulators depends on the countries involved and the nature of the project. Some industries require licensing or certification of some kinds of system: most aircraft, for example, must obtain type certification from the local civil aviation authority before that aircraft is allowed to fly or be sold. Spacecraft require a set of licenses for launch and communication. Other industries, such as consumer electronics or automobiles in the United States, depend on compliance with regulation but compliance is only checked after the fact.

I include voluntary standards as part of regulation. Non-governmental organizations set interoperability standards; the standards for USB (set by the USB Implementer’s Forum) and WiFi (set by the Institute of Electrical and Electronics Engineers 802.11 working group) are examples. Other organizations set safety standards that help to ensure consumer products are checked to be safe.

The regulators perform multiple tasks:

model:6 Regulators
6.1 Compliance and certification
The regulator must be able to work with the project to ensure regulatory compliance and (when appropriate) certify the system
6.1.1 Available regulation
The regulator must make information about regulations available to the organization and possibly user
model.artifacts:8.1
model.plan:3.2.5
model.team:2.3.1.1, 2.3.1.2
6.1.2 Application
The project must apply to the regulator for certification and then follow the certification process
model.artifacts:8.2
model.plan:3.3.7
model.team:2.3.1.5
6.1.3 System auditability
The regulator must be able to audit that the system complies with regulations
model.artifacts:4.2, 4.4, 4.5, 8.4
model.plan:3.3.3.1
model.team:2.3.1.3
6.1.4 Process auditability
The regulator must be able to audit that the organization has followed required processes in building the system
model.artifacts:8.3
model.team:2.3.1.3
6.2 Monitoring
The regulator must be able to monitor the organization, project, and/or system for compliance with regulation
model.team:2.3.1.3
6.2.1 Accurate information available
The organization and/or user must make available to the regulator accurate and complete information about the system and organization behavior
model.artifacts:4.2, 4.4, 4.5, 8.3, 8.4
6.2.2 Notify regulator
The organization or user must proactively provide information to the regulator when a potential regulatory problem is detected, as required by regulation
model.plan:3.3.7.1
model.team:2.3.1.4
6.3 Problem resolution
The regulator must be able to work with the project and/or user to identify and resolve potential regulatory problems
6.3.1 Communicate with organization or user
The regulator must be able to communicate with the organization or user about potential regulatory problems
model.plan:7.1
model.team:2.3.1.3
6.3.2 Accurate information
The regulator must obtain cooperation and accurate information from the organization or user to investigate a potential regulatory problem
model.artifacts:4.2, 4.4, 4.5, 8.3, 8.4
6.3.3 Respond to remedy
The organization or user must be able to respond to a regulator’s remedy
model.plan:7.1
model.team:2.3.1.5

A.2.5.1 Compliance and certification

The regulator makes regulations (or standards), and makes them available to teams building affected systems.

The project responds to the regulations by designing and building the system so that it meets the regulations, maintaining records needed to show that the regulations have been met, and beginning a process for getting certifications or licenses when appropriate.

The project is responsible for:

A.2.5.2 Monitoring

In some cases, the regulator must monitor the project’s work—for example, during aircraft certification, which is generally a joint effort between the aviation authority and the company building the aircraft. A regulator might also need to monitor the project after a violation has been found and the team is working on remedial action.

Accurate and timely information is paramount when this occurs. The project must maintain good records to be able to provide that information to the regulators. The information potentially covers everything about the project: the design and analyses of the system, its implementation, records of design rationales, and logs of the processes followed.

The team must also be prepared to notify the regulator proactively as situations arise. The team should have people who will watch for situations and communicate with the regulator.

A.2.5.3 Problem resolution

I have never observed a licensing or certification process to go with no problems. The processes and regulations are often complex, and unless a team has done the process several times before there will almost certainly be things the team needs to learn to get through the process.

This means that there will be problems to resolve. Sometimes the team will discover the problem and need guidance from the regulator. Other times the regulator will raise the issue.

The team can make smooth the problem resolution process by:

A.3 Model elements

All of the objectives in the previous section map to objectives related to artifacts, team, tools, and plan. Some of them also map to things other than the system-building that goes on in the project.

This section lays out tables of the objectives for each element of the model. Each objective is annotated with its parents; that is, the objectives that are the reason that this objective is included. These are annotated in the tables with an arrow pointing down and right: ↘. If one of the objectives has children, those are annotated with a right-pointing arrow: →.

A.3.1 Artifacts

model.artifacts:1 Artifact management
1.1 Store artifacts
The project must have a place to store artifacts
model:2.1.2, 2.4, 2.5.4
model.artifacts:2.1, 2.2, 3.1, 4.2, 4.4, 4.5, 5.1, 5.2, 6.1, 6.2, 7.1, 7.2, 7.3, 7.4, 8.1, 8.2, 8.3, 8.4, 9.1
model.tools:2.1, 3.2.1, 3.4
1.1.1 Consistent versioning
The artifact storage must be able to maintain versions of all artifacts that are consistent with each other
1.2 Finding artifacts
Team members must be able to find artifacts that they are looking for
model.artifacts:2.1.1, 3.1.3, 4.2, 4.4, 5.2, 6.1, 8.1
1.2.1 Discovery
Team members must be able to learn about artifacts they need to know of that they didn’t previously know existed
1.3 Status
Anyone looking at an artifact must be able to determine the status of that artifact (work in progress, proposed, approved, complete, and its version)
model:3.1.3
model.artifacts:2.1, 2.2, 3.1, 4.2, 4.4, 4.5, 5.1, 5.2, 6.1, 6.2, 7.1, 7.2, 7.3, 7.4, 8.1, 8.2, 8.3, 8.4, 9.1
1.3.1 Support workflow
model.artifacts:2 Purpose-related
2.1 System purpose
The artifacts must include documentation of the customer’s purpose for the system
model:2.1.1, 2.1.2, 2.5.4, 4.2.1, 4.2.2
model.plan:3.3.1
model.team:2.1.1.1
model.artifacts:1.1, 1.3
2.1.1 Discoverable purpose
The artifacts that document the system’s purpose must be readily and accurately discoverable by members of the project team
model.artifacts:1.2
2.1.2 System requirements
The artifacts must include documentation of the customer’s reliability, safety, and security requirements for the system
model:2.1.3, 2.1.4
2.1.3 System usage
The documentation of the customer’s purpose must include accurate information about what the system will be used for
model:3.6.2
2.2 Changes in purpose
The artifacts must include records of requests made for changes to the system’s purpose, including the status of that request and any artifacts resulting from an approved change
model:2.5.3, 2.5.4
model.plan:5.1, 5.2, 5.3
model.team:2.1.1.2, 2.3.1.2
model.artifacts:1.1, 1.3
2.3 Reasons for building system
The artifacts must include documentation of why the team has chosen to build the system
model.plan:1.1
model.artifacts:3 Team-related
3.1 Team structure
The artifacts must include documentation of the structure of the team
model:3.4.4
model.plan:3.2.3, 6.3, 6.4
model.team:1.2
model.artifacts:1.1, 1.3
3.1.1 Team membership
The documentation of team structure must include accurate records of who is on the team
3.1.2 Roles and authority
The documentation of team structure must include accurate records of the roles and authority that each team member has
3.1.3 Navigability
Members of the team must be able to conveniently and accurately navigate the records of team structure
model.artifacts:1.2
model.artifacts:4 System-related
4.1 Technical uncertainty
The artifacts must include records about the uncertainties or risks identified for the system being built
model.plan:8.2
4.2 Specification and design artifacts
The artifacts must include accurate records of the specification and design of the system components and structure
model:2.1.2, 6.1.3, 6.2.1, 6.3.2
model.artifacts:1.1, 1.2, 1.3
4.2.1 Rationales
The design-related artifacts must include rationales for the design choices made
model:2.5.4
4.4 Implementation artifacts
The artifacts must include the implementation of the system
model:2.1.2, 6.1.3, 6.2.1, 6.3.2
model.artifacts:1.1, 1.2, 1.3
4.5 Analysis artifacts
The artifacts must include accurate analyses of the system component and structure design or implementation
model:2.1.2, 2.1.4, 2.1.5, 6.1.3, 6.2.1, 6.3.2
model.artifacts:1.1, 1.3
model.artifacts:5 Verification-related
5.1 Verification tests
The artifacts must include implementations of tests used for verifying the system implementation
model:2.1.2, 2.1.4
model.artifacts:5.2
model.artifacts:1.1, 1.3
5.2 Verification results
The artifacts must include accurate results of performing verification tests, reviews, and analyses
model:2.1.2, 2.1.4
model.artifacts:1.1, 1.2, 1.3, 5.1
model.artifacts:6 Release/deployment-related
6.1 Release/deployment procedures
The artifacts must include the procedures to be used to release or deploy the system
model:2.4
model.artifacts:1.1, 1.2, 1.3
6.2 Release/deployment records
The artifacts must include records of each release and deployment made of the system
model.plan:3.4
model.artifacts:1.1, 1.3
model.artifacts:7 Management-related
7.1 Budget records
The artifacts must include records tracking resource budgets
model.plan:4.1
model.artifacts:1.1, 1.3
7.2 Roadmap and plan
The artifacts must include the plan for completing the system
model.plan:1.2
model.artifacts:1.1, 1.3
7.3 Tasking
The artifacts must include records about the tasks currently being performed, or that are scheduled to be performed in the near future
model.plan:2.1, 2.2
model.artifacts:1.1, 1.3
7.4 Project uncertainty
The artifacts must include records about the uncertainties or risks identified for project execution
model.plan:8.1
model.artifacts:1.1, 1.3
model.artifacts:8 Regulatory-related
8.1 Regulations
The artifacts must include all the relevant regulations that the system must meet (or references to them)
model:2.3.1, 6.1.1
model.plan:5.2
model.team:2.3.1.2
model.artifacts:1.1, 1.2, 1.3
8.2 Certification process
The artifacts must include information on the certification process
model:2.3.2, 6.1.2
model.artifacts:1.1, 1.3
8.2.1 Application
The artifacts must include records of applications made for certification
8.2.2 Certifications
The artifacts must include records of certifications that have been granted or denied for the system
model.plan:3.3.7
8.3 Regulatory process records
The artifacts must include records of the process being followed to meet regulation or obtain certification
model:2.3.2, 6.1.4, 6.2.1, 6.3.2
model.artifacts:1.1, 1.3
8.4 Regulatory verification records
The artifacts must include records showing that the system has been verified against regulatory requirements
model:6.1.3, 6.2.1, 6.3.2
model.plan:3.3.3.1
model.artifacts:1.1, 1.3
model.artifacts:9 Audit
9.1 Approvals
The artifacts must include records of reviews and approvals of designs and implementations
model.artifacts:1.1, 1.3, 1.3.1

A.3.2 Team

model.team:1 Organization
1.1 General structure
Each team member must be able to find and understand the structure of the team
model:3.4.1
1.1.1 Finding team members
Each team member must be able to find essential information about other team members
model:3.4.4
1.1.2 Reporting
Each team member must be able to accurately determine the reporting structure of the team
model:3.4.4
1.2 Roles and responsibilities
Each team member must be able to accurately find and understand their roles and responsibilities
model:3.4.2
model.artifacts:3.1
1.2.1 Clear responsibilities
Each team member must be able to accurately find and determine the responsibilities on which their performance will be evaluated
model:3.4.3
model.team:2 Capabilities
2.1 Customer-related
2.1.1 Customer interface
The team must have people whose responsibility is to work with the customer
model:4.1.1
2.1.1.1 Learn and communicate the customer’s purpose
The team must have people whose responsibility is to work with the customer to identify the customer’s purpose and requirements for the system and to communicate that to the rest of the team
model:2.1.1, 2.1.3
model.plan:3.2.1
model.artifacts:2.1
2.1.1.2 Learn and process change requests
The team must have people whose responsibility is to receive requests for changes, document those requests, and drive the process to decide on and resolve the requests
model:2.5.1, 2.5.4
model.artifacts:2.2
2.1.1.3 Ability to negotiate
The people on the team who interface with the customer must be able to raise issues to the customer and negotiate resolutions of issues or conflicts
model:4.1.1.1
2.1.2 Internal communication
The team must have people whose responsibility includes regularly informing the rest of the team about the status of working with the customer
model:3.6.1
2.2 Technical capabilities
2.2.1 Ability to build system
The team must have the skills needed to build a system that meets the customer’s purpose
model:2.1.2, 2.1.4, 2.5.4
2.2.2 Ability to verify system
The team must have the skills needed to verify that the designed or implemented system meets the customer’s purpose, regulatory requirements, and other constraints
model:2.1.2, 2.1.4, 2.1.5, 2.5.4
2.2.3 Track technical uncertainty
The team must have people whose responsibility includes identifying risk or uncertainty related to the system being built, documenting those uncertainties, and ensuring that the uncertainties are resolved
model.plan:8.2
2.2.4 Ability to release/deploy system
The team must have the skills needed to release or deploy the system
model.plan:3.4
2.3 Regulator interface
2.3.1 Regulator interface
The team must have people whose responsibility is to work with regulator(s)
2.3.1.1 Identify regulation
The team must have people whose responsibility includes identifying relevant regulation or certification requirements and documenting them
model:6.1.1
model.plan:3.2.5
2.3.1.2 Detect and handle regulatory changes
The team must have people whose responsibility includes detecting that relevant regulations have changed, documenting those changes, and driving changes to plans to address the changes
model:2.5.2, 6.1.1
model.artifacts:2.2, 8.1
2.3.1.3 Handle regulator requests
The team must have people whose responsibility includes receiving and responding to requests for information from the regulator
model:6.1.3, 6.1.4, 6.2, 6.3.1
2.3.1.4 Handle regulator notification
The team must have people whose responsibility includes notifying the regulator at identified events that affect certification or regulatory compliance
model:6.2.2
2.3.1.5 Perform certification/approval process
The team must have people whose responsibility includes working with the regulator(s) to obtain certification or approval
model:6.1.2, 6.3.3
model.plan:3.3.7
2.3.2 Process oversight
The team must have people whose responsibility is to ensure that the project is following processes that will result in a system that meets regulations and/or obtain certification
model.plan:3.3.1.1, 3.3.2.1, 3.3.3.1
2.4 Management capabilities
2.4.1 Track plan
The team must have people whose responsibility includes building and maintaining the project plan
model.plan:1.2, 5.4.1
2.4.2 Detect and respond to problems
The team must have people whose responsibility includes detecting when the project may not meet deadlines and oversee the response
model.plan:1.2.4
2.4.3 Prioritize and assign tasks
The team must have people whose responsibility includes determining which tasks should be performed next according to some prioritization, and assigning those tasks to team members
model.plan:2.1, 2.2
2.4.4 Maintain team information
The team must have people whose responsibility includes maintaining information about the team
model.plan:6.2, 6.3, 6.4
2.4.5 Track staffing levels
The team must have people whose responsibility is to detect when staffing levels need to change, and then ensure that needed changes are made
model.plan:6.1
2.4.6 Track project uncertainty
The team must have people whose responsibility includes identifying risk or uncertainty related to project execution, documenting those uncertainties, and ensuring that the uncertainties are resolved
model.plan:8.1
2.5 Support capabilities
2.5.1 Maintain tools
The team must have people whose responsibility includes maintaining the tools used for building the system
model:2.1.2, 2.4, 2.5.4
model.team:4 Exceptions
4.1 Raise technical issues
Each team member must know how to raise technical issues when they find them
model:3.4.5
model.team:5 Other
5.1 Tracking
The team must be able to track time spent building the system
model.plan:4.1.1

A.3.3 Tools

model.tools:1 General
1.1 Automate simple tasks
Where possible, the project should use tools to automate simple or repetitive tasks
model:3.1.3
model.tools:2 Artifact management
2.1 Digital artifact storage
The tools must provide for storing and managing digital artifacts
model.artifacts:1.1
model.tools:3 Hardware support
3.1 Design tools
The project must have the design tools needed to design hardware components
model:2.1.2
3.2 Manufacturing
The project must have the tools needed to manufacture hardware parts
model:2.1.2, 3.3
model.tools:3.4
3.2.1 Stock storage
The tools must provide for maintaining stock materials used to build hardware components
model.artifacts:1.1
3.3 Hardware test
The project must have tools to perform verification tests on hardware components
model:3.3
model.plan:3.3.3, 3.3.5
3.4 Inventory management
The project must have the facilities and tools to maintain hardware parts inventory and track its content
model.artifacts:1.1
model.tools:3.2, 3.5
3.5 Hardware deployment
The project must have tools for delivering and distributing hardware components
model:2.4
model.tools:3.4
model.tools:4 Software support
4.1 Software build
The project must have tools to build software components
model:2.1.2, 3.3
4.2 Software test
The project must have tools to perform verification tests on software components
model:3.3
model.plan:3.3.3, 3.3.5
4.3 Software release
The project must have tools for making, maintaining, and distributing software releases
model:2.4
model.tools:5 Facilities
5.1 Team facilities
The project must have facilities in which the team can work to develop the system
model:3.3

A.3.4 Plan

model.plan:1 General roadmap
1.1 Overall objective
The roadmap must include a clear statement of the objective(s) of the system-building effort
model:3.1.1, 3.6.1
model.artifacts:2.3
model.plan:3.2.4
1.2 Plan to completion
The project must maintain a plan that shows an estimation of the time and effort required to complete the system
model:2.1.2, 2.2.3, 2.5.4, 3.1.1, 5.1.1
model.artifacts:7.2
model.team:2.4.1
1.2.1 Reflect uncertainty
The plan must accurately reflect the degree of uncertainty (or risk) in what is known about the steps needed to complete the system
model:2.2.3.1
model.plan:8.1, 8.2
1.2.2 Update plan
The plan must be updated as work is completed or uncertainty changes
model.plan:3.3.6, 5.4.1
1.2.3 Include resource estimate
The plan must include estimates of the time and resources required to complete steps in the plan
model:3.2.1, 3.2.2
1.2.4 Detect and handle deadline problems
The project must include processes and milestones that will detect when project deadlines will not be met, determine how to respond, and ensure the response is executed
model:2.2.4
model.team:2.4.2
1.2.5 Only project-relevant work in plan
The project must include only work that is relevant to building the system, or managing and supporting that development, in the plan
model:2.2, 3.1.3, 4.3
1.2.6 Efficient execution
The project must organize and plan the work in an efficient way, minimizing time to completion and/or cost without sacrificing quality, customer purpose, or requirements
model:4.3
model.plan:2.2, 8.1, 8.2
1.2.7 Navigability
The project plan information must be accessible to team members in a way that allows them to understand how the work will proceed and how it will affect their assignments
model:3.4.2
model.plan:2 Sequencing and prioritization
2.1 Track current effort
The project must include processes to track what tasks are currently being worked on or have been assigned to people to be worked on in the near future
model:2.1.2, 2.5.4
model.artifacts:7.3
model.team:2.4.3
2.2 Assign next tasks
The project must include processes to determine what tasks should be worked on in the near future, and by whom
model:2.1.2, 2.5.4
model.plan:1.2.6
model.artifacts:7.3
model.team:2.4.3
2.3 Detect and handle tasking problems
The project must include processes and milestones to detect when there are problems performing one or more tasks, determine how to respond, and ensure the response is executed
model:2.1.2, 2.5.4
2.4 Navigability
The scheduling information must be accessible to team members in a way that allows them to determine what work they should be doing, and who is doing work related to theirs
model:3.4.2, 3.4.5
model.plan:6.1
model.plan:3 Process and life cycle
3.1 Defined life cycle
The project must have a defined life cycle that defines the processes people must follow to build and deploy the system
model.plan:3.2, 3.3
3.2 Project startup
The project life cycle must include the processes involved in starting the project
model.plan:3.1
3.2.1 Learn and verify customer purpose
The project life cycle must include an initial step to learn the customer’s purpose for the system and ensure that the team properly understands that purpose
model:2.1.1, 2.1.3, 4.2.2
model.team:2.1.1.1
3.2.2 Learn and verify available resources
The project life cycle must include an initial step to determine the resources initially available for building the system
model:2.2.1
model.plan:4.1
3.2.3 Establish organization structure
The project life cycle must include an initial step to decide on and document the structure of the team that will be working on the project
model:3.4.2, 3.4.4
model.artifacts:3.1
3.2.4 Establish reasons for building the system
The project life cycle must include an initial step to determine whether the team should build the system, and if so, why
model.plan:1.1
3.2.5 Establish regulatory constraints
The project life cycle must include an initial step to determine what regulations apply to the system, including the potential need for certification
model:2.3.1, 6.1.1
model.team:2.3.1.1
3.2.6 Establish organization expectations
The project life cycle must include an initial step to determine what expectations has of the project, including definition of constraints on the project and system
model:4.3
model.external:3.2
3.3 System building
The project life cycle must include the processes involved in building the system
model:2.1.2, 2.1.4
model.plan:5.4
model.plan:3.1
3.3.1 Design to purpose
The project life cycle must provide processes and milestones that ensure that the system design accurately reflects the customer’s purpose for the system
model.artifacts:2.1
3.3.1.1 Design to meet regulation
The project life cycle must provide processes and milestones that ensure the system design meets regulatory requirements
model:2.3.2
model.team:2.3.2
3.3.1.2 Design for release/deployment
The project life cycle must provide processes and milestones that ensure that the resulting system can be released or deployed
model.plan:3.4
3.3.2 Build to purpose
The project life cycle must provide processes and milestones that ensure that the built system accurately reflects the customer’s purpose for the system
model:2.1.2
3.3.2.1 Build to meet regulation
The project life cycle must provide processes and milestones that ensure the built system meets regulatory requirements
model:2.3.2
model.team:2.3.2
3.3.2.2 Build for release/deployment
The project life cycle must provide processes and milestones that ensure that the built system can be released or deployed
model.plan:3.4
3.3.3 Verify against purpose
The project life cycle must provide processes and milestones that verify that the system meets the customer’s purpose
model.tools:3.3, 4.2
3.3.3.1 Verify against regulation
The project life cycle must provide processes and milestones that verify that the system meets regulation
model:2.3.2, 6.1.3
model.artifacts:8.4
model.team:2.3.2
3.3.4 Verify no extraneous behavior
The project life cycle must provide processes and milestones that verify that the system does not include functions or behavior that is outside the customer’s purpose
3.3.5 Identifying and fixing errors
The project life cycle must provide processes and milestones that ensure error will be detected with high likelihood, and that detected errors will be fixed
model:2.1.4, 2.1.5, 3.4.5
model.tools:3.3, 4.2
3.3.6 Adaptation
The project life cycle must provide processes and milestones to adapt the plans and design as the team learns about the system or as uncertainties are resolved
model:2.5.4
model.plan:1.2.2
3.3.7 Perform certification/approval
The project life cycle must provide processes to result in certification or approval, if required for the system
model:2.3.2, 6.1.2
model.artifacts:8.2.2
model.team:2.3.1.5
3.3.7.1 Regulatory notification
The project life cycle must define events at which the project must notify regulators, and processes by which the notification information is gathered and delivered
model:6.2.2
3.4 System release and deployment
The project life cycle must include the processes involved in releasing and deploying the system
model:2.4
model.artifacts:6.2
model.plan:3.3.1.2, 3.3.2.2
model.team:2.2.4
model.plan:4 Budgets
4.1 Resources
The budgets must include the amount of various resources allocated to the project
model:2.2, 2.2.1, 5.1.1
model.artifacts:7.1
model.plan:3.2.2
4.1.1 Amount used
The budget must accurately record the amount of resources already used in the project
model:2.2.2
model.team:5.1
4.1.2 Amount remaining
The budget must provide accurate measures of how much resource remains available
4.2 Deadlines
The budgets must include milestones or other deadlines that the project must meet
model:2.2, 5.1.1
model.plan:5 Change handling
5.1 Receive change request
The project must have a process for receiving and documenting a request for change in purpose or features
model:2.5.1
model.external:4.3.1
model.plan:7.1
model.artifacts:2.2
5.2 Receive regulatory change
The project must have a process for detecting and receiving changes in regulatory requirements
model:2.5.2
model.artifacts:2.2, 8.1
5.3 Decision process
The project must have a process for determining whether to proceed with a change request or not
model:2.5.1
model.external:4.3.1
model.artifacts:2.2
5.4 System change
The project life cycle must provide processes and milestones for building changes to the system
model.external:4.3.1
model.plan:7.1
model.plan:3.3
5.4.1 Plan change
The project life cycle must provide processes for updating plans when change requests are accepted for building
model.plan:1.2.2
model.team:2.4.1
model.plan:6 Team-related
6.1 Determine need for staffing changes
The project life cycle must provide processes and milestones for detecting when a change in staffing is appropriate, and the following through on the needed changes
model:3.2.1, 3.2.2
model.external:1.3
model.plan:2.4
model.team:2.4.5
6.2 Maintain team information
The project life cycle must provide processes and milestones for maintaining information about the structure, roles, responsibilities, and authority in the team
model:3.4.2, 3.4.4
model.team:2.4.4
6.3 Adding team members
The project life cycle must provide a process for adding new team members to the processes and tools, and educating them about the project
model:3.2.1, 3.4.2, 3.4.4
model.external:1.3.1
model.artifacts:3.1
model.team:2.4.4
6.4 Removing team members
The project life cycle must provide a process and definitions of triggering events for removing a member from the team
model:3.2.2
model.external:1.3.2, 1.4
model.artifacts:3.1
model.team:2.4.4
model.plan:7 Regulatory-related
7.1 Receive and process regulatory issues
The project life cycle must provide a process by which the project can receive notification of issues from the regulator, identify remedies, implement the remedies, and respond to the regulator
model:6.3.1, 6.3.3
model.plan:5.1, 5.4
model.plan:8 Risk and uncertainty
8.1 Project uncertainty
The project life cycle must track and manage uncertainties or risks related to project execution
model.plan:1.2.1, 1.2.6
model.artifacts:7.4
model.team:2.4.6
8.2 Technical uncertainty
The project life cycle must track and manage uncertainties or risks related to the system being built
model.plan:1.2.1, 1.2.6
model.artifacts:4.1
model.team:2.2.3
8.2.1 Efficient burn-down
The project life cycle must provide a process and milestones that lead to efficiently reducing technical uncertainty as the project progresses

A.3.5 External responsibilities

model.external:1 Team-related
1.1 Satisfaction
Project and organization management must take steps to provide team members with the information they need to give them confidence that the project is on track and will challenge the team members
model:3.1.1, 3.1.2
1.2 Appropriate assignments
Project management must take steps to match team members with tasks that are needed and that challenge the team member
model:3.1.2, 3.1.3
1.3 Appropriate staffing level
Project management must manage the makeup of the team to ensure that the project has the right people to do the work, including having processes to hire, contract, or let go of staff
model:3.2.1, 3.2.2, 4.1.2.2, 4.1.2.3
model.plan:6.1
1.3.1 Hiring
Project management and the hosting organization must be able to hire or move in staff when needed for the project
model.plan:6.3
1.3.2 Firing or transfer out
Project management and the hosting organization must be able to move out or let go staff who are no longer needed for the project
model.plan:6.4
1.4 Respond to team issues
Project management must respond appropriately to issues raised by team members, both technical issues and non-technical issues
model:3.4.5, 4.1.2.1
model.plan:6.4
1.5 Fit to organization
The organization must provide the team with an understanding of how the project and the team members fit into the organization
model:3.4.1, 4.1.2.1
1.6 Compensation
Project and organization management are responsible for setting compensation levels for team members at a level that is acceptable to both the staff and the project/organization
model:3.5
1.7 Evaluation
Project and organization management are responsible for setting and documenting the evaluation process that will be used to judge each team member’s performance
model:3.4.3, 4.1.2.1
1.8 Ethics policy
Project and organization management are responsible for setting and documenting an ethics policy that applies to all team and organization activities, along with mechanisms for reporting and resolving potential ethics violations
model:3.6.2, 4.1.2.1
1.9 Workplace regulation
The organization management must provide a workplace that meets regulation
model:4.1.2.4
model.external:2 Customer-related
2.1 Ability to sell
The organization must have the ability to sell the system produced (when appropriate)
model:4.2
2.2 Ability to define market(s)
The organization must have the ability to define plausible market(s) for the system
model:4.2.2
2.3 Sales and marketing capability
The organization must have a capability to perform sales and marketing of the system
model:4.2.3
model.external:3 Resource-related
3.1 Sufficient resources
The organization and project management are responsible for obtaining funding and other resources sufficient to perform the project
model:4.1.2.2
3.2 Define expected return
The organization must define the expected return on investment for the project to build the system
model:4.3
model.plan:3.2.6
model.external:4 Organization-related
4.1 Reusable capability
The organization must take steps to build up reusable capabilities that can apply to multiple projects
model:4.4.2
4.2 Ongoing improvement
The organization must have the capability to learn and improve its capabilities over time
model:4.4.3
4.3 Communication with funder
The organization must communicate with the funder in a way that gives the funder visibility into the organization’s behavior and progress on the project
model:5.1.1
4.3.1 Receive and process funder concerns
The organization must have the capability to receive concerns from the funder, discuss the concerns, and take steps to address the concerns
model:5.1.2
model.plan:5.1, 5.3, 5.4

Appendix B: The Heilmeier questions

22 July 2024

The Heilmeier questions, also known as the Heilmeier Catechism, was developed by George Heilmeier, and early director of the US Defense Advanced Research Projects Agency. These questions, adapted from the DARPA web site [Heilmeier24] are:

Some variants add some additional questions:

Further reading and inspiration

The content in this book has been inspired by many others. The following are a few of the key sources I have used, and that expand on the ideas discussed here.

Works of Christopher Alexander. Christopher Alexander was an architect, who advocated for a human-centered design in buildings and cities. His approaches to designing and building building and towns influenced more general systems development practices, including the idea of pattern languages and agile development.

There are three Alexander books that have influenced my thinking. In these works, he addresses questions of how structure and pattern apply to complex systems—in particular, how the relationships between things is a necessary part of understanding how a system works. (The system organization in Chapter 12 is informed by his ideas.)

Engineering a Safer World. Nancy Leveson and colleagues have developed a set of methodologies for designing and understanding safe and secure systems. The book Engineering a safer world [Leveson11] makes the case that safety and security must be treated using a systems approach, and then presents a causality model (STAMP), a safety and security design analysis technique (STPA), and an incident analysis technique (CAST).

This work affected how I think about designing safe and secure systems. Before encountering this work, I had cobbled together a set of techniques to support designing secure systems, based on collaboration with a number of people on a series of DARPA projects. I used basic fault tolerance techniques. I worked with other safety standards, notably ISO 26262 [ISO26262], found that the methods in those standards were missing many of the hazards I was finding. Leveson’s book provided me with a way to articulate the overall systems aspect of safety and security work, as well as providing better tools for the job.

Scaling People. Claire Hughes Johnson has translated her experience building teams and companies into a book that does for team structure and operations what I am trying to achieve for systems building. Her book, Scaling People [Johnson22], sets out basic models for how to build an organization and grow it. The book begins with a few key behavioral principles that apply to engineering work just as well as to business operations. The book proceeds to develop a model for operations, organized as four “core frameworks”.

Two specific ideas from the first core framework resonate with the engineering work I set out in this book. First, she stresses the importance of writing down founding documents: a record of what an enterprise is for, its goals, philosophy, mission, and principles. Second, she makes the case for defining an “operating system” for the organization. It documents “a set of norms and actions that are shared with everyone in the company”; it defines the basic structures and processes that the company follows. The book is notable for not just presenting these ideas, but making the case for why each idea is important based on her experience and on the experience of others in the industry. It also provides examples of the documents that companies have actually used—so that it’s clear how to put the ideas into practice.

This book has helped me articulate ideas about team organization, communication, and the value of documenting processes.

Drucker. XXX

Acknowledgments

Bibliography

[14CFR23] “Part 23—Airworthiness standards: normal category airplanes”, in Title 14, Code of Federal Regulations, United States Government, December 2023, https://www.ecfr.gov/current/title-14/chapter-I/subchapter-C/part-23, accessed 14 February 2024.
[ARP4754] “Guidelines for Development of Civil Aircraft and Systems”, SAE International, Standard 4754 rev. A, December 2010.
[Agile] Wikipedia contributors, “Agile software development”, in Wikipedia, the Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Agile_software_development&oldid=1198512141, accessed 14 February 2024.
[Alexander02] Christopher Alexander, The Nature of Order, Berkeley, California: The Center for Environmental Structure, 2002.
[Alexander15] Christopher Alexander, A City is not a Tree, Portland, Oregon: Sustasis Press, 2015.
[Alexander77] Christopher Alexander, A Pattern Language, Oxford University Press, 1977.
[Bain99] David Haward Bain, Empire Express: Building the First Transcontinental Railroad, New York: Viking, 1999.
[Berger24] Eric Berger, “The surprise is not that Boeing lost commercial crew but that it finished at all”, Ars Technica, 6 May 2024, https://arstechnica.com/space/2024/05/the-surprise-is-not-that-boeing-lost-commercial-crew-but-that-it-finished-at-all/, accessed 28 May 2024.
[Bezos16] Jeffrey P. Bezos, “2015 Letter to Shareholders”, Amazon.com, Inc., 2016, https://s2.q4cdn.com/299287126/files/doc_financials/annual/2015-Letter-to-Shareholders.PDF, accessed 22 February 2024.
[Brand94] Stewart Brand, How Buildings Learn: What Happens After They’re Built, New York: Penguin Books, 1994.
[CMMI] ISACA, “What is CMMI?”, https://cmmiinstitute.com/cmmi/intro, accessed 24 March 2024.
[Collins74] Michael Collins, Carrying the Fire: An Astronaut’s Journeys, New York: Farrar, Straus and Giroux, 1974.
[Conway68] Melvin E. Conway, “How do committees invent?”, Datamation, vol. 14, no. 4, April 1968, pp. 28–31, http://www.melconway.com/Home/pdf/committees.pdf.
[DOD10] DoD Deputy Chief Information Officer, “DoD Architecture Framework Version 2.02”, Department of Defense, United States Government, August 2010, https://dodcio.defense.gov/Library/DoD-Architecture-Framework/.
[Durkheim33] Emile Durkheim, The Division of Labor in Society, George Simpson, translator, Glencoe, Illinois: The Free Press, 1933.
[Fowler05] Martin Fowler, “Code as Documentation”, 22 March 2005, https://martinfowler.com/bliki/CodeAsDocumentation.html, accessed 15 March 2024.
[Garmin13] G3X Installation Manual, Garmin, 190-01115-01, Revision K, July 2013.
[Hardin20] Russel Hardin, and Garrett Cullity, “The Free Rider Problem”, in The Stanford Encyclopedia of Philosophy, Edward N. Zalta, editor, Metaphysics Research Lab, Stanford University, 2020, https://plato.stanford.edu/archives/win2020/entries/free-rider/, accessed 28 March 2024.
[Heilmeier24] George H. Heilmeier, “The Heilmeier Catechism”, in DARPA, https://www.darpa.mil/work-with-us/heilmeier-catechism, accessed 13 July 2024.
[Horney17] David Craig Horney, Systems-theoretic process analysis and safety-guided design of military systems, M.S. thesis, Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, June 2017, https://dspace.mit.edu/handle/1721.1/112424.
[ISO26262] “Road vehicles — Functional safety”, Geneva, Switzerland: International Organization for Standardization, Standard ISO 26262:2018, 2018.
[ISO42010] “Systems and software engineering—Architecture description”, Geneva, Switzerland: International Organization for Standardization, Standard ISO/IEC/IEEE 42010, December 2011, http://www.iso-architecture.org/42010/index.html.
[J1939] “On-Highway Equipment Control and Communication Network”, SAE International, Standard J1939/1_202109, September 2021.
[JO711065] “Air Traffic Control”, Federal Aviation Adminitration, Department of Commerce, United States Government, Order JO 7110.65AA, April 2023, https://www.faa.gov/regulations_policies/orders_notices/index.cfm/go/document.information/documentid/1029467.
[JPL00] JPL Special Review Board, “Report on the Loss of the Mars Polar Lander and Deep Space 2 Missions”, Jet Propulsion Laboratory, Report JPL D-18709, March 2000, https://smd-cms.nasa.gov/wp-content/uploads/2023/07/3338_mpl_report_1.pdf.
[Jacobson88] Van Jacobson, “Congestion Avoidance and Control”, Proc. SIGCOMM, vol. 18, no. 4, August 1988.
[Johnson22] Clair Hughes Johnson, Scaling People: Tactics for Management and Company Building, South San Francisco, California: Stripe Press, 2022.
[Kalra16] Nidhi Kalra, and Susan M. Paddock, “Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?”, Santa Monica, CA: RAND Corporation, Report RR-1478-RC, 2016, https://www.rand.org/pubs/research_reports/RR1478.html.
[Klein14] Gerwin Klein, June Andronick, Kevin Elphinstone, Toby Murray, Thomas Sewell, Rafal Kolanski, and Gernot Heiser, “Comprehensive formal verification of an OS microkernel”, ACM Transactions on Computer Systems, vol. 32, no. 1, February 2014, https://trustworthy.systems/publications/nicta_full_text/7371.pdf.
[Kruger00] Justin Kruger, and David Dunning, “Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments”, Journal of Personality and Social Psychology, vol. 77, no. 6, January 2000, pp. 1121–1134.
[Leveson00] Nancy G. Leveson, “Intent specifications: an approach to building human-centered specifications”, IEEE Transactions on Software Engineering, vol. 26, no. 1, January 2000, http://sunnyday.mit.edu/papers/intent-tse.pdf.
[Leveson11] Nancy G. Leveson, Engineering a safer world: systems thinking applied to safety, Engineering Systems, Cambridge, Massachusetts: MIT Press, 2011.
[Lynch89] Nancy A. Lynch, and Mark R. Tuttle, “An introduction to input/output automata”, Cambridge, Massachusetts: Massachusetts Institute of Technology, Technical memo MIT/LCS/TM-373, 1989, https://www.markrtuttle.com/data/papers/lt89-cwi.pdf.
[NASA16] “NASA Systems Engineering Handbook”, National Aeronautics and Astronautics Administration (NASA), Report NASA SP-2016-6105 Rev2, 2016.
[NASA19] “Debris Assessment Software User’s Guide, Version 3.0”, National Aeronautics and Astronautics Administration (NASA), Report NASA TP-2019-220300, 2019.
[NPR7120] “NASA Space Flight Program and Project Management Requirements”, National Aeronautics and Astronautics Administration (NASA), NASA Procedural Requirement NPR 7120.5F, 2021.
[NPR7123] “NASA Systems Engineering Processes and Requirements”, National Aeronautics and Astronautics Administration (NASA), NASA Procedural Requirement NPR 7123.1D, 2023.
[OConnor21] Timothy O’Connor, “Emergent Properties”, in The Stanford Encyclopedia of Philosophy, Edward N. Zalta, editor, Metaphysics Research Lab, Stanford University, 2021, https://plato.stanford.edu/archives/win2021/entries/properties-emergent/, accessed 13 February 2024.
[OMG17] “Unified Modeling Language”, Object Management Group, Standard version 2.5.1, December 2017, https://www.omg.org/spec/UML/2.5.1/PDF.
[Olson65] Mancur Olson, The Logic of Collective Action: Public Goods and the Theory of Groups, Harvard Economic Studies, Cambridge, Massachusetts: Harvard University Press, 1965.
[Parnas72] D. L. Parnas, “On the criteria to be used in decomposing systems into modules”, Communications of the ACM, vol. 15, no. 12, December 1972, pp. 1053–1058.
[Smith22] Adam Smith, An Inquiry into the Nature and Causes of the Wealth of Nations, Edwin Cannan, editor, Third ed., London: Methuen and Co., 1922.
[Spiral] Wikipedia contributors, “Spiral model”, in Wikipedia, the Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Spiral_model&oldid=1068244887, accessed 14 February 2024.
[TTSB21] “Major Transportation Occurrence—Final Report, China Airlines Flight CI202”, Taiwan Transportation Safety Board, Report TTSB-AOR-21-09-001, September 2021, https://www.ttsb.gov.tw/media/4936/ci-202-final-report_english.pdf.
[Thompson99] Adrian Thompson, and Paul Layzell, “Analysis of unconventional evolved electronics”, Communications of the ACM, vol. 42, no. 4, April 1999, pp. 71–79.
[Tolstoy23] Leo Tolstoy, Anna Karenina, Constance Garnett, translator, Project Gutenberg, 2023, https://www.gutenberg.org/ebooks/1399.
[Tuckman65] Bruce W. Tuckman, “Developmental sequence in small groups”, Psychological Bulletin, vol. 63, no. 6, 1965, pp. 384–399.
[Tuckman77] Bruce W. Tuckman, and Mary Ann C. Jensen, “Stages of small-group development revisited”, Group and Organization Studies, vol. 2, no. 4, 1977, pp. 419–427.
[Wilson05] Simon P. Wilson, and Mark John Costello, “Predicting future discoveries of European marine species by using a non-homogeneous renewal process”, Journal of the Royal Statistical Society Series C: Applied Statistics, vol. 54, no. 5, November 2005, pp. 897–918, https://academic.oup.com/jrsssc/article/54/5/897/7113002.
[Zhang90] Lixia Zhang, and David D. Clark, “Oscillating behavior of network traffic: a case study simulation”, Internetworking: Research and Experience, vol. 1, 1990, pp. 101–112.