Making systems

Richard Golding
Copyright ©2024 by Richard Golding

Table of contents

Chapter 1: Introduction

9 May 2024

This book began as many presentations and short documents that I put together for different projects over the years. Those presentations covered topics from basic requirements management to good distributed system design to how to plan and operate a project that was regularly in flux. A few of the documents were retrospectives about why a project had run into trouble or failed. Others were written in an attempt to head off a problem that I could see coming.

I have worked on many projects. Most of these have been about building a complex system, or one that required high assurance—ones where safety or security are critical to their correct operation. Some have gone well, but all have had problems. Sometimes those problems led to the project failing. More often they have cost the project time and money, or resulted in a system that was not as good as it should have been. In every case the problems have required unnecessary effort and pain from the team working on the project.

This raised the question: what could be learned from these projects? How can future projects go better?

I began to sense that there were some common threads among all the education and advice I was putting together for these teams, and the problems they were having. With the help of some colleagues who were working their own challenging projects, I began to sort through these impressions in order to articulate them clearly and gather them in one place.

I have found that many of the problems I have observed have come at the intersection of systems engineering, project management, and project leadership. Building a complex system effectively requires all three of these disciplines working together. Most of the problems I have seen have arisen from a breakdown in one or more of them: where there is capable project management, for example, but poor systems engineering, or vice versa.

The intersection is about how each of these disciplines contributes to the work of building a system. The intersection is where people maintain a holistic view of the project. It is where technical decisions about system structure interact with work planning; where project leadership sets the norms for how engineers communicate and check each other’s work. It is where competing concerns like cost versus rigor get negotiated. And it is where people take a long view of the work, addressing how to prepare for the work a year or more in the future.

I’ve worked with many people who were good at one of these disciplines, but didn’t understand how their part fit together with others to create a team that could build something complex while staying happy and efficient. I have worked with well-trained systems engineers who knew the tools of their craft, but did not know how or, more importantly, why to use them and how they fit together. I have worked with project managers who had experience with scheduling and risk management and other tools of their craft, but lacked the basic understanding of what was involved in the systems part of the work they were managing. I have also worked with engineers and managers who were tasked with assembling a team, but did not understand what it means to lead a team so that it becomes well-structured and effective.

In other words, they were all good at their individual disciplines but they lacked the basic understanding of how their discipline affects work in other disciplines, and how to work with people in the other disciplines to achieve what they set out to do.

And that brings me to the basic theme of this book: that making systems is a systems problem, an integration problem. The system that is being build will be made of many pieces that, in the end, will have to work well together. This requires at least some people having a holistic, systemic view of the thing being built. The team that builds the system is itself a system, and its parts—its people, roles, and disciplines—need to work together. The team is something to be engineered and managed, and it needs people who maintain a holistic view of how its parts work together.

This book is not a book on systems engineering or project management per se. Rather, it provides an overarching structure that organizes how the systems engineering, project management, and leadership disciplines contribute to systems-building. While I reference material from these disciplines as needed, do not expect (for example) to learn the details of safety analyses here. I do discuss how those analyses fit with the other work needed for building a system, and provide some references to works by people who have specialized in those topics.

This book is for people who are building complex systems, or are learning how to do so. I provide a structure to help think about the problems of building systems, along with ways to evaluate different ways one can choose to solve problems for a specific project. I relate experience and advice where I have some.

Using this book. This book covers two topics: the system being built, and how to go about building that system. These topics are intertwined, because the point of going to the effort to build a system is to build a well-functioning system.

The first two parts of this book are meant for everyone, and to be read first. They provide a general foundation for talking about making systems. That is, they present a simplified but holistic view of making systems. They present a short set of case studies to motivate what I’m talking about (Part I). Part II presents models for thinking about systems and the making of systems at a high level, along with recommended principles for both.

The two parts that follow provide more detailed discussions of what systems are (Part III) and what systems-building is (Part IV). These parts expand on the material in Part II, providing more structure for talking about each of their subjects. These parts are meant to be read after the foundational parts, but need not be read in order or all at once.

The remaining parts go into depth on concepts and tools that help with building complex systems. These include topics like project life cycles, system design, team organization, and planning. These parts use the language built up in the first parts. The later chapters are meant to be read as needed: when you find you need to know about a topic, dip into relevant chapters.

Part I: System stories

A set of case studies illustrating what can go right and wrong in a project to build a system.

Chapter 2: Making a simple system

30 April 2024

This book is about both what a well-built system is and how to make that happen. To begin, I’ll start with a simple story: building a small cottage model out of Lego™ bricks.

This story is made up, but it reflects some of the situations I have found in real projects I have worked on. It deliberately illustrates problems in a simplified and perhaps exaggerated way to make them clear. The simplifications include: a very small team, and one that doesn’t need to grow during the project; customer “needs” that are simple; and a project that does not need to consider real emergent properties like safety, security, or even mechanical strength.

2.1 The request

A customer wants a small cottage model, built out of Lego™ bricks. They would like the cottage to be white. They would like it to have a window. They have a base plate they would like it to fit on.

Someone on the team works with the customer to get this information and understand the needs. This results in a sketch, which the customer agrees reflects what they have asked for (Figure 2.1).

undisplayed image
Figure 2.1: Sketch of customer needs

2.2 Building the cottage

The project gets its team together, and they begin discussing how to design and build the cottage. Based on the sketch concept, they decide to split the work: one person for each of four walls, and one person for the roof.

The team discusses some basic design parameters. The decide on the length and height of each of the walls, based on the size of the base plate and the rough ratio of the sides in the sketch. They also decide which wall will get the window.

Each person on the team then begins designing and building their part, based on the sizes they have agreed on. The result is a set of five assemblies (Figure 2.2).

undisplayed image
Figure 2.2: Initial components

Right away there are some visible problems.

The team then try to integrate the assemblies together to make the cottage. The result is not good (Figure 2.3).

undisplayed image
Figure 2.3: Initial integration, with problems

There are integration problems.

At this point, the team addresses some of these issues. They add roof supports to the front and back walls, and redesign all the walls to interlock at the corners.

The result is a structure that integrates all the components (Figure 2.4).

undisplayed image
Figure 2.4: Initial integrated cottage

There are still problems with the integrated cottage.

The problems with the side wall come from one of the team members rushing to rebuild that wall after they were reminded that the cottage was to be all white and not have red stripes.

The missing door is a specification problem that came to light when the customer saw the completed cottage. The original sketch developed with the customer didn’t include a door—it only had an annotation about a window. People implicitly know that cottages need doors but builders may miss out on the door if it isn’t explicitly specified.

After some systems work, the team corrects the problems, fixing the side wall and adding a door. Correcting the problems involved taking the cottage more than halfway apart and rebuilding it. The result meets what the customer wanted (Figure 2.5).

undisplayed image
Figure 2.5: Final integrated cottage

2.3 Retrospective on building

There were several problems that the team encountered building the cottage.

The team did not work with the customer to develop a thorough understanding of the customer’s needs. The team only had a minimal writeup of the needs, and that writeup left an important need implicit (the need for a door).

Next, the team did not develop a concept of the system (the cottage) and check that concept with the customer. For example, the team could have made a more realistic drawing of the cottage, and talked with the customer about how the cottage would be used. Checking a concept would have probably caught the missing implicit requirement for a door.

To their credit, the team decomposed the cottage into components (walls and roof), defined some dimensional requirements each would meet, and assigned someone to design each component. Unfortunately the team did not work out and document the interfaces between components. This meant that no one looked at how the walls would be joined (interlocking or not), and no one looked at how the roof would be supported on some walls.

One of the team members building a wall did not follow requirements about color—or perhaps the color requirement was missing or unclear.

Finally, the team members did not communicate with each other. Ideally, each one would have shared abstract designs for their component with the people building components connecting to that component. Sharing these designs would likely have caught that each team member had different understandings of how their components would be joined together.

The outcome was that it took longer than it should have because there was rework that could have been avoided.

Of course, this story is simpler than building a real building would be. A real building has multiple internal component, such as electrical, plumbing, or HVAC systems, that would create many more interfaces among components. A real building has to be designed to be mechanically sound; this requires systematic analysis to ensure that the building will stay up event in unusual events like storms or earthquakes. A real building also has safety concerns, like fire safety. Finally, building a real building is regulated in most places, requiring permits, inspections, and approvals from external authorities to ensure regulatory compliance.

2.4 Adding to the cottage

Some time passes, and the customer decides that they would like a larger model cottage, and they make a request to add on to the initial version. The team that built the original cottage has moved on to other projects.

A new team talks with the customer to learn what the customer wants. How much larger do they want the extended cottage to be? Should it be extended horizontally or vertically? The customer indicates that an extension adding between 50% and 100% of the original floor area would be sufficient, and the customer prefers a horizontal extension.

The team next has decisions to make about the overall design of the extension. They settle on an approach that matches the style of the original part and adds a little over half the floor area. They suggest to the customer that a window in the extension would be a good idea, and the customer agrees.

The new team does not have access to the team that made the original design decisions. They have to reverse engineer the design approach used by examining the cottage as built.

The original cottage was located toward the back of the base plate, and the team has decided that the extension should be at the back of the original. This implies that the team will have to move the original cottage forward. The team examine the original structure and determine that it can be moved on the base plate without problems.

The new team works together to design and build the extension. They have learned about the problems that the original team had, and so they manage the interfaces between walls and with the roof better. However, they don’t have access to the decisions that the original team made about interlocking the walls for strength, and so they build the extension as a separate unit.

undisplayed image
Figure 2.6: Cottage with addition

2.5 Retrospective on addition

This illustrates a common scenario: that changes are made to a system long after it was originally built. The changes can be complex projects on their own. The original team may be long gone, or they may no longer remember details that were not written down. Knowing the design decisions and their rationales for the decisions affects how the changes are designed.

The changes not just add features (new space), but add interfaces between new parts and the original, and can change the interfaces within the original.

The team that build the original cottage did not document the design decisions they made. The team building the addition had to reverse engineer the design from the built cottage. The lack of information about the rationale for how walls were connected led to a different, less structurally sound approach for connecting the addition to the original structure.

The project to build the addition took longer than it could have if the team had not had to reverse engineer the design. The lack of design rational led to a structural solution that is sufficient for plastic bricks but would not work in a real structure.

In this story, the new team did learn from the original experience that they should do systems-level work. They worked through the interfaces between new parts, and this led the new project to go more smoothly than the original. The lesson is that learning over time matters.

Once again, this story is a simplification of a real building project. A real building would have far more interfaces: electrical circuits and plumbing might need to be extended. The structure of a real extension would have to be integrated into the original structure.

This story did not show the value of designing the original to be expanded. In the example, the original cottage could have been placed forward on the base plate so there was space for a later addition. In a real building, by analogy, designing the electrical main panel to have space for additional circuits and enough capacity to add more usage would make an addition easier.

2.6 Principles

As I present these stories, I will link them to the principles in Chapter 7 that can provide solutions.

Project leadership. Some of the problems in this story relate to how the cottage-building project was led. The most relevant principle is Section 7.1.3—Principle: Systems view of the system. The original team’s work would have gone more smoothly if they had had someone responsible for ensuring that the system made sense as a system.

System-building tasks. Some of the problems related to how the original team went about its work—which resulted in problems with the final system product.

The team. This story does not illustrate many problems with the team itself. However, the team building the original cottage built each of the components in isolation, and did not discover that their parts would not integrate until the parts had been built.

Chapter 3: Stories about building systems

This chapter presents some case studies of how people have built complex systems.

3.1 Developing a spacecraft mission without engineering the system

10 April 2024

The project. I worked on a NASA small spacecraft project. The project’s objective was to fly a technology demonstration mission to show how a large number of small, simple spacecraft could perform science missions. The mission objectives were to demonstrate performing coordinated science operations on multiple spacecraft, and to demonstrate that the collection of spacecraft could be operated by communicating between one spacecraft and ground systems, and the spacecraft then cross-linking commands and data to perform the science operations.

The problem. The mission had one set of explicit, written mission objectives to perform the technology demonstration. It also had a number of implicit, unwritten constraints placed on it, primarily to re-use particular spacecraft hardware and software designs.

Those two sets of objectives resulted in conflicts that made the mission infeasible. There were three key technical problems: power consumption was far in excess of what the spacecraft’s solar panels could generate; the legacy that could not communicate effectively over the distances involved; and the design had insufficient computing capability to accurately compute how to point spacecraft for cross-link communication.

Conflicts like these are not uncommon when first formulating a system-building project, and NASA processes are structured to catch and resolve them. The NASA Procedural Requirements (NPRs), a set of several volumes of required processes, require projects to formalize mission objectives and analyze whether a potential mission design is feasible. This work is checked at multiple formal reviews, most importantly the Preliminary Design Review (PDR).

At the PDR, expected project maturity is:

Program is in place and stable, addresses critical NASA needs, has adequately completed Formulation activities, and has an acceptable plan for Implementation that leads to mission success [italics mine]. Proposed projects are feasible with acceptable risk within Agency cost and schedule baselines. [NPR7120, Table 2-4, p. 30]

This project, however, failed at three of the necessary steps. First, the project did not perform top-down systems engineering, such as a proper documentation of mission objectives, a concept of operations, and a refinement of those into system-level and subsystem-level specifications. In particular the implicit and undocumented constraints were never documented as requirements; they were tacitly understood by the team and rarely analyzed. Those requirements that were gathered were developed by subsystem leads, and they were inconsistent and did not derive from the mission objectives. Second, individual team members did analyses that showed problems with the the ability of the radios, their antennas, and the ability to point the spacecraft in such a way that cross-link communications would work. The people involved repeatedly tried to find a solution in their individual domain of expertise to fix the problem, and the problems were never raised as a systemic problem. Finally, the PDR was the final check where these problems should have been brought to light as the refinement of mission objectives and the concept of operations would fail to show communication working. Instead, the team focused on making the review look good rather than addressing the purpose of the review.

Outcome. The project proceeded to build the hardware for multiple spacecraft, began developing the ground systems and developing the flight software. After several months, the project neared the end of its budget, and the spacecraft design was canceled. Something like two years’ worth of investment was lost, and the capability of performing a multi-spacecraft science mission was never demonstrated.

The agency later found some funds to develop a much simplified version of the flight software and relaxed the mission objectives substantially to only performing some minimal cross-link communications. A version of that mission was eventually flown.

Solutions. The project made four mistakes. Each one of them could have been corrected if the project had followed good practice and NASA required procedures.

First, the conflicting mission objectives and constraints should have been resolved early in the project. NASA has a formal sequence of tasks for defining a mission and its objectives, leading to a mission definition that is approved and signed by the mission’s funder. If the project had followed procedure, the implicit constraints would have been recorded as a part of this document. Documentation would have encouraged evaluation of the effects of those constraints.

Second, the project did not do normal systems engineering work. The systems engineering team should have documented the mission objectives, developed a concept of operations for the mission, and performed a top-down decomposition and refinement of the mission systems. In doing so, problems with conflicting objectives would have been apparent. The systems leadership would have been involved in analyses of the concept, and thus been aware of where there were problems.

Third, the team lacked effective communication channels that would have helped someone working one individual problem raise the issues they were finding up to systems and project leadership, so that the problems could be addressed as systems issues. For example, one person found that the flight computer would not be able to perform good-enough orbit propagation of multiple spacecraft so that one spacecraft would know how to point its antenna to communicate with another. A different person found problems with the ability of the radios to communicate at the ranges (and relative speeds) involved.

Finally, the PDR should have been the safety net to find problems and lead to their resolution. The NASA procedural requirements have a long list of the products to be ready at the PDR. More than 30 specifically the responsibility of systems engineering [NPR7123, Table G-6, p. 81], and the project overall has a similar number of products [NPR7120, Appendix I]; there is some overlap between these lists. The team took a checklist approach to these several products, putting together presentations for each topic in a way that highlighted progress in the individual topics but failing to address the underlying purpose: showing that there was a workable system design.

Had any of these mechanisms worked, the systems and project leadership would have detected that the conflicting mission objectives were infeasible and led the project to negotiate a solution.

Principles. This example is related to several principles for a well-functioning project.

3.2 Marketing and engineering collaboration

12 April 2024

The project. I worked at a startup company that was building a high-performance, scalable storage system. The ideas behind the system came from a university research project, which had developed a collection of technology components for secure, distributed storage systems.

The company had developed several proof-of-concept components and was transitioning into a phase where it was getting funding and establishing who its customers were. The company hired a small marketing team to work out what potential customers needed and to begin building awareness of the value that the new technology could bring.

The problem. The marketing team had experience with computer systems, but not with storage in particular. They could identify potential market segments, but they did not have the background needed to talk with potential customers about their specific needs.

The engineering team were similarly not trained at marketing. Some of the team members had, however, worked at companies that used large data storage systems and so had experience at being part of similar organizations.

Solutions. The marketing team set up a collaboration with some of the technical leads. This collaboration left each team in charge of their respective domains, with the technical leads helping the marketing team do their work and the marketing team providing guidance about customer needs to the engineering team.

One of the technical leads acted as a translator between the marketing and engineering teams, so that information flowed to each team in terms they understood. Technical leads joined the marketing team on customer visits, helping to translate between the customers‘ technical staff and the marketing team. The marketing team conducted focus group meetings, and some of the technical leads joined in the back room to help frame follow-up questions to the focus groups and to help interpret the results.

Outcome. The collaboration helped both teams. The marketing team got the technical support and education they needed. The engineering team got proper understanding of what customers needed, so that the system was aimed at actual customer needs.

Principles. This example is related to the following principles:

3.3 Missing implicit requirements

13 April 2024

The project. This occurred at the startup I worked at that was building a scalable storage system.

The problem. The team had a focus on making the system highly available, to the point where we had an extensive infrastructure for monitoring input power to servers and providing backup power to each server. If the server room lost mains power, our servers would continue on for several minutes so that any data could be saved and the system would be ready for a clean restart when power came back on. We did a good job meeting that objective.

What we forgot is that people sometimes want to turn a system off. Sometimes there is an emergency, like a fire in a server room, and people want the system powered off right away. Sometimes preventing the destruction of the equipment is more important that losing a few minutes’ worth of data. We had no power switches in the system and no way to quickly power it down.

Outcome. In practice this wasn’t too serious a problem because emergencies don’t happen often, but it meant that the system couldn’t pass certain safety certifications.

Solutions. We made two mistakes that led to the problem.

The first mistake was that everyone on the team saw high availability as a key differentiator for the product, and so everyone put effort into it. This created a blind spot in how we thought about necessary features.

The second mistake was that we did not work through all of the use cases for the system and so implicit features, including power off. Building up a thorough list of use cases can serve as a way to catch blind spots like this, but the team did not build such a list.

Principles. This is related to one principle:

3.4 Building at a mismatch to purpose

15 April 2024

The project. I consulted on a project to build a technology demonstration of a constellation of LEO spacecraft for the US DOD. This constellation was to perform persistent, world-wide observations using a number of different sensors. It was expected to operate autonomously for extended periods, with users world wide making requests for different kinds of operations. The constellation was expected to be extensible, with new kinds of software and spacecraft of new capabilities being added to the constellation over time.

One company organized the effort as the prime contractor. That company built a group of other companies of various sizes and capabilities as subcontractors. The team won a contract to develop the first parts of the system.

The problem. The constellation had to be able to autonomously schedule how its sensors would be used, and where major data processing activities would be done. For example, someone could send up a request for an image of a particular geographic region, to be taken as soon as possible. The spacecraft would then determine which available spacecraft would be passing over that region soon. Some of the applications required multiple spacecraft to cooperate: taking images from different angles at the same time, or persistently monitoring some region, handing off monitoring from one spacecraft to another over time, and performing real-time analysis on the images gathered on those spacecraft.

The prime contractor selected its team of other companies and wrote the contract proposal for the system before doing systems engineering work. This meant that neither a detailed concept for the system’s operation nor a high-level design had been done.

After the contract was awarded, the team had to rapidly produce a system design. This effort went poorly at first because the system’s concept had not been worked out, and different companies on the team had different understandings of how the system would be designed. The team had to deliver initial system concept of operations and requirements quickly after the contract was awarded. The requirements were developed by asking someone associated with each expected subsystem to write some requirements. Needless to say, the concept, high-level design, and requirements were all internally inconsistent.

After the team brought me on to help sort out part of the design problems, we began to do a top-down system design and establish real specifications for the components of the system. We were able to begin to work out general requirements for the autonomous scheduling components.

The project team had determined that they needed to use off-the-shelf software components as much as possible, because the project had a short deadline. One of the subcontractor companies was invited onto the team because they had been developing an autonomous spacecraft scheduling software product, and so the contract proposal was written to use that product.

However, as we began to work out the actual requirements for scheduling, it became apparent that the off-the-shelf scheduling product did not match the project’s requirements. The requirements indicated, for example, that the system needed to be able to schedule multiple spacecraft jointly; the product only handled scheduling each spacecraft independently. The system also had requirements for extensibility, adding new kinds of sensors, new kinds of observations, and new kinds of data processing over time. This suggested that strong modularity was needed to make extensibility safe, but the off-the-shelf product was not at all modular.

Outcome. The mismatch between the decision to use the off-the-shelf scheduling product and the system’s requirements led to both technical and contractual problems.

The technical problem was that the scheduling product could not be modified to work differently and thus meet the system requirements. The project did not have the budget, people, or time to do detailed design of a new scheduling package that would meet the need.

The contractual problem was that the subcontractor had joined the project specifically because they saw a market for their product and were expecting to use the mission to get flight heritage for it. When it became clear that their product did not do what the system needed, they discussed withdrawing from the project.

In the end, the customer decided not to continue the contract and the project was shut down.

Solutions. This project made three mistakes that, had they been avoided, could have changed the project’s outcome.

First, the team did not do the work of early stage systems engineering to work out a viable concept and high-level design before committing to partners and contracts. This would have made it clear what was needed of different system components. It would also have provided a sounder basis for the timelines and costs in their contract proposal.

Second, the team made design and implementation choices for some system components without understanding the purpose that those components needed to fill.

Finally, the team made commitments to using off-the-shelf designs without determining whether those designs would work for the system.

Principles. The solutions above are related to the following principles:

3.5 The persistence of team habits

6 May 2024

The project. I consulted for a company that was working to build an autonomous driving system that could be retrofitted into certain existing road vehicles.

The company had started with veterans from a few other autonomous driving companies. They began their work by prototyping key parts of a self-driving system, to prove that they had a viable approach to solving what they saw as the key problems. This resulted in a vehicle that could perform some basic driving operations, though it was always tested with a safety driver on board.

The team focused only on what they saw as the most important problems in an autonomous driving system. They believed that it was important to demonstrate a few basic self-driving functions as rapidly as possible—in part because they believed that this would help them get funding, and in part because they believe that this would help them forge partnerships with other companies. They focused on a simplified set of capabilities, including sensing, guidance, and actuation mechanisms for driving on a road.

The problem. This focus meant that the team developed a culture, along with a few somewhat documented processes, that was focused on building a prototype-style product, even as they began to fit their system into multiple vehicles and test them on the road (with safety drivers). When they found a usage situation in their testing that their driving system did not handle as they felt it should, they added features to handle that situation to the sensing and guidance components and to simulation tests they used on those components. In other words, the engineering work was driven largely reactively.

The team did not spend effort on analyzing whether the new features would interact correctly with existing features, relying on simulation testing to catch regressions. They did not develop a plan for features that they would need, and for how they would integrate other systems with the core functions they had already prototyped.

Some of the team members had some awareness that they needed to improve the safety of the driving system and the rigor with which the team designed and built the system. These team members, some of whom were individual engineers and some who were leaders, tried from time to time to define some basic individual processes—like defining requirements before design, or conducting design reviews. Their goal was always to move the team incrementally toward sound engineering practice.

None of these attempts worked: each time, a few people would try a new procedure, task, or tool, but a critical mass of the team would keep working the way they had been in order to keep adding new features in response to immediate needs.

Outcome. After nearly two years, the team had not changed its practices and continued to work as if they were building a prototype. The team in general did not define or work to requirements; they did not analyze the systems implications of potential new features before implementing them. The team was making little progress on developing a safety case for the system.

Solutions. The fundamental problem was a misalignment between the incentives that drove the team in the short term and long-term practices needed to build a safe and reliable system.

The team as a whole, from the leadership down, developed habits focused on developing a proof of concept that would let the company get additional funding, as well as attract good staff and help the company build partnerships. This was the right choice for the company in its early days, because a company that cannot get funding does not get to move on to the long term. This short-term focus drove the habits and culture of the early company.

Later, as the company got funding and built up a team to build the system, they needed to change their practices. Changing a team’s culture and habits is hard: the team’s practices have been working out initially. The team’s habit of focusing on short-term results, in particular, defined how they organized all their work.

In order to change practices to be a company building a product that is viable in the long term, teams like this make a deliberate change to their culture, habits, and practices. A disruptive change like this does not happen spontaneously: a team’s culture defines the stable environment in which people can do what they understand to be good work. This creates a disincentive to make a change that disrupts how everyone works together.

Deliberate and pervasive changes come from the team’s leadership. The leadership must first recognize that a change is needed and work out a plan for what to change, how quickly, and in what way. The leadership then have to explain the changes needed, create incentives that will overcome the disincentives to change, and hold people on the team accountable for making the changes.

Principles. This case reflects some more of the principles outlined in Chapter 7.

3.6 Heavyweight, understaffed processes

24 April 2024

The project. A colleague was an engineer working on an electronics-related subsystem at a large New Space company that was building a new launch vehicle.

The team in question was responsible for designing one of the avionics-related subsystems and acquiring or building the components. This required finding suppliers for some components and ordering the necessary parts.

The problem. The company had processes in place for both vendor qualification and parts ordering. They included centralized software tools to organize the workflow.

The vendor qualification process began with submitting a request into the tools. The request was then reviewed by a supplier management team; once they approved a supplier, the avionics team could start placing ordering requests to buy parts. The purchase request would similarly be routed to an acquisition team that would make the actual purchase from the supplier.

The intents of this process were, first, to take the work of qualifying potential vendors and managing purchases off the engineering team, and second, to ensure that the vendors were actually qualified and that parts orders were done correctly.

From the point of view of the engineers building the avionics, the processes were opaque and slow. They would put in a request, and not know if they had done so properly. Responses took a long time to come back. At one point, my colleague reverse engineered the vendor qualification process in order to figure out how to use it; the result was a revelation to other engineers.

It also appeared that the positions responsible for processing these requests were understaffed for the workload. In practice these people did not have the time to do proper reviews of the vendors most of the time.

Outcome. Having supply chain processes was a good thing: if it worked, it increased the likelihood that the acquired parts would meet performance and reliability requirements, that the vendors would deliver on schedule and cost, and that the cost of acquiring parts remained within budget.

However, getting vendors qualified to supply components and then getting the components took a long time, delaying the system’s implementation and then delaying testing and integration.

The suppliers and the parts did not get the intended scrutiny, which may have let problem suppliers or parts through.

The company acquired a reputation with its employees of being slow and difficult to work for.

Solutions. There are four things that could have been done to make these processes work as intended.

First, the processes should be documented in a way that everyone involved knows how the process works. In this situation, it seems that people playing different parts in the process knew something about their part, but they did not understand the whole process; if there was documentation, the people involved did not find it. The process documentation should inform all the people involved what all of the steps are, so they understand the work. It should make clear the intent of the process. It should also make clear what would make a request successful or not.

Second, the processes should be evaluated to ensure that every step adds value to the project, compared to not doing that step or doing the process another way.

Third, the supporting roles—in this case, those tasked with reviewing and approving requests—should be staffed at a level that allows them to meet demand.

Finally, the project should regularly check whether its processes are working well, and work out how to adjust when they are not working.

Principles. The following principles apply:

3.7 Planning the transcontinental railroad

24 April 2024

The project. The first transcontinental railroad to cross North America was built between 1862 and 1869. [Bain99] It involved two companies building the first rail route across the Rocky Mountains and the Sierra Nevada, one starting in the west and the other in the east. It was built with US government assistance in the form of land grants and bonds; the government set technical and performance standards that had to be met in order to get tranches of the assistance. The technical requirements included worst-case allowable grades and curvature. The performance requirements included establishing regular freight and passenger service to certain locations by given dates.

The problem. The companies building the railroad had limited capital available to build the system. They had enough to get started, but continuing to build depended on receiving government assistance and selling stock. Government assistance came once a new section of continuous railroad was accepted and placed into service. In addition, the two companies were in competition to build as much of the line as possible, since the amount of later income depended on how much each built.

This situation meant that the companies had to begin building their line before they could survey (that is, design) the entire route. They operated at some risk that they would build along a route that would lead to someplace in the mountains where the route was uneconomical—perhaps because of slopes, or necessary tuneling, or expensive bridges.

Because the building began before the route was finalized, the companies could not estimate the time and resources needed for construction beyond some rough guesses. The companies worked out a general bound on cost per mile before the work started, and government compensation was based on that bound. In practice the estimate was extravagantly generous for some parts of the work.

Solutions. The initial design risk was limited because there were known wagon routes. People had been traveling across the Great Plains and the mountains in wagons for several years. While the final route did not exactly follow the wagon routes, the early explorations ensured that there was some feasible route possible.

The companies built their lines in four phases: scouting, surveying, grading, and track-laying. (In some cases they built the minimal acceptable line with the expectation that the tracks would be upgraded in the future once there was steady income.) Scouting defined the general route, looking for ways around bottlenecks like canyons, rivers, or places where bridges or tunnels would be needed. Surveying then defined the specific route, putting stakes in the ground. The surveyed route was checked to ensure it met quality metrics, such as grade and curvature limits. After that, grading crews leveled the ground, dug cuts through hills, and tunneled where necessary. Finally, track-laying crews built bridges and culverts where needed, then laid down ballast, ties, and rail. After these phases, a section of track was ready for initial use.

Scouting ran far ahead of the other phases, sometimes up to a year ahead. Survey crews kept weeks or months ahead of grading crews. The grading and track-laying crews proceeded as fast as they could. All this work was subject to the weather: in many areas, work could not proceed during winter snows.

Outcome. The transcontinental railroad was successfully built, which opened up the first direct rail links from one coast of North America to the other. The early risk reduction—through knowledge of wagon routes—accurately showed that the project was feasible.

The companies were able to open up new sections of the line quickly enough to keep the construction funded. The companies received bonds and land grants quickly enough, and revenue began to arrive.

The approach of scouting and surveying worked. The scouting crews investigated several possible routes and found an acceptable one. While there were instances of tentatively selecting one route then changing for another—sometimes for internal political reasons rather than technical or economic reasons—no section of the route was changed after grading started. In later decades other routes were built, generally using tuneling technology that was not available for the first line. Many parts of the original line are still in regular use.

Principles. The transcontinental railroad project was an example of planning a project at multiple horizons, when the work of implementing begins before the design is complete, and where the plan and design is continuously refined.

Part II: Systems background

Foundational definitions used throughout the rest of this book, including:

Chapter 4: What making systems is

9 May 2024

This book is about the work involved in making a system—what a system is, and how to do a good job making one.

Part I presented a set of case studies that showed how system-building project can go well—or not. This leads to two questions: How does one build a system well? And how does one avoid the problems?

To start finding answers to these questions, consider three aspects of making a system: what a system is; the activities involved in making it; and the people who do the activities that make the system.

A system. A system is “a regularly interacting or interdependent group of items forming a unified whole”.[1] Other definitions speak to a set of items or components that work together to fulfill a purpose.

This definition includes some of the key aspects of a system.

For artificially built systems, the system is the outcome of all the work that people do to make the system.

Having a purpose distinguishes a system built by people from systems in nature. A natural system often just exists, and any meaning or purpose to it assigned after the fact by people. A human-built system, on the other hand, does something for someone. The purpose of human-built systems can be described in terms of what it does for someone, and why it is worth the effort to make a system do that.

Most systems are not static: they will evolve rapidly as they proceed from concept through design; once it is in operation, they will continue to evolve as their users’ needs evolve.

The next chapter, Chapter 5, discusses more about what a system is.

Making a system. The work of making a system can be seen as a string of activities, the life cycle of the system. It begins with an idea. That idea might be a user’s need, or it might be an idea for a new way to do something that might fill an as-yet-unidentified user’s need. The work proceeds to translate that idea into designs and then into a working system. This work goes through a number of steps, such as developing a concept, specifying its pieces, designing and implementing them, integrating the parts and verifying the assembly. Once a system has been built, it can be placed into operation. A system that has been in operation may change over time: users’ needs change, or technology changes. Eventually, every system is retired and disposed of.

All these activities are done by a team of people who are building the system, and the point of spending the effort is for the system, at when built and in operation, filling its purpose.

Chapter 6 discusses more about how to make a system.

Who does the work. A team of people working together does all the activities involved in making the system. For complex systems, the team can get large and may involve people at different companies and with different skills.

The team of people is itself a system: a set of people, whose purpose is to build the objective system, who interact with each other through discussion, documentation, and artifacts. A team that is functioning well is able to focus their efforts on the purpose of the system they are building. The team is organized so that its members have information they need each to do their part, and to communicate so that the pieces of the system that they create work together.

Key roles. A team that functions well, like any human-built system, does not happen by accident; it happens because someone takes the effort to design and implement it so that it works well.

In practice, there are three roles that do this work of organizing and running the team. These roles may be divided among team members in many different ways, but every team building a complex system needs the three roles filled somehow. The roles are:

The intersections. Having teased apart the ideas of system, system-building, and people, and the ideas of systems engineering, project management, and project leadership, the next step is to acknowledge that none of these things are in fact separate.

The objective of a project is to produce a system. The way to produce it is to do system-building work. The people in a team do that work. All three must fit together: the way that the work gets done determines whether the resulting system meets its purpose. How the team is organized, its culture and habits, govern how the people will do the work.

While systems engineering, project management, and project leadership are different roles and involve different skills, they work together. Leadership by itself gets nothing done; that comes from engineering and management. Leadership and management without systems engineering might produce a system but it probably won’t work. Leadership and engineering without management usually means a lot of engineers running around doing cool things but also wasting time and resources and not actually getting things done. Management and engineering without leadership isn’t able to make decisions or take responsibility.

The people filling each of these three roles also need to understand their counterparts’ roles. A systems engineer who designs something that would require more time or resources than the project has is not going to be effective. A project leader who does not understand the work the team does is not going to model good work practices. A project manager who does not understand the engineering is not going build a plan or schedule that makes sense.

Systems work, in the end, is about doing work that makes a coherent whole out of the parts it has to work with. The work of making a system is just as much systems effort as its product is. Only when the parts fit together does the work get done as it should.

[1] Per Merriam-Webster Online Dictionary.

Chapter 5: Elements of systems

21 May 2024

5.1 Introduction

Working with systems is about working with the whole of a thing. It is a bit ironic that to make the whole accessible to rational design, we need to talk about the parts that make up systems work.

That is one of the first points about systems. Most systems are too complex for a human mind to remember and understand as a whole at one time. To work on these systems, we must find ways to abstract and to subset the problem. This book discusses some of the techniques for slicing a system into understandable parts, along with ways to use those techniques and why to use them. In the end, however, everything in here deals with carefully-chosen subsets of a system.

This chapter covers some of the essential concepts and building blocks that are the foundation for the techniques discussed in the rest of this book.

The subjects for systems work can be divided into five groups:

The first four subjects are connected by a reductive approach to explaining complex systems, in which the high-level purpose is explained by reducing it to simpler constituent parts and structure, and conversely expressing the purpose as emergent from these simpler parts. The final subject is about ensuring that the system does what it is supposed to do (and only that).

5.2 System purpose

Every system that is designed and built has a purpose. That is, someone has an expectation of the benefits that will come from building the system, and they believe that those benefits will outweigh the costs (in resources, time, or opportunities) that will be incurred building the system.

Every system must be designed and built to address its purpose, and no other purposes, at the lowest cost practically achievable. This point may seem uncontroversial on its surface, but I have observed that the majority of projects fail to work to this standard, and incur unnecessary costs, schedule slips, or missed customer opportunities. Every design choice must be weighed according to how well each option helps satisfy the purpose or not; if an option does not, it should not be chosen.

Making design decisions guided by a system’s purpose means that the team must understand what that purpose is. The purpose must be recorded in a way that all the team members can learn about it. It also needs to be accurate: based on the best information available about what the system’s users need, and as complete as can be achieved at the time. The record of the purpose should avoid leaving important parts implicit, expecting that people will know that systems of a particular kind should (for example) meet certain safety or profitability objectives; people who specialize in one area will know some of these implicit needs but not others. The purpose documentation should also include secondary objectives, such as meeting regulatory requirements or leaving space in the design for anticipated market changes.

The understanding of a system’s purpose and costs will shift over time, both as the world changes and as people learn more accurately what the value or cost will be. When the idea for the system Is first conceived, the purpose may be accurate for that time but the understanding of the cost is likely to be rough. As design and development progress, the understanding of cost improves, but the needs may change or a customer may realize they misunderstood some part of the value proposition.

A system’s purpose also changes over longer periods of time. People add new features to an existing product to expand the market segment to which it applies or to help it compete against similar products. The technology available for implementing a system changes, creating opportunities for a faster, cheaper, or otherwise better system.

Systems leadership have to balance the needs for a clear and complete statement of a system’s purpose with the fact that the understanding of purpose will change over time. The agile [Agile] and spiral [Spiral] management methodologies arose from this need for balance between opposing needs. Later chapters ! Unknown link ref address how systems engineering methodologies can help address this need.

Working in a way that is driven by system purpose requires discipline in the team and its leadership. Many junior- and mid-level engineers are excited about their specialist discipline, and want to get to designing and building as quickly as possible—after all, those are the activities they find fulfilling. I have observed team after team proceed to start building parts of a system that they are sure will be the right thing, without spending the effort to determine whether those parts are actually the right ones. Those design decisions may end up being correct many times, which leads to a false confidence in decisions taken this way (“I’m experienced; I’m almost always right!”). The flaw is that the wrong decisions can have a high cost, high enough to outweigh any benefit from the rapid, unstudied decision.

I have seen many teams say—rightly—that they need to make some design decision quickly, see whether it works, and then adjust the design based on what they learn. This line of reasoning is both a good idea and dangerous. If a team actually does the later steps of evaluating, learning from, and changing the design then this approach can result in good system design. (This is discussed more in later chapters XXX on prototyping.) However, most teams lack the leadership discipline to perform to this plan: once there is some design in place, pressures to keep moving forward drive teams to live with the bad initial design and accept complexity and errors. It requires discipline and commitment from the highest levels of an organization to take the time needed to learn from an early design and change what they are doing. The leadership must be prepared to push back against pressures to just live with a poor design and instead to require their team to take the time to learn and adjust, and to be clear with external parties, such as investors, that the plan is a necessary and positive way to realize a good product.

In Chapter ! Unknown link ref, we discuss techniques that can help to keep a system’s development grounded in its purpose, while adapting to changes in purpose and learning about the system’s design choices over time.

5.3 System boundary

A system has a boundary that defines what is within the system and what is not. What the system does (its functions) and what it uses to do them (its components) are within the system.

The rest of the world is outside the system. The outside world includes the system’s environment: the part of the world with which the system interacts.

The boundary defines the interface between the system and its environment.

What is inside the system and where the boundary lies are within the control of the project building the system. The project must adapt its work to everything else outside the system boundary.

5.4 System parts and views

Systems are designed and built by people. The methods used to build them must account for two human issues. First, most systems today are too complex for one person to keep in mind all the parts at one time, leading to a need to work with subsets of the system at any given time. Second, most systems also require multiple people to design or build, either because of specialties or the total amount of work involved. This leads to the need to break the work up into parts for different people to work on.

There are two techniques used to address this need. First, systems are divided into component parts, typically in a hierarchical relationship: the system is divided into subsystems, which are in turn subdivided, until they reach component parts that are simple enough not to require further subdivision. Second, people approach the system through narrow views, each of which covers one aspect of the system but across multiple component parts—such as an electrical power view, an aerodynamics view, or a data communications view.

Dividing the system into component parts creates pieces that are small enough to reason about or work on in themselves. The description of the part must include its interfaces to other parts, so that the design or implementation can account for how it must behave in relation to other parts. However, the interface definitions abstract away the details of other parts, so that the person can concentrate their attention on just the one part.

Dividing up the system also allows different people to work on different parts, as long as both parts honor the interfaces between them. The division into parts, and the definition of interfaces, create divisions of responsibility and scope for communication for the different people. This is addressed further in the Teams section (Section 6.3.3).

The hierarchical breakdown of the system into components and subcomponents provides a way to identify all of the parts that make up the system, ensuring that all can be enumerated. It also defines a boundary to the system: the system is made up of the named parts, and no others.

Reasoning about views of a system provides a similar and complementary way of managing the complexity of reasoning about a system by focusing on one aspect across multiple parts, and abstracting away the other aspects. This allows different people to address different aspects, as long as the aspects do not interact too much. For example, specialist knowledge, such as about electrical system design, can be brought to bear without the same person needing to understand the aerodynamics of the aircraft in which the electronics will operate.

Sidebar: Non-reductive systems

This approach of defining a system in reductive terms—using parts and structure—is not a formal necessity of systems in general. Rather, this approach is used as a way for ordinary people to define, build, and check systems.

There are numerous examples of non-human processes that have developed complex systems that are not easily explained reductively. Many of these were developed using evolutionary methods, both biological and electronic. Others arise from other optimization and machine learning techniques. These generative design tools have been demonstrated in mechanical and electronic design.

Consider the circuit discussed by Thompson and Layzell [Thompson99]. This circuit was developed by evolving a design on an FPGA, so that the result would distinguish between inputs at two different frequencies. The resulting circuit design achieves its objective, but is not readily understandable by decomposing the design into individual elements on the FPGA—indeed, the presence of some cells that did not appear to be used directly appeared to be essential to the circuit’s function. Further, the circuit only worked well on the specific FPGA chip on which it was evolved; when moved to another FPGA of the same model, it was reported to work poorly.

While these designs are not readily understood by decomposition, they still must be verified for conformance with their purpose. This starts with a clear definition of purpose, from which the fitness or objective function used in optimization can be derived. For critical systems or components, the objective function must not only specify what the desired behaviors are, but also the undesired behaviors and the behaviors when the system is outside its intended performance environment. In some methods, the objective “function” can be an adversarial neural network that must itself be trained based on the system’s purpose. The result of the generative or optimization method must also be verified against the purpose to check that the result is in fact correct—which can catch errors in building the objective function, or subtle dependencies on environment.

5.5 Structure and emergence

Decomposing a system into component parts is one part of the system’s design; the other part is how those components relate to each other. The relations between parts define the structure of the system. These relations include all the ways that components can interact with each other, at different levels of abstraction. At low levels, this might be interatomic forces at the molecular level; at medium levels, mechanical, RF, force, or energy transfers; at higher levels, information exchange, redundancy, or control.

The structure needs to lead to the system’s desired aggregate properties, such as performance, safety, reliability, or specific system functions like moving along the desired path or providing reliable electrical service.

The aggregate properties are emergent, and arise from the way the structure combines the properties of individual components.[1] The structure must be designed so that the system has the desired emergent properties and avoids undesired ones. For example, a simple reliable system has a reliability property that arises from the combination of two or more components that can perform the same function, along with the interaction patterns of each component receiving the same inputs, each component generating consistent outputs, how the two or more results are combined, and how each component responds to failure.

The structure must be designed to avoid unanticipated emergent properties, especially when those properties are undesirable. In a safe or secure system, for example, it is necessary to show that the system cannot be pushed into some state where it will perform an unsafe action or provide access to someone unauthorized. Avoiding unanticipated emergent properties is one of the hardest parts of correctly designing a complex system.

The structure must be well-designed for the system to meet its purpose, and for people to be able to understand, build, and modify it. In particular the structure needs to be:

There are good engineering practices that should be followed to achieve these aims, as we discuss in ! Unknown link ref.

Finally, the structure determines the interfaces that each component part must meet. Those interfaces in turn determine a component’s functions and capabilities, which guide the people working on the component, as discussed in the previous section.

5.6 Evidence

It is not enough to design and build the system; the team must also show that the system meets its purpose.

The team developing or maintaining the system must be able to show that the system complies with its purpose to customers, who need to know that the system will do what they expect; to investors, who need evidence that their investment is being used to create what they agreed to fund; and to regulators, especially for safety- or security-critical systems, who are charged with ensuring that systems function within the law.

The team also needs to ensure that pieces of the system meet the system’s purpose as they are developing or modifying those pieces. They must be able to judge alternative designs against how well they meet the purpose, and once built they must be able to check that the result conforms to purpose.

The process of showing that a system or a component part fulfills its purpose involves gathering evidence for and against that proposition, and combining the evidence in an argument to reach an overall conclusion about compliance. There are many kinds of evidence that can be gathered: results of testing, results of analyses, results of expert analysis, or results from performing a demonstration of the system. These individual elements of evidence are then combined to show the conclusion. The combination usually takes the form of an argument: a tree of logic propositions starting with the purpose and devolving hierarchically into many lower-level propositions that can be evaluated using evidence. The process must show that the structure of the argument is both correct and complete in order to justify the final conclusion.

Pragmatically, arguments about meeting purpose usually follow a common pattern, as shown below. The primary argument that the implementation meets the purpose consists of a chain of verification steps. The implementation complies with a design, which complies with a specification, which complies with an abstract specification, which complies with the original purpose. As long as each step is correct, then the end result should meet the original purpose—but at each step there is the possibility of misinterpretation or missing properties, or that the verification evidence at each step is not as complete as believed. In practice this approach leaves plenty of uncaught errors in the final implementation. To catch some of these errors in the chain of verification steps, common practice is to perform an independent validation, in which the final implementation is checked directly against the original purpose.

undisplayed image

Some industries, particularly dealing with safety-critical automotive and aerospace systems, add an additional kind of evidence-based correctness argument. This is often called the safety case or security case, and consists of an explicit set of propositions, starting with the top level proposition “the system is adequately safe” (or secure) and showing why that conclusion is justified using a large hierarchy of propositions. The lowest-level propositions in the hierarchy consist of concrete evidence; intermediate propositions combine them to show that more abstract safety or security properties hold. ! Unknown link ref

Finally, evidence takes many forms, depending on what needs to be shown. Some correctness propositions can be supported by testing. These typically show positive properties: the system does X when Y condition holds. Some of these conditions are hard to test, and are better shown by analysis or human review of design or implementation. Negative conditions are harder to show: the system never does action X or never enters state Y, or does so at some very low rate. These require analytic evidence, and cannot in general be shown by testing.

We discuss matters of correctness, verification, validation, and the related arguments in ! Unknown link ref.

5.7 Using this model

The model in this chapter provides a way to think and talk about systems work. As a team begins a systems-building project, it will be gathering information or making decisions that can be organized using this model. The model can help guide people as they work through some part of the system. For example, the system’s purpose is reflected in the emergent behavior of the system, which in turn depends on the structure of how components interact. When the system is believed to be complete, the team should be able to verify that all of the relations indicated by this model are defined and correct. Later, as the system needs to evolve and the team makes changes to the system, this model helps them reason about what is affected by some change.

This model of systems provides a foundation for organizing the work that needs to be done to build the system. The next chapter presents a model for this work of building a system or component. The information about one component is represented in a set of artifacts, and there are tasks that make those artifacts. The structure of the artifacts, and thus of the tasks, is based on the model of systems and components in this chapter.

Part III goes into greater detail about each part of this model.

[1] While there is extensive literature on the philosophical basis for emergent properties (see e.g. [OConnor21]), the case for emergence in human-designed systems is altogether simpler. Supervening properties, such as “safety” or “redundancy” or “performance” can be treated as real, and human engineering follows a model of physicalism in which higher-level emergent properties arise wholly from properties of the lower levels of the system—regardless of whether that interpretation is fundamentally true or not.

Chapter 6: Elements of making a system

29 March January 2024

6.1 Introduction

The previous chapter defined what a system is. In this chapter, I turn attention to how to make that system. “Making” includes the initial design and building of the system, as well as modifications after the initial version has been implemented.

Making the system is a human activity. Building a system correctly, so that it meets its purpose, requires a team of people to work together. Building systems of more than modest complexity will involve multiple people, usually including specialists who can work on one topic in depth and people who can manage the effort. It involves people with complementary skills, experiences, and perspectives. Such systems take time to build, and people will come and go on the team. Systems that have a long life that leads to upgrades or evolution will involve people making modifications who have no access to the people who started the work.

This chapter provides a model to organize and name the things involved in the making of a system—the activities, the actors, and what they work with. Later chapters provide details on each part of this model. This model includes both elements that are technical, such as the steps to design some component, and elements that are about managing the effort, such as organizing the team doing the work or planning the work. Note that this model does not attempt to cover all of managing a system project—there is much more to project management than what I cover here.

The model presented in this chapter only serves to name and organize. I do not recommend here different approaches one can take for each of the elements of the model; only attributes that good approaches should have. Later parts of this book address ways to achieve many of these things. For example, the team that is designing a system should have an organization (a desirable attribute), but I do not address which organizational structures one can choose from.

The assembly of all the parts involved in making a system is itself a system. In those terms, this chapter presents the purpose (Chapter 8) of the system-making system and a high-level concept for how to organize the high-level components (Chapter 10) in that system.

6.2 Objective

This model of making captures the activities and elements involved in executing the project to make or update a system.

The approach used for making the system should:

6.3 Model

The making model has five main elements:

  1. Artifacts: the things created that make up the system and its records
  2. Tasks: the activities that are performed to make artifacts
  3. Team: the people who perform tasks
  4. Tools: things that the team uses in performing tasks
  5. Operations: how the team manages the work to be done
undisplayed image

6.3.1 Artifacts

The artifacts are the things that are created or maintained by the work to make the system.

The artifacts have three purposes. First, the artifacts include the system’s implementation—the things that will be released or manufactured and put in users’ hands. The artifacts should maintain the implementation accurately, and allow people to identify a consistent version of all the pieces for testing or release. Second, the artifacts are a communication channel among people in the team, both those in the team in the present and those who will work on the system later. These people need to understand both what the system is, in terms of its design and implementation, and why it is that way, in terms of purpose, concept, and rationales. Finally, the artifacts are a record that may be required for future customer acceptance, incident analysis, system certification, or legal proceedings. Those evaluating the system this way will need to understand the system’s design, the rationales for that design, and the results of verification.

The artifacts should be construed broadly. They include:

Artifacts other than the implementation are valuable for helping a team communicate. Accurate, written documentation of how parts of the system are expected to work together—their interfaces and the functions they expect of each other—are necessary for a team to divide work accurately.

Many engineers focus solely on the implementation artifacts, especially in startup organizations that are trying to move quickly, and do not produce documents recording purpose, design, or rationales. If the organization is successful and the system they are building enters service, at some point this other information will be required—as the team membership turns over, or as the complexity of the system grows, or as the team finds flaws that need to be corrected. The startups I have observed have all had to reconstruct such information after the fact; the reconstructed information is less accurate and costs more than it would have been if it had been recorded from the beginning.

Finally, the artifacts should be under some kind of configuration management. Artifacts will evolve as work progresses. One artifact may be a work in progress, meaning others may want to review or comment but that they should not count on the artifact’s contents being stable. An implementation artifact may reflect some design artifact; when the design artifact is revised, people must be able to see that the implementation reflects an older version of the design. When the implementation artifacts are packaged up and released, the resulting product needs to have consistent versions of all the implementation parts.

6.3.2 Tasks

These are the individual activities that team members perform. The tasks use and generate artifacts. I rely on the colloquial definition of “task” and do not try to formalize the term here.

Systems projects usually have vast numbers of tasks. These include tasks for designing, building, and verifying the system; they also include tasks for managing the project, reviewing and analyzing parts of the system, and approving designs and implementations.

There are usually far more tasks to be worked on than people to do them. Tasks also usually have dependencies: something needs to be designed before it is implemented, or one part of the system should be designed before another.

Tasks, in themselves, need to be known and tracked. People on the team need to know what they can be working on, and who is doing other tasks that might relate to their work. Managers need to be able to track what is being done, what tasks are having problems, and ensure that tasks are coordinated and completed.

Operations, discussed below, addresses questions of what tasks are needed and which ones should be performed in what order.

6.3.3 Team

These are the people who do the tasks. They are not an amorphous group of indistinguishable and interchangeable parts; each person will have their own abilities and specialties. Each person will also have their own authority, scope, and responsibilities.

The team should be organized. This means:

In addition, the team needs to be staffed with enough of the right people to get work done. This means that people with management responsibility need to know who is on the team and their respective strengths, as well as the workload each one has and the overall plan for moving the project forward.

6.3.4 Tools

These are things that the team uses to get its tasks done. The tools are not part of the system being produced, though they are often systems in their own right. An end user of the system being produced will not use these tools, either directly or indirectly.

The tools include things like:

6.3.5 Operations

Operations is about organizing the work that the team does. Its primary function is to ensure that the right tasks are done by the right people at the right time.

Operations sets up “a set of norms and actions that are shared with everyone” in the project [Johnson22, Chapter 2]. It gives people in the team a shared set of rules and procedures for doing their work, and it uses those procedures to manage a plan and tasks that coordinate that work. When people share a set of rules and procedures, they can each have confidence in how others are working and in the results that others produce.

There are two primary objectives for operations: making sure the work proceeds efficiently, and ensuring product quality. Operations has secondary objectives, including keeping the organization informed of progress and needs.

Ensuring the project runs efficiently implies several things.

Ensuring quality means:

undisplayed image

I look at operations through the lens of the tasks that people on the team will do. Operations is about tracking what tasks need to be done, who is working on them, and how those tasks are going. Operations is, in a way, a feedback control system that keeps the flow of tasks running smoothly.

Operations is more than overseeing tasks, however. It is equally about guiding the team through its work, especially in how people should coordinate their efforts. This starts with setting out the guidelines for how work should get done: procedures and process. That leads to planning, which sets the longer-term direction for the project’s work and allows project management to check whether the work is proceeding well. Planning leads to managing the work being done at the moment. All of these depend on information that supports decisions that have to be made.

I use the following model to define the parts that make up operations. This model has a flow from a project life cycle, which is established early in the project and changes rarely, through parts that organize the work, onward to day-to-day tasking. I explain this model in more detail in Chapter 18.

Life cycle. This defines the overall patterns of actions that the team will perform as it does the project. It defines phases of work and how one phase should happen before another. A typical phase is made up of many tasks; it covers (for example) the the work designing some component. The life cycle also defines milestones, which provide planned times when checks on work are done in a phase.

A life cycle pattern says things like: “First work out purpose, then specifications, then design, then implementation. At the end of each of these phases, have a review with one person designated to approve moving forward.”

undisplayed image

There are many different life cycle patterns, and usually an organization or a project will need to pick one—and then customize the life cycle to meet its specific needs. Sometimes the life cycle will be determined by external requirements; for example, NASA defines a common life cycle for all its projects [NPR7120].

Procedures. While the life cycle defines in general what to do, the procedures define how to do some tasks. They provide specific instructions for how to do particular actions or tasks. The instructions might take the form of a checklist, a flow chart, or a narrative.

People on the team need to know how to do things that require coordination. While team members should be able to do most of their work independently, at some point they will need to work together. The work will go more smoothly if everyone understands when they need to work together and how to do it.

There are also some tasks that are procedurally complex, even when only one person is involved. For these tasks it is helpful to have written down the steps to perform—which serve in effect as a checklist.

Procedures should be defined for tasks where getting the actions right is critical or where the task is complex. In the example below, checking a document artifact into a repository is simple, but needs to be done correctly. Performing a design review and approval has potentially many steps to go through: communicating the design to others for review, an approval decision by a designated team member, and changing the status of design documents to show that it has been released. When the life cycle defines a point in the project when something should be checked, such as during a review, procedures ensure that all the needed checks actually happen.

undisplayed image

Documented procedures help the team perform tasks accurately, helping to make sure that steps aren’t missed. They also help the team do those tasks in compatible ways so that one person’s work can build on another’s.

I have seen teams that try to operate without some ground rules for working together. This can work quite well for teams up to three or four people, and when the artifacts they produce do not need high assurance (that is, when what they produce is not safety- or security-critical). On larger teams that have not written down their basic process rules, I have always seen failures to communicate or consult. These failures sometimes led to errors in the system that had to be corrected later once found. Sometimes they led to one person damaging another person’s work, requiring time and effort to recreate overwritten designs.

Documenting procedures also provide a way for the project to learn and improve. If some procedure is not working well, the team can identify which procedure is the problem and then change it. As long as team members then follow the revised procedure, the team’s ability to work should improve over time. Contrast this to not documenting a procedure: some people may have opinions on how to do it better, and they may start doing it the new way, but not everyone will know about the change, and people may forget it after a little while. This makes learning slower and less reliable.

Plan. The plan defines the overall intended path forward to a completed system, along with selected milestones along the way. It is a current best estimate of the general steps needed to move the project toward that goal.

A plan records the approach the team intends to take to build the system. It lays out the phases of work expected, in coarse to medium granularity. In doing so, it records decisions like the flow from specification to design to implementation to verification. It records when the team decides to investigate different ways to design some component, perhaps prototyping some of the ways. It documents expected dependencies and parallelism.

undisplayed image

The plan is, therefore, a record of how parts of the life cycle pattern are applied to this specific project. Just as there are many patterns that a project can choose to use, there are many different ways to organize the project’s work. I discuss these choices in depth in Chapter 18.

A plan is not necessarily a schedule. A schedule is usually taken to mean a sequence of events with a high confidence of accuracy and completeness. A plan, on the other hand, reflects the uncertainties that come with developing a complex system. In the beginning, the plan can be specific about a few things in the near term but must be vague about the longer term until enough design has been completed to fill out later work. As a project progresses and more and more becomes known, the plan should converge to something like a schedule.

A plan is broader than a list of specific tasks. It consists of a number of work phases, and dependencies among them. This information then guides the specific tasks, as discussed in the section on tasking below.

Plans are used in prospect, in the moment, and in retrospect. They should provide guidance on what direction the work will likely go in the future, even when that direction has uncertainty. They are used in the present to track what is happening now. They provide history of what has been done, to understand how the team’s work compares to predictions and to provide accountability for everyone responsible for working on the project.

I have never encountered a project that had a single plan for the whole duration of the work. Plans have always been dynamic. Early in the project, we knew that we needed to develop a concept for the system but did not yet know enough to sketch out the work involved in building that concept. Later we had a general structure for the system, but there were technical questions to resolve; once resolved, we would know what we were building. Later in the project, we would find defects or we would get a change order, resulting in unanticipated work.

Tasking. This is the day-to-day definition of tasks to be done, their assignment to team members to perform, and tracking their progress.

Tasking involves continuous decision-making: the choice of which tasks should be performed next, or which tasks should be interrupted to deal with higher-priority tasks. These choices merge several streams of potential tasks: ones that derive from the nearest parts in the plan; ones made newly urgent by a change in what is known about the system; ones about fixing errors that have been discovered; and tasks related to new outside requests.

undisplayed image

The team will need to keep track of both the potential tasks and the ones that have been assigned and are being worked on. This implies record-keeping artifacts.

The criteria for deciding about tasks should be encoded in procedures, as discussed above. The procedure for choosing tasks can be viewed as a control system that responds to project events to affect the set of tasks assigned for work, with the aim of making the project’s execution run efficient. “Efficiently” means meeting the goals set out above for operations: ensuring that the right work is done, that people aren’t blocked from getting work done, and that the work follows orders or dependencies needed for high-quality work.

How the tasking control system works depends on the development methodology used in the project. Agile development, for example, often focuses on making tasking decisions at regular intervals (for each “sprint”); other methodologies focus on making tasking decisions continuously.

Support. The decisions made during operations take into account several kinds of supporting information. These include:

Sidebar: Resource-constrained projects

Traditional project planning approaches grew out of projects, such as building construction, that focus first on time and budget. This kind of project treats the completion date as the driving factor in organizing work, and assumes that in general as many workers can be brought in as are needed to complete the work quickly, and that parallelism between tasks is limited primarily by dependencies between tasks. For example, in building a house, one contractor typically brings in a team to frame the structure, while another brings in a team to add the electrical wiring or plumbing into the structure. Each of these teams can bring in as many people as needed to get the work done, and then those people go on to another construction project elsewhere when their part is done.

This model of project planning leads to tools organized around a graph of dependencies between tasks. These tools usually provide analyses like critical path analysis, which shows the longest path through the graph of tasks and therefore the hardest constraint on how quickly the work can be completed. Planning the project well often hinges on understanding the dependencies between tasks and the critical path through them.

Most complex technical system projects, on the other hand, do not fit this model well. Each person working on the project needs to understand the context of their work, and there is usually a substantial cost to add someone to the project—largely in them learning about how the project works and how the system is organized. The collection of trained people on the team constitutes a valuable resource that the organization tries to keep around to maintain the system or to work on similar systems.

This approach leads to a different approach to planning work. While dependencies are certainly important, there are often many tasks that any one person can work on (and it is common to expect some degree of multitasking). In this case, getting the order of operations precisely right is not as important. It is more important to ensure that everyone can stay busy and that any major dependencies are accounted for.

6.4 Using this model

This chapter has presented a model for thinking about the work involved in making a system. This model, in itself, does not prescribe any particular way of managing building a system; it only names the topics that need to be addressed and provides some objectives by which an approach can be judged.

In Part IV, I go into more depth about each of the elements in this model.

Those who manage a project will need to decide how they will go about organizing their work. As I noted earlier, how a project is organized and run is itself a system, and the techniques discussed in this book apply as much to designing and operating the project’s operations as they do to designing, building, and operating the system product. Chapter 5 and Part III discuss the model for what a system is.

Part V discusses how the work of building a system can be organized around the life cycle of a project. Chapter 19 introduces the idea of a life cycle. It also introduces the idea that a life cycle model provides a basis for working out the tasks that need to be done to build the system. Subsequent chapters discuss what each of the phases of a life cycle, along with the artifacts and activities that go into each one.

Part VIII discusses ways to organize the team that will do the work.

Part IX presents approaches for planning and organizing the tasks that need to be done.

Chapter 7: Principles for a well-functioning project

3 May 2024

I have been a part of many projects. These projects built a wide range of systems, including specialized small business record keeping, local government IT applications, low-level graphical user interface tools, large storage systems, spacecraft systems, and ground transportation.

Some of these projects went well. They produced systems that were useful for their customers. The systems held up over many years of use, working correctly and supporting their users in ways they needed. The projects proceeded (fairly) smoothly: no major unexpected flaws, teams that worked together well, completion within close to the expected time and resources.

Paraphrasing Tolstoy, all well-functioning projects are alike; each project that has problems has problems in its own way. [Tolstoy23] Though there are several ways for a project to go well, there are far more ways they can go wrong—and it takes deliberate effort to keep a project on the path that goes well.

I have watched many of these projects struggle through problem after problem, most of them self-inflicted. The causes have included poor team organization, lack of a coherent system design, lack of taking the time to think through designs, lack of design, internal organization politics, and many others. The struggles led to canceled projects, startups that had to get extra funding rounds and missed their market opportunity, and unsafe systems being used in public spaces—often consequences not just for the people building the system but for their funders and for society at large.

This book was inspired by observing these problems and finding ways to do better the next time.

So what does a project need to do to function well? To develop a useful, safe system, on a reasonable schedule and budget? To keep its team functioning at a sustainable pace, without internal disruptions? The rest of this book seeks to provide some answers.

My general principles fall into four categories:

  1. The project or organization leadership;
  2. The tasks for building the system;
  3. The plans for building it; and
  4. The team that builds it.

For each of these, I will list some principles I have found important to making a project that runs well, or to keep it running well.

7.1 Project leadership

I have watched many projects, especially in startup companies, try to create a team of the best specialists: executives who are skilled at fundraising and external relations; an HR person who has a track record at recruiting; someone with marketing skills and connections, and a few engineers who can build the key technical parts of the system. Most of the projects that have staffed only with such specialists have either failed or had serious problems with execution.

These projects had a gap at the center of the work. Everyone is responsible for some piece, but there is no one whose responsibility is to link the pieces together: to build either the team or the product as a coherent system. People in the team generally don’t really understand each others’ work. They have trouble finding how to work with each other. The executives don’t understand the work or the team, and issue instructions that don’t make sense. The team makes poor technical decisions because no one understands how the artifacts they are building must work together.

This gap leaves three needs unmet. First, there is communication and translation between the executive team and the engineering team. Second, there is organizing and running the engineering team. And third, there is maintaining a systems view of the team’s technical work.

7.1.1 Principle: Communication and translation

Have at least one person in the organization who can communicate with people in the executive team, marketing, and engineering, and translate among them.

The executive team is, in most organizations today, a collection of specialists in running the company as a whole: corporate activities, finance, legal, public relations, marketing. I have found this to hold equally for independent companies, especially startups, and for projects that are part of larger organizations. The details may differ but the roles are largely similar.

The engineering team is also mostly a collection of specialists in one area or another, according to the needs of the system being built. They will understand parts of the system, but few of them are tasked with making all the parts cohere so they work together. Most of the engineering team will have been contributing by having specific, deep skills.

The communication need is to represent these parties to each other. The executive team is responsible for setting the overall direction for the project. The engineering team needs this direction translated into actionable directions. The executive team also must be responsible for high-level safety and security decisions (e.g. what kinds of safety hazards the company will address in its system products). The executive team has the responsibility for these decisions, and those then need to be translated into the safety and security engineering processes. In the other direction, the engineering team needs to provide feedback to the executive and marketing teams on the feasibility and cost of different possible feature or market decisions the executive team could make.[1] The project management part of the engineering team also has the information about how work is progressing and can provide information about the time and people needed to reach different milestones.

7.1.2 Principle: Provide staff to run the engineering team’s operations

Designate at least one person to oversee how the team building the system operates. This person (or people) organize the team, and adjust how it operates as the team grows and the work progresses.

An organization is a system, and a team of more than a handful of people will not self-organize in a useful way. I will argue below that this system needs careful design to work well.

I consulted with a small startup that did not have someone responsible for organizing the engineering team. The startup had begun as a very few people, who were figuring out the basics of what their company could build. The co-founders did not create an organization below the executive level; instead, they expected that they could all just work together and figure it out. And, predictably, they did not figure it out once they added a few more people to the team and had to specialize.

Johnson [Johnson22] discusses how to organize a growing company, and I recommend her work to the reader. She presents many ideas about what to do to organize a company’s operations. While that book focuses more on the human-oriented parts of operations, such as hiring and performance evaluation, the ideas it presents provide a solid foundation for parts specifically about engineering, such as how to organize design and implementation verification (which are as much a human activity as a technical one).

An organization that is going to successfully build a complex system will need to designate someone as having the primary responsibility for creating and maintaining the team’s structure and patterns of behavior. Either that, or they need to get improbably lucky.

7.1.3 Principle: Systems view of the system

A team building a complex system must have at least one person who is responsible for the system as a whole, not just its parts.

A coherent, working system does not occur by chance. It requires deliberate effort for a collection of parts to work together, and for the collection to fulfill the purpose for a system.

This deliberate effort can be achieved, theoretically, by a group of uncoordinated specialists. However, this amounts to the Infinite Monkey Theorem, where enough workers and enough time can produce any system. For realistic systems, many more times the projected lifetime of the universe might be enough.

In reality, the majority of the engineering team is responsible for parts of the system, not the whole thing. It is not the job of these people to be responsible for the systems view of the whole; nor is it usually their training or experience.

Building a system requires coordination so that the parts work together. This can be achieved by designating one or a few people to be responsible for the coordination, or by having the parts-builders work by consensus. Work by consensus requires skills and time that few people have, unless the team has no more than perhaps five or six members.

Building a coherent system also requires having a way to measure coherence and satisfaction of system purpose. If a team is to work by consensus, all members of the team must have a consistent understanding of these criteria. If a smaller group is responsible for the system as a whole, then fewer people are required to share this understanding.

The shared understanding starts with the purpose for the system. The definition of the system’s purpose is outside the engineering team’s scope; it comes from the customer or their proxies by way of marketing roles (Section 5.2 and Chapter 8). The translation of information about customer needs into an actionable system purpose is the responsibility of a system role. This includes documenting the system purpose, developing a concept of the system, and writing down top-level system specifications. In doing so, the role works with the executive and marketing teams to confirm that the purpose and concept as developed match what the customer and organization actually intend.

The systems role is responsible for ensuring that the component parts of the system fit together into a coherent system. To meet this responsibility, the systems role is responsible for the design of the high-level decomposition of the system into parts, and how those parts are related—the functional and non-functional relationships (Section 5.5 and Chapter 11). While the systems role delegates the work to design and build the components, the role does check that the results match the specification of how the components interact. The systems role also guides the order of work, especially for how to plan integration.

7.1.4 Principle: The team is a system

A well-performing team is deliberately designed to have a structure that gives each member incentives and support to work together. The team’s leadership establishes the design, and monitors the team’s function to adapt the team structure when needed.

An effective team does not happen by accident. When a team is not given a structure and rules about how to work together, they will find ways to work. They will build up habits in response to a few specific early needs—and those habits will not make for a team that communicates well, cooperates well, or makes good systems decisions.

When medium to large teams try to self-organize, they react to problems they face immediately, and each person determines their response based on their own values and self-interest. The team members are not trained or incentivized to plan the team’s organization for future needs; instead, they find ways to work through individual problems as they come up. The team members in general do not have a view of the entire effort that will be needed to build the system, and so they find solutions based on their specific needs.

Team work exhibits variations of collective action problems. [Olson65] These problems occur when a group must work together; each member of the group must contribute in some way, and in return everyone in the group receives some benefit. The optimal strategy for an individual is often at odds with the optimal strategy for the common good. Many commonly-known cooperation problems, such as the tragedy of the commons or the prisoner’s dilemma, are kinds of collective action problems. (In fact an engineering team represents a particularly complex kind of collective action problem, because the contributions of different group members can combine non-monotonically: the value of one person’s contribution to the common good can be negated by another’s contribution.)

In other words, the natural tendency for a group is to form an organization that is reactive to immediate needs and to individual objectives, rather than the long-term objectives of the project as a whole.

Creating an effective team is, therefore, a deliberate act. It involves working out what the team needs to do as a whole, and then designing a structure for the team. That structure should address:

Maintaining a team’s effectiveness is also a deliberate act: good project leadership monitors how the team is doing and adapts organization or processes when needed. The team organization, or its processes, or its role assignments may work well for a while, but not fit the team’s needs as well later. The project’s leadership may set up a team organization or process and then find it doesn’t work as well as expected.

The organization of a team can be evaluated against the objectives in Section 6.3.3: how well people know how they fit into the organization and how that affects the actions they take.

I discuss matters of designing a team in Chapter 17 and in Part VIII.

7.1.5 Principle: Team habits

A team with good habits and culture can get work done. A team with poor habits will not, except by unlikely random chance.

Whether a team follows procedures and processes depends on whether following them is the norm for the team.

Teams follow habits. Habits and norms provide stability to team members: when they know what to expect, they can get on with their work. This creates an incentive to keep following habits and not change them.

Establishing the habit at the beginning of a project is not difficult. Changing their habit later is quite hard and rarely successful. The leadership of a team has one opportunity to set up a team to follow a process without undue effort. When they squander that opportunity, a project has difficulty from then on. If people in a team do not have a de jure process to follow, they will work out ways to get things done, and those habits will be the default way they work. Those habits are likely to have been worked out in reaction to a few specific, immediate situations and won’t account for the indirect ways that one piece of work affects another, and thus will not meet the project’s needs well.

It is possible to change a team’s habits after the fact. However, it takes time (a lot of it) and effort. The transition from one way of working to another will take time, as people will follow their habits without thinking until new habits set in. People will need constant reminders and incentives to change their behavior. There will be a period when people are doing a mix of old and new, which can increase chaos for a while (and often creates extra work to clean up the differences). People will feel extra stress and often there will be a decrease in morale or civility in the team until they settle into the new norm.

Most of the projects I have worked on over the years have been about innovation. The people who start such projects do so because they are excited about what they can build, whether about the technical aspects or the market aspects. They are motivated to get moving as quickly as they can. They usually are trying to make a prototype or do a demonstration as soon as they can. They do not have excitement about the work of crafting a team; if they need that, they will get to it later when they have the prototype built, or when they have the next funding round…

This tendency is often exacerbated by the way some funders behave. They reward market opportunity and technical originality, which incentivizes a team to build the market case and technology demonstrations as quickly as possible. Funders rarely reward or even evaluate whether the project leadership has capability to form a well-functioning team. When a team’s ability to execute effectively and efficiently is not valued by the funder, they will not put the effort into crafting the team.

A project’s leadership must incentivize and model following processes in order to build a team’s habits. I am aware of a company that set out anti-corruption processes, including ethical standards and a hot line for reporting violations. The leadership did not, however, make it clear to the employees how these would be acted on, and there was no demonstration of the standards being enforced. The employees correctly realized that the leadership was not serious about enforcing the standards, and it led to significant internal theft.

7.1.6 Principle: Keep it lightweight and actionable

People will use processes that they can figure out how to follow and that clearly give them benefit. Don’t make processes more difficult than what the team can do.

People will generally follow prescribed practices and procedures as long as 1) they know about them; 2) understand them well enough to perform them; and 3) the practices have high value relative to the effort required.

The first aspect implies that processes and procedures are documented and organized in a way that team members can find them. This also implies that when people join the team, they are taught how to find and use them.

A practice or procedure must be both clearly written and actionable for people to understand it and use it. I have encountered “plans” or “procedures” on multiple projects that amounted to a list of aspirations, rather than a specific set of actions that someone could follow. In one example, a security incident response procedure said things like “we will contact the responsible parties”, without naming who the responsible parties are (or even better, listing them with contact information). Had there been an actual incident, vague statements like this one would have led to time spent figuring out who the responsible parties were, and likely coming up with a wrong answer when under the time pressure of trying to resolve a critical incident.

A process or procedure that requires too much time or effort will lead people to try to create workarounds, usually subverting the reason that a procedure was established. This is the problem of a procedure that people perceive as too “heavy”. Keeping procedures as simple as possible will help. At the same time, some work is simply complicated, perhaps needing several people involved because it affects all of them. When some work is necessarily complex, it is vital to clearly document the process so that everyone involved understands both their own role and what the others involved will be doing.

I will discuss these topics more in Chapter 18, and especially in Section 18.2.1.

7.2 System-building tasks

Most engineers understand the need to use good technical judgment as they build a part of a system, but it is just as important to follow good practices in how the team approaches the work.

7.2.1 Principle: Start with a purpose before doing work

Understand why something is being built—its purpose—before trying to design and build it.

This is one of the most important principles in this book, and it applies in a great many ways.

“Purpose” here means the objectives for some work, the need that is to be met by doing the work or the reasons that it is worthwhile to spend the time and resources involved.

If someone starts designing or building something without understanding the purpose of the work, it is unlikely that what they build will actually meet the need that caused them to start the effort. And even if they do meet the need, perhaps by focusing on the purpose part way through the work, they are likely to have spent time and resources in false directions.

When someone takes on a task, whether to build part of the system or to oversee team operations, it is that person’s responsibility to ensure that they accurately understand the purpose of the work. Ideally they will be told the purpose as part of the task, but the person is still responsible for confirming that they correctly understand the purpose. I have found that taking explicit steps to confirm understanding saves time and effort, even for small tasks.

At the same time, the person who defines a task is responsible for ensuring that there is a clear purpose to the work and communicating that purpose to whoever takes on the task. In other words, the purpose for work is involved in a communicative action.

This principle applies to building a whole system. As I discussed in Section 5.2, a system needs a purpose—a customer need, for example—that it will fulfill. This purpose originates with the customer, or whoever will use the system and the value that the system will provide them.

The principle also applies to building components of a system. Each component (Section 10.2) has some role in the system: functions, behaviors, or properties that it should have that contribute to the system as a whole meeting its high-level purpose.

Other work also should have purpose. Organizing the team, or maintaining the project plan, or reviewing a component design are all tasks that have purposes. Someone doing these tasks should understand why the organization or review is being done, and they should ensure that how they do the work addresses that purpose even if associated procedures don’t spell out every step involved.

I argue in an upcoming principle that successful projects perform checks to ensure that the work that is done correctly fulfills its purpose. Without a clearly-defined purpose, it isn’t possible to determine whether a design or implementation or plan is correct or accurate.

I discuss how purpose fits into a system-building project throughout the rest of this book. I address the purpose for a system in Chapter 8. Each chapter in Part IV, on how to make a system, discusses the purpose of steps in building a system. As I present more specific topics, such as specifications (Chapter 23) or designs (Chapter 25), I present the purpose for that aspect of system-building before talking about what it is or how it works.

7.2.2 Principle: Evaluate tools before adopting them

Investigate whether tools, procedures, methodologies, designs, or implementations fit the project’s purpose before adopting them.

Every complex system is different from others in some way. The differences may be technical, such as how some component must behave, or they may be operational, such as the kind of team, the organization hosting the project, or the customer’s needs.

Differences mean that things taken off the shelf may or may not address the project’s need. An off-the-shelf electronics board might be a good fit, or it might not be available within the time needed, or it may lack a key security feature, or it may have reliability features that the project’s design does not need (but that do not interfere with how the board will be used). Similarly, a development methodology might address the project’s need for moving quickly and being flexible, but it might not work for a project’s distributed team.

In many cases the off-the-shelf methodology or design can be used in many different ways. The team may need to make choices about which of those ways are helpful for this specific project. The team may need to adapt procedures or methodologies for the procedure to fit what this project needs.

A well-functioning project will evaluate something that can be adopted, whether it is a component design or a procedure or a tool, against what the project needs that thing to be. Something that might be adopted can be measured in terms of the benefits of using it, the costs of adapting it to meet the project’s needs, and the costs of using the thing without adapting it. If the benefit outweighs the costs, then the thing can be used. If the thing does not quite meet the project’s need but can be adapted, then an investigation will reveal how to adapt it.

Sometimes a project will be obliged to adopt a process or use a component that is not a good fit. In that case the thing should be evaluated so that the team has a clear-eyed understanding of what problems could arise, and they can work out mitigations to avoid the worst problems.

This principle has a serious risk: that it will become an excuse for the Not Invented Here syndrome. No projects have the time or resources to invent everything from scratch—especially when reinventions often lose sight of the experience that has gone into building existing procedures or components. A team has to balance using tools that are pretty good but not perfect against the cost of inventing from scratch.

The idea of satisficing applies. This is when one applies a solution that is good enough to satisfy a need, without attempting to find a perfect solution. Writing of adapting buildings:

The solutions are inelegant, incomplete, impermanent, inexpensive, and just barely good enough to work. The technical term for it, which arose from decision theory a few decades back, is “satisficing”. It is precisely how evolution and adaptation operate in nature.

Even after generations of satisficing, the result is never optimal or final. […] The advantage of ad hoc, make-do solutions is that they are such a modest investment, they make it easy to improve further or tweak back a bit. [Brand94, page 165]

7.2.3 Principle: Take care with build-versus-buy decisions

Carefully evaluate each choice of whether to design or build something within the project, or acquire it from outside. Be particularly careful about the team’s ability to accurately make this evaluation.

Projects often have choices about whether to design and build something themselves, or to acquire if from somewhere outside the project.

Too often, the choice is made without deliberation. When the wrong option is chosen, the cost can be significant: spending resources to acquire something that doesn’t work well, or to build something that is not very good.

There are reasons to choose to build something inside the project. These include:

There are also reasons to acquire a component.

Sometimes there are overriding concerns in making the decision. If the team does not have someone with the skill to develop the component, it will have to be acquired. If no outside organization offers a component that fits, it will have to be built. If the time to build is too long, then it will have to be acquired, or vice versa.

Other times the decision depends on the costs and benefits of each option.

Two cost considerations are often overlooked. First, a custom-built component can be made to be a perfect match for the system’s needs, while an acquired component may have to be adapted or may have unneeded features (which can become a liability). The cost of adaptation has to be considered in addition to the cost of acquisition. Second, a custom-built component presents opportunity cost as well as the direct cost of building it. If a custom-built component is not essential to the system purpose or the related business purpose, then the resources used to build the component might be better used on something more central to the purpose.

Teams, and individual team members, need to consider their ability to make an objective build-versus-buy decision. I have observed many people who choose to build something new not for sound technical or business reasons, but because they are excited about building that thing. I have seen other cases where someone decided to acquire a component because they were not interested in the effort required to design and build it well. Worse, too often the Dunning-Kruger effect [Kruger00] applies: that the person making the decision is not aware of whether they have the knowledge to make an accurate decision, or are not aware of how their biases are driving a decision.

7.2.4 Principle: Follow the spirit, not just the letter

When a project has adopted a procedure or tool, that procedure or tool has a purpose. When using it, keep the purpose in mind and make sure that purpose is met—not just following a procedure or using a tool blindly.

A well-functioning project does not adopt its procedures or methodologies on whims; it addresses them to purposes. In organizations like NASA, the procedure standards represent several decades of accumulated experience. While the procedure may not be written to make the purpose and experience clear, these reasons exist behind what has been written.

I worked on a NASA project that reached its Preliminary Design Review (PDR) milestone. The team followed the long NASA checklist for what should be presented at that review. Unfortunately, the team did not keep in mind what the PDR was actually for: ensuring that the early, conceptual design coheres as a system and showing that the system is ready to proceed to steps that will involve greater investment. Instead they developed material that checked each box on the agenda, without addressing the system as a whole. The reviewers could tell that the design did not make sense; moreover, the review failed to reveal the actual problems that the design had.

A team should document the reasons or purposes for which they adopt a procedure or a tool. Similarly, each person on a team should put effort into understanding why the team has adopted procedures and tools.

7.2.5 Principle: Document things so there is a future

Document both how things work and why they work so that people can understand the system when they work with it in the future.

It is easy to want to design or implement at full speed, keeping focused on the immediate goal: getting the thing built.

That goal misses the larger purpose of building something—that the built thing meets its purpose and specification, and that it continues to do so as the system evolves.

In practice, the initial design and implementation of a component involves much less effort than is spent on checking that implementation, integrating it with other components, fixing bugs, and making changes later. A project that is building a system to succeed in the long term optimizes for all these other tasks, not just the initial design or implementation.

All these later tasks involve understanding specification, design, or implementation of a component. Understanding means not just being able to see the design or implementation artifact, but also knowing why the component is what it is. This includes documenting the rationales that led to significant decisions about the component. It also includes providing people a guide to understanding the component’s design or implementation, especially if there are subtle aspects to the component that are easy to miss if one is looking just at a design document or an implementation.

When someone is the code for some component and asked to change some behavior, and that person isn’t the one who initially implemented that component (or they are the same person, but it was a while ago), they begin by building up a mental model of how the component works. Once they have that mental model, they can proceed to work out how to change it. They will think of different ways they could make changes, and evaluate them to see if the changes will have the effect they intend and that the changes will not have some other undesired effect.

Building up an accurate mental model involves working out constraints that led to the component’s design, major decisions about how the component is structured, and how different parts of the component work together to achieve its functions. This information is not encoded directly in software source code or mechanical drawings or circuit designs; all those things are the products of a process that works through all those other things on the way to producing the design or implementation artifact.

The person who is tasked with changing a component, and then building up a model of how that component works, can get information two ways: from documentation or by reverse-engineering it from the implementation artifacts. In practice it is usually best to do both. A circuit design is the truth about how an electrical component works, and so this is the most accurate way to learn about the implementation. However, a circuit design or software source code leaves out the rationale for why the design is the way it is. Having documentation about the design, about why the design is the way it is, and a guide to the implementation will help the person understand the component more accurately and more quickly.

Of course, having documentation only helps if that documentation is accurate. If the documentation doesn’t match how the component was actually implemented, then the documentation will lead someone astray when they try to learn how a component works.

There has been a saying in agile software development that “the code should be documentation”. This is usually interpreted as “the code should be the only documentation”, which is not what the people who developed agile methodologies intended.[2] The point in the agile methodology is that software code is necessarily documentation, and it should be written so that it is clear and readable so that others can read and understand the code.

I have experienced both the advantages of having good documentation and the disadvantages of having no or inaccurate documentation. Many years ago, I developed a multithreading package for a research system. That package included a peculiar thread-synchronization primitive tuned for that specific application; correct implementation depended on some unobvious code in one place. It took some time to analyze the design to identify that condition, and if I had not written it down I would not have remembered it correctly when I had to modify the package a year or two later. On the other side, on a personal project I was developing a responsive, single-page web application and developed a combination of JavaScript code running in the browser and Ruby code running in the server to achieve it. I did not document the design, and when I needed to improve it after a couple of years I had to reconstruct the design. I spent much more time than I would have liked on that reconstruction.

7.2.6 Principle: Build in checks

Make independent checks of all critical specifications, designs, and implementations a normal and expected part of project work. Define in advance who will do the checks and when they will do them.

Having one person check another’s work is a basic mechanism for maintaining quality, safety, and security in a system. It applies equally to technical work, such as verifying that a design matches specifications, and to project operations, such as checking that a procedure is working as intended or that team communication is flowing.

Note that this does not mean that developers can avoid writing unit tests or performing design analyses. They should be doing those, and independent checks should be done as well.

There are many advantages to performing reviews or checks:

There are two significant disadvantages that can lead to a team skipping checks. First, checks take time and effort. When a team is pressed for time or short handed, it’s easy to let a check go by. Second, done poorly a review can feel like a lack of trust or like an attack on someone’s work.

Nonetheless, checks and reviews are important enough that a well-functioning project will find ways for checks to happen.

Having checks be a built-in norm for the team helps address the disadvantages. If everyone knows that checks are going to happen, the time and effort involved will be planned for. People will notice if checks are being skipped, and will ask why—helping to ensure that the checks actually do happen. Separately, when everyone’s work is checked, it becomes easier to convey the sense that no one is being singled out or is not trusted.

I discuss how checking can be built into a project’s life cycle patterns in Chapter 18.

7.2.7 Principle: Work against cognitive biases

Take deliberate, ongoing actions to avoid the negative effects of cognitive biases, such as confirmation bias or team echo chambers, and missing or incorrect information.

The work of building a system involves making many complex decisions. These decisions are based on the information that the person making the decision has, along with their skills, experience, and biases.

Incorrect decisions can be made when people work with beliefs or biases that are inaccurate. This leads to concepts or specifications that reflect the errors, and from there to designs and implementation that do not meet system needs. There are many terms for these various situations, including confirmation bias, echo chambers, or recency bias.

These errors arise from many different causes:

These biases can lead to serious system flaws when incorrect decisions are made about high-level system design or safety and security functions.

There is no one method that will eliminate these problems. Indeed, many of these problems are a necessary flip side to cognitive behaviors that have positive outcomes, such as group agreement and pruning a search space when making decisions.

A well-functioning team takes deliberate and ongoing steps to reduce the problems that come from cognitive bias. These address the problems from two directions: prevention, by making complete information available, and reducing occurrence, by building into the project’s procedures methods to avoid or catch problems.

A project can reduce the chances of cognitive bias issues by maintaining complete written records of key information. Information about customer needs (and how those were determined) and rationales for design decisions are most important. Completeness in designs and verification records also helps. Sharing information that changes widely as well as documenting it in writing helps avoid team members working from outdated assumptions.

Reducing occurrences of erroneous bias involves finding ways to see around the bias into information that would have been ignored or dismissed. This almost always comes from finding a way to get perspective that sees a problem from a different perspective. Training team members to take deliberate steps that will try to falsify their hypotheses gives each team member their own improved perspective. Building in reviews where decision rationales are explained to people with different perspectives helps catch biased decisions before they cause errors. Designating someone to be a devil’s advocate in discussions about complex decisions makes it clear that the team is taking the possibility of bias seriously.

Continuous training for team members in their own disciplines and in related ones improves their skills, in addition to what they learn by experience. Greater knowledge and skills helps combat the kinds of cognitive bias related to the Dunning-Kruger effect. Training in related but different subjects improves their open-mindedness, giving the team members new perspectives to use in thinking through decisions.

Project leadership has an important role in avoiding problems that arise from bias. Good leadership models behaviors where the leader explicitly looks for falsifying evidence and alternative perspectives. The leadership has the ability to allocate effort to investigating decision alternatives and being the devil’s advocate in discussions. The leadership sets expectations for the rest of the team by inspecting decision rationales to ensure that steps have been taken to address possible biases.

7.3 Plan for building the system

Complex systems, with dense graphs of relationships between their parts, cannot be built without a plan. A project cannot get such a system built by following a random walk through the space of possible tasks. However, plans have often been over-done, trying to lay out a definite schedule where in fact there are unknowns and then having scheduling crises when something runs long or over budget. A middle ground that remains honest about what is known and what is not, that allows flexibility as the project moves forward, and that also guides the work in a consistent direction works better.

7.3.1 Principle: Prioritize work by risk or uncertainty

Put effort into work that carries risk or uncertainty as early as possible.

Common project management practices advocate paying attention to the critical path: the set of tasks that must be completed on time in order for the project as a whole to complete on time. If any one of these critical tasks runs late, the project as a whole will be late. Each task has some measure of slack, the amount that it can start early or run late without delaying the end of the project. If a task has no slack, meaning it must start and finish on time, it is part of the critical path. Most projects have at least one sequence of critical tasks from the start (or from the present) to the end of the project.

This definition of critical path is useful but overly simplistic. It is useful because it gives a way to identify work that can put the project at risk, and once identified that work can get extra attention to make sure it goes as planned. The definition is simplistic because, at least in the basic formulation, it assumes that the graph of tasks and the duration and dependencies of each task are all known.

The critical path method is a special case of the general principle of using risk and uncertainty to inform project planning. In general, what work could lead to the project being delayed, or to the project failing?

There are at least four kinds of risk or uncertainty to consider.

First, there is the risk that some external event will affect the project. A customer might change their needs. Regulation might change, affecting how the system must be designed. A supplier might go out of business and thus not deliver components. Weather might delay an essential testing operation. Some geopolitical event might happen that changes the ability to manufacture an essential part.

Second, there is uncertainty about how to build part of the system. At the beginning of a project, there is neither concept nor design for the system and so the time required to build it is uncertain. As the design begins to develop, there will be some parts of the system that have low technical risk because they involve well-understood problems, such as wheels for a road vehicle. There will be other parts that cannot be built using available designs, such as a spacecraft that needs low-mass, low-power radio subsystem that can communicate with another spacecraft. If the team can find or develop an appropriate radio, then the project can move forward—but if it can’t be, then the system design or the mission will require significant re-work. It may not even be possible to meet the customer needs within the time and budget they require.

Third, there is uncertainty about the time and effort required to build something. There may be a likely technical solution for some component, but the difficulty of constructing it may have hidden surprises. The time needed for a supplier to provide a purchased component might not be known until a contract is signed with them. The complexity of testing the integration of certain components and fixing bugs might not be understood.

Finally, there is schedule risk from a “long lead” task or sequence of tasks that will take a long time to complete.

A well-functioning project searches out risks and uncertainties like these and puts attention and effort on them. Deliberately spending effort addressing technical and schedule risks early in a project means that potential problems are addressed when it is cheapest to handle them. Consider finding out halfway through a project that there simply is no component available to fill some need. Addressing this might require a redesign of much of the system—but much effort has already been spent building parts of the system that now must be discarded. This is a waste of resources; more seriously, it presents a problem that all too often leads project management to decide to fudge the solution and build a system that does not work as needed.

This principle requires dedication to examining the state of the project thoroughly and without bias.

7.3.2 Principle: Prioritize integration

Integrate components as early as possible. When possible, integrate mockups or skeleton components before building out the component details.

There is common wisdom that the cost of fixing an error in a complex system generally increases over time, up to the release into production. While the hard evidence for this is lacking, I find general acceptance that this occurs, though with plenty of exceptions.[3] The idea of increasing cost over time has led to methods that successfully catch errors early, including concept, requirement, and design reviews, test-driven development, and automated checking tools.

Studies such as those reported by Leveson [Leveson11, Sections 2.1 and 2.5] suggest that the greatest cause of system failures now comes from design errors related to the interaction of separate components: the robustness of individual components is not the problem, but instead how components work together. This appears to be the case even with requirement and design reviews, which certainly catch many errors before they are implemented.

I have found two methods help reduce integration-related errors.

The first method is to use semi-formal, top-down design analysis methods in conjunction with design reviews. I recommend the STPA method that Leveson presents. [Leveson11] The Mars Polar Lander loss review called out the lack of such analyses as a significant contributor to the loss of the spacecraft. [JPL00, Section, p. 16]

The other method is to organize development around integration, so that the component interactions can be tested (not just analyzed) as soon as possible. This principle means focusing on how components will work together before implementing fully detailed components. This leads to building the system in increments, starting with a collection of stub or skeleton components that implement a few parts of the component behaviors and integrating them together into a partial system with limited capabilities. This partial system is then tested, with an emphasis on seeing if the interactions work correctly. Once problems with the integration are sorted out, another tranche of functionality can be added and tested. Along the way, one always has a partial system that runs.

Integration first has two benefits. First, if the component interactions do not work well, multiple components will be affected by a redesign. Detecting the problem before investing in all the details of the components means less re-work. Second, it is usually easier to test interactions with mockup or skeleton components than with “real” components. One can instrument the mockups to observe detailed states that are harder to observe in a complete implementation. One can also add fault injection points to make it easier to create off-nominal test scenarios.

This principle is not one to apply blindly, however. The purpose of integration-first development is to address uncertainty or risk that comes from potential component interaction problems. Some components may have their own internal technical risks, and sometimes it is more important to sort out that risk before addressing component interaction risks. Of course, the ideal would be to address both in parallel.

7.3.3 Principle: Have a long-term plan

Maintain a plan for how to get from the present to a completed system. Detail out the near future; have a concrete but less detailed plan in the medium term; and have a general approach beyond that. Evolve the plan as understanding about the work changes.

Consider the task of planning a route for walking from one place to another. If one has a map of roads or trails connecting the locations, one can search out a path by using a standard shortest-path graph algorithm, which evaluates various parts of paths in an orderly way until it finds a “best” path.

This is analogous to building a system with few unknowns. One can start by designing the system on paper and checking it out. This approach is a low-risk way to build a system, as long as one can be sure that all of the components can be built as designed and that their integration into a system will work as planned. This situation applies when building a system that has strong similarity to other systems, so that there is an existing body of knowledge about what works. This is the basis for repeatable engineering methods, as evaluated by standards such as CMMI. [CMMI] It is also the situation that led to the waterfall system development methodology.

What if there is no map? What if the terrain in between is unknown, and the distance is far enough that one can’t do something like climb a hill and look?

Most projects that are working to build an innovative complex system have a situation like this. At the beginning, there is no obvious path to follow to get to the desired system; indeed, there may not be any path that gets there if the desired system is not feasible.

The team working on the project needs a plan that will guide their work, giving it a general direction for the long term, some concrete plan for the medium term, and details in the short term. As the work progresses, some of the medium-term work will turn into specific, detailed tasks. Some of the tasks will provide information that fleshes out parts of the general, long-term work into more concrete medium-term work. Sometimes bug reports or change requests create new short-term tasks that change the medium- and long-term parts of the plan.

A plan like this benefits the team. It helps ensure that people get all the tasks done, without some getting missed. It conveys decisions about how work is prioritized, which helps the team work independently. It gives a basis for measuring progress and predicting whether milestones will be reached on time.

The act of maintaining the plan provides the opportunity to think about priorities (such as those in the previous principles) and the dependencies between parts of the work.

A flexible, evolving plan strikes a middle ground between a fixed schedule and a purely reactive tasking approach. A fixed schedule, of the kind often associated with the waterfall development methodology, often either becomes a fiction after a few weeks when unknowns intrude onto the planned perfection, or the schedule becomes flexible and takes effort to maintain without a discipline to doing the maintenance. A purely reactive approach, which can be seen as in Agile methodologies taken to an extreme, has the risk of the team wandering around chasing whatever immediate priority comes along, and then having execution difficulties when some work requires more planning than one sprint’s duration.

Of course real projects rarely take either extreme approach; in practice real projects adjust schedules over time. Having a discipline for maintaining a plan from the beginning helps the evolution proceed smoothly.

7.3.4 Principle: Set up intermediate internal milestones

Define regular internal milestones for showing a part of the system working in an integrated way.

Internal milestones that demonstrate some system function give the team a focus for their work in the medium term.

Each milestone demonstrates a set of system capabilities working, especially if those capabilities involve integrating functionality in multiple components. The milestones include a demonstration of the new capability working, in order to prove that the system is working and to give the team a concrete success to celebrate. Internal milestones like these put the team’s focus on a part of the system, leading to capabilities that are integrated together early. (This approach supports the principle of prioritizing integration, above.)

The functionality for each milestone should represent some significant amount of work. I have scheduled such milestones about two or three months apart. If a project is using Agile-style sprints, the milestone should include the effort from several sprints.

I have often focused these milestones on some high-level system function or on some pathway through the system. In the software effort on one multi-spacecraft project, the first milestone demonstrated that the basic software and communication frameworks functioned in a testing environment. The next milestone showed simple control loops in the flight software working; the milestone after that, collective guidance for the collection of spacecraft. Each milestone built on the work of the ones before it.

Of course, not all of the team need be involved in one of these milestones. Part of the team may be working in parallel on other functions. In the multi-spacecraft system example, other parts of the team were working on spacecraft hardware design, mission design, ground systems, and so on.

There is a risk in this approach: that the team takes too narrow a focus and fails to account for the larger system. Any focused effort, whether for an internal milestone or for something else, must be balanced by consideration of the whole system. In the project above, the systems engineering team kept working in parallel to the software teams in order to ensure that the software designs continued to meet mission needs and would integrate properly with the spacecraft hardware and ground systems.

7.3.5 Principle: Use prototyping safely

Use prototyping to validate a concept or determine if an approach is technically feasible. Never let a prototype escape and become treated as a part of the real system.

Building a prototype of a component or a part of the system is an excellent way to learn about how the component or part can be built, and how it will work. It is also a good way to check that a potential design will meet its needs.

Building a prototype is also one of the more dangerous activities that a team can do while building a system. The risk is that a prototype will appear to function in the way needed and will be treated as if it is an initial version of the “real” component, even though it is not.

A prototype has value when it can be developed quickly, at lower cost than its “real” counterpart. Taking shortcuts, implementing only some parts of functionality, not performing much verification—these are all positive approaches to building a prototype and negative for building a component to be used in the final, deployed system.

One example of what can happen comes from a colleague. He was tasked with building some sample software code that would show developers how one could construct a particular kind of application on a new operating system product. The sample code was intentionally simple; it illustrated a particular flow of activity that an application would need to do. It was not a full application in itself. He took some shortcuts in non-essential parts of the code, making the primary part of the application robust but (for example) making some helper functions non-reentrant because they were not an essential part of what was being illustrated. Unfortunately, after this code was published as part of a tutorial, people began blindly copying the helper functions—even though the example was labeled as illustrative only. This led to other organizations releasing buggy applications because they took the easiest and fastest route to building their application by just copying the helper functions.

I observed another example in an ambitious autonomous vehicle system. The company in question began development of their vehicle by building prototypes of several key systems, both hardware and software. In doing so they learned a lot about the problems they were trying to solve. The prototyping effort did what it should: it provided information about how the system should be designed as well as a platform for experimenting with algorithms (such as some of the control systems). Unfortunately, the company did not label or treat these artifacts as prototypes; they saw them as early versions of the real system. The prototypes allowed them to demonstrate vehicles that could perform some operations to investors. This led to increasing pressure to get more features implemented, and to correct problems they found with the vehicle operations as soon as possible. The prototypes had never been designed for reliability, safety, or security, and early safety analyses found significant flaws. Interestingly, the company did treat their hardware platforms as prototypes, and built a hardware platform that was designed to meet safety and security requirements to replace the early prototype boards.

These examples point to both the positive and negative sides of prototyping. To the positive, in both examples, developing a simplified version of the system in question helped people understand the problem at hand. The effort to develop the prototype went faster because the effort focused on only the essential element of what needed to be learned, and omitted aspects that would be needed for a production system. On the negative, in both cases the prototypes ended up being treated as production ready. The prototypes, having been built without the rigor needed for correct, safe, or secure function, led to flaws in the system products. These flaws increase the cost of building a working system, and the flaws tend to be discovered late in development when it is far more costly to correct them. (One startup company I worked with had to rebuild a third of its project when they realized how much they were spending to try to patch up the prototype-quality software they had written; they had to go through extra venture funding rounds to get their product released.) end missed

Using prototyping, thus, is a necessary and helpful part of building a complex system, but it must be done with discipline that keeps prototypes separate from the “real” system components.

Some project managers have talked with me about solving this by policy: they will have their team build a prototype but they will ensure that the prototype is not used for production, and they will put building a real component into the schedule. Unfortunately I have then seen this resolve fade away quickly as the project begins to run late or have funding issues or have an important demonstration coming soon. These imperatives have always, in my experience, taken precedence over system correctness and even over the longer-term cost and schedule to build a working system.

Prototypes are used more safely when they cannot be used in the real system. For example, people often construct storyboards or slides of the user interface for an application. These storyboards allow the developers and potential users to explore how the interface will work, but they cannot be made into an executable application. Similarly, building a software prototype using languages or tools that cannot be integrated fully into the production system helps keep that software from being used in production. Using prototype hardware that is similar but perhaps in a different form factor allows a team to see if a hardware design can work without risking the prototype being put into production.[4]

7.3.6 Principle: Analyze for feasibility

Analyze a system concept for feasibility before committing large amounts of resources to it.

I have worked on multiple projects that were, in retrospect, infeasible. Project A was trying to build a collection of cubesats to perform a demonstration of cross-link communication between the spacecraft. No radio or flight computer was available that could achieve communication between spacecraft except for a brief period at the start of the mission. Project B involved designing a commercial system for which no commercial business case existed—the system was fundamentally a public good that would not generate a commercial return on investment. A third Project C depended on multiple competing government contractors voluntarily developing a shared system architecture, when the rational behavior for all the contractors was to focus only on their own work. Yet another, Project D, depended on secure operating system technology that did not yet exist.

In all these cases, large amounts of money and effort were spent before the projects were canceled.

With hindsight, it is clear that the problems with all but one of these projects could have been detected early. In Project A, basic systems engineering could have created a mission concept of operations and modeled whether available radio and computing hardware was up to the task. The incentive for competing contractors in Project C not to collaborate was clear from the beginning, but the management overseeing the project chose to continue anyway. The missing technology in Project D was identified early but the customer insisted on proceeding.

Project B was the exception. It was defined as a two year limited-time exploration of the problem. At the beginning of the project, no one involved knew whether the system was feasible or if there was a business case. Over the course of the project we learned about the nature of the system, including that it produced a public good [5] rather than a private good, and thus it was not a sensible commercial product.

7.4 The team

A project’s people do the work of building the system. The team is itself a system made up of complex parts, and how effectively it works depends on how well it is organized and led. Supporting a team with the structure it needs, and in particular with the communication channels it needs, gives the team a fighting chance of working effectively and working through the difficult problems that will come along.

7.4.1 Principle: Document team structure

Define clear roles and responsibilities for each member of the team. Document and share that information so everyone has an accurate understanding.

As I noted earlier (Section 7.1.4), the team is itself a system. As a system, it has structure—who is on the team, what their roles and authority are, and how people should communicate (Section 6.3.3).

There are many ways projects can structure their teams. The specific choices depend on the nature of the project—the number of people, the range of disciplines involved, whether there is one organization or many.

In a well-functioning project, everyone on the team will have a common understanding of what that structure is. Each person will know who they should communicate with and when. Each person will know what their areas of responsibility and authority are, so that they know when they can make a decision and when they should work with someone else. They also will know who to go to for answers to questions about other parts of the system.

A shared understanding of team structure becomes most important when people find problems to address. If one person finds a problem with the design of a component, they will need to work with the people who are responsible for components sharing functional or non-functional relationships (Section 11.2). If there are interpersonal problems between two team members, the responsibility for escalating problem resolution should be clear.

Clear team structure enables delegation. In a project of more than trivial complexity, the work must be shared among multiple people. Sharing responsibility only works when both parties trust each other: that both will do their part of the work, that both will communicate what should be done and the progress that has been made, and that both will communicate when they find a problem with the planned work. This trust depends on a shared understanding of the rules about responsibility and communication.

7.4.2 Principle: Plan on reorganizing the team as it grows

Adapt the structure of the team as it grows, to reflect the increased coordination needed as the number of interactions increases.

A very small team, of up to around five people, needs little formal structure, because all the people can interact directly with all the others to coordinate the work. A large team needs formal structure, with defined scopes of responsibility and communication paths. In between, the team needs some degree of structure.

As a team grows, it will move gradually from the size where it needs little structure to needing more and more structure. It will reach points where it is outgrowing the structure it has had and needs to change to have a more formal structure. I have observed that teams need to change at around 5, 30, and 70 people.

In a well-functioning project, the leadership monitors the team’s performance to detect when the team is reaching a size where it needs a change in structure.

Some of the signs that a team needs to move to a more formal structure include:

7.4.3 Principle: Have shared procedures

Document procedures that everyone on the team will use for important tasks.

Procedures define how people perform certain tasks (Section 6.3.5 and Section 18.4). These procedures should be documented and easy for everyone on the project to find. The team should have a cultural norm of following the procedures—not just the letter of the procedure, but the spirit of it as well.

People working together means one person does part of the work, then another builds on their work. For this to succeed, people need confidence that the work they build on has been done properly. Part of that assurance comes from having shared procedures and having a team norm that everyone is following those procedures.

Some procedures are simple lists of steps or checklists. For example, if a team is using a shared artifact repository like git, everyone needs to follow conventions about how to check in work, maintain branches, and baseline versions (such as by pulling to a main branch). If someone does not follow the procedures, then the state of the repository can become damaged.

Other procedures are more complex. Completing a Preliminary Design Review (PDR) in the NASA lifecycle (Section 20.2.1) means that the project is ready to commit money and resources to begin detailed design and, later, implementation. This is a check on the whole project, not just on the design of one part. Passing the review implies that many project artifacts are completed, at least to a preliminary level: cost and schedule baselines, security and export control plans, orbital collision and debris avoidance plans, specifications to at least three levels, technical concepts, operational concepts, and many others. If the project continues but some of these checks are not true, then the project is likely to have serious problems later. (This was the case on a NASA project I worked on.)

7.4.4 Principle: Define regular communication paths

Document regular times and media for team members to communicate with each other.

The work the team does is interconnected. A decision about one part of the system affects other parts, following the system structure relationships. The decisions are based on information that, in turn, comes in part from the other parts of the system. Others on the team are responsible for ensuring that the project is making progress, including detecting when something is not going as expected.

Regular communication ensures that this information is pushed to the people who need it. A well-functioning team knows when to share information (such as times when decisions are being made), and who to share it with (the people whose work it will affect). Such a team will also avoid pushing information to those who do not need it. This avoids inundating people with useless information and thereby obscuring information they do need.

To achieve this, make sure that the project’s operational procedures include defined points when team members are expected to communicate. This might include times like starting on the design for a component, when changes are proposed for an interface, and when a component’s design or implementation are ready for review and approval.

Other team members need regular communication for other purposes. Status updates provide information to update the project’s plan. Other communication ensures that the team is working well, helps project leadership keep a finger on the team’s productivity and satisfaction, and provides a way for everyone on the team to learn the project’s overall goals. Johnson discusses communication as a foundation for team functioning [Johnson22, Chapter 2] and how communicating feedback is essential for keeping team members working at their best. [Johnson22, Chapter 5]

7.4.5 Principle: Define exceptional communication paths

Define and document clear expectations about when and how someone will raise issues with others. Make this an essential part of the team’s cultural norms.

Delegation and sharing work is essential to a team that is building a complex system, and they are based on mutual trust. One part of that trust comes from each party doing their work well, following the project’s procedures and the team’s norms. The other part is being able to trust that people will communicate when there is a problem. (See Section 17.1 for more on this.)

There are many things that can go wrong. Someone can find an error in a specification or design. They can find that they don’t have the resources or skills to complete some task. People can have disagreements that they cannot resolve. A supplier can be late providing some component.

When these things happen in a well-functioning team, people will communicate—not keep the problem to themselves. The project’s operational procedures should make it clear how to handle some of these cases. For example, when someone finds a design error, they work with the person responsible for the design to find a solution, and they let others doing work that could be affected by the design change know. Ideally, they will ask for feedback from these other people to make proposed changes work for related parts of the system.

Communicating about exceptional situations only works if both the person raising an issue and the recipients can trust that the message will be heard, acted upon, and that all the parties involved will handle the matter respectfully. Much has been written about how to create an environment where this happens—see Johnson [Johnson22], for example—and I will not try to add to what others have written.

7.4.6 Principle: Train team in communication skills

Communication is only effective when information passes accurately among the participants, and when everything that needs to be communicated gets heard. Effective communication is a skill that can be learned.

There are many ways communication can go wrong. One person can say something and the other person understands something different. Something can be said that causes the hearer to have an emotional reaction that interferes with hearing and understanding. Two people can be trying to exchange multiple pieces of information, but things interfere and some key information doesn’t get shared. Someone can have something important to say, but withholds the information out of fear of an inappropriate reaction from the person who needs to hear it.

In safety-critical environments, such as air traffic control, pilots and controllers talk using a pre-defined vocabulary, follow pre-arranged patterns for who can talk when, and each party always reads back key information to confirm correct understanding. [JO711065] These rules have been developed over the years to ensure that each party can speak when they need to, that everyone involved will understand what is said in the same way, and that key information is checked.

A well-functioning team has a shared culture of communication practices. These practices include many of the principles found in ATC communication, such as careful definition of terms and reading back or paraphrasing to confirm what has been heard (sometimes called active listening). In addition, people will have uncomfortable things to say and hear while working to build a system and the team’s communication practices will have to handle messages that could trigger emotional reactions without breaking trust within the team. The communication practices also should encourage regular communication to actually happen rather than people forgetting to talk to each other.

There is a lot of useful information available in book, courses, and classes on how to improve communications within a team.

7.4.7 Principle: Provide independent resources for checks

Explicitly organize the team so that people have responsibilities for checking others’ work, including through reviews and by doing testing. Manage relationships in the team to keep the checking from being taken personally.

Building checks into the work plan is a principle listed above. The principle of doing checks requires having team members available to do those tasks. Having someone who did not do the design or implementation perform checks improves the odds that they will find a problem because they do not have implicit assumptions/biases of the designer or developer. This implies that a well-functioning team will be staffed to provide for independent checks, and that some team members know they will be responsible for checks.

It is easy to underestimate the effort required for reviews and tests. Doing a meaningful design review takes significant effort, because the reviewers need to actually understand the design—not just look for particular easy-to-find markers that might indicate a problem.

I have heard many opinions about how much of a team’s effort should be allocated to reviews and checks, anywhere from half the effort to a small fraction. My own experience has been that the teams where about one-third of total effort was allocated to reviews and testing had better outcomes than the teams with less effort available. The appropriate fraction of resources is likely dependent on many factors not yet appreciated.

Reviewers and testers can end up having an adversarial relationship with designers and implementers, and so the way reviewing and testing tasks are allocated requires some care. In one organization I worked with that had permanent testing teams separate from developer teams, the developers looked down on the testers and relations between the teams were sometimes difficult. While some tension is useful so that the work remains independent, careful management will monitor the relationships and work to ensure that the interactions between developer and checker do not become personal and that the skills required for both roles are honored.

[1] In one company, I worked closely with the marketing team. This included going to meetings with customers and participating in the back room during focus groups. Working together was fruitful: I could answer technical questions from the marketing team in real time, and I could hear what the customers were saying first hand. At the same time, I got a perspective from the marketing team of how the product we were designing needed to be positioned. This collaboration was one of the best examples I have of how a translation role improves a company’s work.
[2] Martin Fowler states: “[…] I feel the need to rebut a common misunderstanding. Such a principle is not saying that code is the only documentation. Although I’ve often heard this said of Extreme Programming – I’ve never heard the leaders of the Extreme Programming movement say this. Usually there is a need for further documentation to act as a supplement to the code.” [Fowler05]
[3] I have tried to find studies that back up the common claim that the effort required to fix errors increases by an order of magnitude with each phase of development. I found several on-line discussions that said that while this claim seemed to be more or less true, all citations appeared to lead back to one IBM-internal presentation that could not be verified. Many of the discussions ended with the claim that it was too hard to define “error” and “cost” in consistent ways, and so there might never be quantitative evidence.
[4] Spacecraft avionics development often involves building a “flat sat”: a collection of electronics boards laid out on a test bench. Sometimes these boards are development samples from a vendor, or early prototype boards before the team has finalized a design. The flat sat can be used to determine whether the boards can communicate as expected; they can have test probes attached in ways that production hardware might not allow; they can be connected to computer systems that emulate external inputs and outputs. Nobody ever expects the flat sat to get flown, because it’s just a collection of boards and cables; it doesn’t look like the real thing.
[5] A public good is something where the benefit accrues to everyone; it is not possible to exclude anyone, and providing the good to one person does not decrease the good available to others. [Hardin20, Section 2] Fire protection and lighthouses are examples. A private good is one where access to the good can be limited and providing it to one person means it is not available to another. Owned goods are an example.

Part III: Systems

A detailed model of what systems are. This includes

Chapter 8: Purpose

17 August 2023

8.1 Introduction

Creating a system requires time, effort, and many other resources. The result of spending those resources should be worth the expenditure: the system should do something useful for someone.

This is another way of saying that the system should have a purpose, and that the purpose should be expressed in terms of what the system can do for the people or organizations that will depend on it. This definition of a system’s purpose means that it depends both on what the system does and who it does it for; both must be worked out to be able to accurately reason about a system’s purpose.

The list of who the system is for should be expansive, including everyone who has an interest in the system. This includes the system’s users, who will need to benefit directly from what the system does. It also includes the people or organizations who build and maintain the system and their investors, who will need to get benefit from the effort and resources they put into making the system. It includes others, such as regulators or industry groups, who represent the public interest in avoiding dangerous activities. This list amounts to the (often-abused) term stakeholder, interpreted broadly.

Each of these stakeholders will have a different interest in the system. The needs of each stakeholder must be discovered and recorded. Users derive benefit from the system’s explicit behaviors. Builders and funders derive benefit from compensation for the system, and in the longer term from the potential opportunity to evolve the system, provide it to others, or develop new systems. Regulators, industry groups, and the public derive benefit from how the system affects the world at large in terms like safety, fairness, or security. All these needs must be satisfied, and they cannot be satisfied reliably unless they are known.

8.2 Why purpose matters

Purpose provides a basis for decisions about whether something is worth doing, or to choose among different ways to do something. It guides the design and implementation: each part of the design can be judged on whether it adds to meeting the purpose or not. The sum can be judged on whether it meets all or enough of the purpose to justify building or deploying the system

This principle applies to parts of the system as well as to the system as a whole. Each part has a purpose that it needs to fulfill in order for the system to fulfill its purpose.

Purpose matters because of what happens when one does not give it enough consideration. I illustrate this with two examples, from among the dozens I have encountered.

Early in my career, I was tasked with building software that would be used by machine shop workers to process repair work orders and manage parts inventory. This system would be installed on minicomputer systems with terminals around the shop. I had what I thought was a clever idea for the user interface, based on the ideas of non-modal UIs that were beginning to enter the world in the early 1980s. The result met all of the functional requirements needed—and was completely unusable. I had focused on building something I thought surely would be good without doing the work to understand the needs of the shop workers who would use the system.

More recently, I worked with a startup that was building a software system to control a small vehicle. The software designer had decided that the foundational software infrastructure should provide an event loop mechanism, where the infrastructure would cycle at some frequency, and in each cycle would call functions to read sensor data, perform computations, and write commands to actuator devices. This is a common design pattern for this kind of system, and a reasonable starting point. However, when the designer was asked how they envisioned this being used to implement PID controller logic, it turned out that they had not ever considered what a controller would need and many necessary capabilities were missing. By the time the first version of the system was released for deployment, the vehicle had no control systems implemented.

The common thread in these examples is that in neither case did the person responsible work through the system’s purpose in order to ensure that what was built would be useful. Instead, the designs were based on an unvalidated belief about the right design, and the choices resulted in unusable implementations.

In both cases, a significant amount of time was spent building a system that did not work. In both cases the resulting system could potentially have been redesigned and reimplemented, but building the wrong thing had used up the available time and delivery deadlines were close by the time they were finished. In the case of the shop management system, the project subsequently failed as a result. In the vehicle control system, at the time of writing it remains to be seen if the team can get funding and time to correct the errors.

Both examples would probably have turned out better if effort had been put into a proper articulation of what the system needed to provide before anyone went into depth on design.

I discuss gathering information about purpose and documenting it in ! Unknown link ref.

8.2.1 Not monolithic or fixed

While it would be nice if purpose could be defined once and then remain fixed for the life of the system, this rarely happens.

First, a system’s purpose is rarely fully understood, especially in the beginning of a project to build the system. A team can begin by talking to potential stakeholders and finding out what they need, but inevitably someone will realize some important system behavior well after design or implementation are in progress. Not all of the stakeholders may be apparent at the beginning: for example, in one project I worked on, insurers turned out to be an important stakeholder, but we didn’t appreciate that for quite some time. A team must expect that their understanding of a system’s purpose will be rough at the start and become more accurate over time.

Second, purpose is not usually monolithic: there are many things that could be part of the system’s purpose, and usually people want many more things that are practical to build. The list of potential features usually has to be narrowed down from a long list of user or stakeholder wishes to a short list of the most important features—perhaps with a plan to add more capability over time. This means being able to separate the different features or properties and rank them by importance and achievability. A team must expect that items will be added and removed from a system’s agreed-upon purpose as time goes by.

Finally, needs change. If a project to build and deploy a system takes a few years, the world in which it is deployed will likely be different from the world when the project started. Available technology may change, or the user’s market may have shifted, or new regulation may come.

The result of these conditions is that a system’s purpose is not fixed, and the team building the system must be prepared for these changes. Being prepared means regularly checking for changes in stakeholder value and recording what is learned. It means using design and development processes that can adapt to these changes when they happen. And it means a management commitment to managing change honestly, pushing back on user requests when needed and supporting the development team when changes need to be made. It also means that an organization must be prepared to end a project when the system’s intended purpose no longer has enough value to its stakeholders.

! Unknown link ref discusses how to gather information about purpose, and how to work with that as the understanding changes.

XXX add references to prototyping and end user validation

8.2.2 Inconsistent or conflicting purposes

Having multiple stakeholders usually means that two or more stakeholders will have incompatible needs or desires. Even a single stakeholder may have conflicting desires.

This can cause two problems. First, conflicting needs make it hard to design a system that meets its purpose. Second, conflicting objectives make it harder to rank and choose among potential system objectives.

There is no simple recipe for handling such inconsistencies. One first has to recognize when an inconsistency or conflict exists, which requires understanding what all the stakeholders are saying and understanding the implications of that information. Then one has to work with the stakeholders to find a resolution—be that a negotiation that produces a compromise, or a realization by one party that their needs cannot be met. This can lead to difficult discussions, especially with customers: it is hard to tell a customer that current regulations make some feature they strongly desire illegal.

8.3 Explicit purpose

A system’s or component’s purpose can be separated into explicit and implicit parts. I use a simplified eVTOL aircraft as an example to explain these.

The explicit part is what stakeholders who will directly use the system say they need. This includes:

The stakeholder can only rarely specify exactly what they want. They may have a general idea, but it often requires several discussion sessions for them to express the idea clearly. The team eliciting the purpose from them usually needs to employ active feedback techniques, providing the stakeholder with an interpretation of what the team thinks they have said in order to validate that they have correctly understood the needs.

See ! Unknown link ref for more about different kinds of stakeholders and projects, and what must be done to learn about each kind.

8.4 Implicit purpose

A system’s implicit purpose comes from stakeholders who are involved in the system but are not its direct users.

8.5 Using purpose

A system’s purpose must guide its design and development. This means that the purpose provides the standard on which design and management decisions can be made. There are several activities in system development that depend on purpose.

undisplayed image

First, a project must actively gather and validate its understanding of the system’s purpose. This activity must be explicitly planned for, and sufficient time and resources provided. The resulting information should be validated with the customer and recorded in artifacts that can be referenced throughout the life of the system.

Second, the desired purpose is almost always more complex than what can be developed feasibly at first. The initial desires need to be ranked and pared down to what is essential.

Third, every project has a “go-no go” decision checkpoint, when the team decides whether to proceed with building a system or not. The fundamental question is whether a system can be built that meets all its important purposes, and this requires an analysis to determine whether that is feasible. Is it likely that a system can be built that meets the customer needs? And that will provide necessary compensation to the organization that builds it? Will other stakeholders agree to it? If the answer is no to any of these, then the team should not proceed further in building the system.

Next, purpose should guide design and implementation decisions. Each part of the system must play a role in meeting a stakeholder need, and the team should be able to articulate how it does so. If some part does not support the system purpose, it should not be built. If there is a choice to be made between different design or implementation approaches, the one that best meets the system’s purpose should be the choice. Moreover, the team must be able to explain how each of these choices were made. Chapters ! Unknown link ref present methods for ensuring this happens.

Finally, the system’s design and implementation should be checked against the decided purpose. ! Unknown link ref discusses system validation and acceptance.

Sidebar: Summary

Chapter 9: System scope

20 May 2024

9.1 Introduction

A system’s purpose defines why it exists—the reasons it might be built.

What the system is comes next. This is a high-level view of what the system is and will do—and not how it does those things.

The definition of what a system is starts by defining the boundary between the system and the rest of the world. There are things that are part of the system, which I will call the system’s scope. The rest of the world provides an environment in which the system operates. Interactions between the system and its environment take place at the boundary between them.

undisplayed image

The things within the system’s scope are what is being built, and thus under the control of the team making the system. This includes the functions, behaviors, and qualities of the system that are visible from outside the system. These are interactions between the system and its environment across the boundary between the two. These interactions should fulfill the system’s purpose.

What is in the environment is not under the builders’ control. The team building the system should understand these things, but they can’t be changed.

The environment includes the things that interact with or use the system. This includes things that go in and out of the system, physical places where the system operates, and the ambient environment (atmosphere, electromagnetic forces, dust, vibration, or radiation). The environment also includes those who will use the system, and thus define the purpose for the system to exist.

A caution: the system’s scope covers what the system does, and does not address how the system does that. Matters of how the system is designed to meet its scope are separate.

Sidebar: Keeping specification and design separate

One often reads statements like “good practice is to keep specification separate from design” or “requirements should not address the how, only the what”.

Why is this good practice?

The separation comes from the difference in the tasks involved in working out what something should do and how it should do it. Working out a system’s purpose or scope is a matter of working with a customer, real or potential, to learn about needs in the world outside the system-building project. Everything relating to the purpose or scope should come from other people and organizations—the team may choose which needs they will try to meet, but they cannot in general act as if the customer wants something different from what they actually do. The design, on the other hand, is about figuring out what kind of system design will meet those needs. All the decisions about the system’s design are within the control of the team, as long as those decisions end up supporting the customer’s needs.

In other words:

Purpose and scope come from the outside, not from within the project.

Design and implementation are for the project team to work out.

When design decisions are mixed in to scope or specifications, it is often a sign that the team has skipped over some of the deliberative steps of working out why some design is the best choice and jumped directly to a conclusion. This also can impose false constraints on the team: I have seen people avoid looking at design alternatives because they believe that some design decision came from a customer and can’t be changed.

Mixing scope or specification and design also can cause problems later, when the system is modified. Someone working out how to change a system needs to know why certain design decisions were made in order to understand the effects of changing the design. When specification and design are mixed, people often don’t record the rationale behind the design decision and the people who later need to understand the rationale have to guess.

9.2 Why scope and boundary matter

Building a system starts with working out the system’s scope. All of the specifications of the system are a refinement of the scope, and all of the design follows from that.

One will want to know how big the effort to build a system will be, at some point early in a project. This depends on knowing what the system will be.

Knowing what is in the environment—and thus not changeable by the team building the system—defines constraints on building the system.[1]

Finally, defining a system’s scope, boundary, and environment provides a way to check that the team understands the customer’s purpose properly by asking the customer to review the scope and boundary.

9.3 Content

The what of a system is the root of all the design of the system and its pieces. As discussed in the next chapter, the model of the whole system is the root of a hierarchy of component parts that define how the system is made. That chapter provides a model for how to define each component, including the system as a whole.

The team will use documentation of system’s scope and boundary over and over as they build the system, meaning that the information should be organized in artifacts that people can readily find and understand. (See Chapter 15 for discussion of what this implies.)

The system’s scope includes a few kinds of information: a concept, objectives, constraints, assumptions, and environment.[2]

A concept for the system provides a description of what the system will do. The concept is generally narrative, telling stories about the system. The concept should include major usage scenarios, for how the system’s customer will interact with the system and how the system will interact with its environment. I have often used a few diagrams to illustrate the concept. People looking at the concept should come away with an understanding of generally what the system will do and, equally important, what is not in the system’s scope.

Objectives (or goals) are a more organized way to present similar information. This takes the form of a list of the things people want out of the system: its behaviors or functions, and its properties. These will be general statements, and the process of developing a specification of the system will refine these into something more precise.

Constraints list limitations on acceptable system designs. The constraints do not establish what the system does, but only constrain how well it does those things. Many constraints relate to safety or security. For example, the system might need to meet some safety standard. Initial constraints will be general, and they will be refined as the system’s specification and design are worked out. Many constraints lead to analyses that work out in greater detail what these constraints imply.

The assumptions record information that affects how the system can be designed but that might be forgotten or might be missed. (This is similar to making objectives explicit or leaving them implicit.) This is often organized as a list. The assumptions guide later design decisions.

Finally, the environment lists information about the world in which the system will operate. The environment constrains how the system can operate or how it must be designed: to accommodate a certain level of vibration, for example, or that cellular radio coverage will be variable over a region where a vehicle will operate.

9.4 Using scope and boundary

The scope and boundary are a realization of the system’s purpose. The record of the scope should be traceable to the purpose. The team uses the scope and these traces to check that the definition of scope meets all of the system’s purpose, and that there aren’t significant parts of the scope that are not based on some part of the purpose.

The act of defining the system’s scope helps reveal the details of system’s purpose and constraints. Discussions with a customer or other stakeholder are usually informal and incomplete. The discussions result in notes and drawings, but they are rarely directly usable for working out the system’s specification. The tasks of working through records of those discussions and organizing a model of the system’s purpose will reveal ambiguities in what the customer has said, or gaps or inconsistencies. The team can then work with the customer to resolve those issues so that the definition of scope is more complete.

The team will use the definition of the system’s scope to document top-level specifications for the system, which then inform the system’s design and its decomposition into component parts.

As the project moves forward, the team will work out the design for high-level system properties such as safety, security, or reliability. The tasks that build the designs for these emergent properties (Section 11.4) begin with the definitions of what safety or security the system is expected to provide. Those definitions are part of the scope.

Sidebar: Summary
[1] A real example: in an early system concept discussion for a mission that needed to slow its velocity near the lunar surface, one person was quite confused that the spacecraft had to use thrusters and could not use a parachute. Obviously, the lunar environment was out of the mission’s control and we could not use an atmosphere that wasn’t there.
[2] This list is informed by a discussion on specifications by Leveson. [Leveson00]

Chapter 10: Component parts

20 May 2024

10.1 Introduction

In this book, I describe a system in terms of its parts and its structure. The system overall has a purpose, which can be described in terms of things it should do or properties it should have. The system meets this purpose by combining the parts together with the structure of how the parts interact. One should be able to show that the desired system behavior and properties follow from the combination of parts and structure.

In this chapter, I start by discussing components, the term I will use for parts. In the next chapter I will discuss structure, and how the combination of parts and structure leads to emergent properties that meet system needs.

Terms. I use the term “component” as a generic term for a part of the system. Some approaches use different terms, such as “element” or “item”. Other approaches use different terms depending on the level of encapsulation: system, subsystem, component, subcomponent, for example. I use the term “component” throughout, with “system” reserved for the system as a whole, and “subcomponent” used to denote a component that is part of another component.

10.2 Definition of component

A component is something that is part of a system and that people can think of as a unit. “Unit” implies some kind of singular aspect to the component: one purpose, one implementation, or one boundary, for example.

The unitary nature of a component means that the world can be divided into that which is within the component, that which is at the boundary, and that which is outside the component—its environment and the rest of the world.

This definition implies that different people will see different things as unitary components, often depending on the level of abstraction one wants to work with. One person may think of “the electrical storage system” as a unitary component, while another person may think of battery cells and power regulator chips as components, and the electrical storage system as a collection of components. Both views are correct, and both are useful at different times or for different people.

The focus on unitary purpose or boundary is a way to address complexity in a system. The focus is meant to help humans organize and understand the system they are working with by taking a divide-and-conquer approach. It means that some people can focus their attention solely on the component, making sure that it is designed and implemented to meet its purpose while not having to think about the rest of the system. The focused attention on the component must be complemented by attention on the system structure that connects the component to others, as described in the next chapter.

There are three related principles that can help identify what is a component and what is not. (Some of this is based on principles presented by Parnas [Parnas72].) These are only guides, and there are exceptions to each of them.

The goal of the first principle is to organize components around their purpose. If a thing has multiple purposes, that suggests that it might be divisible into smaller parts, each with a sharper focus, or that part of the thing might be better combined into something else with a similar purpose. On the other hand, if there is some feature that is implemented by more than one component, then those components are candidates to merge together. This is particularly true when those components contribute to some important emergent property (Section 11.4).

The second principle addresses how independent a thing is from other things. Independence can be viewed in terms of causal relationships with other components, as covered in the next chapter. The more tightly two things are related, the more they will have to be designed, implemented, and tested together; the less they are related, the more they can be worked on independently. If two things are strongly related, one should consider merging them into a single component; if they are loosely related, they can be more readily treated as separate components.

The final principle is also related to independence. If the design or implementation of a thing can be replaced with little or no effect on the design of the rest of the system, then that is evidence that the thing is independent and can be treated as a component. Having clear and narrow interfaces between the thing and the rest of the system is a sign that the component is independent. More broadly, replaceability is often an indication that something should be considered a separate component.

There is one additional indication that something should be treated as a component: when it is something that is usually sold or acquired as a unit. Electronic chips, antennas, motors, and batteries are all generally bought as units. Software packages are often acquired as units, whether bought or acquired as open source. A person hired as a contractor to fill a commonly-defined role can be seen as a component in a system.[1]

10.2.1 Component purpose

Every component has a purpose, which defines how that component contributes to the system as a whole. “Purpose” is a broad term, including behaviors that the component should have, properties it should exhibit, or functions it should provide. A component’s purpose is not necessarily defined precisely; sometimes, the purpose is a somewhat ambiguous prose statement of what a human wants the component to do or be. Turning that ambiguous statement of purpose into a precise and actionable definition is part of the engineering process. I discuss this in ! Unknown link ref.

I discussed the purpose of the whole system in Chapter 8. The system purpose is the purpose for the top-level component in a hierarchy, which represents the whole system.

Most human-designed components have a single primary purpose or property, possibly with multiple secondary purposes. Consider a battery: its primary purpose is to store electrical energy and make it available to other parts of the system. The battery may have a number of secondary purposes, such as providing mechanical structural rigidity, providing thermal mass to help maintain a constant temperature in the system, or contributing to the location of the system center of mass.

Each component has a number of properties that derive from its purpose: its state and behaviors, the inputs and outputs it can provide, and constraints on how it should be used. The documentation of these properties provides an unambiguous and precise specification of the component.

People working on the component need to have the purpose (and the specification that derives from it) available as they do their work. This information guides how they design the component, and how they verify that a design or implementation meets its needs. It is important that all of the purpose is available to them in one place so that they know they are considering everything they need to consider, without hidden surprises they couldn’t find.

10.2.2 Limits of the component approach

Components help human engineering and understanding—but when humans aren’t doing the design, there are limits on how the approach applies.

Consider a mechanical structure designed with a generative design tool. The tool can take in a specification of what the structure should achieve—forces, attachment points, and so on—and will find a design that optimizes for given criteria such as weight or cost. These structures often do not resemble ones people design because the tool can explore a more complex design space than a person can, and as a result often produce substantially better results than the human designs. Such designs can also potentially co-optimize multiple functions, such as a mechanical structure that includes channels for coolant flow within the structure or that meets RF reflection properties. While a person could make such a design, generative tools can do so at far lower cost.

As a second example, consider a neural network trained to recognize elements in a visual scene. The neural network is designed by performing a training process that uses a large number of examples of the kind of recognition the system should perform. The resulting network is typically much more accurate than a manually-designed algorithm. However, it is difficult to investigate the network itself to determine how the connections in the network lead to accurate (or inaccurate) image recognition. It is difficult to look at a specific connection in the network and explain how it affects the result, or how changing that setting will change recognition properties.

Both these examples are components that will be part of a larger system. As components, they have a defined purpose, from which a specification can be derived defining what the component should do. From there, automated methods take over to produce the design (for the mechanical part) or directly produce the implementation (for the visual recognition component). If these components were designed by people, we would expect that we could review and understand the component’s design as a check on its correctness. As machine-generated components, however, we only verify that the design or the implementation complies with its specification.

There is one significant difference between the two examples: how they can be verified for compliance with their specification. A mechanical component’s specification is generally complete: all of the conditions in which the component should function and the component’s behavior in each environment can be specified. This means that compliance can usually be checked using finite element analysis software tools, and example components can be built and subjected to their intended loads. Components implemented using neural network methods, on the other hand, usually are expected to function in a complex environment that is too large to fully enumerate. The training methods use a number of example cases, and induce from those examples an implementation that should properly generalize to all, or enough, real cases. The compliance of the component therefore cannot be completely verified, but must be done statistically.

10.3 Divide and conquer: the component breakdown structure

The component approach involves breaking down the system into unitary component parts, in order to make each part manageable by a person. However, as we have seen, different people use different levels of abstraction to understand the parts of the system.

In practice, people divide up a system first into major subsystems, and those into smaller components, and so on until the components are simple enough to deal with. This recursive division defines components at varying levels of abstraction: the electrical power system as a whole, with the power storage, power distribution, and power generation components as parts of the overall power system.

The following is an (intentionally) partial breakdown structure for a spacecraft, illustrating how the spacecraft as a whole (the “space segment” of the whole system) is organized into multiple trees of components.

undisplayed image

This recursive division creates a tree-structured component breakdown structure of the parts of the system. The breakdown structure organizes components in a way that helps people find components, including both finding a specific component that they are looking for and discovering related components that they do not already know about. The structure also defines levels of abstraction that allow people working at different system levels to focus their attention.

The breakdown structure organizes components, but it does not define the system structure, which I will discuss in the next chapter. The system structure defines how components interact with each other, which generally crosses different parts of the breakdown structure tree.

The system and high-level components should be broken down into subcomponents that have a strong internal relatedness and weaker relationships between subcomponents, as I discussed earlier. In doing so, the high-level component provides an abstraction of its subcomponents. This usually means breaking into subcomponents either by function or physical location. Most people think first of dividing by function: the electrical system, the hydraulic system, the communication system. Location is often more implicit. For example, a space flight mission is organized first by ground system, launch system, and flight system (physical locations) and then by function in each location.

A system will not necessarily have a single optimal breakdown structure. When that happens, one must pick some approach and stick with it. Some systems will have lower-level components that contribute to multiple high-level functions. If the system is organized according to the high-level functions, then the low-level components could fit into multiple branches of the hierarchy. I will discuss this further in the next chapter , when I cover how one uses hierarchy to organize the structure of the system.

Keeping components together that are functionally related is important. Part of the purpose of the hierarchy is to help designers and implementers: the hierarchy should guide them toward the information they need and should not hide or lead them away from that information. I have worked on some projects where the team decided to consider the esthetics of the hierarchy and tried to balance the depth of branches. While the resulting hierarchy was easier to draw, actually using the organization became more complex and error-prone. High-level components no longer provided an abstraction of a collection of subcomponents as a whole. Instead, the collection of related subcomponents was split between two or three high-level components; nowhere was the one abstraction of the whole set represented. Building specifications, tests, and project plans became harder because related things were no longer related in the hierarchy.

10.4 Component characteristics

Each system component is defined by a number of characteristics. These characteristics define an external view of the component: information about the component that can be observed without knowledge of how the component is designed internally. The characteristics constrain the component’s internal design, but should only include those aspects that will affect how the component fits with other components to make up the system.

There are six kinds of characteristics in the component model used in this book:

Form. The “shape” of the component. The component does not typically change its form over time. For physical components, form is obvious: the geometry of the volume or area that the component occupies. Form might include the material of which a physical component is made. For electronic or data components, form is how it is packaged: a data file in some format, or a software component in the form of an executable application.

Examples include:

State. This is the mutable “condition” of the component at a particular point in time. More formally, state is the information that is necessary and sufficient to encapsulate the past history of the component, so that any reaction that the component performs to some input is fully determined by the input and the state. State can be discrete (such as binary-encoded digital data) or continuous (such as the angle and angular momentum of a rigid body at a point in time).

Practical examples include:

Actions or behaviors. These are the state changes that the component can perform. Some behaviors are reactive, meaning they are initiated by some input. Other behaviors are continuing, meaning that they continue to be performed without further input.

Examples of reactive behaviors:

Examples of continuing behaviors:

Interfaces. These are the ways in which a component is connected to other components in the external world, and is the only way for to observe the component’s behavior from outside. Inputs can be given to a component, and output can be received from it. Inputs and outputs create a causal relationship between actions in one component and another. Inputs trigger reactive behaviors in the component that receives the input. Outputs can be a result of a reactive behavior, or an observation of a continuing behavior. Outputs are the only way another component can observe information about a component.

Examples of inputs:

Examples of outputs:

Non-functional properties. Components often have some properties that do not change over time (or change very slowly). These properties are not state per se, but they create important constraints on the component’s design and implementation and affect how the component should behave.

Some non-functional properties:

Environment. A component is also characterized by the expected environment in which the component will operate. This can be viewed formally as part of the component’s interface, but in practical terms it is useful to call it out separately. The environment specification typically includes information like the storage and operating temperature range, humidity, atmosphere, gravitation or acceleration, electronic signal environment, or radiation.

! Unknown link ref details more about components and how to specify them.

10.4.1 Characteristics and hierarchy

A high-level component provides an abstraction for the subcomponents that make it up. This implies that each of the characteristics of a high-level component—its form, state, behaviors, and so on—needs to be reflected in the subcomponents. For example, if the high-level component has some state A, then one or more of its subcomponents must have some state that, when aggregated, implements A. If the high level component has form B, then the subcomponents when put together must have that same shape.

Consider a radio communications component. The purpose of the component is to send and receive data packets with another radio somewhere else. The radio component has interfaces to communicate data with another local component, an interface to emit and receive RF signals, and other interfaces for control, configuration, power, and heat transfer. This example radio component, similar to those that might be used on a small spacecraft, has an antenna that is initially retracted but can be deployed on command.

undisplayed image

The radio is built of a number of subcomponents. These subcomponents must implement the state of the radio overall, as well as all its interfaces. The diagram below shows a simplified possible implementation.

undisplayed image

The set of subcomponents implements each of the interfaces named in the high-level radio component. Many of them are provided by the transceiver component, but the antenna handles the RF signal sending and receiving.

The state of the high-level radio is divided over the subcomponents. Again, much of the state is contained in the transceiver component, as it performs the data manipulation. The deployment state is a physical property of the antenna: it is either retracted or extended.

In the example implementation, however, there are multiple powered components—the sensor and actuator related to deploying the antenna in addition to the transceiver. This results in a more complex power state than defined in the higher-level radio component: some of the components could be powered on while others could be powered off, rather than a binary on/off overall state. During design, discrepancies like this should lead to improving the specification of the state of the high-level component.

10.5 Downsides

As I have noted, breaking a system into separate and independent components benefits the people who need to understand the components. This advantage generally outweighs other considerations, but there are downsides to this approach.

The first downside is that a reductive approach doesn’t allow for many kinds of system optimization. Having two separate components means that the two are not jointly optimized.

Software language compilers illustrate this. If each program statement is considered independent, the compiler translates each statement into a block of low-level machine code. However, optimizing compilers break this independence, and gain large speed improvements in the generated code. For example, a code optimizer can detect when two statements perform redundant computations and merge them. An optimizer can detect that a repeated computation (in a loop, for example) can be moved out of the loop and performed only once.

Software optimizers allow a developer to write understandable code, and it performs optimizations that can be proven to maintain correctness but that make the resulting machine code hard for a person to understand and verify. There remains the possibility of system optimizers that perform similar translations, but they are not generally available today.

The second downside is that breaking a system into many components creates an organizational problem: how does one name or find a particular part? A hierarchical component breakdown can help organize the pieces.

10.6 Why components matter

We split complex systems into component parts in order to make parts that are understandable by the people who have to work on the parts. The approach also makes it easier to manage parallel design, implementation, and verification of the parts. If one wants to acquire a component from an outside source, having a definition of what the component is helps the acquisition process.

Each of the people working on the system needs information to work on their parts. Defining a component provides a locus around which to organize the information related to a component. Having a model of what a component is provides a basis for designing artifacts that will contain the right information.

Different people will need to work at different levels of abstraction in the system. Organizing the components hierarchically provides these different levels of abstraction.

The people working on the system need to find pieces of the system, both when they are looking for information about a specific piece and when they are trying to learn what the pieces are. The hierarchical structure provides a way to name and find information about a component, and provides a structured index to help people browse and discover.

Finally, it is generally understood that the structure of a system is related to the structure of the team that builds it [Conway68]. I discuss this further in Chapter 17. XXX add ref to detailed team structure chapter

[1] With the obvious note that the person is not, in themselves, the component; the role they play is the component. The person still should be treated as a person, and not as a cog in a machine.
Sidebar: Summary

Chapter 11: Structure and emergence

3 November 2023

11.1 Introduction

Component parts of a system define the building blocks out of which a system can be built, but by themselves they do not create the complex, high-level behaviors that systems are built to exhibit. System behaviors and properties arise from how the component parts work together. How the components are connected, and how they interact over those connections, is the structure of the system.

In this chapter, I define what is meant by system structure and provide examples of how behavior can emerge from the combination of components and their interactions.

To build a system, one generally has to build a model of what the system is and does. This model will play essential roles in designing a system and analyzing its design. Enquiry into how to organize information about a system’s structure helps one develop a useful model, and so in this chapter I present an informal way to model a system’s structure.

11.2 Definition

The meaning of “system structure” has been debated, but I use the following definition, chosen for its engineering utility:

Structure is how each component part’s behavior relates to each other component part’s behavior.

This structure can be expressed as the graph of how components affect each other.

Components can be related in two different ways:

Functional relationship. The functional relationship is a relation from one component to another that maps how some output on an interface of one component can potentially be received on an interface of another component, and thereby cause a reaction in the receiving component. That is, the functional relation is a map of possible interactions that can be viewed as a directed graph, with components as nodes and directed edges showing how causation can flow between them.

undisplayed image

Consider two electronic components connected by a signaling line, similar to those used in several serial communication standards. One component is able to send a signal on the line by changing the voltage relative to a common ground; the other component is able to observe the voltage and determine what signal was sent. By sending a sequence of different voltage levels, the sender can transmit a series of zero and one bits over the line to the receiver. The receiver can decode the bits into a message, perhaps containing a number, and act on the message it has received.

This functional relation is separate from and mostly independent of the component breakdown, defined in the previous chapter. The component breakdown is primarily about organizing the parts so they are identifiable, and do not imply a causal relationship. Functional relationships show how components in different parts of the component hierarchy work together. The component breakdown is helpful for defining levels of abstraction; we deal with those in the next section.

Non-functional relationship. A non-functional relationship between two components indicates how their behaviors may be related in non-causal ways, such as two components being independent of each other or showing correlated behaviors. These effects do not depend on interaction between the components, but instead are based on inherent characteristics or history of each component.

undisplayed image

Independence and correlation are typical non-functional relationships. These terms are defined in the usual statistical sense. Informally, two components are independent if the probability of some event occurring on both components is the same as the product of the probability of each event occurring on its own. Events on two components exhibit some degree of dependence if the probability of both occurring is different from the product of each event occurring on its own. For a positive correlation, when one event occurs the other is more likely to occur. For a negative correlation, when one occurs the other is less likely to occur. At the extremes, one event occurring means the other is certain to occur, or that the two events never occur together.

Many non-functional relationships are the result of common-cause events. This can occur when two otherwise-independent components A and B have functional relationships with a third component C. When an event occurs in C, it interacts with both A and B so that both change their states. After such an event, the states of A and B are no longer independent.

undisplayed image

System reliability is often built on a foundation of failure independence. For example, data can be stored in two copies, so that if one copy fails the other remains available. A scheme like this fails when both copies fail, and so the copies are designed to be independent to minimize the chances of both failing together. Independence can be a result of using different technologies to store each copy, or using devices from different manufacturers. Two devices from the same manufacturing batch might share a common manufacturing defect, which would increase the probability that both will fail.

11.2.1 Examples of functional relationships

Here is a list of some kinds of functional relationships that I have encountered in systems I have worked on. The first few relationships are simple and primitive from an engineering point of view, while the later examples are built as abstractions on top of simpler relationships.

11.2.2 Examples of non-functional relationships

Non-functional relationships capture ways that components can behave in coordinated ways without a direct causal relationship between them. These are typically states or behaviors that occur because two components share some common state (or do not share such state).

The following examples all relate to the independence or dependence of different components that are being used redundantly to improve reliability.

11.3 Abstraction

An abstraction is a summary or reduced form of a more complex thing, usually focused on the essential or intrinsic aspects of the complex thing. The abstraction is separate from any concrete instantiation of it.[1]

People use abstraction to manage the complexity of a large system. In an airplane, people talk about “the electrical system” or “the powerplant”—things that are built out of thousands of subcomponents, but which are usefully thought of as whole things in themselves. While the component breakdown structure, in the previous chapter , is one example of abstracting the details of multiple components into one larger, abstract component (or subsystem), most complex systems have multiple, overlapping ways to abstract and simplify views of the system.

In general, abstracting structure is about taking a relation between two (or more) high-level components and breaking it down into relations between subcomponents. In the example below, two high-level components A and B have a functional relationship. A and B are both abstractions of a set of subcomponents. The relationship between A and B is an abstraction of the relationship between the A.2 and B.1 subcomponents.

undisplayed image

As a concrete example, consider software on two microcontrollers that communicate over a serial line. The software on each breaks down into an application software component and a serial driver. The serial drivers communicate (over a serial cable) directly.

undisplayed image

Non-functional relationships can follow a similar pattern. If two high-level components A and B exhibit some kind of correlated behavior without direct causation, and those high-level components decompose into lower-level components, then at least one of the subcomponents of A must have a corresponding non-functional relationship with a subcomponent of B.

11.3.1 Overlapping abstractions

Abstraction is not necessarily purely hierarchical: some high-level abstractions overlap. Two different people can look at the same component and need to work with different aspects of it, and see it as part of different high-level abstractions. This is common in systems of even moderate complexity.

Consider an aircraft with modern avionics and engine systems. The avionics provide many functions: flight deck displays, pilot inputs, navigation, radio communications, autopilot, among many others. The powerplant provides thrust to move the aircraft and electrical power to run other systems, but in a modern aircraft it also includes an engine controller (FADEC) that provides autonomous management of engine operations.

undisplayed image

The avionics and powerplant have overlap. The flight deck display will display engine status: thrust, temperature, thrust reverser deployment, and alerts when there are engine problems. The pilot thrust levers are connected to avionics, but provide commands to the engine controller. The autopilot needs to know the capabilities of the engines and how to provide them with control settings.

This overlap leads to a question: is the engine display function part of avionics or part of the powerplant? The answer is that it is part of both, depending on who is looking at that part of the system.

Consider a specific avionics unit for general aviation aircraft: the Garmin G3X display [Garmin13]. It can connect to an engine interface adapter, which in turn connects to sensors or a digital engine controller on the engine. The display is a general-purpose component, which can provide a pilot with many different kinds of information; engine status is just one function. The G3X unit contains a configuration database that defines what engine information it will be receiving, how to display that information to the pilot, and the conditions when it should issue alerts. This database resides within the avionics display unit, implying that someone designing the avionics system will be concerned with it. However, the database is specific to the powerplant installed on the aircraft—changing an engine model requires changing the database—and so it is of concern to people designing the powerplant.

undisplayed image

This pattern is common in systems that have multiple functions: some particular component will contribute to multiple high-level functions, and different people will see that component as part of one abstraction or another based on what functions they are working on. Models of the system must accommodate these overlaps.

When two abstractions overlap, shared components must support both abstractions by implementing behaviors and properties that accurately support both higher-level abstractions. In the G3X avionics example, the configuration database needs to address the configuration of the powerplant as well as the interface to support pilot information displays. This can add complexity to designing the shared component, since behavior that supports one abstraction must not interfere with behavior necessary to support the other.

11.3.2 Abstracting a relationship

Some relationships between high-level, abstract components are themselves abstractions.

Consider once again the example of two microcontrollers that communicate with each other, as in the earlier section, but this time they communicate using a wired Ethernet rather than a serial cable. At the abstract level, there is a functional relationship from A to B where A sends data to B.

undisplayed image

The data communication relationship, however, is an abstraction. The microcontrollers communicate using an Ethernet, which might consist of a pair of cables and a switch. The cables and switch reify the abstract relationship, meaning they take the abstract and make it into something real.

undisplayed image

The inputs to and outputs from the reified data communication link are the same (at the abstract level) as the high-level abstract relationship: data gets transferred from microcontroller A to B.

This is an example of a general pattern. Two components at a high level may have a functional relationship, and both the components and the relationship between them decompose into a number of subcomponents. The consistency between the high-level abstraction and the lower-level details must be maintained, of course, but there is nothing that requires that a high-level relationship can only decompose into lower-level relationships.

In fact this pattern continues recursively down to the lowest observable levels. In the example, microcontroller A passes data into the Ethernet cable as a set of low-level electrical signals. Those signals, in turn, are made up of yet lower-level electromagnetic behaviors of the atoms in the conductors that join the microcontroller to the cable.

11.3.3 Consistency

A high-level abstraction and a lower-level implementation of the abstraction need to be consistent with each other. Speaking broadly, the high- and low-levels are consistent with each other if the low level implements everything in the high level abstraction, and everything in the low level implementation is reflected in the abstraction—that is, that neither level adds or removes anything from the other.

Abstraction does imply simplification, however. The high-level abstraction of a distributed software component might have a “logging” relationship to a centralized monitoring system. The decomposition of that relationship might involve a logging subcomponent within the software that uses a network connection to send log records to a receiver component within the monitoring system. The high-level logging relationship focuses on the ability to reliably and securely send log information to the monitoring system. To be consistent, the lower-level details must provide a way to transfer that information—using the network to move the data, for example. The statement that the information is sent securely—which would need to be better defined at the high level—might be matched by state and behaviors of the endpoint software components to authenticate each other and encrypt data in transmission.

Continuing this example, the lower-level implementation would not be consistent with the high-level abstraction if the network communication mechanism provided a way to send information in the other direction, from the monitoring system to the distributed software component.

We can put this in somewhat more formal terms as follows.

This definition of consistency means that an abstract component or relation has to reflect all of the states, behaviors, or interactions that the lower-level components or relations can have, so that the abstract things model all of what the lower levels will do, and it cannot add to what the lower-level parts do. In reverse, the lower-level components or relations must implement all of what the abstract components or relations do, without adding other behaviors or interactions.

11.4 Emergent system properties

Emergence is the complement of abstraction: it is how high-level properties or behaviors arise from the properties or behaviors of a collection of lower-level components and their interactions. Put another way, one designs the emergent properties in a system to make abstractions true. Previously, in Chapter 5, I introduced the idea that system properties and behaviors are emergent from the properties and behaviors of the components that make it up, combined with the way those components interact. This idea continues recursively through a system, where each high-level abstraction is achieved by designing how subcomponents work and interact.

Emergent behaviors or properties are usually things that cannot be sensibly talked about at lower levels: these are properties that the individual components do not have on their own, but that the aggregation does when the components are combined. In physics, concepts such as gas pressure are emergent: no individual gas molecule has meaningful pressure, but the collection of a large number of molecules in an enclosed space gives rise to measurable pressure. Similarly, the shape of a leaf is emergent. No cell making up the leaf in itself has a property of the shape of the leaf, but the aggregation of all the cells as well as how those cells interact as the leaf grows (that is, morphogenesis) leads to a consistent shape that can be perceived of the whole.

In engineered systems, properties such as safety or “correct behavior” are emergent from the design of components and their interactions [Leveson11]. Consider an automobile: it has a property that the driver must be able to control its speed. The driver’s ability to control arises from the driver’s ability to give commands to regulate speed and the vehicle’s correct response to those commands. The vehicle’s speed arises from a combination of motor behavior, brake behavior, wheel interfaces to the road surface, vehicle inertia, and external forces like wind or gravity. One can talk about the rotational rate of the motor, or the degree to which brakes are applied, but driver control over speed arises from the combination of all these things.

There is a rigorous discipline of systems theory that provides a foundation for this discussion ! Unknown link ref.

An emergent, high-level property is said to supervene the low-level properties of components. A change in the high-level property can only occur when there is a change to the low-level properties. This principle implies that one can in general design low-level properties in order to achieve a desired high-level property. It may be difficult to do this design, of course, but it is possible; properly-designed low-level properties do not necessarily create undesired emergent behavior.

Designing a system so that a desired property or behavior emerges from components involves placing constraints on how lower-level components behave and interact. This is a top-down approach to handling emergent behavior. Reliability properties, for example, are often met using redundant components; for those redundant components to provide reliability, they must be connected in a way where one component can provide service when another fails—a property arising from how the redundant components interact with other components. The redundant components must also exhibit a non-functional relation of some degree of failure independence. I will discuss several more examples in the coming sections.

It is generally more effective to work top down, from a desired emergent property of an abstraction to the components and relations that will make it up, than to work bottom up, starting with a set of component behaviors and hoping a desired abstract property will emerge. Component properties combine in unexpected ways, and determining whether they combine in a way that produces the desired result and at the same time avoids unintended consequences is most often a nearly-intractable problem. Working top down means determining the constraints that must apply to the components and structure that implement the abstraction; analyzing (or designing) the components to determine if they meet those constraints is a simpler and more tractable problem.

For example, the software components inside most operating systems cannot be evaluated for good evidence that they provide the operating system’s intended features in all usage scenarios—and practical experience with popular operating systems shows that most contain large numbers of undiscovered errors. Those operating systems were generally built from the bottom up, with new components being developed on their own following only a minimal goal of function, and then added to an existing system. Only a very few operating systems or software systems of comparable complexity have been analyzed to prove that they actually implement their stated function correctly, and those examples have all started with clear definitions of the abstract behavior and worked from there to design the lower-level components and structure. [Klein14]

11.4.1 Examples of emergent properties

Emergent properties can be simple or complex; what they share is that the combination of properties or behaviors from multiple components yields something of a nature that would not apply to the individual components. Here is a set of examples illustrating different kinds of emergent properties or behavior, ranging from the almost trivial that one might not ordinarily think about as emergent to the complex, and including both desired and undesired emergent behaviors. Reliable data communication

Reliable communication happens when information is sent from one place to another, with the information received matching the information sent. “Reliable” is usually qualified: a maximum probability that any arbitrary bit or message that is received does not match what was sent, and qualifications on the environmental circumstances such as distance between sender and receiver, or the absence of deliberate interference.

At a high level, communication involves an information source and an information sink. The source and sink have a functional relation of sending information from one to the other.

At the lower level, communication involves a set of components. The information source and sink remain. The functional relationship between them is reified by a chain of components: a transmitter, a receiver, and the medium between transmitter and receiver. It also involves various encodings used in sending from transmitter to receiver over the intermediate medium. The components have functional relations from one to the next, for moving information along this chain of components. The transmitter and receiver have a non-functional relationship: agreement on the encodings to be used to move information over the medium between them.

undisplayed image

Neither the transmitter or receiver in themselves move information reliably from source to sink. Instead, reliable transmission is a simple emergent property of combining all the lower-level components and their relations. The reliability comes from properly matching the designs of the transmitter and receiver, including how they encode signals for transmission and reception, so that they can achieve the desired reliability on the medium that connects them. Door closing and latching

Consider a door, perhaps to a cabinet. The door can be open or closed. When open, it can be closed by a person acting to close it. If no one acts on the door, it might remain open or close on its own. When the door is closed, it remains closed until a person takes a specific action to open it. “Remaining closed” means that the door stays closed even when force up to some defined limit is applied to the door. These behaviors should occur reliably for at least some number of open-and-close cycles. They only need to hold reliably in some benign environment (no deforming forces, no corroding atmosphere, and so on).

This is an example of an emergent property of a high-level component that can be achieved by properly designing the subcomponents that make it up.

undisplayed image

One possible implementation of the door that would meet this high-level property uses a latch to hold the door closed. When the door swings closed, the latch engages and keeps the door closed. The latch can be connected to a knob or lever that a person can use to release the latch, allowing the person to perform a two-part action to open the door (release the latch, apply force to the door to move it open).

The high-level door thus decomposes into three subcomponents: the basic door, a latch, and a knob. These three subcomponents, plus the door’s user, have four functional relationships:

  1. Latch to door: the latch holds the door closed when engaged.
  2. Knob to latch: the knob can be moved to disengage the latch.
  3. User to door: apply force to open or close the door.
  4. User to knob: apply force to turn the knob.
undisplayed image

The high-level opening action that a user can apply to open the door decomposes into a sequence of lower-level actions: a turn action applied to the knob, an opening force applied to the door, probably followed by a release action on the knob. The high level closing action decomposes into, first, ensuring that the knob is released, then applying a closing force to the door.

The implementation admits states that do not directly map to the high-level states of the door. For example, the implementation allows the user to turn the knob and then take no further action. This leads to a state of the system where the door is in the closed position and the latch is disengaged. If the environment applies an opening force to the door, the door is not restrained and will swing open. A designer will have to work out what these intermediate states are, and determine whether they are acceptable or not. (In this case, the situation might be resolved by saying that the high-level “open” condition maps to any implementation state where the door position is not closed or the latch is disengaged. Handling intermediate implementation states is not always so simple.)

The knob and latch will have properties that, together, support the high-level property that the door will remain reliably closed through some number of open-and-close cycles. These properties likely involve constraints on the wear imposed on each of them each time the door opens or closes, and the amount of wear before they begin to be unreliable. Similarly, the property that a closed door stays closed when some amount of force is applied to the door decomposes into properties on the latch and knob to ensure they will hold the door in position.

The overall property of remaining closed is an emergent property of the design of the latch and knob. The latch by itself is not closed or open by itself; that is a property of the door that arises when the latch is engaged and the door is in a closed position. Failure resilience

A failure resilient component is one that can mask one or more failures of its parts while continuing to provide correct behavior. This is one way to meet a goal that a component is reliable or available; the other way is to make the fundamental reliability of the component higher.

For a concrete example, consider a control system for an autonomous road vehicle. The control system takes in commands from a user or other outside system, then must provide correct, active control of the vehicle’s attitude and movement to travel on the commanded path. Typical acceptable failure rates are one in 10-7 to 10-9 operational hours. The vehicle should fail safely where possible when the control system fails, but I will leave that aside in this discussion.

Many systems achieve this level of failure resilience using redundancy and voting. In this approach, multiple independent processors run the control algorithm synchronously, each receiving the same sensor input and generating actuation output. The actuation output from each processor is fed to voting logic, which determines whether a majority of the processors are generating consistent output and if so applies that output to the plant being controlled. If one of the processors fails by stopping, or by generating different outputs, the voting logic masks out the presumed failure.

undisplayed image

The combined components will generally perform the same operations as one single computing component by itself, but the combination will fail less frequently. This improvement is an emergent property of the combination. It depends on two non-functional relationships between the redundant components: that they all exhibit the same behavior, and that they generally fail independently.

For the example vehicle control system, I found that the approach of using three identical embedded computers was—based on reliability analysis, not measurement—likely to provide only a modest improvement to overall vehicle safety. The redundant computers were not fully independent: they ran the same code, they shared the same power source, and were subject to heat and vibration in the same environment, all of which increased the chances two or more computers would fail together. They had a greater degree of independence to matters like a cable vibrating out of its connector or dust shorting out traces on the boards. In other applications, such as spacecraft, there are more sources of independent failure, such as radiation upsets. For spacecraft and aircraft, the cost of unreliability is also higher than for a road vehicle, making this approach to redundancy worthwhile.

An incident involving an Airbus A330 landing on 14 June 2020 illustrates how lack of independence between supposedly-redundant computer systems leads to failure [TTSB21]. The Airbus A330 uses three flight control primary computers; on landing, these control the braking, thrust reversers, and spoilers that slow the aircraft on the runway. In this incident, there was an error in the flight control law implemented in all three flight computers. On touchdown, the flaw was triggered in one flight computer after another until all three had failed, leaving the pilots only manual control of the brakes. The pilots were able to apply manual braking to stop the aircraft before running off the end of the runway. The failure occurred because there was a design flaw common to all three flight computers, meaning that there was no redundancy in the face of the particular condition that occurred on that landing. Undesired emergent properties

Components are usually designed and organized so that together they achieve the desired emergent system properties. However, the same design can exhibit other emergent properties that are undesirable.

Network congestion is a commonly-cited example of undesirable emergent behavior. In its simplest manifestation, when multiple streams of data meet and cross at some router in the network, the streams can overwhelm the router’s capacity to process and forward data. The router typically drops some packets in order to try to keep up, which causes some of the streams in turn to detect missing packets and retransmit them—causing even more traffic through the router. This was first observed in the Internet in October 1986, when a particular congested network link was moving about 0.1% of the data it normally could when not congested [Jacobson88].

This has led to congestion avoidance and congestion control mechanisms in Internet protocols, which aim to either keep stream data rates below the level when congestion starts or recover quickly when congestion does occur. The sender and receiver behaviors in the congestion control mechanisms, however, have been found to lead to behavior synchronization across multiple senders, leading to oscillating loads that repeatedly overwhelm a bottleneck, then back off, wasting resources for a while, until the cycle leads to another period of congestion [Zhang90].

These behaviors are similar to other situations where behavior is unstable, and once it starts to behave poorly it gets progressively worse. In many of these cases, congestion or overloaded conditions make it more difficult for mechanisms that would address the situation to work.

The lesson to draw from the possibility of undesirable emergent behavior is that system designs need to be analyzed to look for such negative behavior—not just analyzed to ensure that desired behaviors happen. This is related to a kind of confirmation bias ! Unknown link ref where one is motivated, usually unintentionally, to look for evidence that confirms what one wants or expects. It often requires deliberate effort to look for evidence of negative behavior. Spacecraft imaging a ground location

The final example takes the basic principles in the previous, simpler examples and combines them into a realistically complex case.

Consider a spacecraft system that is intended to take images of ground locations and send those images to users on the ground. The system includes many different parts:

The process of taking an image involves every one of these parts, as well as others omitted from the example to keep the list from getting too long to read. It includes:

If any one of those steps fails to happen properly, the system as a whole will fail to achieve its objective. At the same time, no one component involved in these steps achieves the system objective by itself. In other words, the system behavior of taking an image of a ground location is an emergent property of the system as a whole.

This example is typical of most system properties and behaviors, in that achieving the desired behavior involves many components working properly together. This implies that all these components have been designed to have their individual properties, and that the components have been wired together with the right functional and non-functional relations to work together.

This example also illustrates a common issue: that components depend on other components for their function. For example, the ability for the spacecraft to communicate with the ground depends on the spacecraft being able to determine when it is coming in range of a ground station. This means that the spacecraft must be able to tell where it is, which might rely on the GPS system. If there were to be a problem with the GPS constellation, the spacecraft would not be able to communicate correctly. This kind of dependency creates non-functional relationships—in this case, a non-functional relationship between ground station systems and spacecraft communications that communications will function only when the GPS constellation is working properly. Safety and security

Leveson argues that safety is a fundamentally emergent property:

Safety, on the other hand, is clearly an emergent property of systems: Safety can be determined only in the context of the whole. Determining whether a plant is acceptably safe is not possible, for example, by examining a single valve in the plant. In fact, statements about the “safety of the valve” without information about the context in which that valve is used are meaningless. Safety is determined by the relationship between the valve and the other plant components. As another example, pilot procedures to execute a landing might be safe in one aircraft or in one set of circumstances but unsafe in another. [Leveson11, §3.3]

I argue that related properties, including security, are similarly emergent and must be understood, designed, and analyzed in terms of how components are related.

11.5 Working with structure

The notions of components, structure, and emergence form a foundation for the work to be done when designing and building a system. Upcoming chapters will define the tasks, artifacts, and processes involved in terms of this basic model of how systems can be organized.

For example, the design of a system consists of artifacts that document what the components are in the system, and the desired properties and relations that connect them. Verifying the design involves gathering evidence for and against whether the behaviors that emerge from the components and their relations match the desired system behaviors. A design can be evaluated based on properties of the graph of relations between components, and the graph of relations can guide investigations into whether subtle non-functional relations (such as expected component independence) will hold.

In addition, there are common design patterns of components and relations that provide guidance for implementing complex behaviors. These design patterns can be expressed in general terms of components and relations, making the patterns broadly applicable rather than specialized to a particular use case.

Sidebar: Emergence all the way down

I have taken a pragmatic approach to abstraction and emergence, focusing on the kinds of relations and abstractions one actually encounters in building most real systems. This means only drilling down into lower layers of abstraction as far as is needed, and not as far as it could go.

Consider data that is exchanged between two electronic components. Data is an abstract component that has no direct physical reality; it is an emergent property of lower-level components and relations. The data itself is dependent on mechanisms for observation and interpretation by people—including agreement between sender and receiver on what the data “mean”. The data are transmitted from one component to another using low-level electrical signals over wires; the signals are designed to move the data from one component to the other. The low-level electrical signals are themselves an emergent property of yet lower level atomic and electromagnetic behaviors in the transmitter, wires, and receiver. These may in turn be emergent properties of yet lower level structures and forces, some of which may not yet be understood.

It is intriguing to think about how far one can take this approach. Luckily we can usually stop at some practical level and take the rest for granted.

Sidebar: Summary
[1] Following the definition of “abstract” in the Merriam-Webster online dictionary.

Chapter 12: System views

20 May 2024

12.1 Introduction

Systems are too big for one person to understand all the facts at once. It’s necessary to focus on subsets to manage the scale.

At the same time, different people have different interests as they are working on a system. They need a particular kind of information about part of the system, but do not need to be distracted by other kinds of information.

These needs for subsetting lead to developing multiple views on a system. Each view defines a subset of the information on a system, with the subset defined to support a particular person’s needs and interests. Ideally, each person can do their work using one view or another, and when all the work has been done using many different views the work has addressed all of the system.

Some of these views have a technical focus, being about the function or properties of the system and its parts. These views support those who design, analyze, implement, or verify parts of the system. Other views are non-technical, supporting people who manage the project, organize the teams doing the work, handle scheduling, and similar tasks.

Views highlight some information and hide other information in order to help someone perform a task. If the view shows too much information, then the person using the view will have trouble finding the specific pieces of information they need. They may, indeed, be distracted by irrelevant information. On the other hand, if the view is hiding information that the person needs, they are likely to work with the incomplete information they have and infer that the system does not include the missing information.[1]

The view concept I am defining here is a general mechanism for subsetting information about the system. There are several architecture framework standards that define “view” and “viewpoint” concepts, including DODAF [DOD10] and ISO 42010 [ISO42010]. The view concepts in those framework standards arise from ideas about the processes that should be used to build systems well, and are thus more specific than the general idea presented here. These standards focus on developing models of a system’s design, with subset views that are motivated by exploring the objectives that system stakeholders have in the system. The approach in these standards is one way to use the general idea of subsetting information about a system based on some focus; I will discuss this further in later chapters when I turn attention to how to build systems using the foundational concepts presented now.

12.2 Technical views

Technical views are ones that subset the contents of a system in a way useful to the designers, implementers, or verifiers of the system. These views focus on how a part of the system functions or is organized in some technical sense.

These views can focus in different ways, depending on the specific need:

A view focused on a set of components is useful to someone responsible for a particular subsystem or abstraction. The view can collect all the components, at varying levels of abstraction, related to one part of the system. This might be defined as one or more subtrees in the component hierarchy (Section 10.3)—for example, all the components that make up an electrical power system for a spacecraft. This might also start from some other abstraction. Views like this can be used when working out how an abstraction is to be realized in concrete subcomponents (Section 11.3). It can also be useful for checking whether certain design properties hold, like total mass.

undisplayed image

A view focused on a path through the system is useful for working out or checking how behaviors are realized. Such a view might start with an event in one component, then trace how one event causes events in adjacent components, onward until the high level behavior is complete. Views like this are useful when checking where a path might have gaps that need to be addressed. It is also useful for checking that a causal path among abstract components and relations is properly realized in concrete subcomponents.

Looking at a path can help reveal what conditions need to hold for each step in the path to occur properly. For example, in the spacecraft commanding example in the previous chapter, a ground pass has to happen successfully if a command message is to be received at the spacecraft. A successful ground pass requires a functioning and available ground station, accurate ground knowledge of where the spacecraft will be, knowledge in the spacecraft of where a ground station is and when it will be in range, and the ability to operate the communications subsystem.

undisplayed image

The third kind of view focuses on trees or graphs of dependencies. This information is useful to someone who is verifying that some safety or security condition holds. It is also useful for revealing where there are unexpected vulnerabilities in a system. In particular, looking at the transitive closure of dependencies can reveal unexpected shared dependencies between two components. In the spacecraft commanding example above, a spacecraft’s ability to know when it should operate its transceiver for a ground pass might be based on the spacecraft knowing its location through GPS. This creates a dependency on a GPS receiver on board and the correct function of the GPS constellation. Further, it may require the spacecraft to maintain an attitude where GPS antennas can see the GPS constellation; this may conflict with other demands on spacecraft attitude (like pointing an antenna toward a ground station). Both the communications transceiver and GPS receiver may rely on a shared electrical power system.

undisplayed image

These three kinds of views are not mutually exclusive. Often someone can benefit from starting in one view, such as a path through the system, and then use other views to explore or refine the system, such as checking on dependencies.

12.3 Non-technical uses

Some views are useful for managing project execution. As a manager or lead, I have been responsible for working out what tasks people need to do to develop the system to some milestone, along with potential dependencies among tasks and estimates of the time and resources needed. I have needed to understand the system in order to derive this information about tasks.

For example, I have often started with a high-level design for a part of a system, containing a few abstract components and relations and a few paths through them for performing key behaviors. I have used one or two paths through those components to sketch out milestones that the team can design and develop toward; at each milestone, the designs or implementations will be integrated to demonstrate some level of functionality. This management step uses views of a few paths through the system. After that, I have worked from the view of components and relations that feed into each milestone to work out a set of design and development tasks that will get each part ready for its milestone. These steps use information about the components and relations involved to work out both the individual tasks and how those tasks might depend on each other, leading to constraints in how the effort can be scheduled. I expand on these techniques in ! Unknown link ref.

Following paths through a system, as well as tracing through the ways that abstractions are decomposed, allows one to find gaps in the current understanding. These gaps represent uncertainty, which can lead to risk. Further, following paths through the system that lead to and from some uncertainty to other components or relations helps one work out how much other parts of the system may be affected by uncertainty. This allows one to judge the potential effects of changes that may arise from the uncertainty; the magnitude of the effects is part of determining how much developmental risk some gap poses. I discuss how to use this kind of analysis in ! Unknown link ref.

Sidebar: Specifying a view

The descriptions above may seem focused on extracting subsets of a defined system, but the view concept is intended more generally.

In set theory, subsets are often specified one of three ways: by listing the elements of the subset; by constructing the subset through combinations of set operations such as intersection and union; and by specifying a characteristic function—essentially, a description of a query on the set.

All of these have been useful to me at one time or another. While a system is being designed, the population of components and relations that make it up will be changing constantly. The path through components and relations to achieve some function will be steadily refined; in many cases, there may be two or more alternative designs for parts of the same path to compare. This case lends to a query-like formulation of views, which are updated as the system’s contents change. On the other hand, tasks to verify that a design or implementation are correct and complete benefit from being an unchanging snapshot. This way someone can step through each part of the system, verifying each piece and each integration, without having that work change as people make changes to the system.

Sidebar: Summary
[1] “In experiments where some problem solvers were given incomplete representations while others were not given any representation at all, those with no representation did better. An incomplete problem representation actually impaired performance because the subjects tended to rely on it as a comprehensive and truthful representation—they failed to consider important factors deliberately omitted from the representations. Thus, being provided with an incomplete problem representation (specification) can actually lead to worse performance than having no representation at all.” [Leveson00, Section 3.2]

Chapter 13: Evidence of meeting purpose

20 May 2024

13.1 Introduction

An implemented and operational system needs to meet its purpose (Chapter 8). After all, that purpose is the reason that resources have been spent on developing the system and using it. Meeting purpose means two things: that the system does all the things it is supposed to, and that it does not do things it is not supposed to.

One cannot assume that a system meets its purpose. Each system needs to be evaluated to determine whether it actually does or not, and if not, how and where it does not. The evaluations catch design and communication errors that occur when one party thinks they have specified what is needed, and another party does not understand what was meant or makes a mistake in translating the specification into practice.

How a system works changes over time as well, and regular re-evaluation catches cases where operational behavior diverges from what is needed for correct or safe operation. This includes wear and tear on the system that must be corrected with maintenance. It also includes changes in how the system is operated—from operator practice to management organization and environmental context.

In this work I talk about gathering evidence of a system meeting its purpose.

Parts of a system’s purpose can be specified quantitatively or qualitatively. Quantitative purposes can lead to deterministic ways to check that the system meets the purpose. Complex quantitative purposes, however, aren’t necessarily so easily evaluated: computational complexity or the difficulty in actually measuring system behavior can lead to quantitative properties that cannot be easily or definitively evaluated.[1] For these complex quantitative problems, one must be satisfied with statistical evidence that indicates whether the property is likely true. Qualitative purposes are not amenable to proof of satisfaction or not. These purposes are evaluated by human judgment, which again leads to evidence but not proof of satisfaction.

Systems engineering processes often use the terms verification and validation (or just V&V). These are both special cases of the general need to gather evidence for and against whether a system meets its purpose or not. In this chapter I focus on the general matter of checking a system, and I will note in this chapter and in later chapters ! Unknown link ref when these specific uses of evaluating a system apply.

13.2 When to evaluate a system

Checking whether a system meets its purpose is an ongoing need, starting from when the system is first conceived, through system design, implementation, and operation. In general, a system should be evaluated any time its purpose changes, or any time its design, implementation, or operation changes.

In practice, there are five times in a system’s lifecycle when the system—whether in design, in implementation, or in operation—gets checked against its purpose.

  1. At each of the individual steps from initial concept, through specification, design, and implementation.
  2. At the time when the system is accepted for deployment.
  3. Periodically and regularly while the system is in operation, to monitor for drift.
  4. At each step when a change is requested, from concept through design and implementation.
  5. At the time when a changed system is accepted for deployment.
undisplayed image

During development, systems are checked in two ways: step by step, and a separate evaluation of the whole system when implementation is complete. The step by step checking occurs at each development step, including generating a concept for the system, generating a specification, designing, and then implementing the design. The expectation is that if each of these steps is correct, then the concept will follow the purpose, specification will follow concept, and so on, and the resulting implementation will properly meet the system’s purpose. (See the figure in Section 5.6.) In practice something gets missed or misinterpreted at some step of development, and so the argument that each step is correct does not hold. Separately evaluating the implementation at the end directly against the original statement of purpose allows one to cross-check the step-by-step evaluation. It helps one find which step had a mistake and thus where to make corrections.

Evaluations are part of the process of working out components‘ specifications and designs. The idea of safety- or security-guided design [Leveson11, Chapter 9][Horney17] is to start with safety or security objectives as part of a component’s purpose (or the system’s purpose), refine those objectives into parts of the component’s specification, and then use this to help guide design work. Using safety or security objectives means conducting analyses of specifications or designs to see if they address the objectives, and adjusting the specification or design until there is evidence that they do meet the objectives.

Any time the system’s purpose changes, the system must be re-evaluated in light of the change. This involves repeating steps in the life cycle shown above. Re-evaluation is easy when early in initial design; the later in the life cycle, the more expensive re-evaluation gets. The scope of what parts of the system need to be re-evaluated can be limited by examining the structure of the system and how a change propagates from one component or relation to another.

A system should be evaluated regularly while in operation. In practice, systems drift over time from how they are originally designed and implemented. People who are part of the system, whether as operators, oversight, or management, can shift in their understanding of what they need to do, and often find shortcuts for their role as they adapt to how the system is to work with. Mechanical parts of the system can wear, changing their behavior or increasing the chances of failures. The environment in which a system operates can change, perhaps with people moving near an installation that was previously isolated or maintenance budgets being cut. As a simple example, in one early software system I built, the software included a billing module that would create itemized invoices to be sent to insurance companies that were expected to reimburse for medical expenses. Over time, the people who should have been running the module and creating invoices forgot to do it as regularly as it should have, leading to revenue problems for the business. Leveson discusses several other examples [Leveson11, Chapter 12].

Finally, a system’s purpose usually changes over time. The users need new features, or some assumption about how they will use the system will be found to be wrong. Regulations or security needs may change. All of these lead to a need to change the system’s design and implementation. The team will recapitulate the development process to make these changes, including evaluating the updated concept, design, and implementation against the new purpose.

13.3 Kinds of evidence

There are two kinds of evidence: positive evidence and negative evidence. Both are needed to evaluate whether a system meets its purpose.

Positive evidence is an indication that the system properly implements some desired property or behavior ! Unknown link ref. Positive evidence is what most people think of first: that the mass of system hardware is within some maximum amount, or that the system performs action X when condition Y occurs.

Negative evidence is an indication that the system does not do something it is not supposed to ! Unknown link ref. Safety and security evaluations are fundamentally about collecting this kind of evidence: that the system will not do some unsafe action or enter into some unsafe state. Negative evidence is therefore vital to determining whether a system meets its objectives, but negative evidence is generally much harder to establish than positive evidence. In practice, analytic methods are the only ways we currently have to establish the absence of a condition.

Bear in mind that, as the saying goes, absence of evidence is not evidence of absence; that is, no amount of testing that fails to find an undesired condition can establish with certainty that a realistic system is free of that undesired condition. Negative evidence through testing requires testing every possible scenario, which is infeasible for anything other than trivial behaviors. Testing a very large number of scenarios can potentially generate a statistical argument for the absence of an undesired condition, but only if the scenarios chosen can be proven to provide sufficient, unbiased coverage of all possible scenarios, including rare scenarios. I have never found an example of someone being able to construct an argument for the significance of the test scenarios in a real-world system. Kalra and Paddock [Kalra16] present an analysis for testing autonomous road vehicle safety, and show that it would require an infeasible number of driving miles to show the absence of unsafe behaviors—and they conclude that alternate means are needed to determine whether autonomous road vehicles are sufficiently safe.

Many undesirable behaviors or conditions cannot be completely eliminated from a system, and instead the standard is to show that the rate at which these behaviors occur is sufficiently rare. For example, aircraft are expected to experience failures at no more than some rate per flight hour in order to be certified for operation. These safety conditions lead to a need for evidence of statistical bounds on rate of occurrence at a given confidence level.[2] If these bounds are sufficiently loose, then a carefully-designed test campaign can provide statistically significant evidence. However, statistical significance and confidence rely on the test scenarios either being selected without bias, or with a way to correct for selection bias. This means, for example, ensuring that there is no class of scenarios that are avoided in selection. It also means understanding the probability of rare but important scenarios occurring and accounting for that rarity in the number of scenarios tested or in the way scenarios are selected.

13.4 Methods of gathering evidence

There are three general methods for gathering evidence about systems satisfying their purpose:

Experimentation tests an operational system (or part of a system) to show positive evidence about some desired capability. This is the gold standard for gathering positive evidence.

Experimentation is usually divided into two categories: testing and demonstration. Testing involves setting the system into a defined condition and providing it defined inputs, measuring the system’s response, and comparing that response to expectations. Tests are expected to be repeatable. Demonstration is more open-ended, where the system is operated for a while, possibly by people, and not always in a fully-scripted, repeatable way. Demonstrations can address some non-quantitative conditions, such as whether people like something or not.

Inspection or review is a way to check a design or system for things that cannot be readily measured by experimentation. These methods use human expertise to check the system for specific conditions. It is primarily used to gather positive evidence, but it can be useful for gathering negative evidence when other methods don’t apply. In the simplest form, inspection checks simple conditions that would be difficult to automate; for example, that a physical car has four wheels. For more complex reviews, humans observe and think about what they observe in the system to determine whether what they observe meets expected behavior.

Analysis can be used to collect both positive and negative evidence. Indeed it is generally the most useful way to gather negative evidence—which is often about thoroughness, and analytic methods are better at ensuring all possibilities have been examined. Analysis takes as input a model of the system, extracted from its design or its implementation. It then applies algorithms that work to prove or disprove statements about that design, such as whether there exists some sequence of state transitions that would cause a component to enter an undesired state. The evaluation is usually performed using automated computational tools, though it can sometimes be done by hand for analyses of modest complexity. I have used analytic methods occasionally, usually for foundational components or abstractions on which the system depends for its correct operation. The first time I used it, on the design of a synchronization mechanism in a multi-threading computing environment, it caught a subtle flaw that would have occurred rarely but would have been difficult to detect. On another project, colleagues and I proved the correctness of the design of a family of distributed consensus algorithms—which helped us accurately implement the algorithms. The SeL4 operating system kernel [Klein14] has been formally proven design and implementation, showing that its implementation provides key confidentiality, integrity, and isolation properties as well as functioning according to its specification.

13.5 Completeness and minimality

Separate from these methods for gathering evidence, one also needs evidence of completeness and minimality.

When a system is believed to be complete, one doesn’t want only to show that one or a few purposes are met; eventually one needs to provide evidence that all purposes are met. This does require knowing what the purpose is, and then being able to provide evidence showing each part of it has been satisfied.

One also needs to show that the system as designed or implemented does not do things that don’t derive from and support the purpose. This includes showing that safety and security properties (of bad things not happening) are met. It also includes ensuring that people have not inserted features that the end users do not need or want, which would imply that development resources have been mis-spent and that the system can potentially do things the users will find undesirable.

[1] For example, consider a system property that is equivalent to the halting problem ! Unknown link ref or first order logic satisfiability ! Unknown link ref. Both these problems are formally recursively enumerable ! Unknown link ref, meaning that if a property is true that can be found in a finite time, but if the property is false it may take an infinite amount of time to determine that it is.
[2] Estimating the total number of species in an environment is a similar problem. One way to generate an estimate is to look at the rate at which new species are discovered. When most species in an environment have been discovered, the rate at which new ones are found decreases and in the limit goes to zero when all species have been discovered. See the work of Wilson and Costello [Wilson05] as an example of performing such estimation. It has been argued that the rate of discovery of undesirable system behaviors should follow a similar model.
Sidebar: Summary

Part IV: Making a system

A detailed model for how to go about building a system:

Chapter 14: Approach

21 May 2024

Making a system is about the activities to build the system and the people who do that work. In Chapter 6, I laid out a basic model for these activities and what they involve. The model involves five elements (repeated from that chapter):

undisplayed image

The model is organized around the tasks that are performed to build the system. The tasks generate artifacts, including design and implementation. The team is the people who do these tasks. The people use tools to do some of the tasks. And finally, the plan organizes the work.

This model provides a template for thinking about how to set up the processes and policies for a system-building project. That is, when it comes time to do a project, one can use this model to help guide the decisions about how the project will be run. In this book I do not specify how one should make these decisions—each project has its unique needs, and no one recommendation will be a good solution for every project. Instead, the model provides a framework for understanding what decisions need to be made, and in later chapters I provide menus of choices for different parts of the model.

All the pieces of running a project are themselves a system, whose purpose is in general to get the system built. In this part, I follow a general approach for designing any system in order to lay out a set of functions that each part of the model can have. In doing so, this lays out a framework for the criteria by which someone can judge potential designs for their project’s organization.

undisplayed image

The approach, then, begins with working out the purpose of the system for running the project. The purpose in turn derives from the stakeholders who must be satisfied with the execution. In the rest of this chapter, I lay out a template list of stakeholders and the needs each of them might have. This set of needs will then provide guidance for what each component part of the model—artifacts, team, plan, and tools—should have in order to satisfy the stakeholders.

14.1 Purpose

The primary purpose of the system that is the project is:

Get the system built, accepted, in operation; maintain and evolve it.

There are also secondary objectives that different stakeholders will have, which we will discuss next. This includes, for example, needs of the organization hosting the team that does the work: the organization in most cases expects at least to be able to cover the cost of development. If the organization doesn’t believe that it can cover the cost, they may well decide not to pursue the project.

In the next step, I identify potential stakeholders. Following that I will identify potential needs each can have, what different kinds of each there can be, and how each stakeholder relates to the organization that runs the project.

14.2 Stakeholders and needs

The first step in working out a system’s purpose is to identify the stakeholders who define the purpose (or put constraints on the project that are, in effect, part of the purpose).

I group stakeholders into five classes:

  1. The customer for which the system is being built;
  2. The team that builds the system;
  3. The organization(s) of which the team members are part;
  4. Funders who provide the investment to build the system; and
  5. Regulators who oversee the system and its building.

Each of these are meant to be roles, rather than single entities. For example, when a system is built under contract for an organization who is paying for the work, that organization is both the customer (they will be using the system) and the funder (they are paying for the system-building).

14.2.1 Customer

The customer is the person or organization(s) that will use the system once it has been built and deployed. The system’s value in the world in the end derives from what the customer can do using the system.

The customer primarily cares about the system meeting some need they have. In addition, they care that the system:

Variations. The simplest kind of customer is when one organization contracts with another organization to build the system for the first organization. In this case, it is clear who needs to be satisfied with the system (the one paying for it).

Other times the customer is internal: when an organization determines that it needs some system for its own use. Who defines the purpose of the system is then usually clear—though sometimes it is unclear who defines the purpose, because there is not such a clear separation between the “customer” and the builders.

Finally, the more complex situation occurs when the customer is hypothetical. This occurs when an organization builds a system product in hopes of providing it to future (paying) customers. In this case, there is no one person or organization who can dictate the system’s purpose. Instead, the team designing the system must build up an idea of who potential customers are and what they might want.

I discuss the different kind of customers further in Section 22.5.

Relation to broader organization. Most organizations have someone or a team responsible for finding and working with customers. This might be a business development group, or a sales and marketing group. These people will be responsible for actually working with the customer, and they should stand in as a proxy for the customer during internal discussions. The systems aspects that I discuss here support the interface between the marketing or business development people and the people who build the system that is delivered to the customer.

14.2.2 Team

The team is the collection of all the people who do the work to design and build the system. This group includes developers and engineers, managers, contracting specialists, marketing, and everyone else who does tasks related to getting the system built.

Many of the things that the team needs are not directly related to building the particular system, but are aspects of the organization for which they work. An organization’s policies and management have the most effect on whether the team are satisfied, but there are aspects of systems work that can support (or hinder) the organization.

The people in the team need, in general:

Variations. The team can be as simple as one or two people, or it range to a large team of hundreds. The team can be all within one organization, or it can be spread over multiple organizations (such as when multiple organizations collaborate on a project). A team can also be viewed as including external vendors who provide parts of the system or essential services.

Relation to broader organization. Most of a team’s needs are matters of project management and business operations, not of systems-building in itself. The organization defines its human resources policies, for example, which address matters of how people are evaluated or paid, and how they can report problems.

However, the organization of systems work can help to meet these needs. Accurate staffing depends on understanding the work to be done, which in turn depends on the system’s design. Well-defined job descriptions and processes help people understand how to get their job done, contributing to people feeling secure in their position.

14.2.3 Organization

The organization is the entity or entities for whom the people in the team work, and which provide a legal entity for the project. I use the term “organization” rather than “business” or “company” because there are many kinds of organizations that can run a project: a government, a consortium of other organizations, a non-profit organization, or an informal group of people can all run a project.

All organizations share one concern: the ability to deliver the system. This includes having the ability to communicate with the customer (or model potential customers) and the ability to hire and support the team doing the work.

Organizations also share a need to maintain their reputation. If an organization has a reputation for delivering good systems, on time and on budget, they will be more likely to be able to keep going.

Some organizations have additional needs, focused around how the project will position them to deliver other things to other people. An organization may need to show a profit—enough to fund the organization’s overhead and to deliver returns to funders. An organization may need to be able to sell the system to potential customers. And an organization may need the project to position the organization for future work, based on improving the organization’s capability and maintaining its reputation.

Variations. There are many different kinds of organizations. These include:

Relation to broader organization. Obviously, most of an organization’s needs are addressed not by the team building a system, but by the organization’s management and operations. The systems project supports these needs, however. The organization needs to be able to estimate the cost and time involved in a project in order to ensure that it has the funding needed to complete the project. The organization’s reputation depends in no small part on its ability to execute the systems-building project, so things that helps the project move ahead efficiently and smoothly will be good for the organization.

14.2.4 Funders

Funders provide the capital or other resources needed to build the system.

A funder has one primary need: the return on their investment. The return may be monetary (profit from sales of the system) or it may be more intangible (a business ecosystem, regional economic development).

Some funders will have secondary needs, such as enhancing their reputation and positioning themselves for funding future projects.

Variations. Funders can be external to the organization building the system, providing investment in the expectation of a monetary return. Venture capital funding is one example of this kind of funder.

The customer can be a funder when the customer pays for building the system. This can be a commercial customer funding the project through a contract. This can also be a government organization providing a development contract. The expected return in these cases is primarily the system itself, and secondarily less tangible benefits like the development of capacity to build similar systems.

A project can also use internal funding. This occurs when an organization has the capital to develop a system itself. The organization generally expects a return on its investment either by improving the organization’s own capabilities, such as by building a tool that helps the organization run better, or by providing a product that the organization can sell for a monetary return.

Relation to broader organization. While the organization has the primary responsibility for working with funders, a systems-building project can help meet the funders’ needs by building the system efficiently, using the investment well, and by producing a good system, which helps ensure that the expected return will occur.

14.2.5 Regulators

Regulators in general are people or organizations independent from the team and project. The regulators provide an external check on organizations and products to ensure they meet safety and security regulations, or that they provide legally-required public value.

Regulators need compliance with regulation in the system and in the work the team does to build the system. The regulator may verify that regulations have been met by inspecting the final system or by auditing records of the system’s creation. The regulator may block a system’s deployment until the system can be certified as meeting these requirements, as happens with aircraft. Alternatively, the regulator may depend on the team to know and follow the regulations and only check the system’s compliance when something goes wrong. The US automotive industry is an example of this.

The systems-building process, at minimum, supports regulators’ needs by knowing and following the regulations. This often can involve dialog with the regulatory organization to ensure that the team has all the information it needs, and to ask for clarification or guidance when the team is unsure about the regulation. The team also needs to maintain records that can be checked to show how it has complied with regulations. When the system requires certification before being deployed, the team usually needs to engage with the regulators to ensure the process goes smoothly.

Variations. A government organization is the obvious regulator. They have the charter to look after the public interest, especially when a project has incentives that would work against that interest.

Industry organizations can act as de facto regulators. A group of companies can come together to set voluntary standards for the systems they make. The groups that standardize the Internet (the Internet Corporation for Assigned Names and Numbers, ICANN) or WiFi (the IEEE Standards Association and the Wi-Fi Alliance) for interoperability are examples. These organizations do not have authority to penalize systems that do not comply, but a system that does not is not allowed to claim compatibility.

Finally, there are non-governmental organizations that set safety or security standards, often for a particular industry. ISO, SAE, and others provide safety standards (such as [ISO26262] or [ARP4754]) and companies have grown up around them to help other organizations comply with the standards. These organizations also have no authority to penalize non-compliant systems directly, but compliance is usually evidence used to show that government regulations are met, or provide a defense against lawsuits.

14.3 Mapping needs to model

The previous section introduced a set of stakeholders that have an interest in how the project operates, and a summary of each of their needs. The next step is to work out how the model for performing the project can support meeting those needs (see the diagram above). This involves mapping the stakeholder needs to each of the parts of the model (artifacts, team, tools, plan).

I developed this detailed mapping. Appendix A reports the details of each stakeholder and their needs, along with the full derivation from needs to the requirements for the pieces of the system-building model. The mapping has the form of tables of requirements or objectives, with each stakeholder need mapped to one or more objectives for each part of the system-building model. The result is that every stakeholder need is either supported by aspects of the system-building model, or is explicitly labeled as the responsibility of others outside the system-building project. The derivation also shows that every objective listed for the system-building model is justified by helping meet some stakeholder need.

The remaining chapters of this part of the text explain what each part of the model should be or do. These chapters are based on the derivation in the appendix.

Chapter 15: Artifacts

25 May 2024

15.1 Purpose

Artifacts are all the things created in the process of making a system. It starts with records of the purpose of the system, and the requirements it must fulfill. It includes the implementation of the system ready to deploy—such as hardware inventory in a stock room and software ready for installation. The artifacts include everything in between, including design, source code, verification records, rationales for decisions, records of reviews and approvals, and many, many more. The artifacts also include information used by the team to help do its job, such as information about who is on the team, processes to follow, and how the team operates.

The objectives for artifacts are documented in Section A.3.1.

The artifacts have three functions: as deliverables, as communication, and as a record of the project for auditing.

As deliverables, the implementation artifacts are the actual system to be deployed. It should be possible to take a set of implementation artifacts, assemble them (following instructions that are themselves artifacts) and have a working instance of the system. These artifacts are joined by things like records of regulatory approval and information associated with serial numbers or versions showing the history of the specific artifacts deployed in the system.

Most of the artifacts, however, are for communication: between people working on one task and another, between the customer and system designers, between those who implement and those who verify. Sometimes those people are working concurrently, such as when two people design two components that are expected to work together. Sometimes the communication is between someone who specifies attributes for a part of the system and someone who implements that parts. The communication is also between someone who made a design decision and someone who, years later, must understand that decision in order to make changes to the system.

Audit is a special case of communication. It is between the project and someone outside who will be checking the project’s work. In many cases the external party will have an adversarial role, looking to find mistakes or violations. Regulators, for example, may look through records to check that the team has followed processes that meet regulatory requirements.

Note that there are many ways to achieve the objectives laid out in this chapter. Each project will need to determine how to handle its own artifacts. The specific solution will depend on the complexity of the project, the size of the team, and requirements from the organization or industry. The appropriate solution may change over time: as a team grows, it may need more formal mechanisms.

I have seen a range of working approaches for handling artifacts. Two projects kept track of planning information on designated whiteboards. Others maintained plans in project management tools. (The whiteboard approach had a problem: one time someone erased the board. Luckily there was a recent picture of its contents.)

I have also been on projects that had an overly complicated solution. One project was a joint venture between multiple companies on multiple continents. That project used multiple repository tools for different kinds of information. There was a process for proposing design and implementation changes, but no one knew quite what it was or how to follow it. After a few years that joint venture fell apart, in part because the teams could not figure out how to work together.

Whatever solution you adopt, it is important that it fit your project and team. It should be capable enough to manage the kinds of artifacts your team will use, and simple enough for the team to use.

The objectives in this chapter can help you work out what capabilities your solution should handle.

15.2 General principles

The artifacts are meant to be shared, at least within the team and sometimes to people outside. The people using these artifacts will come and go, so supporting people who will use them in the future is as important as sharing in the moment.

This leads to some general principles about artifacts.

People should be able to find the artifacts they need. An artifact is not useful it the people who need it don’t know it exists, or if they don’t know how to find it. The artifacts should be organized in some way that helps people find them.

“Finding” has multiple aspects. It can mean that when they know something exists, they can get to that artifact conveniently. It can mean that they know that a general kind of thing probably exists, and they need to be able to navigate through to the artifacts of that kind. They may not know what is out there, and need to be able to browse or discover artifacts in order to learn about the system. Or it might mean that they need to have confidence that they can itemize all of a certain kind of artifact, without missing any.

People should have confidence that they have found the correct artifact. In the worst case, someone will look for a particular thing and find three or four potentially-relevant artifacts. Which, if any, of those should they believe? What if they disagree with each other?

This principle generally means, first, that any particular piece of information or artifact should be in one place. There should not be two different artifacts that appear to be authoritative sources for the same piece of information. It also means, second, that when there are legitimately multiple versions of an artifact, those versions should be clearly identified and that a user should see consistent versions of different artifacts unless they take explicit actions to see different versions.

The artifacts should be maintained securely. The system that the customer will ultimately use is based on many artifacts that the project maintains. If someone subverts or damages some of those artifacts, the resulting system will be compromised. If someone destroys some of the artifacts, some of the team’s work will be lost.

This argues at minimum for maintaining the integrity of the artifacts, meaning that the artifacts or the collection of them cannot be modified in an unauthorized way. (Good practice is that any change to an artifact can be traced reliably to the person who made the change.)

Some of the artifacts may need to be kept confidential, if they contains secret information. Almost every project has some information to be kept confidential, at minimum as part of maintaining the integrity of artifacts. (Login credentials, for example.)

15.3 Kinds of artifacts

This section lists the kinds of artifacts that the analysis in Appendix A showed contribute to meeting stakeholder needs. The artifacts are listed in the order in that analysis.

15.3.1 Purpose and constraints

These artifacts include clear documentation of the customer’s purpose for the system. Every feature of the system derives, directly or indirectly, from this purpose. If that purpose is not written down, the team is unlikely to accurately design to meet those needs—and is likely to add features that the customer does not want (so-called “gold plating”). These artifacts should be visible to most of the team in order to guide them as they design, build, or verify the system.

The customer’s non-functional constraints should be included. This includes the safety, security, and reliability they expect.

Constraints from other stakeholders should also be documented. The organization may place constraints on the project, such as expected profitability. Regulators can place many constraints that must be met to license or certify the system.

The understanding of the purpose or constraints will change over time. A customer will find they have needs they did not initially realize, or they will discover that whatever purpose was agreed with the team is not quite what they meant. An organization or regulators may change their constraints as time goes by.

There should be an explicit record of the changes requested or identified. If a change is accepted—and the project may choose to reject some changes—then it should lead to a new version of the purpose and constraints. It should be possible to determine whether other artifacts, such as requirements or design, are consistent with a particular version of the purpose and constraints.

The specific kinds of artifacts include:

15.3.2 Team information

Maintaining information about the team helps the team work together.

I worked on one project where the management did not want to put together an org chart or a list of team members. I ended up talking to the wrong person about a particular technical subject—that person was happy to talk about it, but it turned out they were not actually on the part of the team working in that area. Their opinions turned out not quite to agree with those of the person actually in charge, but I hadn’t been able to find the person I should have been talking to.

This kind of confusion is more common than people expect, and it results in people getting the wrong information, or in people not getting information they should.

Information about the team is only valuable if it is accurate, however. The team should have someone responsible for keeping it up to date—meaning that ideally updating the information is a normal part of the processes (! unknown reference chap-plan) for bringing in a new team member or changing assignments.

The specific kinds of artifacts that will help include:

15.3.3 System artifacts

These artifacts are the system that is being built—the majority of the work of a project.

The system artifacts include:

The exact set of these system artifacts depends on the process and life cycle (Section 18.3) that the project uses. If the life cycle has some review milestone that a part of the system is supposed to meet, then there may be documents or analyses specific to that review.

That said, good system building practice involves some core kinds of artifacts: specifications, designs, and implementation.

The artifacts should include some items that are more about the system building process than about the deliverable system itself. These include:

How the team maintains these artifacts can vary widely. Many software efforts use version control systems, which maintain versioned software artifacts in a repository server. Many hardware design tools either provide their own versioning repository, or are designed to work with a separate repository system. For hardware artifacts—not their design—one must work out where to store and how to track each physical artifact.

15.3.4 Verification artifacts

Verification artifacts support verifying that the system (or components in it) meet their intended purpose and specification, and that they are free of errors.

These artifacts include:

These constitute both a record of what parts of the system have been checked and found to meet their verification criteria.

Verification should be repeatable. The artifacts maintained for doing verification checks should be complete enough that different people can perform the checks in the same way. The instructions for performing checks should be clear. The test equipment should be maintained and people should have instructions on how to use it. Software test environments should be controlled so that when a test is run twice, it is in the same environment both times.

The verification results are generated by people performing checks, and used by people reviewing part of the system to ensure it has been checked before it is accepted as working. They may also be audited by regulators or other outsiders who will be checking whether the project has built the system properly.

15.3.5 Release, manufacture, and deployment

Releasing and deploying a system are complementary steps. Releasing involves taking implementation artifacts and making them available for manufacture or distribution.[2] Manufacturing the system follows if needed—involving producing and assembling hardware, or packaging software into a deployable form. Deployment takes the manufactured system and sets it up for a customer to operate.

undisplayed image

The artifacts should include the procedures used to release, manufacture, and deploy the system. The release procedures define the sequence of steps involved in taking a version of the implementation artifacts, checking that they have been verified and meet the intent of a release (such as the features implemented or bugs fixed), and placing copies of those artifacts in a separate area as a release. The manufacturing procedures define how to take those released artifacts and manufacture products that are ready for deployment: assembling hardware according to a released hardware design, for example, and giving them serial numbers. The deployment procedures tell how to take those manufactured artifacts and install them so that they are a working customer system.

There are different variations on this flow of operations depending on whether one is releasing and deploying a whole system or an update, whether the artifacts are electronic (software or data) or physical (hardware components), and whether the system will be mass produced or not.

Hardware components will generally start with a release of a hardware design. That hardware design is the basis for manufacturing instances of the component. Whether it is a single unit made in house or many units produced in a dedicated facility, the manufacturing procedure determines how the hardware products are made. Before finishing manufacture, hardware components are typically given an identity, often recorded as a serial number, that identifies the specific component instance and associates it with records like which design release version was used, what subcomponent parts were used, date of manufacture, and so on. Then the part is placed in inventory from which it can be deployed.

Software components most often follow a different path. Being electronic information rather than physical, there is no “manufacturing” step. The release procedure gathers implemented software and creates a deployable package from it. The manufacturing procedure gives the package an identity (a release number) and signs it or otherwise sets up security protections. It can then be copied to a server that makes it available for distribution and deployment.

Deployment procedures take hardware from inventory and software from a distribution server and puts it into use for a customer. This could be as simple as letting customers know that a software update is available for download. It could involve moving a number of physical components to a customer site, setting them up, and performing deployment checks to ensure that the installed system is working. It could be as complex as delivering a spacecraft to its launch provider, preparing it for launch, and having the spacecraft start up on orbit.

The whole process of producing deployed systems often generates a lot of records. Hardware devices have associated records about what specific design was used, what subcomponents were used, when and were it was manufactured, and then accumulate service records: when deployed, what defects were reported, what repairs made, how the device was disposed at end of life. Software has similar records: the identity of the software image, the versions it contains, how it was built, when it was made available for deployment, where it has been deployed, and its service history.

15.3.6 Project operations

Artifacts that support operations can be broken down in the same way that operations itself is (Section 6.3.5 and Chapter 18).

The project’s life cycle and procedures can be maintained in simple documents. Because these documents serve as a reference for team members, it is important that people be able to find easily the parts of the documentation they need for a particular situation: for example, if someone is setting up a design review for a particular component, they need to find the procedure for design reviews. The documents also need to support people reading through the life cycle or procedures to learn how the project operates in general. Having a good table of contents or index and accurate summaries can help them understand the breadth of operations before they need to learn about some specific procedure.

I have worked on several projects—especially including NASA projects—that develop complex “management plans” and “systems engineering management plans”. I have found that few people in the team actually use these documents. The management plans often follow a template that speaks to the team’s aspirations (“the team will do X”) but does not lay out the actual procedures (“do X by doing Y and Z”). The information in these plans is also often organized for a management reviewer, rather than for the people who need to follow the procedures. As a result, the documents sit unread after being approved and the team operates on shared lore about how to do one task or another, and the plans become increasingly out of date as the team’s practice diverges from the original intent.

Instead, the life cycle and procedure documentation should:

Beyond the life cycle and procedures, planning and tasking activities involve creating and maintaining records. These artifacts are often maintained using specialized tools, such as project planning tools and task management (or issue management) systems.

Operations also maintains records of supporting information, such as budgets, risk registers, and lists of technical uncertainty.

15.3.7 Regulatory artifacts

Working with regulators typically involves a lot of records. The team uses some of these to guide how it builds the system. Other records form a legally-binding record of what the project has done and how the team has interacted with the regulators.

First, the artifacts should include records of the regulations that the project must comply with. This might be as simple as references to publicly-available reference sources (such as web sites that make current government regulations available). It may also include documents that explain what these regulations mean. This information is only of value if it is accurate; this means it must be kept up to date as regulations change. (In some fields, it is worthwhile having someone who tracks likely upcoming regulatory changes so that the team can anticipate those as well as working to current regulations.)

The artifacts should also include records of the processes that the team needs to follow working with the regulators. For example, if the system must obtain a license before being deployed for use, then there will be a process for applying for that license. Again, this information must be kept up to date to be useful. The processes are often difficult to find or interpret, so it is helpful to maintain documents that explain the process as well as just a record of the process.

Second, systems that need licenses or certification will require applications to regulators. The application information should be maintained, including copies of any application forms (with dates!) and any supporting documents generated as part of putting the application together. For example, I helped one team apply for a license to operate a small spacecraft in low Earth orbit. The license application included an orbital debris assessment report, which was sent to the regulator as part of the application packet. The assessment report included information generated by a debris assessment tool [NASA19]. The database used by the assessment tool was an artifact to be maintained, along with the report itself.

Correspondence with regard to the applications also needs to be maintained. This should include any information that shows how the team took steps to follow the application processes.

Next, the project must keep records of licenses or certificates that have been issued.

Finally, the project will need to maintain evidence that the system it has built complies with regulation, whether a license application is involved or not. These take the form of a mapping from a table of regulatory requirements to the evidence of compliance with each of the requirements. The evidence can be complex: for example, showing that the probability of a particular hazard occurring being below a mandatory threshold.

15.4 Managing artifacts

Artifacts are the result of the team’s work, and thus they carry value to the team and its customers. They represent the system being built. They are used continuously to inform and manage the team. They are often used long after they are created, to audit the work and to guide modifications to the system.

The artifacts change over the duration of the project. An early design draft gets revised into a version used to build the corresponding component. Later, the design is revised for a second-generation component.

These conditions lead to three general principles for managing artifacts: security to protect integrity, organization so people can find the artifacts, and change management.

15.4.1 Security

The artifacts need to be managed in a way that preserves their value by maintaining their integrity. Losing or damaging an artifact results in a loss that could be anything from annoying (losing minutes from a status meeting) to fatal to the project (damaged implementation of a critical component). The artifacts should be protected against both accidental loss, such as a server breaking, and malicious loss. For data artifacts, this means using resilient storage systems with good cybersecurity. For physical artifacts, it means storing artifacts in storerooms that maintain a benign environment and that provide physical security.

Access to the artifacts should be limited to authorized people using access control mechanisms. These mechanisms reduce the risk of malicious damage by limiting who can get to the artifacts. For artifacts that need to be kept confidential, limiting access helps reduce knowledge leaking to unauthorized people.

15.4.2 Organization

A random jumble of artifacts is of little use to people on the team. The team members need for the artifacts to be organized in a way that allows them to find the ones they need accurately and quickly.

There are two kinds of “finding” that team members will do.

In the simple case, they will know what they need: the design document for some component, or the risks associated with the project, or widget serial number X. To find something specific, they need to know where to find artifacts and how those artifacts are organized. They can use that organization to get to the specific one.

The other case is when someone knows they have a need but does not know exactly what they are looking for. This might be someone who has recently joined the project, or someone who is working in an area they aren’t familiar with. These people will need to be able to see and learn how the artifacts are organized, and will need a guide to help them understand what is available.

Finally, there should be one logical place for each artifact, and artifacts should not be duplicated. (There might be copies for redundancy, but the people looking for one artifact should see those copies as if they were one thing.) Two people looking for the same information should not end up finding two different artifacts that cover the same topic and that have diverged from each other. This leads to people building incompatible components, sometimes in ways that are hard to detect but that lead to errors in the system.

15.4.3 Change management

As I have noted, artifacts change regularly over the course of a project. However artifacts are managed, they need to account for the effects of these changes.

Some artifacts, like records of task assignments and progress, change often but at any given time there is only one accurate copy of the information.

Most system artifacts, on the other hand, evolve in more complex ways. At any given time there may be multiple versions that are works in progress—containing incomplete changes that their creators don’t believe are ready to be used by others. Some of those in-progress versions may develop to become accepted versions, ready for others to use: a design that is ready to be implemented, or an implementation ready for integration testing. A version that has been used like this may later become obsolete as an updated version comes along.

This pattern of change calls for supporting versioning on this kind of artifact. Versioning means that one can find multiple versions of the artifact, and each artifact has an identifiable status so that someone can know whether they should be using that version to build other artifacts, or just looking at the version to understand it.

The dependencies of one artifact on another, such as a design leading to an implementation, and and implementation leading to verification test results, means that mutually consistent versioning is also important. When looking at an overall version of the system, it should be clear that (for example) the design for component X has been updated, the implementation for that component is in progress of being updated to match the design, and any verification results are from an older implementation that may no longer be accurate.

Most project life cycles and procedures define different statuses that an artifact version can have, along with procedures for how that version can change status. While the details differ, the statuses generally include some sort of work in progress, proposed, approved (or baselined), and superseded. The procedures generally say what has to happen for a version to move from one status to another, such as defining that a proposed design needs a review and approval step to be accepted as a baseline.

15.4.4 Implementing artifact management

There are many tools and processes in use today for managing artifacts. At the time of writing, no one tool works well for all kinds of artifacts, and so a project must stitch together its approach to managing artifacts out of multiple different tools.

Electronic artifacts. Software development uses version control systems to manage electronic files. There are many such systems, all of which provide a storage repository with a few common features:

Other industries use document control systems to manage collections of electronic files. These systems also provide a repository for a collection of files, but the generally focus on the management of documents rather than just versioning. They commonly include features like:

In addition, tools such as CAD systems or requirements management often include versioning and workflow features. These tools support creating different versions of an artifact, and defining a workflow for the procedure to be followed for approving a version as a baseline.

In practice the tools for managing artifacts do not often work together, requiring a project to (for example) select one tool for managing software artifacts, one for CAD system artifacts, another for structured systems engineering artifacts (such as requirements or specifications), and another for documents that do not fit neatly into these other categories.

Hardware artifacts. Many projects will create physical artifacts—mechanical components, electronic boards, manufacturing jigs, and testing equipment. These physical components need:

[1] “I kept a seven-by-ten-inch black notebook divided into six sections, as follows: (1) Schedule, (2) Systems Briefings, (3) Experiments, (4) Flight Plan, (5) Miscellaneous, and (6) Open Items. Section 6 meant problems of which I became aware as we went along, and which were duly listed by number. As long as they remained unsolved, or open, I reviewed them periodically and bugged the appropriate people for solutions. As they were solved, they were closed, and I drew a line through that number. By the morning of launch, I had 138 items, and all 138 had been crossed out. If this process was a bit scary and time-consuming, it was also immensely satisfying. It was going to be one hell of a flight, if only I could figure out… Whip out the notebook and write it down before I forgot it.” —Michael Collins, writing about the preparation for the Gemini 10 flight [Collins74].
[2] The term “release” has different meanings in different contexts. The term here could be taken as “release to manufacturing” in those situations where “release” requires qualification.

Chapter 16: Tools

25 May 2024

16.1 Purpose

Tools are things that people use while designing and building the system. The tools are not part of the system itself; they are not delivered to an end user. Their purpose is to help the team do their job. Each project will have its own needs for tools, so this list is meant to inspire ideas rather than prescribe what may be needed for building any specific system. There are, however, some common principles for selecting and managing tools.

This chapter brings together information about many different kinds of tools, with references to the other parts of this volume that discuss details.

Please note: I do not recommend specific tools.

16.2 General considerations

There are a few general principles that apply to selecting tools generally.

First, most tools will be used for shared work. Tools should be evaluated on how well they help the team work together. Computer-based tools that manipulate shared data, such as CAD tools, should make it easy for multiple people to access the information concurrently. They should support the project’s approach to versioning information ! Unknown link ref. Physical tools should be accessible to those who need to use them. This is especially important to consider if people work in multiple physical locations.

Second, many tools require training to be used effectively and safely. The project must ensure that each person has been trained to use a tool safely before they are allowed to use it. That implies that tools should be evaluated on the quality of educational material available on how to use them.

Third, good tools are integrated so that they work together. Tools that can share information can provide greater value to the team than ones that cannot.

Next, tools should support the general life cycle and procedures the project uses. They should fit into the project’s procedures for managing artifacts, versioning them, and reviewing them.

Finally, tools should be secure. Good tools will support the project’s overall approach to security, including controlling access to information based on a person’s role in the project. This includes both electronic and physical security.

16.3 Kinds of tools

This section provides an overview of all the kinds of tools discussed elsewhere in this volume, with references to the sections that provide details. The overview can serve as a checklist for a team working out what tools they need.

16.3.1 Storing and managing artifacts

The tools for storing and managing artifacts are discussed in Section 15.4.4.

Electronic artifacts. Alternatives include:

Hardware artifacts. These can use:

16.3.2 Specification tools

As I will discuss in Part VI, the team will develop specifications for system components. A specification defines a component’s external interfaces—in systems terms, how the component is part of functional and non-functional relationships (Section 11.2).

There are several kinds of specifications (Section 23.4), including requirements, interface definitions, and models.

Requirements (Chapter 24). Requirements provide textual statements of things that are to be true about a system or component. Requirements can be managed using:

I list a number of considerations for selecting requirements management tools in Section 24.13.

Interface definitions (! Unknown link ref). Interface definitions specify how one component can interact with others. These can be written using:

Models (! Unknown link ref). Mechanical, mathematical, electronic, behavioral, and other kinds of models are used as specifications. Relevant tools include:

16.3.3 Design tools

A project’s design phase works out a set of designs for the system and its components that satisfy the corresponding specifications (! Unknown link ref).

A design records the structure of each component—whether a high-level, composite component or a low-level component (Chapter 10). It also records analyses that lead to designs and rationales for how a design ended up as it did.

There are two kinds of design artifacts: the breakdown structure and the designs themselves. The model in Section 10.4 has six parts to a component design: form, state, actions or behaviors, interfaces, non-functional properties, and environment.

Breakdown structure (Chapter 26). I recommend that the component designs be organized by the component breakdown structure. This structure organizes the components into a hierarchical name space, giving each one a unique identifier and showing how one component is made out of others.

On most projects, I have used a spreadsheet to list all the components, the breakdown organization, and their names. This has worked well enough, and I am not aware of tools that explicitly support such organization.

Form (! Unknown link ref). The form represents the aspects of a component that do not change, or only very slowly. The design of physical components is generally handled using CAD tools. These tools use notations or drawing standards appropriate to each subject.

State, actions, behaviors (! Unknown link ref). This part of a design addresses the parts of a component that change readily.

Non-functional properties (! Unknown link ref). These properties change slowly and are not part of the component’s form.

Environment. This is the environment in which the component is expected to operate, or in which it may be stored. This is usually recorded in text.

16.3.4 Analysis tools.

These tools help the design process by providing feedback on how well a particular design works. They also are used when verifying a proposed design.

16.3.5 Build tools.

These tools help translate designs or implementations into operable components that can be integrated into a running system, or used for testing.

The built artifacts will need to be stored and tracked, as discussed above .

Physical artifacts. The building of physical artifacts is, in effect, manufacturing one or a small number of those artifacts. These can be built in multiple ways.

In-house building will require maintaining a stock of the materials used in the components. This may include a stockroom of pre-acquired parts, such as metal or plastic stock and fasteners, or suppliers that can provide the needed material quickly.

The building process should be deterministic: if the team builds multiple instances of the same component, the components should all look and behave the same way. This places constraints on whatever tools and procedures are used to build the components.

Software artifacts. Software artifacts are built by translating source code into binary and packaging it in forms that can be installed on a target system.

The software build process must be repeatable: if the same software is built twice, the result should be identical in behavior (differing only in things like version numbers, timestamps, or affected signatures). This usually means that the software build tools should be under configuration management so that identical tools will be used each time.

16.3.6 Testing tools

Testing involves taking a component, or collection of components, and subjecting it to some sequence of activities to verify that the component behaves as specified.

Testing occurs at two different times during system development: as people are building parts of the system and when a component or the system is being verified for final acceptance. These two uses lead to somewhat different needs in the tools for testing.

Tests need to be accurately reproducible: someone should be able to run a test one time on one component, then run the same test later on the same component and get the same result. Of course some component behaviors are not fully deterministic, but accounting for that, one should be able to count on passing a test meaning that the component really does meet the specification being tested. If a test fails, people need to be able to reproduce what happened to understand the flaw and to determine whether a fix works.

Reproducibility places constraints on testing tools. Physical tests will need to be done in consistent environments, using control and measurement tools that can be calibrated to ensure they are behaving consistently. Software tests similarly need to be run in controlled environments.

Hardware testing. Testing hardware components can range from measurements of single components to integration tests of subsystems or even the complete system. The tools available vary widely, depending on the kind of testing being done.

All hardware testing will involve:

Tools that support testing electronic components can include:

Tools for testing mechanical components include:

Integrated system testing can go well beyond the tools listed here. Flight testing a new aircraft, for example, is far more complex than suggested by these tools. I leave the design and operation of such testing to others better versed in it.

Software testing. Software testing generally involves setting up a number of test cases or scenarios, running the software being tested, and recording the results. There are many different tools that can be used, and these depend on the kind of test being performed and the environment or language being used.

Categories of tools include:

16.3.7 Operations tools

The team uses other kinds of information to manage its operations—about the team, about procedures, plans, and to support decision-making.

Team information (Section 15.3.2). This information is organized around the roster of who is on the team, along with their roles and authority.

This information links to other other tools, some of which are often outside a project’s scope. These include:

These relationships get updated whenever someone joins the team, leaves the team, or their role changes. Using tools that guide people through the procedures for these updates will make the changes more accurate.

Life cycle and procedures (Chapter 18). Teams follow a project life cycle and procedures to do their work. These consist of steps that people should follow to get specific tasks done.

Workflow management tools exist to help guide people through these procedures. These tools can help by:

Plans and tasking (Section 18.5 and Section 18.6). The project maintains plans for how the system-building work will move forward and the work currently in progress. The plan records the work that the project will be doing, at varying levels of confidence and detail, while tasking tracks the specific work that people have been assigned. This information is used both to make sure that the team do the work that is needed, without important tasks getting forgotten, and for forecasting the time and resources needed to move forward.

Maintaining plans and tasking is an exercise in managing a lot of detail. Many tools are available to help with these.

In practice, many of the tools available have been designed for projects other than systems-building, and do not support systems projects well. Many project scheduling tools are based on methodologies worked out for predictable work like building construction, where the tasks can be known fairly accurately in advance. These tools often are organized around a Gantt chart of the work, prompting their users to estimate duration and make task assignments early in the project. This works poorly in systems projects that have significant uncertainty early in the project, and where the degree of certainty (or predictability) improves unevenly as time goes by. This often results in a false sense of confidence in the project’s schedule early on, and requires a lot of effort to try to keep the schedule adapted as work moves forward.

It is worth spending effort working out how a project will manage its planning and tasking, and ensure that any tools chosen will support that approach.

Support. Project operations maintains other kinds of information as well, for which tools are sometimes available. These include:

16.4 Managing tools

Good tools can enhance a team’s performance. Poorly chosen or implemented tools can harm it. One must choose tools carefully and apply thought to how they are implemented and used.

A project’s tools are themselves systems, and should be treated with the same care as the system being built for a customer.

Each tool should have a purpose. Spending the time to work out who will benefit from a particular tool, both directly and indirectly, can provide useful guidance when choosing between options for that tool.

The engineering support tool industry has generated many products that can be used, meaning there are often many possibilities to choose from. While sometimes the team can cut a decision process short because they already have experience with one particular tool, in the other cases it is worth setting out some criteria for making the choice.

Factors that can influence the choice of tool include:

Once a tool has been chosen, it will need to be purchased or built, and deployed for the team to use. This usually requires finding space for the tool, whether that is physical space in a lab or capacity on a compute server. The acquired tool will need to be deployed and integrated into the project’s systems: adding information about the tool to an inventory database, setting up a service schedule if needed, integrating software systems with the project’s security mechanisms.

Team members will need to learn how to use new tools. For some tools, this can amount to providing a written introduction or presentation on how the tool works. More complex tools will require more formal training. If there are safety or security risks in using the tool, the project should ensure that people are required to receive training before using the tool. It is common to track formally which people have gotten this kind of safety training.

Chapter 17: Teams

29 March 2024

Building a complex system involves a team of people to do the work. The people in the team will fill many different roles: developers, managers, customer and regulatory interfaces, support staff, among others.

In this chapter I discuss the issues to be addressed when deciding how a team should be organized, including its structure, roles, and communication.

17.1 Purpose

When many hands do the work, the team needs to be organized so that the work is coordinated. Each person needs to know their responsibilities, and how to find each other person they may need to interact with. In general the team needs clarity about each person’s responsibilities, about communications within the team, and about who is on the team. (See Section A.3.2 for details.)

The work must be coordinated so that different pieces of work are compatible, that all the pieces of the system are built, and that the work is done efficiently. For this to happen, people on the team will need to communicate with each other—and that means they need to know who they should be talking with. They need to know who is responsible to work on which pieces, so that they do not duplicate work. And when something goes wrong, they need to know who to work with to find a solution.

The ability to share work is key to a project being able to scale up to build a complex system. The people on the team need to be able to trust that others will follow the same rules they do: that they will share important information, that they will consult when needed, that they will limit their decisions to their scope of authority. When a team has this kind of trust, each person can do their portion of the work with confidence that the others are doing their own parts as well.

Sidebar: Delegation and micromanagement

Projects involving many people require sharing work. If someone doesn’t share work, then they will be overwhelmed, will take too long to get work done, and will be a single point failure in the project.

Delegating or sharing work implies a dynamic between the two people involved. Person A delegating the work defines the work that Person B, the delegatee, is to do. Person B does the work and periodically gives progress updates. Once the work is delegated, Person B can proceed independently and Person A can turn their attention to other things.

One way this can go wrong is if Person A doesn’t let Person B get on with the work independently, and instead tries to micromanage the work. Learning the habit of managing loosely takes time and effort—but it requires trust between the two people involved. That trust in turn depends on Person A having confidence that Person B will follow shared norms doing the work.

Another way this can go wrong is if Person B isn’t able to complete the work independently. If Person B finds a problem with the work, such as design error, that is beyond their scope, they can raise the issue to Person A and jointly resolve the problem. If Person B is unable to do the work, perhaps because they don’t understand the problem or find they lack a necessary skill, they can raise the issue and jointly handle the problem. If Person B tries to muddle through, however, they stand a good chance of not doing the work needed, leading to Person A needing to check their work in detail and possible redo the work.

In other words, sharing work requires having clear expectations of how to define delegated work and when to raise exceptions.

17.2 Directory

Two of the first things people on the team need to know is their own role and who else is on the team. Once they have that information, they can communicate with others to learn other things they need to know.

Consider the following scenarios.

  1. Person A is working on some component. That component has an interface with another component, and so person A needs to coordinate how they implement their part of that interface with someone working on the other component.
  2. Person B has finished a design for an update to a component. Project procedures say that they need to have the design reviewed and approved before moving on to implementing the design. Person B needs to find out who the reviewers and approver will be.
  3. Person C discovers an ambiguity in the specification for a component, and they are concerned that this ambiguity may lead to a flaw in the designs that follow from the specification. Person C needs to find the people responsible for the specification so they can discuss the potential problem and find a resolution to the ambiguity.

For all these scenarios, the people need to determine who on the team is responsible for some part of the system beyond what they are working on themselves.

To meet this need, the project should maintain some kind of directory of people on the team. This should record:

This information is generally fairly simple, but it must be kept current. If people come to believe that the directory is likely out of date they will not trust it.

undisplayed image

17.3 Structure

A team of more than perhaps three or four is not an amorphous blob of anonymous people; it is organized so that each person has their own specialized roles, and authority is not duplicated. The team’s structure is the pattern of this organization.

The structure may arise spontaneously or deliberately, but teams that are able to deliver complex systems will have some degree of organization. A team grows because it makes use of specialized skills or because it has work that should be done in parallel. In both cases, avoiding duplicated work matters. If one person has a needed specialization, then they should be the ones doing the work that uses that skill. If the project needs parallelism, doing duplicate work fails to meet that need. Further, duplicating work often results in team members who believe their work is not valued, when that work duplicates something another person is doing.

At the same time, complex systems require people to collaborate. Sometimes one component will need people with multiple skills to get it designed and built.

A few aspects of team structure address part of these dynamics of collaboration and avoiding duplication: how people are grouped into sub-teams, and how decision authority—for both technical and management decisions—is distributed in the team.

Groups of people within the team will work closely together when they are working on closely-related components, or on different aspects of one component. Sometimes these groups form ad hoc, when the people involved find that their work is interdependent. Other times this grouping is worth establishing formally and maintaining for an extended time. This might happen when all the people working in one discipline, such as contract management or electronics design, are organized so that they share skills across the work for different components. This might also happen when the people who are working on one high-level component (that is, subsystem) work together to maintain the consistency of all the pieces that make up the high-level component. Note that one person might be part of more than one group.

Each group should have some reason to exist. People on the project should know how the groups are organized and the purposes for each. Generally speaking, each technical part of the system should be associated with exactly one group. People in the project should know who to talk to about any part of the system, and they should not get conflicting information about who to talk to or how some part of the system works.

Technical decision authority defines who has the final responsibility for ensuring that the design and implementation of some part of the system meets its specifications and objectives, including safety and quality of work specifications. The person who has that responsibility must have the corresponding authority to approve the design or implementation, based on verification checks. The verification checks should provide an independent view of the work that will catch errors that the designer or implementer could not see because they were directly involved in the work. The approver may delegate some or all of the reviewing and decision, but in the end one person must be responsible. (See sidebar below.)

There are several ways that technical decision authority can be assigned, as I discuss in ! Unknown link ref.

Management decision authority defines who makes decisions about work assignments and about resolving conflicts within the team. While this is largely a matter of project management, the design of the team’s structure affects how people will resolve management issues. This can include making decisions about who will be part of which sub-teams, and about staffing in general. Perhaps most importantly, the people with management decision authority also have responsibilities when conflicts or problems are reported.

Both technical and management authority are generally hierarchical. For most projects, there is one person or small group of people who have overall responsibility for the entire project. The authority that others have derives from delegation from this top-level authority. A project can choose different kinds of hierarchy, with deep or shallow chains of authority.

There are many ways projects can organize their teams. I discuss some of these ways in ! Unknown link ref. Whatever approach one chooses for a project, that approach should be evaluated against these needs.

Sidebar: Team structure and system structure

It is generally understood that the structure of a system is homomorphic to the structure of the organization that is building the system [Conway68]. This means that people must work to ensure that the structure of the team and the structure of the system are compatible, possibly by organizing the team around the system structure when possible. Doing so requires having an understanding of what the system structure is, and the hierarchical component breakdown ! Unknown link ref provides part of that understanding. In the other direction, the team’s organization will inevitably bias how the system is organized and built; being aware of the two organizations helps one to see unhelpful bias reflected in the system organization.

17.4 Communication

Following the model in Chapter 11, the component parts of the system are interconnected. When one person works on one component that has a relationship with another component, needs to ensure that the related components have compatible designs and implementations. Doing so means that the people working on each component need to communicate with each other.

People communicate when they want or need to. Creating an environment and procedures that help them realize when communication is needed is part of the art of organizing a team.

To design the procedures and team structure, one needs answers to two questions: when do people need to communicate, and with whom should they communicate?

I identified some scenarios above for events that create a need for people to communicate. There are, of course, a great many other cases, but these give an idea of the breadth of events that define when people will need to communicate.

There are four general times when people will need to communicate:

  1. When they are looking for information that another person may have. For example, when someone finds they need to know how some component is going to behave.
  2. When they have information that will affect someone else’s work. For example, when one person decides on a component design, and that component interacts with another component.
  3. When they need a decision or action. For example, when someone has completed a proposed design and procedures indicate that the design should be reviewed and approved before moving to implementation, or when someone has a team problem that needs to be resolved at a higher level.
  4. When a decision or action has results. For example, when reviews are done, or when action is being taken on a team problem.

Some of these times can be encoded in procedures that the team will be following ! Unknown link ref. Many others will occur in the moment, when someone realizes they need to know something or need to ensure that someone else knows something.

When a person finds that they do have a need to communicate, they then need to figure out who to communicate with. If they are looking for information about a part of the system, they should be able to use directory information (Section 17.2 above) to determine who should know about that part. If they need to push out technical information that affects other parts of the system, they can use the functional relations in the model (Section 11.2) to determine the affected parts, then use team directory information to find out who to talk with for each of those parts. If someone is asking for an action to be done, procedures can indicate the responsible role, and team information will direct them to the people filling that role.

17.5 Team organization and size

A team’s organization generally starts small and informal, as a very small group starting to investigate a customer’s need or a potential system project. As the project moves forward, the team grows and its needs for structure change. The team also changes as people join and leave, and as people move from role to role.

I have found that most teams go through phases as they grow—rather than showing smooth changes over time. These changes arise from the combination of complexity growth, development of group relationships, and the growth in understanding of the work ahead.

Small groups (of just a few people) have been observed to go through a development sequence [Tuckman65][Tuckman77]. These small groups begin as the group forms, and the people work out how they should relate to each other and how to get work done. As time goes by they develop into a cohesive group that gets work done and where people trust each other. (The studies do not discuss how this process can fail, leading to a group that does not cohere or disbands.)

The interpersonal complexity of a team grows with the size of the team. The number of potential connections between team members is O(n2) in the size of the team. In my anecdotal experience, the amount of time spent on coordinating work within the team grows in line with the number of connections. If there is no structure to the team, at some point the amount of time and effort spent on communication will exceed the amount spent doing work building the system.

When a project starts, the nature of the system to be built is not well understood. The team has to go through a process of working out the purpose of the system, developing concepts, and eventually beginning to design. Along the way, the team gets increasing understanding of the work ahead.

In practice, the combination of these causes leads the team to change its organization over time. At the beginning, the initial exploration of what the project might be (working a purpose and finding some initial concepts) is typically a small group. This small group will go through a process of learning to work together, but typically the group can self-organize and does not need hierarchy for much. As the work progresses and a few more people join the project, they will initially try to fit into the self-organized small group. These additions will alter the interpersonal relationships, but at some point the complexity of using consensus will necessitate creating some initial structure. The team will settle into this structure. But as the team continues to grow, it will initially accommodate people into the structure but eventually reach another point where more structure is needed to manage complexity.

The message is that a project should expect its team organization to change over time. Almost every project I have been part of has been resistant to addressing a need for changing team structure, and has put off dealing with it until a crisis occurs. In every case this cost the organization time and money, needlessly setting back the project. A project’s leadership should be alert to the need to periodically reorganize the team so that this can be done before it causes problems.

Sidebar: Unitary decision authority

I worked on two projects that had problems building their systems because someone on the team got conflicting instructions on the objectives for some component they were supposed to be building.

In one case, a software developer was tasked with implementing a particular CPU scheduling algorithm in a real-time operating system kernel. This scheduling algorithm had been chosen in order to make certain system safety properties work, and to enable some high-level control features. The developer in question did not understand the assignment, and reached out to someone else—someone not authorized to make decisions about the CPU scheduling algorithm. The developer got advice from the other source and implemented a different scheduling algorithm. The other algorithm could not provide basic safety and control features the system needed. As this project was being executed on a cost-plus contract, the developer’s organization had to pay for someone to remove the work the developer had done and implement the correct algorithm.

In another case, one senior system architect (systems engineer) was responsible for a particular feature set of the system. The system architect was working with a pair of developers to work out a design for those features. A second senior system architect, who was not responsible for that part of the system, was having a conversation with the developers and instructed them to design the features in a particular way. This conflict in instructions to the developers led to confusion that took several days to detect and resolve.

Problems like these are instances of a common design flaw pattern: conflicting control. This is a common source of accidents in control systems [Leveson11, Section 4.5.3], and it applies just as much to the system of building a system.

The techniques for addressing a potential system hazard apply to the conflicting authority: first try to eliminate the conditions that can lead to a hazard, then make it unlikely to happen, reduce the likelihood of it causing a problem, and then try to limit the damage when it does happen.

The first line of defense is thus to organize the project so that conflicting decisions and authority do not occur, or make it unlikely. This is most easily done by having for each part of the system exactly one person authorized to make decisions, and making that information clearly available to everyone on the team. Note that this does not mean that only one person is allowed to design; rather, it means that one person has responsibility for the design. The responsible person can and should delegate the design effort as much as possible to the people actually doing the work, and the responsible person should focus on setting objectives for the design, guiding the design, and checking that the results are acceptable.

Theoretically, a team can avoid conflicting decisions or directions by having a few people operating in a way where they reach consensus before making decisions. In practice consensus algorithms work well enough for computer systems but people find it hard to work that way: communication happens informally, people are in a hurry, or someone has a good idea they get enthusiastic about and don’t wait to share it with others for agreement first.

The second line of defense is to have regular review points in the project when discrepancies can be caught.

Chapter 18: Operations

11 February 2024

Operations covers how people on the team organize the work of building the system.

I introduced the basic ideas of operations in Section 6.3.5. I model operations as five parts: life cycle, procedures, plan, tasking, and support. In this chapter I go into more detail about each of these parts. The material in this chapter is based in part on the needs analysis reported in Appendix A.

This chapter details out the model for operations in general, without recommending specific solutions.

This chapter is focused on the operations directly involved in building the system. This is a subset of the larger matter of organizational operations

18.1 Purpose

Operations is about organizing work, in the form of tasks. It is complementary to team and artifacts, which I discussed in previous chapters. Operations ensures that people know what tasks they should be doing, similar to knowing what they should be producing (artifacts) and who they do it with (team).

I leave “task” largely undefined, relying on its colloquial meaning. It should be taken to mean some unit of work to be completed.

Operations should organize the work so that:

  1. The right tasks are done at the right time by the right person.
  2. Everyone does their work in compatible ways.
  3. The work is done efficiently.
  4. The work is of high quality.
  5. The work meets deadlines and budgets.
  6. Adapts with need.
  7. The project supports its customer and funder.

Each project will work out its own approach to operations. The list above provides objectives against which an approach can be measured.

18.2 Operation model

The model operations in Section 6.3.5 has five parts:

undisplayed image

The life cycle is the overall pattern of how the project works, with phases and milestones.

Procedures are the checklists or recipes for performing key tasks.

The plan is an evolving understanding of the path forward for the project.

Tasking is the assignment of tasks to people, and figuring out what tasks each person should do next.

Support maintains tools and information needed to do the other parts.

These are ordered by rate of change and at which decisions are made. The life cycle is established early in the project and changes slowly after that. Procedures change a bit more frequently, but not often. The plan is updated on a regular cadence, while tasking is continuous.

18.2.1 Making the model work

Some people will look at the life cycle and procedure parts of this operations model and say that it is “process”—a term that has acquired a negative connotation. Yes, the life cycle and procedures do define processes that are supposed to guide the team. Process, when done well, helps a team work more effectively and more happily. Done well process is simple: it is a guide for how to do common sequences of events, or tasks that are critical to be done a certain way. It provides a checklist to make sure things get done and aren’t missed. It encodes checks to make sure technical work is done correctly.

I have outlined the advantages that life cycle patterns and procedures can bring to operations in Section 18.1 above.

In my experience, the potential disadvantages, and the reasons people have come to dislike the idea of process, arise from three misuses of operations: making it too heavy, making it too complicated, and defining something the team is unwilling to use.

As an example, a colleague told me about a project they had worked on where getting approval to order a fairly simple part (for example, a cable) took multiple approvals and potentially weeks to complete (heavy process). Indeed, nobody was even sure exactly how to go through the process to get an approval to get the part ordered (complicated process). The processes were, presumably, put in place to ensure that only parts of sufficient quality were used and to manage the spending on parts acquisition. In practice the amount of money spent on people’s time far outweighed potential cost savings, and the amount of work required for people to review an order over and over meant that the reviewers did not have the time needed to perform meaningful quality checks.

A “heavy” life cycle or procedure is one that takes more effort or more time than is warranted for the value it provides. This works against the objective of doing work efficiently. Each part of a life cycle pattern or procedure should have a clear reason for being included. The effort and time involved should be compared to that reason, and the procedure or pattern should be redesigned if the comparison shows it is too heavy. To avoid this, each procedure and life cycle pattern should be scrutinized to eliminate any steps that are not actually needed.[1]

A complicated life cycle or procedure is one that involves many steps, often with complex conditions that have to be met before some step can proceed. In the example from my colleague, nobody on the team could figure out all the steps that needed to be done. This can be avoided by, first, ensuring that procedures are as simple as possible, and second, by documenting them and making that documentation easy for people on the team to find and understand.

Teams are generally willing to follow procedures, as long as a) they know what the procedures are; b) they understand the value of following them; and c) following procedures has been made a part of the team’s norm. This means that the life cycle patterns and procedures should be documented, and their purpose or objectives should be spelled out. Normalizing following the procedures, however, is not something that can be accomplished by just writing something down. This has to be practiced by the team from the start, with leadership setting examples. Involving the team in setting up the life cycle patterns and procedures can help people understand and buy into the process.

Bear in mind that when a project adopts a particular life cycle pattern, the project is making an implicit commitment about staffing. If the pattern indicates that certain reviews must happen before key events happen, like ordering an expensive piece of equipment or beginning a complex implementation effort, then the project must ensure that there are enough people with enough time to perform those reviews. If the project does not staff enough, people on the team will quickly learn the (correct) message that the project or its organization does not actually care about the reviews and will begin to work around the pattern.

How all of this is handled for a particular team depends a lot on the team’s size. It’s common for a project to start with simple life cycle and planning when it is small and the project is uncertain. The project will need to shift strategies at times as the team grows, as the work becomes more complex and interconnected.

For some projects, the life cycle will be determined by an external standard. NASA defines a family of life cycles for all its projects [NPR7120]. This flow is designed to match the key decision points where the project is either given funding and permission to continue, or the project is stopped. It defines a sequence of phases A through F, with phases A-C covering development, D covering integration and launch, E covering operations, and F covering mission close out. Specific kinds of projects or missions have tailored versions of this overall life cycle.

Many companies have similar in-house project life cycle standards that revolve around decision points for approving the project for development and ensuring a product is ready for commercial release.

18.3 Life cycle

A project’s life cycle is the set of general patterns of how work unfolds. They encode a few principles that the project has decided on. I introduce the idea of life cycle here, and discuss specific examples and guidelines for building a life cycle pattern in Chapter 19.

The life cycle patterns help team members understand how the work they are doing fits with other work. It provides guidance on what they should expect from work that others are doing that will lead into work they will do. It helps people work out who is doing work related to their own, and who to talk to about that work. It helps people understand what steps will be coming up after they perform one step.

The life cycle is not a schedule. It is only a set of patterns, and it should guide the team as they work out a plan and schedule tasks that achieve that plan.

The life cycle is not much connected to the specific system being built. A life cycle pattern can be more or less well suited to a project depending on attributes of the system being built—most especially how often there are irrevocable or expensive-to-reverse decisions.

The pattern generally consists of:

undisplayed image

For example, a simple life cycle pattern might say that the project must start with a phase where it works out and documents the customer’s purpose for the system before proceeding on to other work. That purpose-determining phase would conclude with a milestone where the customer reviews and agrees on the team’s purpose documentation. The next phase would involve developing a general concept for the system. This phase would include review milestones, checking that the concept will meet the customer’s purpose and that it can likely meet the organization’s business objectives. After those reviews, there might be a milestone where the organization makes a go-no go decision about whether to proceed with the project.

The life cycle model is general. It is not meant to provide a diagraming model or formal semantics; rather, it is a technique for working out how the project will order its work. It has evolved from a combination of examining several different life cycle standards, observing how teams use Gantt charts for scheduling, and the common practice of sketching things out on a whiteboard.

The life cycle model is connected to the development methodology that a project chooses to follow. A project that uses an agile-style or spiral development methodology will use different patterns for some development steps then what a project following a waterfall methodology will use. I will discuss this further in Section 18.5 below.

There are two ways that one can view life cycle patterns. The first way is as a path to be followed: one must go here, then here, then here. The other is as a way to measure progress. Being in some phase means certain things are believed done, while other things are in progress and yet others will be done later. These two views are compatible, and it is useful to use both viewpoints.

The difference between the two comes when dealing with changes. If the work on some component is in phase X, what happens when an error is found in work from an earlier phase? Or when a request for a change in behavior arrives? And what if one chooses to build a component in multiple steps, creating a simple version first then adding capabilities over time?

This is where viewing the pattern as a measure of progress is helpful. Consider a component that is to go through specification, design, implementation, and verification phases. When the work is in implementation, the implication is that specifications and designs are complete and correct. If someone then finds a design problem, the expectation that design is complete is no longer true. This situation leads to those tasks needed to make true again the condition that the design is correct. Put another way, this “rewinds” the status of the work on that component into the design phase. People will then do the tasks needed to advance back to the implementation phase by correcting the design and performing a review of at least the design changes.

In addition, work does not actually happen perfectly linearly. While someone is working on the specifications for a component, they may well be thinking of design approaches. In the example above of rewinding to design when a flaw is discovered, some implementation work still exists while the design is reworked. Part of the implementation might continue if there is someone to do it and parts of the implementation are unlikely to be affected by the redesign.

A life cycle pattern can be coarse-grained or fine-grained. A coarse-grained pattern would have phases that apply to the whole project at once, and take weeks or months to complete. The NASA family of life cycles [NPR7120] is coarse-grained: it is organized around major mission events like approval to move from concept to design, or from fabrication to launch. Fine-grained patterns might apply to a single component at a time, such as a component being first specified, then designed, then implemented, then verified, as a sequence of four phases with review checkpoints at the transition between phases.

Some life cycle patterns have phases that can overlap and repeat. Consider a fine-grained life cycle pattern that applies to each component. The general pattern might be:

undisplayed image

A project might apply this pattern to each component being built. When multiple components are being developed in parallel, multiple instances of this pattern will be proceeding at the same time, and different components may be at different points in their cycle.

undisplayed image

Finally, the project’s life cycle patterns do not necessarily imply one-way linear progress. A project or the work on one part of the system can potentially move through a phase, progress to another, and later rewind back to the earlier phase.

Consider the situation mentioned earlier, where a design flaw is found or a feature change request arrives during implementation. The dashed line in the following diagram shows how work proceeds on this component. It proceeds through specification and design into implementation, with accompanying reviews ensuring that both the specification and design are complete. During implementation, the need for a design change arises, and work reverts back to the design phase. Once the redesign is done and reviewed, work goes back to proceeding with implementation.

undisplayed image

A project should clearly document the life cycle patterns it will use and make them accessible to the whole team. While the patterns are used directly for planning, making them accessible to everyone ensures that everyone knows the rules to follow and reduces misunderstandings about what is acceptable to do.

18.4 Procedures

Procedures define how specifically to do actions or tasks defined in the life cycle. They often take the form of checklists or flow charts.

Procedures are related to the system being built, but are generally portable between similar projects.

Having clear procedures will:

Having common procedures for the whole team makes key work steps less matters of opinion and more based on shared fact. This can improve team effectiveness by removing a source of conflict between team members.

A project can realize these benefits only when the team members know what procedures have been defined, when they can find and understand the procedures, and when the team uses those procedures consistently.

Three steps help team members know what procedures are defined. First, the procedures should be defined in one place, with a way to browse the list of procedures as well as a way to find a specific procedure quickly. Second, the life cycle should indicate when one procedure or another is expected to be used. (For example, when the life cycle indicates that an artifact should be reviewed, it should reference the procedure for performing the review.) Finally, new team members should be shown how to find all this information for themselves.

Understanding and using procedures depends on the procedures being actionable: they should indicate the specific conditions where they apply, and provide a list of concrete steps for someone to perform. This is especially true for procedures that will be used when someone someone is under stress, such as in response to a safety or security accident. I have often seen “procedures” that say things like “contact the relevant people”—which is unhelpful. The procedure needs to list who the relevant people are (or at least their roles) so that a person in the middle of incident response can contact the correct people quickly.

18.5 Plan

The plan is a record of the current best understanding of the path forward for the project. It contains the foreseeable large steps involved in getting the system built and delivered, and getting it to external milestones along the way. It guides the work, as opposed to people working on tasks at random.

The plan:

Plans versus schedules. I differentiate between a plan and a schedule.

A schedule is a “plan that indicates the time and sequence of each operation”.[2] In practice, a schedule is treated as an accurate and precise forecast of the tasks that a project will perform. People treat the timing information it provides as firm dates, and will count on things being done by those dates. Schedules are often part of contractual agreements.

Because people outside the project use a schedule to plan their own activities, a schedule is hard to change.

Schedules are appropriate when the work to be done can be characterized with sufficient certainty. In most construction projects, for example, once the building design is complete, the site has been checked for geologic problems, and permits have been obtained, the remaining steps to actually construct the building are generally well understood and the time and effort involved can be estimated with confidence. However, before the site has been inspected one might not be able to create an accurate schedule because there could be undiscovered problems in the ground (perhaps an unmapped spring or an unstable mud layer).

The plan, on the other hand, is not a detailed schedule. It is a general indication of the steps to be taken, along with as much information about time required for different steps as can be estimated. It will reflect varying degrees of certainty about the steps and timing, from fairly certain in the near term to highly uncertain later in the work. It provides guidance, but it does not represent a promise of dates or exact sequencing of events.

A plan is dynamic and constantly changing, as it is a reflection of where the project currently stands.

At the beginning of a project that requires innovation, the team is just beginning to work out what the system will be, and so they cannot build a schedule because there are too many unknowns. As the project works out the customer needs and basic concept, the flow of work becomes a little clearer but most of the work ahead is still unknown. People will continue to learn more and more about the system, and at each step there will be fewer unknowns and the certainty of plans can improve. Even so, the exact schedule is not known until the very end of the project—when there are no places left that could hide surprises.

To some degree, the difference between a schedule and a plan is an attitude. A schedule is something people treat as a contract, and so it does not accommodate uncertainty well. A plan is a flexible current best estimate that doesn’t promise much except to accurately reflect what is known, and avoids information that might appear accurate but in fact is not certain. A schedule is useful to someone writing a contract to get something done. A plan is about an honest accounting of where the project stands and where it is going, and thus more useful to the people building the system.

Plan contents. A plan gathers four types of information:

  1. The set of work steps that can be foreseen to be needed.
  2. Milestones, both internally-defined and those imposed from the outside.
  3. Dependencies among the work steps, and between work steps and milestones.
  4. Estimations of uncertainty about all of these. The chunks of work and milestones form an acyclic graph, with dependencies as edges between the work or milestones. The work can be annotated with estimates of resources required and time, to the degree those are known—and they should not be annotated if the information cannot be estimated with reasonable confidence.

In addition, some projects will give each work step a priority or deadline. Tasks that should be done soon should be scheduled early, perhaps to meet a deadline, to address uncertainty, or to account for a task that is expected to be lengthy.

There is no set format for recording a plan. I have used scheduling tools that use PERT charts and Gantt charts as user interfaces. I have used diagramming tools that help the user draw directed graphs. I have used graphs and time tables written on white boards. I have used tools meant for agile development, with task backlogs and upcoming iterations. All of these have had drawbacks—scheduling tools are not meant for constantly-changing plans; agile development tools are structured around that methodology; drawings on white boards and drawing tools are hard to update over time.

Making and updating the plan. The plan starts at the beginning of a project, and is continuously revised until the project ends.

Assembling an initial plan starts with knowing the status of the project and working out the destinations. At the beginning of a project, the status is that the project is largely undefined beyond a general notion of what customer problem the system may solve. The endpoint might be delivering a working system, or it might involve expecting to deliver a series of systems that grow over time.

undisplayed image
Initial plan for a new project.

If the project is already in progress, one starts on the plan by working out what is currently completed and in work.

undisplayed image
Example initial plan with milestones filled in.

The next step is to fill in major intermediate milestones and work steps. The project’s life cycle patterns should provide a guide to these. For a new project, the life cycle might indicate that the project should start with a phase to gather information about customer needs. As the first phases progress, the team will begin to develop a concept for the structure of the system. If the customer or funder has required some intermediate milestones, those can be laid in to the plan, along with very general work steps for getting ready for each of those milestones.

undisplayed image
Example life cycle pattern for the overall project.

It is normal for the plan to have large work steps that amount to saying that somehow the team will get something completed or designed or whatever. In the example above, “implement system” is completely uncertain when the project starts. When one does not know how part of the system will be designed, or how to implement some component, or even how some part of the work should proceed, it is better to put in a work step that accurately reflects the uncertainty. Being accurate about what is known and not known prompts people to find answers to the unknowns, gradually leading the plan toward greater certainty.

The plan then grows according to the system design. As the team works out the components that will make up the system, each new component creates a stream of work to be done to specify, design, implement, and verify that component, as specified by the life cycle. All these add new work steps into the plan, along with dependencies from one step to the next.

undisplayed image
Example pattern for developing a component (linearly).

The plan should be revised regularly. It will change whenever there is some change to the likely structure of the system and as each component proceeds through its specification and design work. Many components will require some investigation, such as a trade study or prototyping, before they can be designed. The plan will evolve as those investigations generate results.

Part way through building the system, the plan will typically become large and show significant parallelism. This is also normal and desirable, because it reflects the true state of development. Mid-project there usually are many things that people could be working on. The plan should reflect all these possibilities so that those managing the project know the true status of the work and can make decisions with accurate information.

undisplayed image
Example plan in progress. Some steps are complete, some are in progress.

The life cycle patterns a project uses provide building blocks out of which people can construct parts of the plan, but they do not dictate the plan entirely. Maintaining the plan is not simply a mechanical process of adding a set of work steps each time someone adds a new component to the design. There are three more factors to consider, and these make maintaining the plan a task requiring some skill.

First, the various components will be integrated into the system. The steps to put the components together and then verify that they interact correctly adds more work steps.

Second, a component does not necessarily proceed linearly through specification to design to implementation. Often the design will require investigation, perhaps a trade study to compare possible alternatives. In many cases it is worth building a simple prototype of one or more of these alternatives to learn more about the component before settling on a design. This turns a design step into several steps. Sometimes the outcome of an investigation is that the whole approach to designing a set of components is wrong and design needs to be revisited at a higher level. (This is the rewinding discussed in the section on life cycle above.)

Third, many system development disciplines, such as agile or spiral development, do not proceed linearly with developing a component from start to finish in one go. They often focus on building a simple version of a component or of a collection of components first, and then adding features over time.

Each project will have its own style for addressing these factors, and this will be reflected in the specific work steps included in the plan. For example, when a project follows a spiral development methodology, the plan for developing a part of the system might have several internal milestones: first a simple version of the components that can do some minimal function, then another version or two with increasing function. There might be design, implementation, and verification steps for each component involved for each milestone.

A project should document what methodology it has chosen, so that team members know what to expect and so they can plan consistently.

Plan and tasking. The plan is used to guide tasking—the assignment of specific tasks to specific people (Section 18.6). The plan includes work steps that are in progress and ready to be executed. These are the sources of tasks that people can pick up and work on.

Most of the time, there will be more tasks that are ready to be worked on than there are people to do them. The plan organizes them and thereby helps the process of deciding what someone should do—whether a manager makes task assignment decisions or people pick tasks for themselves. If work steps in the plan include priorities, those will help guide task assignment decisions.

The plan and tasking together support accountability and measurement. They should allow someone to identify when a plan was changed, to see if the change was an improvement in retrospect. They should help identify when some tasks were completed faster or slower than expected, or completed with quality problems. This information can be used to improve forecasting and to identify tasks and procedures that should be restructured.

Plan and forecasting. Most projects will have deadlines they must meet. Customers want estimated delivery dates, so they can make preparations for steps they will take to put the system in operation. Funders may want intermediate milestones to show that their investment is on track. Others want to know the budget—money and time—required to get the system built, or to meet other internal milestones. The team will need to manage project execution in order to meet those deadlines.

One can look at this as a control problem. Forecasts using the plan provide the control input: based on the current plan, including its uncertainties, is the project likely to hit a deadline or not? The control outputs are to rearrange the work steps in the plan or to add and remove steps. Adding or removing steps often means adding or removing capabilities from the system, also known as adjusting the system to fit the time available.

Forecasting using the plan will always be imprecise because the plan reflects the actual uncertainty in the project. In some industries it is possible to estimate the time and effort required for work steps, within a reasonable error bound, once the system is well enough understood—for example, in many building construction projects. However, when building systems that do not have extensive comparable systems to work from, estimates will be unreliable for much of the project’s duration.

There are ways to manage a project’s plan to reduce uncertainty as quickly as possible. I discuss those in ! Unknown link ref.

18.6 Tasking

Tasking is about the day-to-day management of what tasks people are working on and what tasks are ready to be worked.

The choices of what tasks are ready is based on the plan, along with bugs that have been found, management tasks that need to be done right away, and ongoing tasks that do not show up in the plan.

Tasking builds on the plan. The plan should be accounting for which tasks need to be done sooner than others in order to meet deadlines or to avoid stalling because of a dependency between tasks.

The objectives for tasking are:

One can treat tasking as a decision or control process that works to meet those objectives. Other scheduling disciplines, such as job-shop scheduling ! Unknown link ref and CPU scheduling ! Unknown link ref, can provide useful ideas for how to make choices about who should work on what.

There are many different choices about when, who, and how much. Each project will need to define its own approach, usually following whatever development methodology the team has selected. The approach should be documented as a procedure that the team follows.

Decisions about tasking can happen at many different times. It can happen reactively, when one task is completed, when a task someone is working on is stalled waiting for something else to happen, or when some urgent new task arrives (such as a high-priority bug or an external request). It can also happen proactively or periodically, putting together a set of tasks for someone to do ahead of time.

Tasking can be done by different people as well. One person can have a scheduler role and make these decisions. A group can divide up tasks by discussing and reaching consensus. Each person can take on tasks when they are ready for more. Combinations of these also work.

Finally, tasking decisions can occur one task at a time, or they can focus on giving each person a queue of tasks to perform.

A large project will have a very large number of tasks—potential, in progress, and completed—to keep track of. Using a shared task tracking tool of some kind is vital. Without one, tasks will be forgotten, or there will be confusion about how they have been assigned. The tracking tool is another one of the tools that the project should maintain (Chapter 16).

Each task must be defined clearly enough that the person doing the work can properly understand what is to be done, and so that everyone can agree when the task is complete.

18.7 Support

The decisions made in planning and tasking need supporting information.

Risk and uncertainty affect choices of what should be done sooner or deferred. I have often chosen to prioritize work that will reduce risk or clarify uncertainty, in order to make the project more predictable down the road. Many projects maintain a risk register, which lists matters that could put the project at risk. These risks are often programmatic, such as the risk that a delayed delivery from a vendor will cause the project to miss a deadline. I have on some projects maintained a separate, informal list of the technical uncertainties yet to be worked out; for example, how should a particular subsystem work?

Project management will also need to manage budgets. Programmatic budgets, most often funding, affects how the project execution can proceed. Technical budgets, such as mass, power, or bandwidth, are aspects of the system being built. For both types of budgets, the amount of the resource that has been used and the amount left need to be tracked. The project will need to estimate how much more of them will be needed to finish the project. If there isn’t sufficient resource left, then the project management will have a decision to make—whether reallocating resources, reducing demand, or finding more resources.

Almost every project will need to report on how the work is progressing, relative to deadlines and available resources. The plan mechanisms should help people obtain and organize this information.

Sidebar: Development methodologies and operations

Each project will at some point choose a development methodology to follow. There are several popular methodologies, such as waterfall development, spiral development, or agile development, along with a great many variants of each.

The operations model I have presented is a mechanism that can support any of these methodologies. The methodologies affect the life cycle patterns, how the plan is structured, and how tasking is done.

Waterfall development is characterized by developing the system linearly, starting with a concept and working through design and implementation of each of the pieces, then integrating those pieces together to form the final system. The life cycle pattern for waterfall development will reflect this ordering, and plans will follow the life cycle pattern.

Spiral development is organized around a set of intermediate milestones. The system becomes a bit more complete at each of these milestones (or iterations). Each milestone adds some set of capabilities to the system and the system, or some part of it is integrated and operable at each. The life cycle pattern for spiral development will define the way each spiral or iteration proceeds. The plan will reflect how the team will reach each of the upcoming milestones.

Agile development is organized around short cycles (called sprints in some versions of the methodology). Each cycle typically lasts one to four weeks, and adds a small number of capabilities to the system. The system is expected to be integrated and operable at the end of each cycle. Unlike spiral development, the objectives for each cycle are typically decided at the beginning of the cycle based on the set of tasks that are ready to execute, and priorities for each task. This means that agile development is primarily about tasking, and it relies on a plan that defines what all the ready tasks are.

In practice, most projects end up using a combination of methods.

The cost or difficulty of changing a decision usually drives a project to combine methods. The easier it is to change a decision, meaning undoing the work of some tasks already completed, the more agile the methodology can be. The more costly it is, the more care that should be taken to ensure that changes downstream are unlikely.

The cost of change is significantly lower near the beginning of a project, when there is less work to be redone and when one change will not cause a cascade of changes to other work already completed. As work progresses, a particular change will become increasingly costly.

The cost of change also depends on the kind of work involved. Software and similar artifacts are malleable. The cost of changing a line of software source code or changing one line in a checklist is, in itself, tiny, though a change in the software may cause a cascade of changes in other parts of the system and may cost time and effort to verify. Changing a built-up aircraft airframe, on the hand, is costly in itself—in both materials and in effort.

These differences in the cost of change lead to differences in life cycle patterns and planning related to potentially-expensive decisions. For example, the NASA family of life cycles [NPR7120] follows a linear pattern in its early phases so that key aspects of the project can be worked out before the agency commits to large amounts of funding, especially for building aircraft or spacecraft hardware. Parts of some of these projects follow a more agile process after they have passed the Critical Design Review milestone Section 20.2.1.

[1] In many cases I have seen, steps get added to procedures because someone wants to make sure they have a voice in any decisions made. This is a legitimate concern, but blindly adding review or approval steps to a procedure often does not really solve the problem. In most cases, the need to have a voice or to check something can be met with ensuring that regular communication happens, along with providing the person doing the main work of the procedure with the tools to perform most checks themselves.
[2] Per Merriam-Webster Online Dictionary.

Part V: Life cycles and project phases

Chapter 19: Introduction to life cycle patterns

23 February 2024

19.1 Introduction

System building in general follows a common story.

A project to develop a new system begins when someone has an idea that people should make the system. At this initial moment, the system is largely undefined. There is a vague concept in a few minds, but all the details are uncertain.

The project then moves the system from this initial concept through to an operational system, and through the operational life and eventual disposal of the system. During development, the team will need to ensure steps are taken in order to produce a correct, safe system. Designs will be checked. Implementations will be tested. The system as a whole will be verified before being deployed into service. At the same time, the resources spent on building the system must be used efficiently, doing the work that needs to be done and avoiding the work that doesn’t need to be done.

Many projects continue system development beyond the first operational version, with ongoing development or problem fixes. Some projects include the steps to shut down and dispose of the system once it has completed its functions.

The life cycle is how a project organizes the way the team moves through this story. It is a pattern that defines the phases and steps in the work: what will come first, what will done before something else, and when checks will happen. It provides checklists to know when some step is ready to be done, and when it should wait for prerequisites. It provides checkpoints and milestones for reviewing the work, so that problems are found and dealt with in a timely way. It provides an overall checklist to ensure that all the work that needs to be done is in fact done.

Section 18.3 introduced the basic ideas for life cycle patterns. These include:

Each project will use its own life cycle patterns. The patterns may incorporate a framework that is standard for the industry or the parent organization. Selecting and documenting the patterns is an essential part of starting up a project, and people in the project should review how well the patterns are working for them from time to time and may want to improve the patterns.

19.2 Key ideas

Almost all project life cycle patterns, for both whole systems and for components, follow a similar overall flow. Abstracting from the story in the introduction, there are phases:

  1. Identifying purpose
  2. Developing a concept
  3. Refining concept into specification and design
  4. Implementation
  5. Verifying the result
  6. Operating the system or component
  7. Evolving it over time
  8. Disposing of the system or component at end of life

For a whole system, this looks like:

undisplayed image

Note that this flow starts with the system or component’s purpose. Good engineering always begins with having a clear understanding of what a thing is for. I have watched many engineers rush into designing and building a component without putting time into understanding what the component is going to be used for. By random chance their design has occasionally worked out to match what the component actually needed to do, but only rarely.

Understanding a system’s purpose or a component’s purpose also provides a way to bound the work. If one doesn’t know what a component is for, it is easy to keep working on a design without stopping because there isn’t a clear way to know when the design is good enough to be called done.

There are many points in this flow where one might add checks. At these times one can check on the correctness of the work. These checks improve system quality by building in the opportunity to discover and correct flaws before other work builds on the flawed work. Finding minor problems quickly usually means the cost of correction remains low.

This general pattern applies recursively. One can start by creating a specification and design for the system. The system design will decompose the system into high-level components (Section 5.4). The act of defining a set of component implies identifying a purpose for each one, then specifying and designing each high-level component. The design of a high-level component might in turn decompose into a set of lower-level components, which in turn need a purpose, then specification and design.

The overall flow shows a move from high uncertainty at the beginning to lower uncertainty as the work proceeds. I will address managing using uncertainty in ! Unknown link ref.

Finally, a project’s life cycle patterns will reflect the development methodology that the team has selected. Waterfall, spiral, and agile development all affect the contents of the patterns. I discuss this more in ! Unknown link ref.

The life cycle is provides a general set of patterns for how work should proceed, but it should not define exactly how each work step should be done. That is left to procedures (Section 18.4), which should provide step-by-step instructions for how to do key parts of the life cycle. For example, if a life cycle phase indicates that a design review and approval should occur before the end of a design phase, then there should be a corresponding procedures for design reviews. That procedure should indicate who should be involved in a review, what they should look for, how those people will communicate about the results, who is responsible for approving the design, and how they indicate approval.

The life cycle patterns are the basis for the project’s plan (Section 18.5). The patterns are a set of building blocks that people in the project can use to develop the plan. The plan, in turn, guides tasking: the selection of which tasks (as defined in the plan) people should be working on next.

19.3 Purpose

Life cycle patterns address problems that projects have. They can help the team have a predictable and reproducible flow to how work should be done, so that everyone shares the same understanding of how the team works.

There are six ways that life cycle patterns should help a project.

  1. Quality of work. The team must build a system that addresses the customer’s purpose, and in doing so must meet quality, safety, security, and reliability objectives.
  2. Efficiency. The project will be expected to deliver the final system as quickly as possible, at the lowest reasonable cost, while meeting the quality objectives. This means that the team needs to be kept busy doing useful work.
  3. Team effectiveness. People on the team need to know how to work together. Building trust depends, in part, on having shared expectations of how each person will do their work.
  4. Management support. Project management will need to plan and track the work in order to ensure the team meets deadlines and that they have sufficient resources to do the work.
  5. Customer and regulatory support. The customer may have specific milestones they expect the project to meet in support of the customer’s acquisition processes. Regulators often have similar expectations if a system must be certified or licensed for operation.
  6. Auditing support. The project’s work may be audited to check that the processes followed meet regulatory requirements, certification requirements, or as part of a legal review.

Gaining these benefits is not a result of using life cycle patterns per se; rather, it comes from using patterns that are designed to provide the benefits. For example, if the customer has an acquisition process that specifies certain milestones, then the top-level life cycle pattern for the project should incorporate those milestones. If the project is likely to have auditing requirements, then the patterns should include tasks to generate and maintain auditing records.

Quality of work. The purpose of a project’s approach to operations is, in the end, to produce a system for the customer that meets their objectives. This means it should do what they need, meet safety and security needs, and be sustainable as the system evolves in the future. In other words, the team’s work needs to produce a system with good quality.

Neither the life cycle patterns by themselves nor the plan that derives from them directly result in good product quality. System quality comes from all of the detailed work steps that everyone on the team performs. If they do their work well, and if mistakes they make are caught and corrected, then the system can turn out well. If some work is not done well, nothing in the life cycle patterns can prevent that.

However, the life cycle patterns can create an environment that will more likely lead to good quality. They can proactively make flaws less likely by ensuring that steps happen in order: identifying purpose and concept before design and implementation, for example. They can insert points in the work that encourage people to think through what they should design or implement. They can also avoid problems by providing a checklist for what should be complete at the end of a work step. They can ensure that when a system is delivered, that all the work needed to put it into operation is complete. They can build in checkpoints for reviews and verification to catch problems early. They also help project management organize the work so that it is complete, that is, so that no parts of the system or some work steps are overlooked.

Sometimes the value of a life cycle pattern will come from slowing down work. Most of the work done on a project is done by people who are focused on a particular part of the system; it is not their job to manage how the project goes as a whole. Their job is to get that one part designed and built, according to the specifications they have been given. If the specialists start building before the context for their work has been established, they are likely to design or implement something that does not meet system needs. I have been part of more than one project where the resulting rework caused the project to be canceled or required a company to get additional funding rounds to make up for the resources spent on the mistakes.

Efficiency. Most systems projects will be resource-bound, with more work than there are people on the team to do the work. In this kind of project, it is important to keep each person busy with useful work. This means that nobody on the team is blocked with no tasks they can usefully perform. It also means that almost all the tasks that people perform contribute to the final system—that there is little work that has to be thrown out and redone because it had flaws that made it unusable.[1]

As project management builds the project’s plan, using the life cycle patterns as building blocks, they must detect where there are dependencies between work steps and plan the work steps so that later steps are unlikely to get blocked. For example, if some part will require an unusually long time to specify and acquire from an outside vendor, then the management will need to ensure that work on that part starts early. The life cycle patterns provide part of the structure on which the plan is based, and provides a template for some of the dependencies.

Life cycle patterns can also help avoid unnecessary rework. This comes partly from the ways that the patterns help improve the quality of work. In particular, a good life cycle pattern can lead people to take the time to think through the purpose and specification of something before they jump into design and implementation unprepared, and then build something that does not meet the system’s needs.

Finally, the patterns can help bound the work to be done. When a project does not define the scope of work to be done, it is likely that someone will start working on something in excess or not related to the customer needs. Good patterns help avoid this by defining an orderly and thoughtful process for identifying what work needs to be done.

Team effectiveness. Members of an effective team respect and trust each other. Having shared norms and understandings for how work is done and how people communicate is important as part of the environment that allows the team to develop respect and trust.

A defined life cycle for a project addresses part of this by defining a common understanding of how work should be done. Good patterns define expectations of what will be done in different work steps. Everyone on the team can agree when a work step has been completed. Good patterns also create times when people know they are expected to communicate about some work step. This makes it easier for someone to trust that they will be consulted at appropriate points about work that might affect what they are doing, so that they do not need to create separate, ad hoc communication channels or try to micromanage something that is not their direct responsibility.

As I have noted elsewhere (Section 18.2.1), the life cycle patterns can only have this benefit if the team actually follows them.

Management support. The team, or designated parts of it, will be responsible for making a plan (Section 18.5) for the project’s work, then coordinating and tracking the resulting tasks. The life cycle patterns provide templates for the tasks that will go into the plan, and the key milestones that anchor the work. The life cycle sets the pattern for phases that the project will go through, such as initial conception, initial customer acceptance, concept exploration, implementation, and verification. The cycle also sets the pattern for milestones that gate the progression from one phase to another, such as a concept review, a design review (and approval), or an operational readiness review.

The plan will change from time to time, both in response to external change requests and as the project progresses and the team learns more about the work ahead. Sometimes the need for change occurs gradually, with an issue slowly manifesting itself but causing no acute problem that causes people to recognize there is a need for change. A good life cycle will build in times for people to step back to get perspective and detect when there is a slow-building problem to address. Review milestones are often a good time to plan for this.

Having life cycle patterns and corresponding procedures that apply when these changes occur will help the team adjust their work in an orderly way. It will help them ensure that steps don’t get missed as they work out how to change the plan (and the system being built).

Good life cycle patterns can help a project steadily decrease its uncertainty and risk as work proceeds. Most of the time, a project will start with high uncertainty about what the system will look like, and early project phases result in increasing understanding of what the system will need to be. This process will repeat at smaller scales: once the general breakdown of the system into major components is decided on, each of those components will start with high uncertainty about how it will be structured. The uncertainty about the major components will then gradually resolve, and so on. However, this occurs when the project is guided in a way that uncertainty is addressed systematically, not haphazardly.

Customer and regulatory support. Many customers will have a process they go through to decide whether to build a system and to track its development process. For US governmental customers, much of the process is encoded in law or regulation, such as the Federal Acquisition Regulation (FAR) [FAR] or Defense Federal Acquisition Regulation Supplement (DFARS) [DFARS]. The process governs matters like which design proposal is selected for contract, providing evidence of good progress, providing information that determines periodic contract payments, accepting the finished system, and determining whether the project should continue or be terminated.

These customers will expect deliverables from the project from time to time. The life cycle process must ensure that there are milestones when these are assembled and delivered. (It is then the job of project management to ensure that these milestones, and the tasks for preparing deliverables, can be completed by the time line that the customer requires.)

Whether the customer requires explicit intermediate deliverables or not, formally involving the customer may be important for keeping the project on track.

Similarly, regulatory bodies have processes by which a system that must be certified or licensed before operation can apply for that approval. Those processes will define activities that the team must perform, along with milestones and deadlines by which applications must be submitted or approvals received.

Auditing support. A project’s development practices may be audited for many reasons. Auditors may perform a review as part of an appraisal or certification against standards, such as CMMI ! Unknown link ref. They may review processes to ensure compliance with regulatory standards, especially for security-sensitive projects. The processes may also be audited as part of a legal review. These reviewers need to see both the entire definition of processes, including the life cycle patterns, as well as evidence of how well the team has followed these practices.

19.4 A model for patterns

Each project will have several life cycle patterns, each covering a different part of the work.

Each pattern is defined by its purpose, the circumstances in which it applies, the phases or steps involved, and the dependencies among the steps. It should also include rationale that explains why the pattern is structured the way it is. In the previous chapter I used the example of a simple pattern for building one component:

undisplayed image

This pattern applies to building one low-level component where the purpose of the component is already known, and the component is straightforward to design and build in house. Similar but slightly different patterns might apply when the component has to be prototyped before deciding on a design, or when the component is being acquired from a supplier outside the project. This pattern would be used as one part of a larger pattern for building a higher-level component that includes this one.

Each phase of a pattern defines a way to move part of the work forward. It should have a defined purpose that defines what work should be achieved in that phase.

undisplayed image

The details of the phase are defined by:

Each action should also indicate who is responsible for performing that work. The responsibility will usually be defined as a role, not a specific individual. For example, a component design phase might involve three actions: design the component, review the design, and approve the design. The design action would be the responsibility of the component developer; the review action would be the responsibility of the developers responsible for components that interact with the one being designed, and the approval would be the responsibility of a systems engineer overseeing some higher-level component of which this one is part.

The rationale for this example design phase might say:

The actions defined for the phase should reference the procedures for doing those actions, when those procedures are defined. For the example design review action, the procedure might be:

The procedure might also name the tools to be used (an artifact repository for the design, a review workflow tool for the reviews).

19.5 Documenting life cycle patterns

A team needs clear documentation of the phases if they are to execute them properly. A team can’t be expected to guess at what they need to be doing, or how their work will be reviewed; it needs to be spelled out.

This documentation is assembled during the project preparation phase. The details are usually not completely worked out before any other work is begun; rather, “project preparation” more often proceeds in small increments, working out the rules shortly before the associated work begins.

Each life cycle pattern should have a purpose, and the steps or phases in the pattern should be checked that they can achieve that purpose (and that there is no extraneous work in the pattern).

A pattern should also have an explanation of when it applies and when it does not. For example, there may be multiple patterns for designing a component: one for a simple component that is built in house; one for a component that is outsourced to a supplier; one for a high-level component that is made up of several lower-level components; one for a component that requires investigation or prototyping before deciding on a conceptual approach to its design. All these patterns likely have a lot in common, but procuring an outsourced component will have contracting steps that an in-house component will not.

Someone using the documentation should be able to tell accurately whether they are using the correct version of the patterns. The life cycle patterns should be revised from time to time—as the team grows and as people find ways to improve how they work together. This means that the material that a user sees should indicate not just a revision number but have a clear indication of whether the version they are looking at is not longer current.

The form of the documentation is not as important as the content. It can be a written document. It can be made available electronically, with structured access and search capabilities (such as in a Wiki). Some companies offer tools that help define and document development processes or life cycle patterns, including definitions of phases. What matters is that each person who needs to use the documentation can do so conveniently and accurately.

19.6 Work steps and artifacts

Each phase or step has a number of artifacts that the team must develop. At the end of a phase, some of those artifacts need to be complete (allowing for future evolution), and others need to have reached some defined level of maturity. The work in a phase consists of the tasks that develop those artifacts.

I discussed artifacts in Chapter 15. The artifacts are the products of building the system, including the system being delivered as well as documentation of its design and rationale, records of actions taken during development, and information about how the project operates.

These artifacts are the inputs and outputs of the work specified in life cycle patterns (and the associated procedures). Using the component design step example, the work uses:

The design step produces:

In general, every artifact involved in building the system should be a product of some work phase or step, and every input or output of work steps should be included in the set of artifacts the team will develop. Ideally, the life cycle patterns will be checked for consistency with the list of artifacts the project uses.

Different artifacts are developed at different times during the course of a project. A few artifacts should be worked out as the project is started—especially those recording the initial understanding of the system’s purpose and initial documentation of how the project will operate. These will be refined over time. Other artifacts are developed during the course of development, and the life cycle patterns indicate which ones are to be worked out before others. The artifacts will be in flux during development: the team learns about the system as it designs and develops it; the customer or mission needs often change over time; flaws get discovered in designs or implementations.

Many of the project’s artifacts support how people work together, and the life cycle patterns should reflect these communication needs. For example, one person may work out the protocol that two components need to use to communicate with each other. Two other people may design and implement the two components. The interface specification that the first person develops serves to communicate the details of the interface among all three people. The patterns should record that the design and implementation work steps depend on the work to develop the interface specification. Later, if one of the component developers identifies a flaw in the interface, the people involved can work through how to revise the interface—and the revised specification artifact records exactly how each person needs to update their work to match the change. The pattern helps to show how information about a change to the interface specification triggers rework on dependent artifacts.

A good life cycle pattern must have procedures to manage the change in artifacts, and how those changes affect other artifacts downstream from them. There are two separate problems these procedures must address:

  1. Managing how changes are coordinated across multiple artifacts and through the team while a part of the system is in development
  2. Ensuring that when a part of the system is complete, all the artifacts are consistent with each other

Different life cycle patterns approach this in different ways, which we will discuss in later chapters on different patterns. The most common approach is to maintain different versions of an artifact, with at most one version being designated as a baseline or approved version, and other versions designated as works in progress. Many configuration management tools have a way to designate a baseline version, and many software repository tools provide branching and approval mechanisms to track a stable version.

19.7 Life cycle and teams

What is the team size and background? How is it expected to change over time? A small team can often be a little less formal than a large team, because the small team (meaning no more than 5-10 people) can keep everyone informed through less formal communication. A large team is not able to rely on informal communication, so more explicit processes and communication mechanisms are important. Many teams start small when the project is first conceived, but grow large over time. A team that will grow will need to communicate more formally from the beginning than they otherwise might so that as they add people to the team, the larger team works smoothly.

Conversely, if the life cycle patterns indicate that some action will be performed by some person, does the team actually have the staff to do that work? When a project says that some work is to be done and then does not staff that function sufficiently, it sends a message to the team that they should not take the process as written seriously. This undermines the team’s trust. If the function is actually needed, either the team will find an ad hoc workaround or the function will not get done adequately. Either way, there will be a disconnect between what is written down and what actually happens.

19.8 Life cycle and planning

The life cycle patterns are just patterns that provide a guide to work that goes in the project’s plan. The plan is the actual definition of the tasks to be done. When the plan needs to be updated, the patterns provide a template for the work that goes into the plan.

Assembling the plan, however, takes into account many inputs, of which the pattern is only one. Planning involves deciding on the priority and deadlines for work, which is based on project deadlines, risk or uncertainty, and the project’s development methodology.

! unknown reference XXX discusses in detail how the plan is developed and maintained, including how the life cycle patterns get incorporated.

Consider the following example of how a pattern gets incorporated into the plan. This example shows how the pattern is only a template, and there are many decisions that will depend on other information.

This pattern defines what should happen when a customer requests a change. The basic pattern is that first someone on the team should evaluate the request; this may involve working with the customer to clarify the request, and with other engineers to estimate the scope and cost of the work. The project can then make a decision whether to accept the change or not. If the decision is to make the change, work to build, release, and deploy the update will follow. If not, there is another pattern for how to communicate with the customer that the change will not be made.

undisplayed image

The activity starts when the project receives a change request. Based on this, the plan can be updated to include three tasks right away: the evaluation, review, and decision tasks.

At the same time, the planner must make decisions: who should each task be assigned to? What priority should the flow of tasks have? The pattern can indicate the roles involved in the tasks, such as there being a small team responsible for evaluating change requests and a customer representative from the marketing team, but it doesn’t determine which specific people. That’s for the planning and tasking efforts to determine. Similarly, the pattern does not specify how the work should be prioritized relative to other work the same people are doing. The planner incorporates information about how urgent the customer’s request might be and the importance of the customer into the decision. The project might have decided, for example, that there should be a queue of outstanding change requests and they should be evaluated in their order in the queue.

Determining who should be involved in a review of the evaluation might depend on the results of the evaluation. The pattern might indicate that the evaluation should be reviewed by engineers responsible for each high-level component that will be affected by the change. This means that the decision about who specifically will be tasked with the review can’t be made until the evaluation has worked out the scope of the change.

The decision to proceed with making an update will depend in part on whether the team has the time and resources to make the update. The team will need to determine whether adding the work to the plan will cause a problem with meeting deadlines that have been established already, or if it will overload a team that is already busy. This determination will involve analysis of the current plan—something that the life cycle pattern can help with only to the extent that the patterns can help with generating estimates of the work that would be involved.

When the project takes the decision to go ahead with developing an update for the request, the pattern shows that work steps follow to develop a change and then release and deploy the update. When the decision gets made, this will trigger the planning activity to add development and release work into the plan. These are high-level work steps with little detail. The planner will find patterns for these steps and populate those patterns into the plan.

Decisions about the work involved in development will depend on the development methodology that the team has selected to follow. If the update will involved extensive changes and the team is following a spiral-style methodology ! Unknown link ref, the development plan might consist of two or three development rounds. Each round would design and implement part of the changes, with a milestone at the end of each round showing how the partial changes have been integrated into the system.

Decisions about the release and deployment work will also incorporate policy decisions about how the team works. Will each change request result in a separate update release? Or will updates be bundled together into releases that combine several updates, perhaps on a schedule defined in advance?

19.9 Principles for a life cycle pattern

In this section I list some principles to consider when designing a workflow pattern.

The act of designing—or refining—a life cycle pattern is an opportunity to think deliberatively about how the team should get its work done. Life cycle patterns are the templates for the project’s plan, and so they should be designed to achieve the work that is needed to move the project forward well.

Designing the patterns ahead of time means having time to define good work patterns. The pattern does not have to be worked out under pressure, as a reaction to something unexpected happening in the project. It can be discussed among multiple team members to get different perspectives and to ensure everyone’s needs are met. Working in advance gives time to check that the steps in the pattern are consistent with each other. It means that there is time to think about what exceptional situations might happen and define what to do in those cases.

Note that if an organization already has an approach to life cycle patterns, whether documented or not, one should aim for continuity with that approach. Anyone already in the organization will know that approach to organizing work; making a major change would mean loss of the advantage of established team habits. On the other hand, if the current approach is not working well, then a new project is an opportunity to improve.

The life cycle patterns encode principles and methodology that encourages good work. Principles to consider include:

  1. Know the purpose for something before developing it.
  2. Build in time for and incentivize deliberative thinking.
  3. Assign decision-making authority to an appropriate level based on the nature of the decision.
  4. Build in ways to check work, and design them so they are a team norm and not prone to triggering defensive reactions.
  5. Build in the longer term.
  6. Think about exceptions that might happen, how to handle them, and when to change course.
  7. Define the work so that everyone on the team can agree when a step has been completed.
  8. Give a clear definition for each step of the quality considerations by which the work can be judged.
  9. Make the pattern as light-weight as possible without compromising quality.

Purpose. I have mentioned this principle several times already, and I believe it is a basic principle of effective system-building. The life cycle patterns encode this principle for specific parts of the team’s work.

As with anything else that is designed, a pattern itself starts with a purpose. That purpose might be “build a simple component” or “build the whole system” or “handle a customer’s change request”. A good pattern addresses its purpose thoroughly, without trying to achieve other purposes.

The pattern that results should then ensure that team members follow this approach when building parts of the system. If the pattern is for handling a customer’s change request, for example, the pattern should address understanding and documenting what the customer wants changed (and why), before starting to work out whether to agree to the change or to begin implementing the change.

Time to think. Key parts of a complex system are best served by taking some time to properly understand the purpose or need of that part, and to look at options for how it can be designed or built. A project running at too fast a pace skips this thinking and uses the first thing that someone thinks of—though there may be subtle ramifications of that decisions that are not appreciated until the decision causes a problem later. Asking someone to how alternatives they considered and rewarding them to do so works to improve the quality of the system.

At the same time, people can take too long to make a decision or fixate on making it perfectly. The time spent on deliberation should be bounded to avoid this.

Decision-making authority. Bezos introduced the idea of reversible and irreversible decisions [Bezos16]. He wrote:

Some decisions are consequential and irreversible or nearly irreversible—one-way doors—and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation. If you walk through and don’t like what you see on the other side, you can’t get back to where you were before. We can call these Type 1 decisions. But most decisions aren’t like that—they are changeable, reversible—they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long. You can reopen the door and go back through. Type 2 decisions can and should be made quickly by high judgment individuals or small groups.

As organizations get larger, there seems to be a tendency to use the heavy-weight Type 1 decision-making process on most decisions, including many Type 2 decisions. The end result of this is slowness, unthoughtful risk aversion, failure to experiment sufficiently, and consequently diminished invention.

For engineering projects, many decisions fall in the middle ground between reversible and irreversible. Consider building an aircraft. As long as the designs are just drawings, the designs can be changed with low to moderate cost. Early in the design process changes can be quite low cost; as the design progresses and more and more interdependent components are designed, the cost of rework increases. Once the airframe has been machined and assembled, the cost of changing its basic design becomes high, possibly high enough in time or in money that it is in effect irreversible.

Good life cycle patterns will account for different costs of reversing decisions. They should both build in time for deliberation and consultation before making hard-to-reverse decisions and use lighter-weight decision-making for less risky decisions. Similarly, the patterns should ensure that the authority for hard-to-reverse decisions is assigned to someone with high-level responsibility in the project, while the authority for low-risk decisions should be placed as close to the work as possible.

Checking work. Checking that work has been done well is commonly understood to improve the quality of results ! Unknown link ref. It is essential for parts of a system that require high assurance—safety- or security-critical parts.

The key to checking is that they not be subject to implicit biases that the developer might have. This can be handled either by the developer doing analyses that force a stepping back from decisions (perhaps by encoding them mathematically) and that can be checked for accuracy by someone else, or by having an independent person review the work.

Either way, the developer’s pride in their work can feel threatened. Setting out life cycle patterns in which every part of the work is checked enables the project to make checks a norm. Designating in advance that checks will happen, and who will do them, helps depersonalize the effort and in the long term contribute both to quality work and team morale.

Building for the longer term. It is easy to solve an immediate problem at hand quickly and move on, leaving a problem for the future. Taking time to think about the problem (the principle of taking time for deliberative thinking, above) will help but is not sufficient.

It is likely that someone will revisit the work sometime in the future. They may need to understand the work in order to fix a flaw or make an upgrade. They may be auditing the work as part of a critical safety review. They will need to know the rationale for decisions that were made, and they will need to understand subtle aspects of the work. If this information has been documented, these people in the future will be able to do their work accurately and relatively quickly. If they have to deduce this information by looking at artifacts built in the work, they will have to spend time reverse-engineering the work and their accuracy is generally low.

Building into the pattern checks for documentation of rationale and explanations will accelerate future work.

Exceptions. Things often go not to plan. What then? Who needs to know? What needs to be done to respond?

Sometimes this is as simple as setting an expectation for the team. If a component’s specification is inconsistent or cannot be met, who gets informed, and how does the problem get corrected?

Sometimes the situation is time-critical. If a major piece of equipment catches fire, what is the response? What if an insecure component has been incorporated and deployed? What if a large part of the system has been built, and someone finds a fundamental flaw? The responses to situations like these are complex, and there often isn’t time in the moment to work out the details.

Good life cycle patterns include pre-planned responses to these exceptional situations. This might consist of references to procedures that should be followed, or it might reference a pattern used to respond to the situation.

Completeness. Can everyone on the team agree when a part of the work has been completed? The person assigned a task should understand their assignment, so that they can do their work independently. Others will check the work, or mentor the person doing the work—and they should have the same understanding of the assignment.

The definition of actions, as well as the list of outputs and post-conditions for a pattern, should be clear to everyone.

Quality considerations. As with completeness, the people assigned to work on tasks need to have a clear definition of what makes the results of their work acceptable, or what makes one way better than another. Sometimes this is simple: when objectives or specifications, which would be inputs to a work step, are met. Other times considerations of quality arise not from specifications but from things like coding standards. In those cases the quality considerations should be spelled out explicitly so the people doing the work know to use them.

Light-weight patterns. Good patterns are lightweight enough to get their job done, and not more. Working out the pattern in advance is an opportunity to work out what parts of the work are truly needed and which can be omitted or simplified. For example, a pattern should be adapted to the possible cost of making a wrong decision (see decision-making authority above). Patterns that involve easily-reversible decisions should include streamlined decision-making steps, pushing the decision authority to as low a level in the team as possible and involving as little work as possible. On the other hand, more difficult decisions should involve a pattern that calls for greater deliberation, more checking and consultation, and places decision-making authority higher in the team’s hierarchy.

Similarly, the patterns should be achievable by the team. If the team is small, it makes no sense to mandate complex work flows for which there isn’t the staff. Each decision about what to include in a pattern should be measured against what is possible for the team to perform.

19.10 In upcoming chapters

In the chapters that follow, I discuss life cycle patterns in more detail. This includes:

XXX add to this list as the part is developed

[1] Prototyping can be a grey area, on the boundary between useful and not useful work. I will argue in ! Unknown link ref that prototyping is useful, and indeed necessary, for reducing uncertainty about how a part of a system can be designed or implemented. However, a prototyping effort can produce less value than is justified by the effort involved if the prototyping goes on too long, or if it is not focused on learning rather than building. My guidance is that prototypes must not have a path to transition directly into a real component, and the prototype artifacts must be segregated from other system artifacts.

Chapter 20: Example life cycle patterns

29 February 2024

20.1 Introduction

In this chapter I survey some of the many different life cycle patterns in use.

The patterns have different scopes. Some cover the whole life of a system, from conception through retirement. Some are concerned only with developing a system. Others focus on more narrow parts of the work.

I group the patterns in this chapter into four sets, based on scope. The first group covers the whole life of a project, without much detail in the individual steps. The second dives into the development process. The third addresses post-development processes—for releasing and deploying a system; these patterns overlap with development processes. The fourth and final group is for patterns with a narrow focus on some specific detail of building a system.

Patterns with different scopes can potentially be combined. Most patterns that cover a system’s whole life, for example, define a “development phase” but do not detail what that is. One of the patterns for developing a system can be used for the details.

Each of the examples will include a comparison against the following baseline pattern for the whole life of a project.

undisplayed image

The baseline phases are the same as in Section 19.2:

20.2 Whole project life cycle

These patterns organize the overall flow of a project, from its inception through system retirement and project end. I have selected two examples: the NASA project life cycle, which is used in all NASA projects big and small, and the Rational Unified Process, which arose from a more theoretical understanding of how projects should work.

20.2.1 NASA project life cycle

The NASA life cycle has been refined through usage over several decades. It is defined in a set of NASA Procedural Requirement (NPR) documents. The NASA Space Flight Program and Project Management Requirements document [NPR7120] defines the phases of a NASA project.

The NASA life cycle model is designed to support missions—prototypically, a space flight mission that starts from a concept, builds a spacecraft, and flies the mission.

NASA space flight missions involve several irreversible decisions, and this is reflected in how the phases and decisions are organized. Obtaining Congressional funding for a major mission can take months or years. During development, constructing the physical spacecraft, signing contracts to acquire parts, and allocating time on a launch provider’s schedule are all expensive and time-consuming to reverse. Launching a spacecraft, placing it in a disposal orbit, and deactivating it are all irreversible. These decision points are reflected in where there are divisions between phases, and when there are designated decision points in the life cycle.

There are several life cycle patterns in that document, depending on the specific kind of program or project. I focus on the most general project life cycle [NPR7120, Fig. 2-5, p. 20], which is reproduced below:

undisplayed image

The pattern includes seven phases. There is a Key Decision Point (KDP) between phases. Each decision point builds on reviews conducted during the preceding phase, and the project must get approval at each decision point to continue on to the next phase.

The key products for each phase are defined in Chapter 2 of the NPR and in Appendix I [NPR7120, Table I-4, p. 129].

Pre-Phase A (Concept studies). This phase occurs before the agency commits to a project. It develops a proposal for a mission, and builds evidence that the concept being proposed is both useful and feasible. A preliminary schedule and budget must be defined as well. If the project passes KDP A, it can begin to do design work.

Phase A (Concept and technology development). This phase takes the concept developed in the previous phase and develops requirements and a high-level system or mission architecture, including definitions of the major subsystems in the system. It can also involve developing technology that needs to be matured to make the mission feasible. This phase includes defining all the management plans and process definitions for the project.

Phase B (Preliminary design and technology completion). This phase develops the specifications and high-level designs for the entire mission, along with schedule and budget to build and complete the mission. Phase B is complete when the preliminary design is complete and consistent and feasible.

Phase C (Final design and fabrication). This phase involves completing detailed designs for the entire system, and building the components that will make up the system. Phase C is complete when all the pieces are ready to be integrated and tested as a complete system.

Phase D (Assembly, integration, test, launch, checkout). This phase begins with assembling the system components together, verifying that the integrated system works, and developing the final operational procedures for the mission. Once the system has been verified, operational and flight readiness reviews establish that the system is ready to be launched or flown. The phase ends with launching the spacecraft and verifying that it is functioning correctly in flight.

Phase E (Operations and sustainment). This phase covers performing the mission.

Phase F (Closeout). In this phase, any flight hardware is disposed of (for example, placed in a graveyard orbit or commanded to enter the atmosphere in order to destroy the spacecraft). Data deliverables are recorded and archived; final reviews of the project provide retrospectives and lessons learned.

This pattern of phases grew out of complex space flight missions, where expensive and intricate hardware systems had to be built. These missions often required extensive new technology development. The projects involved building intricate hardware systems that required extensive testing. The NASA procedures for such missions are therefore risk-averse, as is appropriate.

I have observed that many smaller, simpler space flight projects have not followed this sequence of phases as strictly as higher-complexity missions do. Many cubesat missions, where the hardware is relatively simple and more of the system complexity resides either in operations or in software, have blurred the distinctions between phases A through C. In these projects, software development has often begun before the Preliminary Design Review (PDR) in Phase B, and the teams have used continuous integration tools to begin verifying that the software components work together as they are developed rather than waiting for a formal integration activity in Phase D.

At the same time, I have observed some of these smaller space flight projects failing to develop the initial system concept and requirements adequately before committing to hardware and software designs. This has led to projects that failed to meet the mission needs—in one case, leading to project cancellation.

The phases in the NASA life cycle compares with the baseline model presented earlier as follows.

undisplayed image

The NASA life cycle splits the system development activities across four phases. The NASA approach does this because it needs careful control of the design process, in particular so that agency management can make decisions whether to continue with a project or not at reasonable intervals. The NASA approach also places reviews throughout the design and fabrication in order to manage the risk that the system’s components will not integrate properly. Many NASA missions involve spacecraft or aircraft that can only be built once because of the size, complexity, and expense of the final product; this makes it hard to perform early integration testing on parts of the system and places more emphasis on design reviews to catch potential integration problems.

The NASA pattern is notable for some initial work on a mission concept starting before the project is officially signed off and started. There are two reasons for this. First, because all NASA missions have common processes, there is less unique work to do for each individual project. Second, NASA is continuously developing concepts for potential missions, and this exploratory work is generally done by teams that have an ongoing charter to develop mission concepts. For example, the concept for one mission I worked on was developed by the center’s Mission Design Center, which performed the initial studies until the concept was ready for an application for funding.

20.2.2 Unified Process

The Unified Process (UP) was a family of software development processes developed originally by Rational Software, and continued by IBM after they acquired Rational. Several variants followed in later years, each adapting the basic framework for more specific projects.

The UP was an attempt to create a framework for formally defining processes. It defined building blocks used to create a process definition: roles, work products, tasks, disciplines (categories of related tasks), and phases.

The framework led to the creation of tools to help people develop the processes. IBM Rational released Rational Method Composer, which was later renamed IBM Engineering Lifecycle Optimization – Method Composer [ELOMC]. A similar tool was included in the Eclipse Foundation’s process framework, which appears to have been discontinued [EPF]. These tools aimed to help people develop processes and then publish the process documentation in a way that would let people on a team explore the processes.

While the UP and its tools gained a lot of attention, their actual use appears to have been limited. I explored the composer tool in 2014, and found that it remarkably hard to use. It came with a complex set of templates, which were too detailed for project that I was working on. Another author wrote that “RUP became unwieldy and hard to understand and apply successfully due to the large amount of disparate content”, and that it “was often inappropriately instantiated as a waterfall” [Ambler23]. Certainly I found that the presentation and tools encouraged weighty, complex process definitions and that they led the process designer into waterfall development methodology.

The UP defined four phases: inception, elaboration, construction, and transition.

  1. Inception. The inception phase concerns defining “what to build”, including identifying key system functionality. It produces the system objectives and a general technical approach for the system.
  2. Elaboration. This phase is for defining the general system structure or architecture and the requirements for the system. The results of this phase should allow the customer to validate that the system is likely to meet their objectives. This phase may be short, if the system is well defined and or is an evolution of an existing system. If the system is complex or requires new technology, the elaboration phase may take a longer time.
  3. Construction. This involves developing detailed component specifications, then building and testing (verifying) the components. This includes integrating the components together into the whole system and verifying the result. The result is a completed system that is ready to transition to operation. RUP focuses on constructing the system in short iterations.
  4. Transition. This phase involves beta testing the system for final validation that the customer(s) agree that the system does what is needed, and deploying or releasing the final software product.

The UP does not directly address supporting production, system operation, or evolution; however, the expectation is that, for software products, there will be a series of regular releases (1.0, 1.1, 1.2, 2.0, …) that provide bug fixes and new features. Each release can follow the same sequence of phases while building on the artifacts developed for the previous release.

The four phases in UP compare with the simple model presented earlier as follows:

undisplayed image

The Unified Process provides lessons for defining life cycle patterns: keep the patterns simple, make them accessible to the people who will use them, and put the emphasis on what they are for, not on tools and forms. The basic ideas in UP are good—carefully defining a life cycle, and building tools to help with the definition. I believe that these good ideas got lost because the effort became too focused on elaborate tools and model, losing focus on the purpose of life cycle patterns: to guide the team that actually does the work.

20.3 System development patterns

Some patterns focus only on the core work of developing a system. These patterns generally begin after the project has been started and the system’s purpose and initial concept are worked out. The patterns go up to the point when a system is evaluated for release and deployment. In between, the team has to work out the system’s design, build it, and verify that the implementation does what it is supposed to.

These examples all share the common basic sequence of specifying, designing, implementing, and verifying the system or its parts. Some of the examples include similar sequences of activity to evolve the system after release.

20.3.1 Systems V model

This pattern is used all over in systems engineering work. It is organized around a diagram in the shape of a large V. It is used in many texts on systems engineering ! Unknown link ref. It has also been used to organize standards, such as the ISO 26262 functional safety standard [ISO26262, Part 1, Figure 1].

In general, the left arm of the V is about defining what should be built. The right arm is about integrating and verifying the pieces of the system. Implementation happens in between the two. One follows a path from the upper left, down the left arm, and back up the right side to a completed system.

There is no one V model. There are many variants of the diagram, depending on the message that the author is trying to convey. Here are two variants that one often encounters.

The first variant focuses on the sequence of work for the system as a whole:

undisplayed image

The second variant focuses on the hierarchical decomposition of the system into finer and finer components:

undisplayed image

The key idea is that specifications, of the system or of a component, are matched by verification steps after that thing has been implemented.

In general this model conflates three ideas that should be kept separate.

  1. Development follows a general flow of specification, then design, then implementation, then verification.
  2. System development proceeds from the top down: start with the whole system, and recursively break that into components until one reaches something that can be implemented on its own.
  3. Development follows a linear sequence from specification and design, through implementation of components, followed by bottom up integration of the components into a system (with verification along the way).

The first two ideas are reasonable. Having a purpose for something before designing and building it is a good idea. There are exceptions, such as when prototyping is needed in order to understand how to tackle design, but even that exception is merely an extension to the general flow. The second idea, of working top down, is necessary because at the beginning of a project one only knows what the system as a whole is supposed to do; working out the details comes next. Again there are exceptions, such as when it becomes clear early on that some components that are available off the shelf are likely useful—but again, that can be treated as an extension of the top down approach.

The third idea works poorly in practice. It is, in fact, an encoding of the waterfall development methodology into the life cycle pattern, and so the V model inherits all the problems that the waterfall methodology has.

In particular, the linear sequence orders work so that the most expensive development risk is pushed as late as possible, when it is the most expensive to find and fix problems. By integrating components bottom up, minor integration problems are discovered first, shortly after the low-level components have been implemented when it is cheapest to fix problems in those low level components. Higher-level integration problems are left until later, when complex assemblies of low-level components have been integrated together. These integration problems tend to be harder to find because the assemblies of components have complex behavior, and more expensive to fix because small changes in some of the components trigger other changes within those assemblies already integrated.

Development methodologies other than waterfall address these issues better, as I discuss in ! Unknown link ref.

20.3.2 Systems or software development life cycle (SDLC)

20.4 Post-development patterns

20.4.1 DVT/EVT/PVT

Many electronics development organizations use a set of development and testing phases:

EVT. The EVT phase is preceded by developing requirements for the hardware product. It is often also preceded by development of a proof of concept for the board.

During EVT, the team designs and builds working prototypes, often iterating from a first prototype through a few revisions as testing reveals problems with the prototype. The EVT phase ends when the team has a prototype whose design passes basic verification.

DVT. The DVT phase involves more rigorous testing of a small batch of the designed board. The design should be final enough that sample boards can be submitted for certification testing. The DVT phase ends when the sample boards pass verification and certification tests.

PVT. The PVT phase involves developing the mass manufacturing process for the board. This includes testing a production line, assembly techniques, and acceptance testing.

These three phases are all part of the system development phase of the prototype phase pattern I have presented.

This pattern addresses mass production of hardware in ways that our prototype pattern and the NASA pattern do not.

XXX references for this process

20.5 Comparisons and lessons learned

The differences between the four phase patterns I have presented illustrate how a project must adapt its phases for the specific system, vendors, and customers involved.

There is one thing in common in all the patterns: they all put effort into defining the concept and objectives for the system first, before investing too much in developing the system.

The differences between the NASA approach and the other approaches illustrate how phases are structured differently when the system involves the development of expensive components. The cost, in time and money, of building the wrong software component is relatively low. The cost of building an airframe or rocket motor is far higher, and so it is worthwhile to spend more effort ensuring the airframe or motor design is right before beginning to build and test it.

The NASA and DVT approaches show how the need to interact with customers, funders, or suppliers can change the phases. The NASA approach is influenced by the US Government fiscal appropriation and acquisition mechanisms, which require programs to have multiple points where the government can assess progress and choose to continue or cancel a program. The DVT approach is influenced by the way a team needs to work with board and chip vendors to prototype a board, and get it ready for mass production.


Chapter 21: Model life cycle patterns

22 February 2024

21.1 Introduction

Projects generally proceed in a series of phases. Each phase has a different emphasis on what kind of work is done; the focus shifts as the project moves from one phase to another. Different life cycle patterns specify different sequences of phases, as we will see in later chapters on different patterns.

All project life cycles share a common general pattern, as shown below. The patterns differ in the details of how system development proceeds.

undisplayed image

Some projects are only concerned with building a system; once the system has been implemented and tested, it goes into production or operation and is no longer the concern of the development team. Those projects only go through the first four phases. Most projects, on the other hand, have some level of involvement after the system is deployed and in operation, such as fixing bugs or enhancing the system. These projects involve all the phases.

The phases are:

This is a minimal set of phases. Many projects will break up some of these phases into smaller ones.

21.2 The example phases

The phases defined earlier in this chapter provide a simple model that can be used to compare and contrast other phase structures, or that can serve as a basis for defining a custom life cycle pattern.

The phases are project preparation, concept development, system development, operational acceptance, system production, system operation, system evolution, and system disposal.

Project preparation. Starts when the idea for a project first comes up, which may be the same time that someone has an idea that starts concept development. There are several decisions taken and artifacts developed in this phase:

The project preparation phase typically overlaps concept development and early parts of the system development phases. It is complete as long as each member of the team knows the rules for how to do their work.

Concept development. Starts with an idea for some customer need to address, and develops the customer objectives that the system should address. There are a few artifacts developed in this phase:

System development. Starts with the customer objectives, and designs, implements, and verifies the system. This phase develops most of the system artifacts, including:

Operational acceptance. The process for a customer reviewing the implemented system and the evidence collected during verification to ensure that the built system meets their needs, as well as regulatory requirements. The phase results in customer review outputs and the customer’s approval.

System production. This is the phase where the artifacts built up during system development are turned into one or more working, deployed systems, ready for operational use. If the system is to be mass produced, this is where many systems are built and made available for operation.

System operation. In this phase, the system is placed into operation. The team supports the operation by supporting problem identification, analysis, and fixes. The artifacts involved include:

System evolution. This phase occurs in parallel with system operation, and involves changing or adding to the system to make it better. The evolution phase is often planned in advance of the first version of a system going into operation—for example, when an organization releases a minimum viable product (MVP) initially and plans on quarterly improvements to the system after that. The artifacts include:

System disposal. There are two parts to this phase: disposing of the operational system and disposing of all the system artifacts. The development team is sometimes responsible for defining how parts of the system should be taken out of operation and retired, such as taking hardware systems out of service and preparing them for recycling, or destroying any data that must not be preserved. The development team is also responsible for archiving all the system artifacts that were developed so that they can be re-used or audited in future.

XXX pulled from introduction

21.3 Life cycle phases

Projects generally proceed in a series of phases. Each phase has a different emphasis on what kind of work is done; the focus shifts as the project moves from one phase to another. Different life cycle patterns specify different sequences of phases, as we will see in later chapters on different patterns.

All project life cycles share a common general pattern, as shown below. The patterns differ in the details of how system development proceeds.

All projects begin with some kind of preparation. The team—or at least its few initial members—work out what rules, life cycle pattern, tools, processes, and standards should be used to design and build the system. This work often runs concurrently with the first technical phases.

An initial concept development phase begins the technical work. During concept development, the team works with the customer or mission to determine what the purpose and objectives of the system are. One must work out the objectives first so that the team spends development effort on building something actually relevant to what will be needed in the end.

The majority of work falls in the system development period. This typically includes activities or phases for:

Development ends with accepting the system for operation. The system is not deployed and put into general operation until it is shown to be ready, and the customer or mission has given their approval showing that they have accepted the system from the development team.

Some projects end when the system has been handed over to the customer for operation. Other projects continue to support the system while it is in use: fixing defects and evolving the system as needs change. The project may continue through the end of system operation, when system components are disposed of and information about the system and its operation is archived.

Sidebar: Canceling a project

Projects get canceled all the time. Anecdotally, it seems that more projects are canceled than go to completion—this is a consequence of using competitive approaches to programs, and the net effects of competition are generally regarded as valuable.

Consider two examples, based on projects we have worked on.

In the first project, the team was writing a proposal for a US DoD spacecraft system. In the proposal-writing phase, the team has to establish the basic architectural and management approaches for the project, show they meet the department’s needs, and establish the price at which the team proposes to build the system. The team progressed through establishing the initial concept and architecture for the system, and we began evaluating the solution to see how good a job it would do for the customer and how much it would cost to build it.

We had a checkpoint milestone where we reviewed what we had found. At that review, it became clear that while our team had a decent solution for the needs, we did not have a great solution, and that other companies we expected to propose designs would likely have better solutions (because they had more experience in a couple of key technical areas). We made the decision not to pursue the proposal.

This was a good decision. Assembling a proposal is not a small task; we had a team of about 15 people working long hours. For US government projects, the proposer generally pays for the proposal development. Choosing to spend our team’s time and money on this project meant that the team couldn’t work on some other project. We judged that the opportunity cost was not matched by the probability of successfully winning the contract, so we freed up the team to work on a different system that did prove successful. If we had continued to work on the original proposal, we would have spent the budget available to develop proposals and could not have spent it on the proposal that succeeded.

In the second example, a different US DoD spacecraft program, the team was about two years into a multi-year contract. The team had performed excellently in a competitive first prototyping phase, and was the only team to be selected to move on to a second phase for building an initial working version. A key subcontractor on the team had staffing and management problems, and were not delivering results. Within the team we were struggling to fix the execution problems or find another way to build the necessary components, all the time keeping a large staff on payroll and running through budget. While the technological solutions for many system capabilities were probably sound, the team could not deliver. The customer observed the problem, and after working with the team to try to resolve the problems, went through the process to cancel the project.

This was also a good decision. In hindsight, the team lacked necessary capability in the subcontractor and in the project management team. If the project had been allowed to continue, it is unlikely that the team would have solved the problem and more money would have been spent without benefit in the end.

The take away from these examples is that there are many sound reasons for canceling a project. Sometimes the cancellation is designed in (as with competitive acquisition); other times it is because continuing to invest money, time, and the care of the team building the system has become unlikely to pay off.

For a more general discussion of US DoD project failures, see the report by Bogan et al. [Bogan17].

Part VI: Specifications

Chapter 22: System concept

23 August 2022

XXX rewrite to reduce use of conops term

XXX harmonize stakeholders with list from earlier chapters

XXX harmonize concept contents with earlier chapters

XXX pull out document management section

22.1 Purpose of concept development

At the very beginning of a system development project, there is generally only a rough idea of what the system should be. The understanding of the system objectives is too vague to launch into development or writing tests right at the start.

The purpose of developing the initial system concept is to get a reasonably clear initial definition of what the system should be or do. The definition should accurately reflect what the customer needs (whether the customer is an actual customer or a representative of an expected customer). The definition need not be perfect; it will be revised as the project moves forward and the initial concept gets validated or the understanding of the customer’s needs improves.

Internal to the project, the initial concept provides the information that the team needs to begin working out the structure or architecture of the system and to begin writing the high-level system specification. In the absence of a well-structured concept, the development team cannot begin working out how to implement the system without taking risks that the design is wrong and will have to be redone.

Outside the project, concept development supports the relationship between the project and the customer. Clear agreement between the customer and the development team makes for efficient and less fraught development. Agreement on the concept can support writing a contract for development that protects both the development team and the customer from feature creep or extra costs.

The documentation of the concept will be used throughout the entire life of system development and operation. The concept is the record of the big picture for the system. While it gathers the information needed to guide system design, the big picture is also important to new people joining the team who need to learn about the system they will be working on. The concept also serves management as a definition of the goal that system development is trying to reach, allowing the team to check from time to time whether they are designing and building the right thing. In that way a good definition of the concept is essential to being able to validate the system design.

22.2 Concept development work

The concept development work in a project seeks to establish a clear statement of what the system should be and do. At the end of the work, there should be a record of the customer’s objectives for the system, a concept of operations that embodies those objectives and explains what the system is, and review and approval from both the customer and the project leadership.

undisplayed image

The core of this is to determine and record what the customer wants. This is rarely an easy task: there may or may not be a well-defined customer, and the customer may or may not be able to articulate what they need and what they want.

Start with working out who the customer is. Some projects are customer driven, meaning that there is a customer for the system and the project is working in response to that customer’s needs. A customer who contracts with a development team to build a system for them is the simplest example of this kind of project. Other projects are RFP driven, meaning that there is a customer but they are asking one or more teams to propose a system design before committing to funding development. These projects differ from customer driven projects in that the customer provides a request for proposal (RFP) that should contain all the information needed about what the customer wants. (In practice the RFP is rarely sufficient.) Still other projects are visionary, meaning that there is no specific customer driving what the system should do, but instead the system is being designed in the expectation that there will be customers for the system in the future. This kind of project includes innovative systems that are expected to help create a market for themselves.

Every project should document who their customer is and how they work. Where there is an actual customer, this should include information about how they make decisions about systems, who the decision-makers and influencers are, and any contacts that might be able to provide background information or advice about the customer. For visionary projects, common practice is to develop a profile of one or more hypothetical customers, including how they are expected to decide whether to acquire a system and where information can be found to characterize these potential customers.

A document recording what the customer wants is the first major work product in the concept development phase. This document serves as the primary record of what the customer has asked for. It is used in later work to check whether the interpretation that the development team puts on what they think they have heard from the customer is accurate. It is vital that the customer objectives document be free of biases that the project team brings, because this document is used to detect when the team have applied their biases.

The customer objectives document, therefore, should be written using the customer’s own language, and should only be a summary of what the customer has said. One way to achieve this is to collect primary source material from the customer—such as notes or recordings from meetings, a request for proposals, or externally-sourced market analyses—and then write a summary of their contents. If possible, the summary document should include references to the primary sources so that someone can check the summary for accuracy. Writing the objectives document as prose, or as a bulleted list, is reasonable. The customer objectives document should not be organized into formal requirements; that step comes later and formal requirements should derive from the customer objectives document.

Where possible, the customer objectives document should be shared with the customer (or representative potential customers) to validate that the document is accurate. It is normal for there to be many iterations with the customer to get the details right; indeed, that is the point of gathering the customer objectives.

The next major document, or collection of documents, records other constraints on the system’s design. This includes things like

This document of other constraints should reference the source material for each kind of constraint.

With the customer definition, customer objectives, and constraints documented, the team can develop a concept of operations, or CONOPS. This document records a simple model of how the system can be structured and operate at a high level. The contents of the CONOPS should be limited to how the system will interact with things outside itself, whether that is the function and structure that the customer will see, the interactions with other systems, or its interactions with regulatory organizations. The CONOPS should be fairly brief, and diagrams are helpful. The CONOPS should not become a specification of the system’s design; again, the specification and design are derived from the CONOPS. The CONOPS document is often written in the language that the development team uses. It is common to use graphical notation standards for some elements where appropriate.

The contents of the CONOPS should derive from the customer objectives and other constraints. A good CONOPS document will include references to those sources to show why the concept is structured the way it is.

At the end of the concept development phase, the team will have gathered information about what the system needs to be, both to satisfy the customer and to satisfy other stakeholders, and created a conceptual version of the system’s functions in the CONOPS.

The CONOPS should be shared with the customer for their review and approval. Because the CONOPS document is written for the development team using the team’s language, it is often necessary to interpret the contents of the CONOPS to the customer. The customer should validate that the functions included in the CONOPS cover everything they are expecting. The development team’s management should also review and approve the concept material to validate that it meets organization policies and regulatory needs.

Sidebar: Stages of working out the system concept

Some people have asked: what is the difference between the customer objectives, concept of operations, requirements, and design? Why have all these different documents? Why not begin by recording requirements?

We view the process of working out the system’s concept as a multistage flow, starting with a vague idea and ending in a concrete specification, including requirements, for what the system should do. The flow is:

undisplayed image

The customer objectives and concept of operations documents deal only with the concept of the system—the abstract general functions and purpose. Most importantly, they should be understandable by the customers, who look at them to see if their ideas and needs are being reflected in the system that will be designed.

Take requirements, for example. The content of every top-level requirement that goes into the system specification should be present in the customer objectives and concept of operations, but the information should be recorded informally, accompanied with explanation of the context and purpose of each function. These will be translated into requirements in the system specification, which have a strict format and are typically recorded in a database that is not easy for customers who are not also trained in systems engineering to follow.

22.3 When the concept is complete

The goal for the initial concept development phase is to have a clear but informal understanding of what the customer wants, what constraints other stakeholders place on the system, and agreement from all the parties that the documented understanding is correct, so that the team can move on to formalizing the design of the system.

Put another way, the system design efforts depend on knowing what the system is supposed to do. The amount of effort put into system design should be low until the team has confidence that they understand what the system design should do.


22.4 Evolving the concept

The concept for the system will change over time. This can happen in the early stages of a project, when one is still working with the customer for the first time to understand their objectives. It also happens later: when the customer’s needs change, when the team realizes that they have misinterpreted the customer’s objectives, or when regulation changes.

In general terms, while working to develop the system concept for the first time, any changes can be made as needed. At some point, the initial versions of the concept documents will be “done”, reviewed, and approved. The documents are then baselined: marked as stable, meaning that people can use the information in the baselined concept documents to develop system architecture and specifications without worrying that the documents will be shifting on them all the time.

After the concept has been baselined, the team must follow a more careful process to make changes. First one identifies what has changed: a change in the customer objectives, or in constraints such as regulations. Next one analyzes the change to determine where the concept documents need to be revised. The revised document versions exist in parallel with the baseline version, but only the baseline version is official until the revised documents are reviewed, approved, and marked as the new baseline.

undisplayed image

When the baseline version is updated, changes may need to propagate to other documents. For example, the system architecture and specifications derive from the concept documents. A change that adds a function to the system concept in the CONOPS document will induce changes to the architecture (possibly adding new components) and specifications (adding functional requirements to some components‘ specifications). If the change happens late in the project, when parts of the system have been implemented and verified, the change may propagate all the way to updated software, hardware designs, and test cases.

This explanation of the process simplifies the steps somewhat: individual documents are versioned or baselined. A change to the customer objectives results in an updated version of the objectives document. This leads to an updated version of the CONOPS, which in turn can lead to a need to review and approve the updated CONOPS, as shown in the diagram above. The new versions of the documents should not be baselined until they have collectively been reviewed and approved. They should be baselined together, so that the official, stable versions of all the documents remain consistent with each other. This means that updated versions should remain work in progress until the review step has been completed.

Sidebar: Life cycle of a document

Each of the documents we have discussed will go through a sequence of steps:

  1. The development of the initial version. In this step, the document will be changing regularly as its contents get worked out.
  2. Initial baseline. This is the step where the document is treated as “finished”. This step often involves reviews and approvals. People can treat the baselined version of the document as stable; developing derived artifacts from the document is low risk because it isn’t in flux.
  3. Identification of a potential change, and development of a revised document that addresses the change. During this step, two or more versions of the document will be available to the team: the most recent baselined version, and work-in-progress versions that are in flux.
  4. Baselining a revised version. The revised version gets a review and approval, and is marked as the newest stable baseline version.

At any time, there is at most one most recent baseline of a document.

Every project needs to have two capabilities to support these changes: configuration or document management, and a way to work out the effects of a change. We will discuss these in an upcoming section.

22.5 Concept development for different kinds of projects

Concept development is all about knowing what the customer wants. But who is the customer? How does one learn what they want? There are multiple answers to these questions.

We can divide projects into three general groups:

22.5.1 For customer-driven projects

Customer-driven projects are those that have a specific customer whose needs or desires drive what the system should do. The development team is focused on satisfying this specific organization.

In these projects, the development team can communicate with the customer to learn how the customer works and what their needs are. Ideally, the customer will be involved all through system development, so that the team can check what they are building directly with the customer.

The Agile Development approach to software development grew out of customer-driven projects. The Agile approach advocates for the customer being continuously involved, including helping to prioritize work in each development sprint. This kind of direct involvement is only possible when the development team can interact constantly with the customer. Note that we do not advocate dogmatic Agile Development (nor do we think many projects actually use it); we will discuss this more in chapters on validation and management.

Some customers will begin working with the team with only a general concept in mind. The team will need to draw out from the customer what that concept means, and explore all the corner cases with them.

Other customers will bring a partially-developed concept of what they want. In these cases, the team must, first, ensure that they properly understand what the customer is saying, and second, explore the concept with the customer to find any missing information.

Working with a customer on the concept is full of pitfalls. The most serious is that the team will interpret what the customer is saying in a way that the customer does not actually mean. The development team can bring their own interpretations to understanding what the customer says; the result can be a system design that doesn’t meet the customer’s actual needs. The team should use communication techniques that allow them to validate their understanding of what the customer is trying to say, such as active listening ! Unknown link ref methods. The key ideas in these techniques are

There are many references available on these techniques.

The concept needs to include everything that the customer actually wants. Most commonly, the customer will be thinking about the most important functions they need but will not be considering all of the other functions that are needed to make the important functions work.

The team needs to work with the customer to elicit these other functions or use cases. While there is no recipe for finding all these other use cases, we have found that there are some questions that help ferret them out.

22.5.2 For RFP-driven projects

An RFP-driven project is one where a customer is asking for proposals from development teams about how they will design and build a system. The customer is usually asking for multiple, competing teams; the customer will choose one or more teams for a contract to build the system.

The customer writes a request for proposals (RFP) document that defines both the characteristics of the desired system and how the customer expects to judge between multiple competing proposals, if there are any. The RFP should thus document the customer’s objectives. In many competitive acquisition cases, the RFP must be the only official source that a team can have so that all proposing teams work from the same information—thus treating all teams equally.

When deciding to respond to an RFP, the team must learn what acquisition rules the (potential) customer is using in order to determine what restrictions to follow when communicating with the client. The team must also learn how the customer makes decisions, including who makes the decisions, who influences the decisions, and how the decision will be made. When responding to a commercial RFP, this can be easy: there is a contact who sends out the RFP and who can answer questions as needed, there is someone they work for who reviews and decides whether to accept a proposal or not, and the decision is based on what the decision-maker thinks meets their needs at the best price. For a US Government agency RFP, on the other hand, the decision process is defined by Federal Acquisition Regulations and by the agency’s supplemental rules. There are formal processes for submitting questions; there is typically a defined scoring and weighting system that a formal review team must use to rate each proposal.

The information gathered about how the customer communicates and makes decisions should be included in the Customer Definition document.

When the customer is doing a competitive acquisition, the team also needs to gather information on the other teams that may be choosing to submit a proposal. This information helps shape the proposed design and the proposal itself to make them look better than the competition to the customer. This can include relative strengths and weaknesses of the other teams, such as whether this team has proprietary technology that will do a better job for the customer (a weakness of the other team), or whether the other team has more flexibility in pricing (which might be a strength of the other team). This information should be gathered into a Competition document.

In practice RFPs are rarely complete or unambiguous. This is because they are written only by the customer, and there is little opportunity for dialog so that the customer can get alternative perspectives and check that their work is clear and complete. When it is possible, the team should engage in the kind of dialog with the customer that they would in a customer-driven project in order to confirm their understanding of what the RFP says and to flesh out the request to include a more complete picture of what the customer actually needs. When this is not possible, the team should find people who can accurately represent the customer’s way of thinking and needs, such as people who have a similar position in a different organization in the same industry, or someone who has worked closely with the customer in the past and knows the business or people involved.

Whether one can get clarifying information or not, the concept documents should include documentation on where the team has made assumptions or interpretations of the RFP source material. These points are matters where there is greater than usual risk that the team’s assumption does not match what the customer is thinking. This means that there is a higher than usual risk that the concept or design that the team proposes will be interpreted differently than what the team means—and so it is worth putting extra effort into making those parts of the proposed concept or design as clear as possible.

There are two end results of the process for responding to an RFP: first, a decision whether to complete and submit a proposal, and second, submitting a proposal if the first decision is positive.

The decision about whether to submit a proposal or not depends on

Determining whether the team has resources requires estimating the resources needed. For the first steps of concept development, this may be small, perhaps one or a handful of people to gather information and to get an initial understanding of what the customer wants. As the work progresses, more resources will be needed—to gather more information, to do concept development, to gather competitive market data. At each step of the process, it will become clearer how many people or other resources are needed for the next step of developing the proposal. At the same time, the team must be able to estimate how much resource will be needed to build the system if they win a contract. This will be unknown to start, but as the system concept and architecture work move forward the estimates will improve. The team must develop the architecture enough to be able to determine prices to charge the customer and to be able to determine if the team will have the capacity to do the work. These analyses grow out of the concept of operations and later architecture documents.

Determining whether the team has a reasonable chance of winning is a combination of knowing how the customer will judge proposals, how strong other teams are likely to be, and how well this team can satisfy the customer. This information is gathered in the Customer Definition document, the Competition document, and in how the Concept of Operations and architecture respond to the customer’s objectives.

Finally, determining whether building the system can be worthwhile depends on knowing what the team’s organization values. Does the organization require a particular profit margin? Is there a minimum or maximum contract price that is considered “interesting”? Does the system fit within the organization’s business strategy? These kinds of questions are captured in the Business objectives document, and analyses use customer objectives, CONOPS, and architecture documents to develop an answer to them.

Developing proposals is a complex specialty, and much has been written about it. We refer the reader to ! Unknown link ref for further reading.

22.5.3 For visionary projects

A visionary project, as we are using the term, is one where the system being designed and built is not for a specific, existing customer. Instead, the system might be marketed to several potential customers down the line, or the system might be part of a strategy to change an existing market or create a new one, thus creating new customers who may not even exist yet.

Consider building a new commercial passenger transport aircraft. The air transportation system is mature, and so one can name who buys these aircraft: airlines, aircraft leasing companies that provide the aircraft to airlines, businesses using aircraft for private transportation, and government organizations that fly passenger aircraft. No aircraft company in recent decades has built a new large passenger aircraft to be sold only to a single customer; instead, the companies work out the needs of many potential customers and design an aircraft that will be good for many of those customers. Since airlines come and go often relative to the lifetime of an aircraft design, many of the potential airline customers do not yet exist when the aircraft company has to decide on the capabilities of the new aircraft. This is a case where the market exists but there is not a single customer to satisfy.

In contrast, consider the first generation of global satellite data and telephony networks (such as Iridium and Globalstar). When they were being designed, there was no mass market of ground-to-space mobile communications. These companies, and others that did not end up deploying their networks, had to work out who their potential customers might be and what they might need. Indeed, all of these first generation providers went bankrupt at some point as they developed both their network systems and at the same time built up a subscriber base. This is an example of a project that was creating a new market.

In both these cases, there is not a single definition of a customer. Instead, the team must determine the market—the set of customers—who might want the system. The team looks for the set of features or capabilities that will satisfy a large enough market to be worth supporting. The plan will often be to start with a small market segment and grow over time by adding capabilities to satisfy more people, while having learned more about the first set of customers and gaining some revenue to help fund growth.

All this information will need to be collected from a number of sources, including market analysts, surveys of potential customers, and the experience of people who have worked in related industries. Finding people or organizations who can act as a proxy for a class of possible customers is helpful. It is important to gather from multiple sources in order to cross-check the information and to account for sampling bias that can happen if information comes from only one perspective.

The information about the target market segment(s) will change regularly over the course of the project as customers come and go, or as new opportunities appear. This means that the design and implementation of the system will likely need to adjust as time goes by. This also means that the team needs to continue to survey the market and talk to potential customers.

At the same time, it is a rare project that can successfully chase arbitrarily changing customer objectives. The design and implementation team needs enough stability that they can complete a version of the system. Marketing and sales teams need stability so that they know what they can actually sell to a customer. The stable version of the customer objectives should be baselined (see section on configuration management below). Changes to the baseline should occur only periodically, when the team decides that either there is a change in the understanding about customers that is vital to reflect in the design of the system right away, even at the cost of delaying the system being ready for use, or when there is a change that does not delay or significantly change the system being designed and built right now.

The idea of a minimum viable product (MVP) is fashionable in recent years. The general approach is to create the simplest system that will meet the needs of just a few customers, put the team’s focus on building up that first version, then plan on adding capabilities as time goes by to make the product attractive to more customers. This is an example of planning how to handle changes in understanding what customers want.

Visionary projects can expect that there will be competition with other teams’ products. Indeed, customer choice is a fundamental precept of the Western market system, and often required by regulation. A team should develop a record of what their competition might be, whether that is another organization offering a similar product (as happens with large passenger aircraft), or whether a customer could meet their needs a different way, or whether customers will choose no to buy a new product and live without its benefits (which is common with new technology trying to create a new market). The team should also build up an analysis of what sets this team’s system apart from alternatives—why a customer would choose this system over other options. Maintaining the Competition document with this information will help the team make decisions about changes to the customer objectives or business objectives around which the team is designing the system.

22.6 Artifacts

The concept development phase involves artifacts such as the customer objectives and concept of operations. We can now define each of these artifacts, but first we will address document management as a necessary supporting capability.

22.6.1 Document/configuration management

Each of the artifacts worked on and produced in the concept development effort should be placed under document management. The document management system should provide:

A project should establish a document management system early in the initial concept phase. The concept will be represented in the artifacts listed below, and when these artifacts are reviewed and approved, a baselined version of each should be available in the repository.

Organization. Ideally, a project will designate one tool for storing all electronic information, and organize the documents stored in that tool so that it is convenient to find each kind of document. In practice most projects use different tools for different kinds of artifacts—a source code management system for software, a document system for ordinary documents, design repositories for hardware designs.

A document does no good if the people who need to use it cannot find it. A project must provide a single starting point for finding documents, whether those are stored in a single tool or spread over multiple tools. The contents of each repository must be well organized; we have too often seen projects build up a long, long list of documents, each with a document number and unhelpful title, requiring users to scroll through the list or guess at search terms. Creating an index that organizes the artifacts by the relevant phase and component helps people significantly.

Repository organization takes effort. We recommend making at least one person explicitly responsible for maintaining the organization in the repository, maintaining indexes, and (if necessary) updating the organization to address how people actually use it.

This means that the repository should:

Versions. The tools for storing artifacts must be able to maintain both a baselined version and multiple working versions.

The baselined version must be clearly identified as the baseline, so that people know what the official document is. A baselined version must also be immutable: it has been approved as a stable version. The baseline version should be replaced when a new version is approved as the baseline.

Working versions, on the other hand, can be updated often. People store working versions in a repository for multiple reasons: to preserve a copy of the work in case their local copy is lost or damaged, to share work in progress with others, and to provide a version as a proposed new baseline. People may be working on different changes concurrently—one person addressing one change, while another person works to address some other issue.

This means that the repository should support:

Approvals and workflow. The team relies on the integrity of baselined artifact versions. Any updates to the baselined version should, therefore, be carefully controlled. The typical workflow is that someone develops a working version of the artifact, then proposes it for a new baseline. The proposed version then gets reviews, and is either approved to become a new baseline or is given issues that need to be addressed before it can be approved. Once approved, the proposed version is promoted to become a new baseline.

Every project needs to have a clear, written procedure for this workflow. It should be clear to every team member how they go about proposing a working version to be baselined, how the review and approval steps are performed, who is responsible for approval, and the steps required to turn a proposed version into a new baseline.

Some artifact repositories provide support for these workflows. Software repositories, for example, provide functions to create branches (working versions), and to control the process where a branch is merged into the master branch (baselined version). Other tools provide a general workflow functionality that one can use to implement and enforce these steps.

We have seen some projects that do not use automated workflows, instead having a well-documented manual procedure for each of the steps. While this can be error-prone and while it does mean that one or more people must be responsible for managing the repository contents, this approach works well as long as the team is not too large and no more than a few dozen artifacts are being managed. This is especially useful when a project is starting up and has not yet determined what tools they will be using.

Finally, we noted earlier that sometimes it is important to update the baselines of several artifacts at once so that they stay consistent with each other. For example, consider when a customer requests a new function be added to the system. The new function must be added to the customer objectives document. The customer objectives and the concept of operations will then be inconsistent: the objectives will include the function, but the CONOPS will not. Someone will then need to update the CONOPS to add the function, followed by reviews and approval. It can be best to baseline the updated customer objectives and CONOPS documents at the same time, once they have both been updated and the updated CONOPS has been approved.

The repository, thus, should:

It is desirable, but not required, that the repository:

Other considerations. The previous sections have outlined the functions that a repository should provide. Effective artifact management requires some other capabilities.

These capabilities include:

In addition, the repository will work in conjunction with issue tracking or change order management tools. Those will be discussed in a later chapter.

22.6.2 Customer definition

Purpose. The customer definition captures information about who the customer is. It is used partly to help inform the process of developing the initial concept for the system, but it is also the place for recording things like who the points of contact for the customer are.

Form. The customer definition document is generally a prose document; it does not need structuring the way objectives or requirements do. Some organizations may have customer relationship management tools that will capture some of the content defined below.

undisplayed image

Input. The customer definition document contents come from a number of informal sources. These include:

Dependents. The customer definition document affects the other artifacts developed during the concept development phase because knowing who the customer is is the first step in defining what the customer wants.

Content. The customer definition includes:

If this is a visionary project, the customer definition does not describe a single, specific customer, but instead describes the general class of customers who are expected to want the system. The definition may include information about one or more customers who are representative of the class. It will also need to include more general information—for example, the range of ways that customers in the class decide about acquiring a system, or the range of company sizes and budgets.

Completion. The customer definition document usually gets regular updates through the life of a project, such as when the points of contact at the customer change or when a new market analysis shows that the potential customers for a visionary project have changed. The customer definition should be baselined initially when most of the content has been recorded.

22.6.3 Customer objectives

Purpose. The customer objectives document is a record of what the customer wants out of a system. This document is a summary of what they have said they want and what has been drawn out from them in discussion or research. The objectives document is the source of all the rest of the artifacts developed for the system.

The objectives should record

The customer objectives document should be as close to the customer’s words and organization as possible. The document is a summary of the customer’s wants and needs. It is used as a proxy for the customer throughout system development, rather than having every developer talk directly to the customer to check their specification or design.

Other information must be kept out of the customer objectives document. The other information is captured separately in things like the regulatory and business objectives documents, which we discuss next. We have seen organizations include their business objectives, such as profitability, in the list of customer objectives. We have also seen teams include internal technical objectives like being able to reuse parts of existing designs included. Doing so creates confusion: is an objective in the document actually something the customer wants, or is this something the customer doesn’t care about? There will come times in the development process when hard decisions must be made about some part of the system design; at those moments, the team must be clear about what is actually a customer need and what is an internal need. If a customer need can’t be met reasonably, then the team needs to talk to the customer to resolve the issue. If an internal business or technical objective is proving hard to meet, the decision should be handled internally and the customer should not be involved—they don’t care or know about the issue.

undisplayed image

Form. The customer objectives document is a predecessor to formal specification of the system, so it does not need to be a formally-structured document. A prose document with plenty of diagrams works well.

Input. Where the objectives come from depends on the kind of project and the kind of customer. If this is a customer-driven project, the information will come from discussion with the customer. For an RFP-driven project, the information will come from the request for proposals, possibly supplemented by information gathered in discussion or from market research. For a visionary project, the information must come from market research.

Dependents. The customer objectives document is the source, direct or indirect, for every technical artifact developed for the system.

The CONOPS derives directly from the customer objectives. The CONOPS is the first level of turning the informal statement of customer objectives into something more formal. The top-level specification of the system—which is developed after the initial concept phase—derives in turn from the CONOPS, with references back to the customer objectives.

The initial concept development phase ends with review and approval of the customer objectives document and the CONOPS.

The customer objectives provide input to contracting materials. If there is a contract between the development organization and the customer, there is usually a statement of work defining what the development organization should deliver. The statement of work will need to match the material in the customer objectives document. For RFP-driven projects, the development organization’s proposal to the customer must match the customer objectives document; the proposal is one of the sources that leads to a contract and its statement of work.

Content. The customer objectives document should include everything that the customer has said they want the system to be. This should include things like:

The document should organize this information in understandable ways. The information from the customer will likely come in small increments, in arbitrary order—especially if it is obtained in discussions or from market research.

The objectives document must not include material that is not directly about customer needs or wants.

Completion. The customer objectives document is ready to be baselined when it includes everything that has been obtained from the customer or other input sources, and when the customer agrees that the objectives document is complete and correct.

Determining when everything is recorded is not easy. There are three conditions that we have used to decide that the objectives are complete:

  1. All the key functions or capabilities recorded in discussions with customers or in other source documents have been checked off as included in the objectives document.
  2. There are no remaining significant loose ends, where we have questions about what something the customer has said or where we think that there is an objective implied by something that has been said.
  3. Continuing investigation or discussion has not uncovered new information for a while.

To do these, we have built up a collection of the messages or documents received from the customer and of the notes from discussions. We maintain this collection as the source material that the objectives document references. While writing the objectives document, we mark a copy of these sources to show each piece of information that should be included as an objective, and cross them off as they are incorporated. This usually leads to a final (tedious) review of all these sources to check that nothing has been missed before declaring that we have properly checked everything off.

Where possible, the customer should review the objectives document and approve that it correctly includes all their needs, and nothing else. If the customer cannot do such a review, then someone who is independent of the team and can be an accurate proxy for the customer should review the document. For a visionary project, where there is no customer, this could be someone who has done market research. For an RFP-driven project, this could be someone who is familiar with the customer.

Comments. For some users, working in terms of use cases ! Unknown link ref will be familiar. While documenting use cases—with users and functions—is helpful, it cannot capture all of the information from the customer in their original language. Resist the temptation to document the objectives as formal use cases unless the customer is providing information that way. Formalization comes in the concept of operations document, which derives from the objectives.

22.6.4 Regulation and process definition

While the system being built needs to meet the customer’s needs, there are other stakeholders whose needs must be addressed as well. These are, broadly, internal objectives of the development organization and external objectives of third parties, such as regulatory bodies. Some of these objectives will define capabilities that the system must have. Other objectives provide constraints on how the system can function or be implemented, without defining specific capabilities. Regulatory objectives

Many kinds of systems are subject to regulation. Some systems require licensing or certification, to prove that they meet regulations; others only need to be able to show compliance on demand.

These regulations pose constraints on the design of the system. Some are simple: aircraft emergency exits must be marked in particular locations. Some are complex: the crew of an aircraft must be able to properly determine what is happening with an aircraft even when there are complex failure situations—which involves human factors as well as the design of aircraft sensing systems.

A system will not be able to be put into operation unless it can meet these regulations. This means that the regulations must be incorporated into the design, just as the functional desires of the customer must be. One cannot do this unless one knows what the regulatory constraints are, and so one must search out and document all the regulations that apply.

The regulatory objectives document should at minimum list the source regulatory documents that apply to the system. Before design validation is complete, the information in the regulatory objectives document must translate into a detailed collection of requirements against which the system can be checked.

It is often necessary to involve either experts in the regulation of a particular industry or the regulatory agencies themselves to properly gather all of the regulations that apply.

Regulatory examples. We look at regulation of two kinds of systems: aircraft and spacecraft. These two examples show different approaches to regulation. Most (but not all) aviation regulation is typically provided by a single government organization, the national civil aviation authority (CAA). In the US, this is the Federal Aviation Administration. All the civil aviation authorities worldwide are harmonized through the International Civil Aviation Organization (ICAO). In contrast, regulation of spacecraft is spread over multiple organizations, and there is little or no international harmonization of regulations.

Aircraft regulation. Aircraft regulation is focused on managing the risk to aviation non-participants (such as people on the ground) or casual participants (passengers on board an aircraft). The body of regulation is complex, taking a number of different approaches to both protect people in general while allowing those who can take responsibility for aircraft behavior the maximum feasible freedom to do as they need. This results in a combination of rules: licensing of aircraft types, constraints on where different kinds of aircraft can be flown, pilot training and certification, air traffic control over where aircraft are flying, and many others. It requires the combination of all of these rules to meet the objective of controlling risk to the public.

The regulations that apply to aircraft in particular (as opposed to the larger aviation system) begin with classifying the kind of aircraft by the risk it poses. Ultralight aircraft are lightly regulated, primarily defined as a maximum weight, speed, stall speed, and so on. Pilots either do not need a license or only need a limited license for ultralight aircraft. They generally can only be flown in daylight. There are intermediate kinds including those for general aviation, aerobatic and utility aircraft, commuter aircraft, and finally transport aircraft. Each category has limitations on its weight, speeds, number of passengers, acceptable pilot qualifications, and allowed maneuvers. The restrictions increase as the number of passengers, weight, and speed increase because each of these induces greater risk to the public.

CAAs throughout the world have encoded the regulations for each category of aircraft. In the US, for example, the regulations for transport aircraft (the largest category) are defined in the Code of Federal Regulations, Title 14 (the FAA), Part 25 (Transport category airplanes). Other parts of Title 14 cover topics like airports, the structure of airspace, air traffic control, carriers or operators, and navigation facilities; these other parts define the environment in which the aircraft will operate.

Most kinds of aircraft require a type certification. This is issued by a CAA to show that the CAA has verified that the aircraft’s design meets all these regulations. This is the first enforcement mechanism used to ensure that an aircraft complies with regulations. There are additional mechanisms, including registering individual aircraft and periodic inspection of the aircraft and its records by CAA-authorized auditors. The final level of enforcement comes from air traffic control granting permission to fly or not.

There are some regulations that apply to aircraft that are not typically handled by a CAA. This includes radio communication, which is typically regulated by a national communications authority (in the US, the Federal Communications Commission) and harmonized worldwide through the International Telecommunications Union.

Spacecraft regulation. Unlike aircraft, spacecraft do not have a unified regulatory regime. This is in part because there is no single unifying principle behind the regulations, as there is for aviation (safety of the public). Most spacecraft pose a negligible danger to the public during operation, as they are small enough to be destroyed when they re-enter the atmosphere. Historically, there has been concern about the military value of the information produced by spacecraft; more recently, there is increasing concern about the dangers one spacecraft poses to other spacecraft.

At the time of writing, in the US, spacecraft regulation includes:

These regulations are spread over multiple agencies, and are changing rapidly as commercial uses of space change. Third party or service provider objectives

No systems operate in isolation. Instead, they operate within the context of a larger system of people, businesses, and organizations. This might include:

The interactions and dependencies within this larger system also create constraints on how the system being designed must function. It is important to identify each of these organizations or systems, document how the system will interact with them, and then document the more specific objectives that are involved in working with them.

This information should be collected into one or more documents that record, first, the structure of the larger system and its interfaces with the system being designed; and second, the sources of constraints or objectives for each interface.

Information about the ecosystem in which the system will operate is likely to change frequently over the course of developing a system, especially for visionary projects. This means that it is important to update information about these objectives, and when it changes, flow those changes down into the system design.

Example: communication services. Consider a system of multiple vehicles—such as cars, trucks, or small UAVs—that need to communicate continuously with a central operations facility. The system itself is the vehicles and the operations facility. The communications are likely to be provided by a third party: a cellular communications company, for example.

As the system design progresses, the team will be able to define more and more accurately what capabilities are needed from the communication system. How reliable does it need to be? Can there be areas with poor or no coverage? What data rates are needed?

At the same time, communication providers will have their own constraints and capabilities. This might include pricing—both how pricing is calculated (Flat rate? Amount per data transferred?) and what the rates are. It might include their coverage area, and their mechanisms to provide information about outages or new coverage. It might include terms of use, with restrictions on what kind of data can be transmitted and what security measures the system must take in order to be connected to the provider’s network.

Example: spacecraft launch provider. Most spacecraft launches are performed by a company different from the organization that builds and operates the spacecraft. The launch service provider is responsible for receiving the spacecraft from its builder, integrating it onto the launch vehicle, and placing the spacecraft in a designated orbit. The launch provider is in turn responsible to regulatory agencies that ensure that the launch operations are safe, and in many sites the launch provider must work with a range safety organization (in the US, the US Space Force provides range safety for the Eastern and Western Test Ranges).

There are two classes of interactions between the launch vehicle and the spacecraft: the effects that the spacecraft can have on the launch vehicle, and the effects that the launch vehicle can have on the spacecraft. The provider gives the spacecraft designers specifications of the launch vehicle, including how the spacecraft will be attached and released; what vibration, pressure, and thermal environment the spacecraft will be in during processing and launch; and what communication is possible between the launch vehicle and the spacecraft. The provider also gives constraints on what the spacecraft can do, such as constraints on the spacecraft’s mass, volume, center of gravity, or gas releases. The provider also gives safety constraints, such as the allowed propellants or toxic materials, the state of batteries or other energy storage systems, or the permitted electromagnetic radiation. These constraints usually derive in part from the launch provider’s safety certification with the appropriate regulators or range safety organizations.

Most launch providers make a Payload User’s Guide available that documents this information.

Example: safety-critical component provider. A recent project we worked on involved acquiring a number of sensors for measuring the environment around a vehicle, so that the vehicle could safely plan a path around obstacles. Some of the sensors were not yet available in production, and the team had to work with the providers to obtain evaluation units.

The interaction between the team and the sensor provider was typical of interactions with providers in general. Negotiations between the team and the provider covered topics like:

These issues do not affect the core technical function of the component. Some of them do, however, place constraints on how the team can use the component (it might not be possible to repurpose the sensor for any arbitrary function). Other issues, such as quality control or acceptance testing processes, affect the safety of the system that incorporates that component.

As a result, these constraints also need to be captured in an objectives document, and the system’s design must be validated against the terms. Business objectives

An organization that is devoting resources to build a system must be able to obtain those resources. At the minimum, the organization must be able to hire and pay the people who design and build the system; it must be able to pay for the tools and prototypes it uses; it must be able to pay people to gather customer objectives and work with regulators and all the hundred other tasks involved.

Most organizations are also building an ongoing business, not just coming together long enough to build one system and then disbanding. Sustaining a business requires obtaining funding, getting sufficient return on the work the organization does in order to fund continuing work, and building capabilities that allow the organization to keep building or maintaining systems into the future.

All these imply that an organization needs to have a business strategy, which leads to business objectives. The organization may have a strategy of developing a product line that serves a wide variety of customers. This might translate into an objective to build a simple initial system product that is able to generate X revenue, and that can be extended over time to address the needs of more customers.

Many organizations develop these objectives at the executive level but do not feed the information downward explicitly to the team who must design a system. This is a problem because the design team knows that such objectives exist but don’t necessarily know exactly what they are, and thus can’t make accurate design judgements. We have seen, over and over, questions in a design team like “should we design this board with extra capability now, or design the minimal board and replace it later?” These have often led to arguments because the design team did not have the information needed to make a choice between a higher up-front investment cost for extra capability or incurring cost later in a redesigned board.

There are many different kinds of business objectives to document.

Some objectives are easily quantifiable:

There are general business case objectives:

Finally there are business strategy objectives:

These business objectives change continuously. When there is a proposal to change the objectives, the team must follow a disciplined process to determine what the effects of the change might be. This involves tracking down how the change will affect technical requirements and designs, which in turn affects whether the changes will affect the system’s ability to satisfy customer, regulatory, safety, or security needs. Changes to the design will also affect development cost and the time required to bring the system to operation. Sometimes a change to business objectives will make sense: changing the rate at which the system should scale up after the initial operational version may not affect the development time much but will increase customer satisfaction. Other times a change will have negative consequences: setting the goals for the size of the addressable market too high too early may require a higher development budget and longer development time than is available. Making a well-informed decision about these changes is only possible if the team can determine what the effects of a potential change in business objectives are. Safety objectives

Safety is the condition that a system, when operated in the intended way, does not produce too many events that cause harm. There are four parts to this statement:

  1. The operational system (its design and implementation)
  2. The way it is intended to be operated (the intended functions and environment)
  3. Produce too many events
  4. Events that cause harm

In the end, a system must be shown to be safe by showing that the rate at which it causes harm is below a threshold. The process of designing a system to be safe is well known to be a difficult task, and there are many books and standards that try to give guidance on how to do so. As the system is designed, it must be evaluated to show the likely rates at which harmful events will occur.

This is a complex topic, and later chapters will address the design and analysis of safe systems. For now we focus on safety objectives.

Performing these evaluations requires defining what kinds of harms are to be measured, along with the acceptable rates at which they occur. There is no possible way to justify that a system is “safe” or “unsafe” without defining the harms they refer to.

A project, therefore, must define and document its high-level safety objectives in terms of the harms and the acceptable rates of those harms occurring. This is the safety objectives document.

Some industries have conventional definitions of harm and rates. The automotive industry has adopted a scale of zero to three for “severity” in the ISO 26262 standard ! Unknown link ref, focused entirely on injury to persons. Severity 0 is no injuries, 1 is light to moderate injuries, 2 is severe injuries with survival probable, and 3 is severe or fatal injuries. The aviation industry has defined a five-level scheme in the ARP 4754 standard ! Unknown link ref, ranging from minor (slight increase in crew workload or minor passenger inconvenience) through hazardous (serious or fatal injuries among passengers) and catastrophic (many deaths, loss of aircraft).

These two standards differ in two respects. They consider different ranges of harm: ISO 26262 has any severe or fatal injury as its highest category, while ARP 4754 considers the distinction between fatal injury and mass fatality. They also consider different kinds of harms: ISO 26262 only considers injury to persons, while ARP 4754 considers effects on the crew’s ability to control the aircraft and damage to the aircraft.

These point to deficiencies in the standards, and to the reason why a project should define its safety objectives more carefully. There are many harmful incidents that these standards do not address, such as damage to property, economic harm, or damage or injury to non-person cargo. Consider an incident involving a truck that damages an overpass, but does not injure anyone directly. The cost of repairing or replacing the bridge can run to several millions of dollars; the economic impact on the community of not being able to use the bridge can be equally high. In addition, depending on the industry, the range of severity in these standards can also be too limited: they do not account for harms that spread beyond the people and vehicles immediately involved in an incident. The use of aircraft as missiles in the 9/11 attacks showed how an aircraft safety incident can result in mass casualties or worse.

In addition to defining the harms that system design will consider, the safety objectives document sets targets for how often those harms can occur. Guidance issued for commuter aircraft ! Unknown link ref, for example, gives a maximum allowed rate of incidents per flight hour: XXX

Minor 10-3 Major 10-5 Hazardous 10-7 Catastrophic 10-9

The safety objectives document should define a maximum rate and the time interval over which that rate applies for each category of harm.

Some organizations may choose to say that the system they build should allow zero safety incidents above a certain level. This is possible only if the system can be guaranteed never to perform operations that could induce such serious events. For example, an aircraft can be guaranteed never to cause catastrophic harm, involving multiple fatalities—but only if the aircraft has a maximum weight of a few tens of kilograms, a low maximum speed before it disintegrates in the air, can only carry a single person, and so on. No transport aircraft (more than 19 seats or maximum takeoff weight greater than 19,000 lbs) that actually flies can ever have a zero rate of catastrophic harm. Similarly, many weapons systems can never have a zero rate of mass casualty harms simply because of the energy they carry. In most cases, as the conventional wisdom goes, the only way to get a system to have a zero rate of harm is not to build the system.

Safety objectives, like customer, business, or regulatory objectives, are sources that lead to the concept of operations and top-level system specifications.

Defining precise safety objectives early in a project is required for building a safe system. We have observed many projects that made aspirational statements about “safety being a first priority”. In every single instance where the definition stopped at that statement, the team designed an obviously unsafe system—often because, in the absence of an objective standard, each person took steps they thought would be safe but in aggregate the design missed even basic scenarios that resulted in hazards. Further, the absence of an objective meant that no one could perform an objective analysis of a design to determine whether it was good enough. Security objectives

The security objectives document provides guidance for how the system should be designed and validated to ensure that it can handle a reasonable range of attacks.

Security objectives are similar to safety objectives: they define a set of harms resulting from security incidents that the system must work to avoid or contain. Unlike safety objectives, however, security incidents occur as the result of malicious, intentional actions rather than as a result of failures, accidents, or design flaws. Like safety objectives, the security objectives document names the harms that the system should avoid. It cannot, however, generally specify maximum acceptable incident rates because the rate at which attacks occur is something that attackers can deliberately control.

The approach to defining security objectives, then, is to name threat actors and the harms they can cause. A threat actor is a person or organization that can choose to initiate an attack on the system, such as a hacker, a criminal organization, or a hostile nation state. Each threat actor can be characterized by their motivations (a criminal organization for financial gain, a nation state to disable defense-relevant capabilities). The harms the threat actors can cause include disclosure of confidential information, interruption of business, death of persons, financial loss, or theft of goods. The list of harms includes every kind of harm addressed as a safety concern, plus harms that do not involve damage or injury but do involve loss of value or information.

The system must then be designed to address the different harms that different threat actors might pose. The resulting design can be analyzed to determine whether the threats are sufficiently addressed. The built system can be tested to verify that key defensive features are working as intended.

The definition of “sufficiently addressed” remains subjective. Some security analysis techniques have rationales for assigning weights to different threats. For those analyses, ensuring that all high- and medium-priority threats have been mitigated might be sufficient.

There are many standards related to security, and depending on the industry and geographic region compliance with some standards may be mandatory. These may define security objectives that a system must meet for regulatory or business acceptance. This information should be documented in the regulatory objectives document, and information about threat actors or harms should flow from the regulatory objectives document into the security objectives document.

22.6.5 Concept of operations

Purpose. The concept of operations (CONOPS) document is the systems team’s response to the customer’s objectives. It collects in one place a description of how the system will work, from the point of view of the people who will use it. Whereas the customer objectives come from the customer and should record their needs from the customer’s point of view, the CONOPS shows how a system could behave in a way that meets the customer needs. The CONOPS is written from the point of view of the system.

The CONOPS document organizes the ideas about how the system will behave. In doing so it gives a model for understanding how the system’s functions can be organized and how different behaviors relate to each other. It does not aim to provide every detail about the behavior; its value is in documenting the big picture.

The CONOPS document has three primary purposes:

  1. To collect a description of the system’s behavior in one place, which can then be used as the source to develop the system’s specification and design,
  2. To provide a description of the big picture for how the system will work to people on the team, especially people joining the team, and
  3. To feed back to a customer the system’s team’s understanding of what the customer has asked for, in order to check that understanding.

The document is typically written as a narrative, and not as formal requirements or detailed behavioral models. Its value is in its explanation, not its precision. It is an explanation of how the system might function, without reference to how the system can be implemented to achieve those functions. The details of operations, as well as the implementation, are recorded in documents that derive from the CONOPS. It should, however, expose the users, features, functions, states, and use cases that model what the customers’ objectives mean.

The CONOPS should include functions that are implicit in the customer’s objectives. For example, the document should cover the system’s entire life cycle, from deployment or initial startup through shutdown and disposal. The document should cover major faults that might occur, and the system’s behavior when those occur. For a spacecraft, for example, it should include recovery modes that allow the spacecraft (perhaps under ground control) to re-establish normal operation after a fault. It should include not just the technical core of the system, but also how the human or organizational elements that use the system behave. For an automobile, the CONOPS should not just say that there is a driver, but include expectations like the driver being trained and licensed. For an airline, the CONOPS should include the airline’s safety management system and how that interacts with an aircraft or maintenance technician.

To produce the concept, the systems team reviews and understands all the objectives documents already described, then follows a process to extract from those objectives a model of how a system could behave. The process of analyzing all the objectives will almost always reveal things that the customer or others have not addressed—customers often focus on the main operational behaviors, for example, and don’t address how to deploy or dispose of the system. The systems team needs to find these gaps and address them. Where possible, the systems team should work with the customer to check whether the customer in fact has expectations about these topics before committing to a concept.

undisplayed image

Form. The CONOPS is typically a narrative document, though organization is important. Diagrams are especially helpful as long as they only expand on the narrative description.

It is not recommended that the CONOPS document consist solely of diagrams, such as UML/SysML use cases. While these can be helpful as a part of the document, the CONOPS must provide the explanation for what these use cases are and how they relate to each other.

We have seen many projects that try to use a “CONOPS document” to record the specification of the system. One can recognize when this has happened because the document runs to hundreds of pages, includes lots of details, and is usually abandoned shortly after system development begins. This is bad practice.

The CONOPS document should be short. It is a high-level explanation, not the details. The details come in the system specification, which will be long, tedious, and written in stylized forms that are not easy for the uninitiated reader to understand. The CONOPS document should explain and illustrate the life cycle of the system, from deployment, through operation, to retirement. It should define the major users and the major functions they need from the system. A good CONOPS document is often anchored around the “big scary picture”, like the OV-1 overview diagram in the DODAF standard ! Unknown link ref: a diagram that illustrates the main behavior of the system in one place.

Input. The CONOPS derives primarily from the customer objectives. It is the system’s team distillation of what the customer has indicated they need, combined with the team’s exploration to define the users and behaviors

Dependents. The CONOPS document is the primary technical output of the initial concept development effort, and all of the other technical artifacts derive from it. The CONOPS is the source for the top-level specification of the system, which is a more formal interpretation of the CONOPS. Other design, evaluation, and implementation artifacts in turn flow from the specification.

The CONOPS is provided back to the customer for their review. The customer should check whether the system described in the concept meets what they need and what they were expecting. If so, the customer’s review leads to approval of the concept.

Content. The CONOPS document is recommended to include the following information:

Again, the CONOPS document is intended to be short. Many engineers have succumbed to the temptation to make the concept of operations document be the design document; don’t do that. The CONOPS should remain tightly focused on the users, use cases, and externally-visible behaviors without going into implementation. If there is a need to provide great detail about some externally-visible behaviors, write a document with the details and reference it from the CONOPS document, or defer this to the more detailed specifications that follow on from the CONOPS.

There are many templates for CONOPS documents. Two examples are:

Completion. The document can be considered complete when three conditions hold:

22.6.6 Reviews and approval

Purpose. The initial concept development phase gathers and organizes information about what a system should be and how it should behave. It gathers this information from many sources: the customer (if there is one), third parties that impose constraints, the developing organization’s policies and standards.

The concept leads, in turn, to all of the technical artifacts that make up the system and its design: the specifications, designs, analyses, implementations, and so on. Those technical artifacts can only be as good as the concept from which they are developed, so checking that the concept is accurate and complete is vital to producing a good system.

The team is ready to proceed to system specification when two conditions hold:

undisplayed image

Form. The review and approval steps can take many forms, but at minimum they should include providing the reviewers and approvers with the documents to be reviewed, and a mechanism for recording comments and approvals.

Input. The review and approval steps use the various documents developed in the initial concept phase.

The people who should provide reviews and the people who have approval authority should be identified before starting the review process.

Dependents. The approvals of the initial concept are a gateway to all further technical development.

Content. There can be several different reviewers and approvers for the initial concept. In general, part of the concept needs to be reviewed by the customer, in order to get feedback from them on whether the concept meets their needs. Other documents created during the initial concept development are not necessarily for the customer’s knowledge—matters like the developing organization’s business objectives. These other documents should also be reviewed, at minimum by people inside the development organization.

As always, the reviewers and approvers should be independent of the people who wrote the documents. The goal is to have people who do not necessarily bring the same preconceptions to reading the material as the people who wrote it, in order to catch assumptions that need to be detailed.

One should expect that the reviews will generate comments and questions. Some of the comments will require the team to revise the various documents, at which point the changes will need to be re-reviewed. Being able to clearly identify the changes that a reviewer needs to address will help them determine what to focus on.

Completion. The review and approval step is complete when all the approvers have formally indicated that they concur with the documents and that they believe that the project is ready to move on to designing the system.

22.6.7 Proposal (RFP-driven projects)

Purpose. A proposal, in the sense meant here, is a document that is sent to a potential customer in response to a request for proposal (RFP) that the potential customer has issued.

A proposal needs to make four cases to the customer:

The proposal derives from the work done during concept development, but usually also must include initial system specification and design work. This initial technical work is needed both to be able to explain to the customer what they would be getting if they choose this team to develop the system, and to generate a reasonable price for building the system.

Many processes and guidelines for proposal development have been published over the years, and we refer the reader to that large body of literature for details.

undisplayed image

Form. Many proposals are required to follow a precise form. The form and contents are typically specified in the RFP, and often derive from regulatory requirements.

A typical proposal to NASA, for example, must follow a structure specified in the RFP. The form usually consists of:

The proposal also specifies the format in which the proposal must be delivered (PDF electronic form, paper), allowed numbers of pages for different sections, the font choices and sizes, and many other details.

Input. The proposal contains a summary of a lot of information about the technical system design. This means that the team must have developed a top-level system architecture, which in turn depends on being clear about the system’s concept and the objectives it has to meet.

The proposal also needs to include cost or pricing information. This also depends on having a reasonably accurate idea of what work will have to be done to provide the system to the customer. This also depends on understanding the business objectives that a contract would need to meet, such as the expected profitability.

Finally, a good proposal needs to clearly demonstrate to the customer that the system being proposed meets their needs. This is often presented in the form of a compliance matrix or compliance table. This table lists each of the customer’s major objectives and points to where in the proposal this objective is addressed, so that the customer has an easy way to check how their objectives are met.

Dependents. A successful proposal leads to a contract and to system development. The contract will specify what the team is supposed to build, superseding any information that was gleaned from an RFP. The team will need to develop updated versions of the customer objectives, regulatory objectives, and safety/security objectives, then revise the concept of operations, to reflect the actual commitments they have made to the customer.

As with all development projects, the real system specification will then flow from the revised CONOPS and onward into technical implementations.

Getting from the proposal to the contract may involve negotiation, which (from the developing team’s side) will use the material generated for the initial concept and for the proposal to inform negotiating positions.

Content. A proposal, as we have said, needs to convey to the potential customer what the team proposes to provide to them, along with evidence that the team can actually do the work and do it better than any competitors.

Completion. The proposal is complete when it is delivered to the customer. When delivered, it must meet the format and content requirements that the customer has provided in their RFP.

It is common for a customer to have clarification questions or ask for revisions to a proposal. In those cases, the team may need to respond with an update to the proposal.

22.6.8 Competition documentation

Purpose. Most systems projects will be in some kind of competition—whether for a customer contract, for sales of a developed system, or for acceptance of a new technology over an existing approach. A team can develop a good concept or a good system, but then fail to get that system used.

The competition document gathers together intelligence about who and what might compete with this team’s system. It lists strengths and weaknesses of each competitor.

Knowing about competition applies to every project, not just those which must generate a competitive proposal. A customer-driven project must still satisfy its customer; the customer will be aware that they have choices about what investments they make in new systems or upgrades. A visionary project may have direct competitors who may try to build similar systems—but visionary projects can also have competition from the way problems are already being solved, as a customer can always choose not to buy the team’s new system and stick with what they already have.

Form. The competition document does not have any set form. We have often organized the document with one section per competitor, with a description of each and bulleted lists of their strengths and weaknesses.

Input. The information in the competition document comes from a number of primary sources: people who track relevant markets, interactions with the potential customers, market surveys, and so on.

Dependents. Information about competition can feed into many parts of the initial concept development:

Content. The competition document must be an unbiased presentation of the alternatives to using the system being designed, and of the advantages and disadvantages of those alternatives.

Many people will naturally want to emphasize what they see as their own strengths and try to contrast the competition to those strengths. That makes for a misleading competition document.

The competition must be presented as fairly as possible, and from the customer’s point of view. The document must be honest about the strengths that competitors have: they will have strengths and the team cannot defend against those if they do not have an accurate assessment of them. The document must be equally honest about the competitors‘ weaknesses. The team cannot design a better solution if they do not accurately know what customers don’t like about what their competition offers (or might offer), or if they don’t understand what structural problems the competition might have in designing or building their own offering.

Completion. The competition document is never really complete, because other teams and other technologies will always be changing. The competition document can be complete enough to support CONOPS development or proposal development when the people who are searching out potential competition haven’t found any new competitors in a while.

Chapter 23: Specifications

8 February 2023

23.1 Purpose

Specification is about recording how a component (or system) should behave or the structure that the component should present. It only documents how the component appears from the outside, as a black box; it does not specify how the component achieves these ends. A specification derives from the less-formal concept for the system or component.

XXX address specification vs requirement

XXX make sure this ties into the broader flow of phases

A specification provides a simplified and abstract view of a component. This abstract view allows one to reason about how the component will work with other components. Without the abstract view, one would have to analyze the details of a component’s implementation to determine whether it will interact properly with another. While that is possible, the work of figuring out how the component will behave only serves to reconstruct design information that was originally worked out when designing the component. The reconstructed information will not necessarily match the information used during design, and the effort is wasteful.

A good specification records the intent and assumptions that went into working out what the component is supposed to do. This information helps the component’s implementer and designer to check that they understand what they need to build, and to check that the specification matches the intent. These assumptions also help people understand how a component might need to change when part of the system is redesigned—to add a new feature, for example. A record of the intentions helps people who come along later to understand the system, and the particular component’s role in it.

Finally, a specification serves as a sort of contract between a component and the rest of the system in which it functions. The people building the component in question can proceed to work on their component with confidence that the result will likely integrate correctly into the system as long as they build to that contract. The people building other parts of the system can likewise proceed with reasonable confidence that when they go to use the component, it will do what they expect.

23.1.1 Good specification properties

A specification is used for several different tasks by different people over the course of a project. A good specification needs to be structured and contain the information needed to support these people.

Specifications should be clear and unambiguous. Each of the people who will read and use each specification need to come to the intended meaning of the specification.

They should be testable. Someone using the specification should be able to look at a design or implementation and determine whether it is compliant with the specification. That does not mean that determining compliance is easy; it only means possible. Sometimes the most that is possible is to build a body of evidence that a design is highly probably compliant. For a specification to be testable, however, the specification can’t contain statements like “approximately” or “fast” or “heavy”; it needs specific values that define what “approximately” (“+/- 10%”), fast (“at least 20 m/s”), or heavy (“greater than 5 kilograms”) mean so that compliance is not a matter of subjective judgment that can differ between two different people.

The specifications need to be organized. A specification is no good if the people who need to use it don’t know it exists or can’t find it. A specification is also not useful if the people who need it can’t tell whether it is currently applicable, outdated, or a speculative proposal. Specification should be kept in one place where everyone on the project can find all of them, and they should be maintained under configuration management.

A good specification is minimal. It addresses the needs for the system or component that have been identified in the concept work leading up to the specification, but it does not add other elements that are not relevant to the identified needs. (Note, however, that the process of developing a specification can often reveal needs that were missed in building up the concept and CONOPS. When those gaps are found, the concept and CONOPS need to be updated as well as addressing the gap in the specification.)

23.1.2 Specification versus documentation

Specification and documentation play different roles. Specification is a record of what something should be, while documentation is a record of what it has been designed and implemented to actually be. Specification deals with the black-box, external behavior, while documentation deals with the internals of the component. The documentation should connect decisions about the component’s internal structure to the external behavior or structure documented in the specification.

23.1.3 Specification needed to scale a project

A small project, implemented by a very small group of people over a short time and thereafter left alone, and that does not provide safety- or security-critical functions, does not necessarily need specification.

Unless all of those conditions hold, some level of specification is necessary in order to communicate between people and across time.

The communication includes:

Sidebar: The role of experience substituting for specification

Every specification is written in terms of some level of common knowledge: language, jargon, subject matter Have to strike a balance between what is assumed and what is explicitly recorded In small and fast-moving teams, temptation to rely on experience rather than writing down the needs, especially when the same person specifies and implements Works in the short term but not in the long term as people change or work becomes shared Is error-prone (example of hysteresis) Disadvantages early career engineers Can be okay if this is a transient condition, and specifications and assumptions are recorded before being forgotten

23.2 Specifications and systems

A specification defines the metaphorical shape that the component should have in order to fit into and support the system.

undisplayed image

A specification treats the component as a black box: it considers only how the component should be seen from the outside, without determining how the component’s internals should be designed or implemented. One way to look at the specification is that it defines a contract between the system and the component: if the component behaves according to the specification, the system should work correctly as a whole.

A specification may define behaviors or attributes that in effect narrow the range of possible designs, possibly to only a single design. That situation in itself does not make a specification invalid. However, the specification should not include definitions that are not strictly needed to record needed external behaviors solely in order to constrain the design.

After a component has been specified, design of the internals of that component begins. The internal design often uses sub-components. The designers will develop specifications for the sub-components.

undisplayed image

This process repeats recursively to lower and lower components, until one reaches components that have no further sub-components. The result is a tree (or possibly a DAG) consisting of alternating layers of specifications and designs. (This has been called the “layer cake model”.) The design of one component (or the system) responds to its specification. The specification for subcomponents depends on the design that has been selected for the component—the design determines both what subcomponents there are, and how they are to work together.

images. images/spec-layering.svg

23.3 Example

Some years ago, I worked on a rack-mounted computing system that had high reliability and uptime goals. A decision was taken to include a battery pack in each server assembly, so that if the mains power went out the servers would have enough time to record their state on storage before shutting down.

Consider the specification for the battery pack. It may seem simple—provide enough power to run the server assembly for some period of time—but the actual specification contains several subtle elements because its function is entwined with other system-wide reliability and safety behaviors.

Here are some of the system behaviors that affect the specifications for the battery pack:

These are rough objectives for the server assembly as a whole. These translate into specifications on the battery pack itself.

Addressing keeping the server assembly running:

Addressing the server assembly changing its behavior:

Addressing the server assembly lifetime:

Addressing likely failure:

Addressing safe and convenient customer installation:

Addressing fire, toxic gasses, and similar safety issues:

Addressing supply chain attacks:

Addressing fitting into a standard rack:

Addressing the environmental conditions:

These example objectives are not all of what would be needed for a server battery pack, but they illustrate several of the kinds of concerns that the battery pack’s designers will need to consider. These rough objectives must be turned into more precise specifications in order to guide the designers accurately. For example, some of the statements above use subjective words like “nominally” that need to be made precise. Other statements are too general and need to be decomposed into a set of more specific statements.

23.4 Kinds of specifications

“Specification” is a deliberately broad term, encompassing many different ways of recording what something should be or do (and why).

Many people assume that “specification” means “requirements”. While requirements are one kind of specification, they are not the only one—and requirements are not generally sufficient by themselves to record all the information needed about behavior or structure.

Kinds of specification include:

There are many kinds of models.

Example: control function for a PID controller XXX
undisplayed image
Example: TCP session opening state machine XXX

23.4.1 Combining multiple kinds of specification

In practice I have found that no one kind of specification meets all needs, and have used multiple kinds of specification together.

Generally, each kind of specification we use meets the good specification objectives of being clear and testable, as defined earlier.

Mixing multiple kinds of specification, however, requires care in organizing the specifications. Different kinds are often written and stored in different tools (a tabular tool for requirements; a CAD tool for mechanical drawings). This easily leads to a situation where a practitioner cannot find all of the specifications to which they need to be paying attention.

One way we have addressed this is to use a table of textual requirement statements as a primary specification, and include requirements like “the component shall comply with state machine X”, including a reference to the drawing of the state machine. Using a tool that makes all these forms accessible through one common user interface helps make this convenient for users. Using tools that can perform configuration management across all the different forms of specification also helps.

23.5 Using specifications in a system

We first look at how specifications are developed and used from the outside: from the perspective of those who are concerned with how a component fits into the system, and not with what the specification means for the design internal to a component.

A specification for a system derives from the objectives and CONOPS developed during the system concept development phase.

undisplayed image

The system-level specification leads, in turn, to a system design and then recursively to the concepts and specifications for components in the system.

23.5.1 Building the specification

This is the first step in using specifications. The specification developer looks through all of the conceptual material assembled for the system or for a component, and organizes and formalizes it to make a specification.

In practice this does not happen all at once. People develop the various kinds of objectives that lead to the specification iteratively, and parts of the specification will be developed as the objectives and concept becomes clear. As people develop the specification, they will identify gaps in the concept, which will lead to improvements in the objectives and CONOPS and in turn lead to updates to the specification.

23.5.2 Evolving the specification

The needs that a system solves change over time. New capabilities get requested. Regulations evolve. Problems with the system are found and need to be fixed. All of these can lead to changes in the concept and thus to changes in the system specification.

The concept and design of components also changes, and for similar reasons. As well, a component may have a perfectly adequate design, but it may become outdated because subcomponents become unavailable. This leads to a redesign of a component, inducing new specifications for subcomponents.

It is important to follow an organized process when a specification changes. Many process standards recommend specific approaches; for example, ISO 26262 [ISO26262] specifies that any change to a system must begin with an impact analysis, which determines how a change to objectives or specification propagates through the design of the system, and downward through the hierarchy of components. Standards like that also specify that the specifications and designs be maintained under configuration control so that everyone can know whether a change is a work-in-progress proposal or has been committed to.

23.5.3 Validating the specification

The specification must reflect all of the needs identified in the concept from which it derives, and the specification must not add needs that do not appear in the concept and objectives. Before a specification can be declared complete, someone must go through all the material in the concept to check that the specification accurately reflects each of the identified needs or objectives.

A specification validation exercise can also help identify gaps in the objectives. Checking the specification often involves someone who was not part of developing the objectives and CONOPS; a fresh perspective can lead to asking questions about the objectives or the specifications that in turn lead to discoveries of topics that are missing.

23.5.4 System consistency

As the system design grows and more and more components are defined and specified, someone needs to check that the designs and specifications are all consistent. This is especially important for “long distance” dependencies: where the correct function of one component depends on the correct function of another component in a different part of the system. (More formally, when two components A and B depend on each other for correct function, and the lowest common parent of A and B in the component hierarchy is near the top of the hierarchy.)

23.5.5 Safety and security design

As we will discuss in future chapters ! Unknown link ref, the safety and security properties of a system must be designed top down, and they need to be defined early in system development, before too many low-level components are designed.

We advocate using the systems safety methodology ! Unknown link ref, which emphasizes starting with the accidents or losses that are to be avoided, and then the conditions that must be maintained in a system to achieve safe operation. (This is different from many safety methodologies, such as functional safety, which focus on safety in the face of failure conditions and do not address safety problems arising from design or component interactions.) The categories of losses come from the safety and security objectives defined in the concept development phase.

Some example conditions:

Once these conditions are identified, systems engineers must determine how to address them in the design of the top-level system. They must then create derived specifications for each of the top-level components in the system, and show that if each of the components meets its specifications the overall system will exhibit safe or secure behavior by complying with the safety and security conditions. This process is repeated through at increasingly lower levels of the system.

23.5.6 Review and approval

A specification guides the design and implementation of parts of the system. Given the importance of this role, a specification—or an update to a specification—should be reviewed before being committed to. Each specification should be checked by the people whose work it affects: system designers, the designer of the component or system that contains the thing being specified, potential implementers, and those people who are working on components that will interface with or use the component being specified.

As with other system artifacts, a specification or specification update should be under configuration management so that each user can determine whether they are using the correct version or not, and whether the version they are using is a proposed or work in progress version, is the current approved (baselined) version, or a version that has become obsolete.

23.6 Using specifications for a component

We now turn our attention to those people and activities who use a component to design and implement a component; that is, who are concerned with how the internals of a component reflect its specification.

There are two tracks of activity that use a component’s specification:

undisplayed image

One track follows the design and implementation of the component itself, which should result in a component that complies with the specification. The other track follows the design and implementation of verification methods, such as tests or static analyses. The tracks come together when the implementation gets checked by the various verification methods, resulting in a determination of whether the implementation is in fact compliant, or whether the design and implementation need to be fixed to bring it to compliance.

23.6.1 Learning about a component

A specification is an abstracted view of what a component should be. That makes it useful as a guide for someone who needs to learn about a component, before diving into the design or implementation of that component.

Someone who is learning about a component—or about the structure of the system across many components—needs to be able to find the relevant specifications. The specifications should be organized to support them:

23.6.2 Designing and implementing to specification

The general task of a designer or implementer is to create a component that complies with its specification. In practice, of course, this is a complex activity.

The designer needs to be able to clearly identify all of the behaviors or capabilities that the component must implement. This implies that the specification must be organized in a way that helps the designer find all of these, and in a way that can serve as a checklist for tracking which features have been satisfied and which have not yet.

As we will discuss further in upcoming chapters, the designer or implementer should be able to identify which aspects of the component have the highest design risk or are the most technically complex. The designer and implementer will often choose to focus on these hard aspects first, before dealing with aspects that are easy to solve. The hard aspects are often candidates for prototyping, in order to determine if a design approach is feasible and can meet the specification. (See XXX for more on prototyping and risk reduction.)

Complex systems and components can benefit from the combination of incremental development and continuous integration. Incremental development involves selecting a few parts of the component’s specification and implementing those, followed by testing. Once those aspects of the component appear sound, the developers perform a second iteration by selecting a few more aspects of the specification and adding them to the design and implementation. Continuous integration, in this context, involves performing integration testing of these partial designs and implementations in a skeleton of the rest of the system. The partial implementation of this component may use mockups of subcomponents, or interact with mockups of peer components in the system. We discuss incremental development and continuous integration more in XXX.

As people work through design and implementation, they are likely to find problems or gaps with the specification. The specification may be ambiguous in some part, or the specification may not define the behavior for some condition. The developers must be able to work with those who defined the specification to sort out these issues. The developers should check the specifications in depth, asking the specifiers questions to check their understanding or to confirm that there are issues. The developers then should work with the specifier to resolve the issues.

The developers should not make an assumption about a gap or ambiguity and move forward without confirming their assumption. The people who wrote the specification are responsible for ensuring that the specifications for different components are consistent and address large-scale safety or security concerns. The behaviors needed to support correct interaction are encoded in the specification. The developers are responsible for implementing components that correctly support these behaviors so that the resulting system works correctly. The developers do not necessarily have the big-picture perspective to make changes to these critical behaviors, and do not necessarily know who else needs to know about an assumption in how a component is defined. The developers need to work collaboratively with those responsible for the specifications so that all the pieces of the system remain consistent and correct, and so that everyone involved shares a common understanding of how the components and system are to function.

A component’s implementation will need to be verified against the component’s specification. People using continuous testing or test-driven development methods have had good results producing correct component implementations efficiently by testing an implementation in small increments as functionality gets added to it. This reduces the risk that the design or implementation has made some fundamental, early mistake that becomes increasingly expensive to correct as more functionality is implemented on top of the erroneous implementation. Performing continuous testing (or verification) requires having verification cases defined and implemented concurrent with the implementation of the corresponding functionality.

Finally, each component design and implementation will need to be reviewed and approved before being accepted as finished. Verifying that the design and implementation comply with the specification is a major part of the review process. The review activities will be much easier if the specification is well organized.

23.6.3 Evolving specifications

As mentioned earlier, a component’s specification will likely change when a system remains in use for a long time. Systems engineers will need to investigate the impact of making a change to a specification before committing to the change.

The component designers and implementers are part of the investigation process. While a systems engineer can look at what will change in how a component interacts with other parts of the system, the component designers and implementers are better positioned to evaluate the effect that a change in specification will have on implementation or verification.

To change a design and implementation in response to a change in specification, the developers need to correctly determine what has changed in the specification. Having a clear mechanism for showing what requirements have been removed, added, or changed, and for showing specifically how other parts of the specification have changed, makes this task possible. In particular, being able to accurately enumerate every change is important; the developer should not have to hunt for subtle changes that may be hidden.

The decisions that are encoded in a component’s design include how different parts of the component interact with and depend on each other. When a component’s design is to be changed in response to a change in specification, some parts of the design will be directly affected. For example, a decision to add a new input message to a component directly implies that new message reception and handling functions must be implemented. However, one change can affect other parts of the existing design, and the designer and implementer must find and address all of these effects. The example new input message, for example, might require changes to a database schema for storing additional information, or might affect response time behaviors that require changes to foundational concurrency control capabilities in the design. Having a clear record of how parts within one component are designed to depend on or affect each other reduces the effort involved in making this kind of change, and reduces the chances of an error stemming from some dependency being overlooked.

23.6.4 Verifying a component

The specification defines what a component should be or do; the design and implementation define how it is or does these things. Verification is the process of ensuring that the implementation produces behaviors that match the specification.

Every element of the specification should have a corresponding method for verifying compliance of the implementation. Different aspects of the specification will require different methods: some aspects can be verified by testing, such as showing that given some input A, the component responds with behavior B. Other aspects will require demonstration, such as showing that a physically representative user can see and reach control devices. Some aspects—especially safety and security—can only be verified by analysis or formal methods, such as showing that a component never enters performs some action identified as unsafe.

Verification methods involve design and implementation, similar to the design and implementation of the component itself.

Designing a verification method involves, first, determining how a specification property can be verified. (Sometimes a property is best verified using more than one approach in parallel.) once the approach—testing, review, demonstration, or analysis—has been determined, the next step is to design how that specific specification property will be checked. That can involve designing a set of test cases that cover the expected behaviors, or defining a test procedure to evaluate a mechanical component, or defining who will perform a review and what they will look for.

Implementing a verification method turns the design into a specific set of tools and actions that, when used, give a yes-or-no answer to whether the component is compliant.

The verification methods can have errors. Indeed, in some cases the verification of a property can be more complex than the component implementation it is checking. This means that the verification designs and implementation need careful scrutiny to ensure that they are, in fact, checking the specified properties and not something else.

The verification methods also must be complete: if some property is worth specifying, it is worth verifying. The verification designs and implementations need to be checked to ensure that they cover all of the specification. Explicitly recording which parts of the specification any particular verification method checks helps the task of checking completeness.

Finally, it is common for project management to track what portion of a component’s specification has been completed and verified. This can be organized by identifying each property in the specification, and tracking which verification methods check each one. As verifications are done, the project managers can determine which parts of the specification correspond to verification activities that passed.

23.7 Specification artifacts

Specification activities take as input the objectives ! Unknown link ref and CONOPS ! Unknown link ref artifacts that were generated during concept development.

The specifications themselves involve:

The elements in the specification should include traces that show how each individual part of the specification derives from some part of the objectives or CONOPS, and conversely how each part of the objectives is reflected in the specification.

The specification artifacts should be maintained under configuration management. That means that there should be a common repository that everyone working on the system can use to retrieve (and potentially update) the artifacts. The repository should maintain separate versions of each artifact, and clearly identify which version is the current, baselined version that people should use, which versions are outdated, and which are works in progress.

The configuration management system should support people reviewing a specification, and must support recording when a particular version has been approved to be baselined.

Chapter 24: Requirements

1 April 2024

24.1 What are requirements?

Requirements are one kind of specification: they say something about a property that a component or system should have, or a behavior they should exhibit.

A requirement is a specification in the form of a single, declarative textual statement. In the simplest case, a requirement is a statements of the form:

<thing> <specification mode verb, like “shall"> <do or exhibit something>

For example,

The encabulator shall be colored green.

There are many nuances and variations on this basic form, but they are all extensions of this basic idea.

Requirements are written this way in order to maximize the simplicity and clarity of the specification.

Requirements are only one part of the specification for a component or system. They document specific facts about a system’s design, but they do not document the explanation of how that particular design came to be. They do not document the general purpose and scope of a particular component. They do not document complex interaction patterns. These other parts of a specification are documented in other design artifacts that complement requirements.

24.1.1 Why write requirements?

One of the jobs of systems engineering is to ensure that a user or consumer of some artifact (system or component) will be satisfied with the artifact once it is built and deployed.

The specifications for a system or component serve as a way to organize the information about what the user wants, and to organize the process of checking that the final result meets the user’s desires. The specification thus acts as a kind of implicit contract between the end user and the implementers: if the user agrees that the specification properly records their objectives, and the resulting system can be verified to meet the specification, then then the implementers have built something that satisfies what the user agreed to. (Whether the user is actually satisfied is a separate matter.)

XXX would a couple diagrams help here? A first one might show user → conceptual artifact, conceptual artifact → developer → concrete artifact; a second one might show systems and verification in the picture?

This means that there are three main uses for requirements (and the rest of specifications):

  1. Encoding the user’s objectives in a written form, and allowing the user to validate that the specification matches what they want.
  2. Guiding implementers as they work out the design for the artifact that will meet the user’s objectives.
  3. Providing a checklist for verifying that the resulting artifact meets the specification, and thereby the user’s objectives.

A systems engineer is typically the keeper of the specifications, responsible for overseeing the writing, changing, and verification of requirements and other specifications.

Requirements—and all specifications—are therefore acts of communication between multiple groups of people with different roles in building the system.

Systems engineers are facilitators and interpreters in this communication between users and implementers. They are responsible for translating information received from users into specifications (including requirements), for explaining the specifications back to the users for validation. The information from the user is often unstructured and incomplete. It is up to the systems engineer to work with the user to clarify their objectives and ensure that the result accurately reflects the user’s intent. The systems engineer also works to ensure that the specifications are complete. This often involves identifying use cases that the user has not thought of themselves and working with the user to define what behavior the system should have in those other cases.

The systems engineer also facilitates the implementer’s work. The systems engineer develops specifications so that the implementer has a clear guide to what they need to design and build; this requires that the systems engineer provide translation or explanation when the specification does not use the same terms or concepts that the implementers do. The systems engineer is also responsible for ensuring that the final artifact meets the customer’s objectives by overseeing the verification of the implementation against requirements (and other specifications). This involves working with verifiers to ensure that verification methods match the requirements, and checking that all requirements have been verified before the system is declared done.

A systems engineer performs other tasks using requirements, such as checking consistency or completeness. We will discuss these tasks in a later section.

A good requirement must meet several objectives in order to provide accurate communication between all these parties:

These needs lead to conventions about how requirements are written and organized, as we will discuss later.

24.1.2 What are requirements about?

Requirements are a general-purpose way of writing down facts about what something is supposed to be (or not be).

Requirements can apply to just about anything. In a typical system project, they will be used to:

24.1.3 The context for requirements

Requirements don’t stand on their own.

Most requirements in a system will apply to particular components in the system. The component breakdown structure provides the list of components that requirements can be about.

Requirements are part of more general specifications for the system and its components. The specifications include

The requirements must be consistent with these other parts of a component’s specification.

In the end, requirements are satisfied by the implementation of the components in the system. Being able to trace the connection from a component’s requirements to the pieces of the implementation matters in order to be able to show that the requirements are satisfied.

24.2 A single requirement

A requirement itself is a single statement about something that should be true about something.

More formally, a requirement has three parts:

The main winding of the encabulator shall be placed in panendermic semi-boloid slots of the stator

where “be placed” is the verb.

undisplayed image

Some examples:

24.2.1 Example

Consider an example of a statement of what the mission manager for a small spacecraft mission wants:

A spacecraft mission wants a small spacecraft that is expected to operate in low Earth orbit (LEO) for at least three years.

This sentence has a number of problems. It mixes statements together: the mission and the spacecraft, the operating environment and the lifetime. The sentence is not very precise: what is “low Earth orbit”? What does the spacecraft have to do to “operate”? It is unachievable: nobody can guarantee that a spacecraft will function for a particular duration as an absolute guarantee; what if there is an unusual solar flare that fries its electronics?

We can improve the example sentence a bit by splitting it into three requirements statements:

These requirements improve the original statement. First, it splits the original so that each requirement is about a single topic (and is written in the subject-mode-property form). Second, it improves the description of two of the requirements by making them more achievable (“95% probability”) and precise (altitude range given).

These three requirements in themselves are not sufficient. Before the requirements are done being written, for example, there will need to be a definition of what “operate nominally” means. Similarly, the “at least three years” requirement is not enough by itself: three years would be difficult or impossible to meet if the intended environment were the surface of Venus; it would be almost trivially easy in the intended environment were an air conditioned clean room. Adding more information about the environment is necessary to interpret the three-year condition—for example, what is the expected radiation environment at those altitudes?

The three example requirements are not sufficient in another way: they are high-level and provide the designer of, say, a battery subsystem no guidance about how the battery must be designed so that the spacecraft meets these requirements. The derivation or flowdown is the topic of an upcoming section.

24.2.2 Rationale

A well-written requirement is concise. As such, it makes a statement about what a component should do—but the text of the requirement does not capture why the component should do that.

Good requirements should include a rationale statement that documents the thinking that went into choosing to make the requirement. The rationale does not change the requirement; it only adds explanation. The rationale helps those who must come along later, after the requirements are written, to understand or evaluate the requirements. It helps educate other engineers about considerations that may not be obvious. It helps those who later need to revise requirements understand what constraints there may be on the requirement they are changing.

24.3 Multiple requirements

Requirements actually come in groups; they are practically never singular.

The meaning of a group of requirements is the logical and of all of them. If there are ten requirements, an implementation complies with the requirements if it complies with all ten of them individually.

There are two issues to watch out for when there are multiple requirements: contradictions and exclusivity.

Contradiction: Two requirements contradict if complying with one of them means that it is impossible to comply with the other, and vice versa. Every collection of requirements must be checked to ensure there are no contradictions. The section on consistency below discusses this further.

Exclusivity: If a collection includes a requirement

A must do X,

it is perfectly reasonable to also have another requirement

A must do Y.

Having both of them means that there are two things that A must do.

The question then arises: if component A also does Z, is that compliant or not? In some cases it is okay if A does Z (it has a feature that isn’t used) and sometimes it is not (if it is important that A only does X and Y and nothing else ever).

The answer is that having requirements about doing X and Y means that the requirements are silent on Z. If the requirements are silent on a topic, that topic is not considered important and it doesn’t matter for compliance. (If the topic is important, it needs to be included in the requirements.)

If it is important that A only does X and Y and nothing else, that needs to be stated explicitly. This can sometimes be written directly into one requirement:

The component must be colored one of red, green, or blue

This can also be written in a general negative form:

The component must not do any activity not listed in these requirements

Explicitly listing the allowed activities is preferable to a “must not” requirement—the negative form is convoluted and easy to misread.

24.4 Organizing requirements

Even a moderately-sized system will typically have thousands of requirements. Users need some kind of organization of all those requirements in order to find the requirements they will be working with.

There are three concepts to discuss: organizing by subject, organizing by sections, and hierarchical writing.

24.4.1 Levels of requirements

People use requirements for different purposes. This leads to fundamentally different kinds or requirements.

At the most abstract level, the general product or mission objectives capture what stakeholders want the system to do—its purpose. These almost always start as general, vague statements. The stakeholders, system engineers, and product managers refine these over time into a clearer definition of the system’s purpose. The exercise may or may not result in proper requirements statements, but it is worth treating the results as if they are requirements and showing how the top-level system requirements derive from these objectives.

Projects also have guiding objectives that do not specify the system directly, but instead define policy or standards that the system must adhere to. There are many kinds of policies, including:

It is helpful to organize the product/mission objectives and all the various policies and standards into separate collections, identified by the kind of policy or source of objectives. For example, one can maintain one collection for business policy and a separate one for the quality assurance standard being used to build a system.

The top-level requirements on the system as a whole are part of the formal or semi-formal definition of what the system is to do. These requirements say what the system is and does when looked at from the outside, as a black box. These requirements are best kept separate from the more vague product/mission objectives—the objectives represent desires, while the top-level requirements represent the commitments made for what the system will do. The derivation mapping from objectives to top-level requirements provides a place to record the rationale for why different decisions were made about the commitments in the system, and why the decision was made not to commit to supporting some desires, represented in objectives.

Requirements on lower-level components provide definitions of what the pieces that make up the system must do. These obviously have a different scope than the top-level requirements for the whole system.

undisplayed image

24.4.2 Organizing by subject

The first concept is that requirements should be organized by their subject, following the component breakdown structure.

The system objectives are those requirements that apply to the system as a whole. These typically encode the CONOPS for the system, along with requirements derived from the process or design standards.

The rest of the requirements apply to specific components within the system. The component breakdown structure defines what the components are, and gives them names.

Organizing by component is important for proper verification, so that each requirement can be connected to the implementation artifacts that are expected to comply with the requirement, and so that the implementer of some component can properly determine all the requirements they need to adhere to.

24.4.3 Organizing by section

One single component or process/design standard can often have several hundred requirements. Users can find and work with all these requirements more easily if they are organized by topic as well as by subject.

This can be done by creating a set of topic sections within each component. Often these sections are the same for all components—sometimes empty when they are not relevant, but having the same organization across all components help people find what they are looking for.

There is no one recommended set of sections that will apply to every system. The choice of sections is affected by the kind of system or components being developed, as well as by process and design standards. For example, if an automotive project is following the ISO 26262 Functional Safety standards [ISO26262], the Safety Goals and/or Safety Requirements should be collected into one section.

As a starting point, we have used variations on the following set of sections in several projects:

It’s a good idea to work out one or a few section structures that work for your project, then use those sections consistently across all components.

Keep in mind that some requirements will always fit into multiple sections. For example, a requirement may both be about regulatory compliance and define a function the component is supposed to provide. Try to make consistent choices about which section a requirement goes in, but don’t try to make some perfect hierarchical section scheme that would let people avoid making such choices.

24.4.4 Hierarchical versus flat requirements

There are two general structures for organizing requirements on a particular topic:

The flat organization has all requirements within a section be at the same level. Each requirement is independent of the others and can be understood only by reading the text of the requirement.

The hierarchical organization places requirements into an outline, with general requirements and more specific sub-requirements. The sub-requirements must be read and understood in the context of their parent. The sub-requirements provide details, clarification, or limitations on the general parent.

Consider a set of requirements for security on a TCP/IP communication channel. The general requirement is that the communication channel should be authenticated and encrypted. In outline form, this looks like:

  1. Communication channel X must implement security mechanisms
    1. Communication channel X must require authentication before application data is exchanged
      1. The authentication protocol must mutually authenticate both parties to each other
      2. The identity being authenticated must be granted by the organization’s security management system
      3. The authentication protocol must be resistant to man-in-the-middle attacks
      4. The authentication protocol must support revocation of either party’s credentials within X minutes
    2. Communication channel X must maintain integrity and confidentiality of the application data being exchanged
      1. The confidentiality protection must be resistant to traffic analysis

Consider requirement 1.1.1, requiring mutual authentication for the communication channel in question. The requirement for mutual authentication must be understood only to apply to communication channel X. There could well be another communication channel, called Y, that does not have the same authentication requirements.

Written in a flat style, the requirements might be expressed as:

  1. Communication channel X must require authentication before application data is exchanged
  2. The authentication protocol used for communication channel X must mutually authenticate both parties to each other
  3. The authentication protocol used for communication channel X must use identities granted by the organization’s security management system
  4. The authentication protocol used for communication channel X must be resistant to man-in-the-middle attacks
  5. The authentication protocol used for communication channel X must support revocation of either party’s credentials within X minutes
  6. The communication protocol used for communication channel X must maintain confidentiality and integrity of the application data being exchanged
  7. The communication protocol used for communication channel X must provide confidentiality protection that is resistant to traffic analysis

Each of these statements can be read on their own; each statement includes all the necessary qualifications (“the protocol for communication channel X must…”) to identify the scope of its subject without having to refer to other statements.

There are pros and cons of each approach.

24.4.5 Requirement identifiers

Every requirement needs a unique identifier.

People use this identifier to refer to the requirement, including using it as a bookmark or link to reference the requirement in other documents. Software check ins to a repository often use the requirement identifier to indicate what functionality is being added to the repository. Task management systems use requirement identifiers to track the progress on implementing and verifying particular requirements. In general, the requirement identifier enables the integration of requirements management with other tools and tasks

The identifier must be stable. That is, once a requirement has been given an identifier, that identifier should not change. The text of the requirement can (and will) change, but the identifier remains a stable way to refer to the requirement in documents, email, and other messages without having to track down all the uses of the identifier and change them.

It is good practice for the identifier to convey some information about the requirement. At minimum, the identifier should make it clear what component or body of external requirements the identifier applies to. If one writes requirements hierarchically, then using the number of the requirement in the outline is a good identifier.

Having the identifier carry some information helps the user check that they are referencing the requirement they intended to reference. It also helps the reader to know generally what the writer is talking about, without going into a requirements management system to check.

For many projects, I have used the format <component id>:<hierarchical requirement number> as the identifier. For example, space.eps.panels:3.4.2 for a requirement applying to a spacecraft’s solar panels.

There are requirements management systems that use a universal, flat namespace for identifiers, such as REQ-82763. This is not a good identifier, because it makes it hard to check when one has accidentally mistyped or miscopied the identifier into another document. If one accidentally types REQ-82764 into another document, that other requirement could apply to a completely different component—and the mistake is obscured.

24.5 Writing good requirements

Requirements are a way of communicating between people on a project: between the customer and systems engineers, between those who look at how multiple systems work together and those who implement the pieces, between those who design and those who test. A good requirement is one understood equally well by all the people who use that requirement.

Writing good requirements takes practice, but the following guidelines will help in writing and reading requirements.

24.5.1 General form

Individual requirements have a general form:

<subject> <specification mode verb> <property>

The subject is often a component named in the component breakdown structure. It should be named explicitly:

The solar panels shall generate at minimum…

The rudder shall move between 10º left and 10º right

The majority of requirements use either the word “shall” or “must”, depending on the organization and industry. “Shall” indicates an assertion that the statement about the subject is to be true in the implemented system. “Must” expresses the obligation that the statement will be true in the system. In practice the two words mean the same thing when writing requirements.

The solar panels shall generate at minimum…

The flight computer must consume no more than X watts in any mode

The property is a predicate that should be verifiably true about the subject.

Writing the predicate is usually the complex part of writing a requirement. In some cases the predicate is simple:

The subject shall be painted green

The subject shall generate at most X watts of heat

In other cases, the predicate must have conditions added, saying when or under what conditions the predicate applies:

The subject shall generate at most X watts of heat while powered on.

Sometimes the requirement statement is easier to read if the condition clauses are presented in a different, natural order. However, the semantics remain the same: the clause is part of the property statement:

While powered on, the subject shall generate at most X watts of heat

24.5.2 Single topic

A requirement should specify a single property of the subject. The examples above all deal with a single property.

There are requirements that may have multiple things in their property statement that still deal only with a single property. For example:

The widget must be painted green, gray, or white

Formally, this requirement deals with a single property: what color the widget may be painted. The color is restricted to a set of three colors—but the property in question is the color.

Note that this requirement is slightly ambiguous: it is not clear whether the widget can be painted only one of those colors, or some mixture of them. This requirement could be improved by either rewriting it as:

The widget must be painted one of green, gray, or white

Or adding a second requirement:

The widget must be painted a single color

24.5.3 Clarity about subject

A good requirement must be clear about what thing it applies to. In general it is best to write down a proper name of the subject—the name of the relevant component in the breakdown structure, for example.

This rule makes for a lot of repetition in requirements. “The control system must X”, “The control system must Y”, “The control system must Z”, and so on. While it means a little more typing, using the component’s name in each requirement means that each requirement can be understood on its own.

24.5.4 Consistent language

Use consistent terms throughout requirements. Always call component X by one name; don’t change it from requirement to requirement. Always call some one function by the same name, so that it’s clear that all the relevant requirements really are talking about the same thing.

Having lists of names or terms helps those who write requirements to use consistent terms, and provides those who read requirements with definitions when they need to confirm what a term refers to. This means:

24.5.5 Plain language

Requirements (and the rest of specifications) may be written by one or a few people, but they will be read by many people. The readers need to understand correctly what the requirements mean. Many of those readers will be learning about the system by reading requirements or other documents, so they won’t enter into reading the requirements with the same context that system engineers writing the requirements will have.

This means: don’t get fancy with requirements language. There are some ways that requirements will sound stilted, like the subject-mode-property form. There is some technical jargon that is needed to make the requirement precise. But don’t make the language more complex than it needs to be.

For any words or phrases that do not have a meaning that will be obvious to all your readers, help them out by defining how those words are being used in the specifications. Start with “must” versus “shall” and any other mode words (see Advanced Requirements below). Provide a glossary of the definitions of the rest of the words.

24.5.6 Negative requirements and “only”

Many organizations prohibit requirements that say “shall not”. Negative requirements have their place, but they are tricky to get right. The problems arise with exactly how broad or narrow the requirement actually is.

Consider a component implementation that could do one of three behaviors, A, B, or C.

If the component has a requirement “the component shall do A”, the implementation satisfies the requirement (it does A). That is because the requirement, as written, allows for the implementation to do other behaviors as well.

If the component has a requirement “the component shall only do A”, then the implementation does not satisfy the requirement because the implementation might do other things.

Now consider a requirement such as “the component shall not do D”. The implementation does satisfy the requirement, but not necessarily in a helpful way. Just because the component doesn’t do D, what should it do? Are behaviors A, B, and C all acceptable? What about behavior E?

In most cases it is clearer to name exactly the behaviors that are required, because that is unambiguous. One can write verification conditions to test exactly what is allowed.

Sometimes, however, one should write a negative requirement. If there is some behavior that really, truly must never happen, then writing a “shall not” requirement calls out that important condition, and a verification test can be designed to show that the system will not do the thing it isn’t supposed to. The negative requirement should usually be paired with a positive requirement that says what the system should do instead.

Safety and security properties often require stating a negative requirement, because these properties are fundamentally definitions of things that the system is to be designed not to do. I have not been able to imagine a way to write “a robot may not ingure a human being” [Asimov50] as a positive requirement.

Verifying negative requirements is more complex than verifying positive requirements. See Section 13.3.

24.5.7 Avoid “it”

Avoid the word “it” and other non-specific pronouns or modifiers (“they”, “those”, “them”, “its”). Repeat the name of a thing involved in the property, even if that seems repetitive and wordy. An example:

The control system must enter mode X when it is allowed

This is better written:

The control system must enter mode X when mode X is allowed

Because the “it” in the first example is ambiguous: the word could refer to the mode or to the control system.

24.5.8 Avoid impossibly high bars

There are things that we want a system to do. When writing a requirement, it is tempting to write something like

The spacecraft shall function nominally for at least three years on orbit

Unfortunately, this three-year required property of the spacecraft is virtually impossible to meet (unless, maybe, the “spacecraft” is a large, inert chunk of rock). A spacecraft has many parts, operates in a difficult environment, and is built by fallible humans.

The problem with this requirement is that it sets a bar that is so high that no real spacecraft can meet it. The requirement does not allow for any off-nominal operation. It doesn’t allow for a spacecraft to have a temporary fault and then recover. It doesn’t allow for debris to impact the spacecraft. In fact, this requirement is met only when the spacecraft is perfect for those three years. Any real spacecraft will fail verification if it has a requirement like this.

This kind of requirement needs to be modified to something more realistic. There are many ways to do that. The NASA Systems Engineering Handbook has the rule that a requirement should specify “tolerances for qualitative/performance values (e.g., less than, greater than or equal to, plus or minus, 3 sigma root sum squares)” [NASA16, Appendix C].

Three common ways are:

Of course, these are often combined.

24.5.9 Measurable conditions

The point of a requirement is that someone can determine whether an implementation complies with the statement in the requirement. Operationally, this means that a requirement can be verified (see the section on verification below).

One way to make a requirement measurable is to specify the condition quantitatively. For example, a spacecraft’s battery must be able to store at minimum X milliamp-hours. It’s not hard for a test engineer to see how to create a test to verify that the battery complies.

Other requirements, especially those that specify an action that should be taken under some condition, aren’t quantitative, but instead are measured by observing whether the required action is taken. The verification tests will involve either creating the condition under which the action is to occur or observing that the condition has occurred, and then observing that the required action has been taken. For this kind of requirement to be useful, a test engineer must be able to understand accurately the enabling condition and be able to create or detect that condition. The test engineer must also be able to understand the action that is supposed to occur, and detect that it has occurred. If the enabling condition or action can’t be detected, then the requirement is not readily measurable.

Requirements on low-level components are often easier to make measurable than requirements on high-level components. This is why high-level requirements are often verified by looking at requirements derived from the high-level requirement rather than by trying to construct a verification test directly on the high-level requirement.

24.5.10 Unverifiable conditions

When writing requirements for human-machine interaction or user interfaces, the underlying need is that a user can understand what the system is doing, and give it the right commands so that the system does what the user wants.

How would someone verify that the system as designed or implemented actually meets this objective? The statement is too vague actually to test.

There are multiple ways to address this issue.

First, one needs to break the objective up into a number of more-specific objectives. This often involves putting together a list of what it means to “understand what the system is doing”. This might involve:

And so on.

This breakdown is an improvement over the original desired objective, but the conditions are still not verifiable. As we will see in the later section on requirement derivation, these can be turned into high-level requirements that are broken down further, and the verification condition on these high-level requirements consists of, first, verifying all of the derived requirements, and then showing an argument that satisfying all the derived requirements shows that the high-level requirement is satisfied.

The derived requirements about “perceiving” or “observing” are themselves not verifiable: how does one verify that a person has observed, or can observe, some state of the system? This needs to be broken down into yet further, more specific requirements. For example,

Observing how much fuel the system has remaining

Is a process, consisting of a chain:

System has fuel → system can measure how much fuel → system transmits this information → an indicator shows the amount measured → a person can see the indicator → a person can accurately observe the indication XXX

If all these steps are satisfied and work correctly, then the person should be able to see the amount of fuel remaining.

Focus on the last two functions in the chain: that a person can see the indicator and that they can observe the indication. Seeing the indicator can be in turn broken down into further requirements, primarily on the physical structure around the person. For example, some of these might be:

There is some prerequisite information needed to verify these examples. For example, what range of sizes will the users be? In order to check for unobstructed line of sight, one must know where the user’s head will be. What visual acuity or color perception abilities are required of the users? A colorblind user will not be able to perceive some color differences that might be used to convey necessary information. What expectations will a user bring to the task? If a user is socially conditioned that green means good and red means bad or stop, using different colors to indicate good or stop will be hard for a user to interpret.

How would one go about verifying these requirements? There are multiple techniques that will help—and usually the techniques must be used together to really check whether a requirement is satisfied. These techniques are a combination of analysis using models and real-world measurement.

The experimental approaches are often the most expensive in time and money, but they are the gold standard for verifying a human interface requirement. Conforming to standards can help address expectations that users will bring to tasks.

In summary, there are several tools for addressing requirements that are too vague or complex to verify:

XXX revisit this section to bring it into line with the Leveson viewpoint on user interaction as control

24.5.11 Detail appropriate to the level

Requirements should be written as a description of what one sees in a component when looking at it from the outside—a black box view. A good requirement does not go into how the feature or behavior is implemented inside the black box.

Put another way, the requirements for a component are documentation of how the component fits into the system around it. If component A is part of a larger component B, the requirements on A document what the implementation of B needs for A to do its part correctly. If components C and D are peers, the requirements document what they will need from each other for both to do their job.

This matter connects directly to requirements derivation from component to subcomponent, which is discussed in the next section.

There are four reasons to follow this rule.

  1. Requirements aren’t the only specification of the system. There are design documents whose job is to document how a component will be implemented internally.
  2. Many requirements are written before a component’s internal implementation is understood. The requirements serve as a record that the component designer can come back to to make sure they have designed or built a component that meets the needs of the components or users that will interact with the component.
  3. Things change. Components get redesigned. If a component’s features don’t change but its implementation does, the requirements defining the component shouldn’t change.
  4. Saying what a component is supposed to do leaves room to document the thinking or the rationale that led from the external what to the internal design of how the component provides the whats. This helps others who come along much later to understand the system—in particular, it helps when a requirement needs to change and a new person has to work out what effects that change in requirement will have on an implementation.

It is tempting to skip right to the details of how a component is built. Don’t do it; provide other people the benefit of your understanding of the problem, not just the final design answer.

24.6 Requirement derivation

XXX revisit this to bring it in line with system model terms

No requirement stands entirely on its own. Almost all requirements have some reason that they have been included in a system, starting with: this requirement is necessary so that the system meets some objective. In lower-level components, the reason often is: this requirement is necessary so that this component provides some feature that other components depend on.

These are examples of requirement derivation. Derivation encodes the relationship between requirements.

Almost all requirements are derived from other requirements, and the requirements in a system must keep track of how one requirement leads to another, or how one is dependent upon another.

undisplayed image

There are several kinds of relationships that people record. Some of these are:

Let’s look at each of these kinds of derivation.

24.6.1 Subcomponents providing features for parent

A parent component has a requirement that the component provide some feature. The requirement in the parent specifies what the parent must do, but does not specify how to implement that feature. The design of the parent component, and later, the implementation, document how the parent component will satisfy that requirement.

When the designer decides on the implementation, they will decide (among other things) how the parent component will use subcomponents to implement the feature. These decisions create requirements on the subcomponents so that they provide the features that the parent component will use.

The reason for these requirements on subcomponents is that they are necessary to satisfy the requirement on the parent component. A derivation relationship between the parent requirement and the subcomponent requirements documents why the subcomponents have the requirements they do.

Consider a spacecraft example. The spacecraft as a whole has a requirement that it be able to point at a ground location, with some number of degrees of accuracy. To implement that feature, the spacecraft designer chooses to use the spacecraft’s attitude control system to point the spacecraft toward a ground location, and then slowly rotate the spacecraft as it passes over the ground location. The parent component—the spacecraft—has the high-level requirements for what it needs to do. The subcomponent—the attitude control system—must be able to slew accurately to an initial pointing vector, and then be able to slew slowly and accurately until the spacecraft is done with an observation. The slewing accuracy and speed are the derived requirements on the attitude control system.

The process continues recursively. The attitude control system designer decides to use reaction wheels as the primary attitude control mechanism. The requirements for slewing accuracy and speed create requirements on the reaction wheels for how quickly or slowly they can turn the spacecraft.

24.6.2 Internal derivation

Some components will have a requirement that specifies a very high-level capability the component must provide. For example, in a section on disposing of a component that is being discarded:

The component shall have a procedure for disposal that ensures that no confidential information is leaked to unauthorized parties

There are several ways this requirement could be met: destroying the retired component in house, crashing the component into the atmosphere or ground in a way that will assure the component is destroyed, or erasing the data on the component before giving the component to an outside entity for recycling.

Whatever the implementation decision is, it creates more requirements on the component, and those requirements derive from the decision on how to satisfy the requirement on protecting confidential information. If, for example, the implementation decision is to recycle a retired part, then this might lead to requirements like:

The component shall provide an interface by which an authorized user can command the erasure of all data stored in the component

The component shall provide a function that erases all data stored in the component

In some organizations, the practice is only to record derivation from one component to another. Sometimes that works out; in the example, the requirement for an erasure command could be on a command handling subcomponent, and the erasure requirement could be on a memory component. However, some components do not break down into subcomponents easily—for example, when the component is being implemented by an outside vendor. In other cases, it is simply clearer to document the implementation requirements for the component directly and then passing the requirements through to subcomponents, so that a user can see the totality of the functional interface to the component in one place rather than having to search through subcomponents for something they don’t know exists.

24.6.3 Pass through

External objectives and standards often impose general requirements on “all components of type X”, or the like. For example, an automobile might have a requirement that all electronic components function nominally across a temperature range of -40º C to +125º C. (See the section on Sets as subjects below for more on this.)

This requirement can be placed on the automobile as a whole; the requirement might read

All electronic components in the automobile shall function nominally across the temperature range of -40º C to +125º C

If the automobile includes engine, braking system, and entertainment systems as parts, the temperature range requirement can be passed down to those subcomponents:

All electronic components in the engine system shall function nominally across the temperature range of -40º C to +125º C

All electronic components in the braking system shall function nominally across the temperature range of -40º C to +125º C

The braking system controller unit shall function nominally across the temperature range of -40º C to +125º C

But the entertainment system, which is not safety critical and operates in the more benign environment of the passenger cabin, might have the requirement:

All electronic components in the entertainment system shall function nominally across the temperature range of -10º C to +50 C

In these examples, the general requirement is copied down into lower-level subcomponents until it reaches some component (such as the braking controller in the example) that does not have further subcomponents. Sometimes the requirement is copied verbatim, just changing the scope of the subject; other times, some component will have a variant on the general requirement.

This kind of derivation is sometimes referred to as allocating requirements to subcomponents.

24.6.4 Mutual dependency

Sometimes two components are peers of each other, and need to interact. A fuel tank provides fuel to an engine; a spacecraft communicates with a ground station to send telemetry and receive commands; client and server applications send messages to each other.

These interactions involve requirements on each of the components involved, showing how the components support each other. The fuel tank must send fuel; the engine must consume fuel. The spacecraft must be able to communicate with the ground station; the ground station must be able to communicate with the spacecraft.

This leads to pairs of requirements that record this mutual dependency. At a high level,

The spacecraft must be able to communicate with ground stations using protocol standard X


Ground stations must be able to communicate with the spacecraft using protocol standard X

These two requirements should show a two-way relationship with each other. (Formally, this introduces a cycle in the derivation graph.)

24.6.5 Using derivation

Derivation shows how requirements are related to each other.

Systems engineers use the record of these relationships for several tasks.

A derivation relationship between requirements on two different components helps to document the implementation approach for meeting a higher-level requirement. When a designer looks at the high-level requirement, they can see what features are used to implement the high-level requirement. The lower level requirements and their rationale allow the designer to see the argument that the implementation will be sufficient to meet the high-level requirement. This makes the design rationale available to people who didn’t create the design in the first place, but need to understand it to evaluate it or to make changes.

The section on analyzing requirements, below, goes into more detail on how one can look at the requirement derivation relationships to evaluate completeness or sufficiency, to argue whether low-level features are actually necessary, and to trace out the effects of making a change in requirements.

24.6.6 Viewing derivation

There are two ways that a user should be able to see derivation relationships. First, when looking at any one requirement, the user should be able to see what requirements this one is derived from directly, and what requirements derive directly from this one.

Good requirement management tools will also provide a view of the graph that shows derivation graphically. Derivation relationships can be viewed as a graph, as a way to see multiple levels of derivation. The graph is typically mostly a tree or DAG, but there are legitimate reasons that the graph will sometimes have cycles (between peer components, for example).

Here is an example showing how a top-level requirement is the source for a number of other requirements.

undisplayed image

24.7 Advanced requirements

All the requirements discussed so far are simple requirements. Simple requirements have a single, clearly specified subject component. Each simple requirement expresses one property about that subject that must be true.

Simple requirements are not sufficient to express every need that real systems encounter. There are two that we have seen many times: requirements on sets of components, and requirements for standards.

24.7.1 Sets as subjects

Consider a system where all code is expected to adhere to a published coding standard. The implied requirement does not apply to any single component; it applies to all of them that include software.

This expectation can be written as a top-level requirement on the system as a whole:

All subcomponents of <the system> that include software shall adhere to the XYZ Coding Standard.

The subject of this requirement is the set of all software components in the system. The property is that their implementation adheres to the named coding standard.

This kind of requirement is placed on the top-level system, and then each first-level subcomponent includes a derived requirement that propagates the requirement downward:

All subcomponents of component X that include software shall adhere to the XYZ Coding Standard.

On a component Y that has software as part of its implementation then has:

The software in component Y shall adhere to the XYZ Coding Standard.

If component Y has subcomponents, Y should also have a second requirement that continues to pass the requirement down to Y’s subcomponents.

This is an example of a general technique:

24.7.2 Writing for standards

Many texts on requirements approach the subject from an assumption that there is one system being built: these are the requirements for System X. System X will be built in its entirety as specified; any and all requirements must be satisfied.

Writing standards is a different problem. A standard is specifying requirements on multiple hypothetical systems that may exist at some point. Those systems will not be identical, but the systems that adhere to the standards must adhere to the requirements in the standard.

Standards often provide options. The standard has a set of optional features. If the system chooses to implement those features, the features must conform to the standard. However, the system does not have to implement those features. This means that the system does not have to satisfy every requirement in the standard.

Some standards also present best practices. For some feature, it is recommended that the feature conforms to a part of the standard, but it is not absolutely required to do so.

The vocabulary of “shall” or “must” does not accommodate these situations well. The Internet Engineering Task Force (IETF) has defined a richer set of requirement modes. For example:

MAY. This word, or the adjective “OPTIONAL”, means that an item is truly optional. One vendor may choose to include the item because a particular marketplace requires it or because the vendor feels that it enhances the product while another vendor may omit the same item. An implementation which does not include a particular option MUST be prepared to interoperate with another implementation which does include the option, though perhaps with reduced functionality. In the same vein an implementation which does include a particular option MUST be prepared to interoperate with another implementation which does not include the option (except, of course, for the feature the option provides.) [BCP14]

The words used to indicate these more complex conditions must be defined just as carefully as “must” or “shall”, and must be used consistently.

24.8 Analyzing requirements

Many people think of requirements only as a contract for guiding implementation and a checklist for performing verification tests later. However, requirements—along with other specifications—are useful in themselves for helping build a design and making sure the design is good.

There are three kinds of analysis that systems engineers do on the requirements themselves:

  1. Ensuring that the requirements (and specification) are complete
  2. Ensuring that the design is minimal, meaning that the design only contains features that it actually needs and nothing extraneous
  3. Ensuring that the requirements are consistent
  4. Understanding the effects of making a change to one part of a system design

These are all analyses that should be done on the specifications of a system, including the requirements, and not delayed until implementation. Some of these tasks are easier to perform on the abstracted and simplified view of the system that specifications give. Performing these tasks before implementation will reduce the amount of re-implementation needed when one finds that the requirements aren’t sufficient or minimal.

24.8.1 Complete design

The expectation is that if a system is built to conform to its specification, including requirements, that the system will do the job that its users need and do it correctly. (Of course, this assumes that the top-level specifications are themselves a correct and complete record of the users’ objectives; we discuss this more in the section on validating requirements below.)

To meet this expectation, the system’s requirements need to be complete and correct. This means that when one looks at any given top-level requirement, one can trace out the features on other components that will be used to implement the requirement and argue that those features will combine correctly to produce the desired result.

There are two parts to this analysis:

Having tools that allow one to view parts of the derivation graph in visual, graphical form is invaluable to performing this analysis.

Consider an example. A UAV (drone) is supposed to receive and process commands from an operator on the ground. This leads to requirements:

undisplayed image

These requirements are not complete, because they leave out a critical step: when a command is sent from the ground operator to the UAV, the message first goes to the transceiver. The receiver extracts the message, and then sends the message to the command and data handling component. The example omits the part about the transceiver and command handler passing information to each other. This means that one could build an aircraft that had a radio and had a flight computer, but the two would never talk to each other. Obviously, the UAV would not be acting on commands with that design.

This leads to a more complete set of requirements:

undisplayed image

In the example, the communication between the transceiver and command handling components should be documented in some other specification for the UAV, perhaps an activity diagram showing how commands flow through components. The requirements then need to be checked against these other parts of the specification to make sure that all of the functions in each of the steps are reflected in the functions each component is required to implement.

Sometimes determining whether a set of requirements is complete or not will require further analyses. As a simple example, the maximum mass for an aircraft might be X kg. Making sure that the aircraft’s overall mass comes in under that limit means enumerating all the components in the aircraft that have mass, adding up their mass, and determining that the result is below X kg. For that analysis to be complete, it cannot leave out, say, the mass of the motors; all components must be considered.

As a more complex example, a system might have a maximum acceptable failure rate target. Being able to argue that the system is reliable enough involves performing a fault tree analysis, enumerating all the ways that failures in components can lead to system failures. The analysis cannot leave out components and be complete; nor can it leave out some failure modes of some of those components.

Checking whether the design is complete is not a simple task that can be performed just by inspecting the graph of requirements. The analysis is helped by being able to see the requirements, but it requires imagination and effort to actually check the result.

XXX Sidebar: relationship to Goal Structuring Notation and safety cases

24.8.2 Minimal design

Every feature and every requirement on a component should have a reason for being there.

At the top level, for the system as a whole, only features that address customer needs or business objectives should be included. At lower levels, the only requirements that should be placed on components should be ones that are actually needed to make the system work properly—meaning the system meets those top-level objectives. Tracking the purpose for every requirement

The derivation relationships between requirements encode the reasons for a requirement to exist. This leads to a condition that should hold across all requirements:

Every requirement for a system and all its components should derive from one or more customer or business objectives

This is straightforward to check using the derivation graph: every requirement should derive from at least one parent requirement, and it should be possible to trace upward through the derivations to reach a customer or business objective.

Often while requirements are being developed, a requirement will be placed on some component without setting up the derivation. This requirement will not have a parent, and so the checking method will flag it. But what to do then?

In most cases, there was a good reason that someone wrote that component requirement. When one finds a requirement that is not documented as supporting some higher-level reason, it is worth exploring why that requirement is valuable. In some cases, the parent requirement(s) are present, and the requirement just needs to be linked to them. In other cases, the requirement can be a clue that there is some higher-level principle that the writer had in mind, and that higher-level principle should be added into the requirements higher up in the system.

For example, consider a data storage component where an engineer placed a requirement that all data be stored in an encrypted form. As written, that requirement doesn’t derive from any other requirement. But why did the engineer believe that encryption was necessary?

One answer is that encryption isn’t necessary. In that case the encryption requirement can be removed. Another answer is that the engineer wrote that requirement because they believed that the component would be storing confidential data that should be protected against disclosure. In that case, it is worth checking: does the system have requirements—or business objectives—about protecting confidential data? If not, then this exercise will have found a topic that has not been adequately addressed, and new requirements need to be added to make a correct specification. Those requirements should be added throughout the system, and the requirement we started with should show that it derives from those new features.

Many such requirements result from external standards that are supposed to be met, such as regulatory, safety, or security standards. Those standards should be included in the external objectives for the system, their requirements should flow down through the system to the components where the standards apply. This produces a record of how the system’s design complies with those standards. Finding unnecessary requirements

Some requirements that show how they are derived from some parent requirement are still not actually necessary.

There is no simple, mechanical way to find these unnecessary requirements. However, the analysis used to determine whether a collection of requirements is complete is also useful for finding these unneeded requirements.

Consider this example:

undisplayed image

The requirement about encryption is not actually needed for the system in question. That is because the connection between the transceiver and command handling components is physically contained within the UAV, and the physical encapsulation provides enough security to protect the messages passing between the two. The encryption requirement can be removed with no loss of capability.

However, in this example, the engineer who wrote the encryption requirement had a good idea but expressed it wrongly. The engineer understood that the integrity of communication between the two components was important; a command that was properly received but garbled in being sent to the command handling component could be a problem. The presence of the encryption requirement should be replaced by a less costly requirement, that the channel must protect the messages it carries against corruption.

24.8.3 Consistency

Consistency in a body of requirements is when the requirements don’t contradict each other. If requirements do contradict each other, the system as specified isn’t implementable and the specification needs to be fixed.

Broadly speaking, there are three kinds of consistency that one should check:

  1. Consistency among requirements for one component
  2. Consistency between requirements on either side of an interface between two or more components
  3. Consistency between requirements on a higher-level component and the requirements on subcomponents that should show how the higher-level requirements will be implemented

As long as requirements are written as text, and not in a formal notation, consistency checking will be manual. It involves reading through each requirement, finding other requirements that address related topics, and checking that they are consistent with each other.

Some inconsistencies are fairly easy to detect. If one requirement says component X shall be blue and another says component X shall be red, it’s obvious—one must just read through all the requirements on component X and see that two requirements both deal with the color property and they say opposing things.

Other inconsistencies are harder to spot because they do not use the same language in the properties they are specifying. As an example, one requirement might say component X shall use encryption algorithm Y while another requirement says component X shall use protocol standard Z. If protocol standard Z allows encryption algorithm Y, this is fine. But if the standard does not allow that particular encryption algorithm (perhaps because the algorithm is outdated and no longer considered secure enough) then there is an inconsistency.

Another class of inconsistency comes from the states a component can take on. Elsewhere in the specification of a component, there should be a definition of the state machine that the component is supposed to follow. The requirements translate that state machine into individual actions that the component is expected to take in response to particular inputs. It is easy—especially when editing or updating the component’s specification—to have two requirements: when condition A occurs, component X must transition to state Y and when condition A occurs, component X must transition to state Z. The inconsistency can be more subtle, such as leaving out some transition, or using inconsistent definitions of the condition that causes the transition. This class of problem can be addressed by having a single, clear definition of the state machine the component is expected to follow, and then checking the requirements against the state machine.

Finally, another class of inconsistency that can be hard to detect has to do with timing. Two requirements can impose timing constraints that cannot both be satisfied. For example:

When event A happens on component X, event B must happen within 10 milliseconds

When event C happens on component X, event B must happen within 15 milliseconds

Component X must perform the events A, C, and B in that order

There is no way for component X to meet the timing requirements given the order that events must occur. Building a timing model of the component in question, and performing a timing feasibility analysis using that model, can help find this kind of inconsistency.

This is by no means an exhaustive list of the kinds of inconsistency one must look for.

24.8.4 Effects of changes

Systems change. This can happen because customer needs change, or because technology changes, or because someone has found a better design for part of the system. A good development process supports constant evolution and change of the design and implementation of a system.

Not every change that is proposed will be performed. When someone proposes a change, someone else will analyze the proposal to determine the effects of the change. Based on this analysis, people may decide to go ahead, postpone the change, or not make the change.

The analysis must accurately determine:

This analysis makes use of all the specifications in the system, but requirements are a major contributor. In particular, the derivation relationships help show how component features depend on each other, and thus help guide an analysis of how far some change will spread. Effects of top-level changes

Top-level changes include adding a new feature to the system, removing a desired feature, or changing a standard or other external source of requirements.

If the change changes a top-level requirement, look at the derived requirements from that changed requirement and see if the derived requirements are still necessary and sufficient to satisfy the newly-changed requirement. If they are, then no further action is needed. If they are not, then the derived requirements must be revised, possibly adding or removing some of them. The process then needs to repeat with these changed derived requirements. If the change affects a requirement that supports a different top-level requirement, then one must check that the other top-level requirement is still satisfied by the changed derived requirements.

If the change adds a new top-level requirement, work out what derived requirements are necessary and sufficient to satisfy the new requirement. Look for lower-level requirements that already exist that can also support the new requirement. This may involve a change in design, not just requirements; this will cause more changes to propagate out.

If the change removes a top-level requirement, see if any lower-level derived requirements are no longer needed or can be relaxed. If so, work downwards to propagate the effects of those changes. Effects of lower-level changes

Many more changes will come to lower-level components in the system. There are many reasons this can happen: because people have found that a design in process is infeasible or too costly; because a vendor’s part specification or availability has changed; or because someone has found a better design for some lower-level component.

Evaluating a lower-level change involves all the checks for a top-level change above, along with the need to see how the change will affect higher-level requirements. Will the change leave the higher-level requirement unsatisfied? Will this change make some other sibling requirement redundant (that is, the parent is satisfied without the sibling)?

Tracking down these effects is much easier if the derivation relationships among requirements are accurate. Tools

Good tools help the process of evaluating changes. There are three features in particular to look for:

  1. The ability to create an independent working version of the requirements, in order to try out changes before committing them to a baseline. The ability to see what has changed between the baseline and the working version and selectively merge changes into the baseline allow reviewers to understand the whole effects of the change and to accurately accept the changes.
  2. A feature to mark some requirements as potentially changing and others as needing evaluation. This feature helps ensure that the change evaluation does not miss some important change.
  3. The ability to record a rationale for a derivation relationship between requirements helps the people evaluating changes determine why a set of derived requirements was considered necessary and sufficient.

24.9 Validating requirements

XXX rewrite this to bring into line with introductory language on deriving verification

Validation is the process of determining whether a set of requirements accurately reflects the needs of the system. This can mean that the system will meet customer needs, or mission needs, or other external objectives.

It is important to keep validation separate from verification, which is discussed below. Validation is about seeing if the requirements (and the rest of the specification) is an accurate reflection of external needs. Verification is about seeing if the implementation is an accurate reflection of requirements. (Some software engineering texts focus validation on consistency, completeness, and similar properties. Systems engineering has generally kept those kinds of checks separate from validating customer or mission satisfaction.)

The validation process starts with checking the system objectives, business objectives, security and safety objectives, and regulatory objectives to see if they are an accurate reflection of the customer or mission needs. Presumably appropriate care has been taken while these objectives are being gathered and written down, but mission understandings or desires change over time and an independent check on the objectives will help avoid having problems be discovered late, when it is expensive to make changes.

At the top level, one should check:

At lower levels, one is checking whether the derived requirements from a parent are necessary and sufficient. The analyses for complete and minimal design, discussed above, cover those checks.

There are many different ways to validate a system’s specifications. They generally fall into two groups: analysis and simulation.

XXX improve language: analysis as formal method vs review as informal

Validation by analysis involves people reviewing the requirements and using their judgment to check the specifications. This can involve performing joint reviews with stakeholders so that they can check the requirements.

Validation by simulation involves stakeholders somehow seeing a model of the system in action. There are many ways to do this. Stakeholders can be invited to define some scenarios that represent how they will use the system, and then try out those scenarios using a model of the system. Some ways we have done this include:

These validation exercises should be completed and the stakeholders should concur that the specifications are correct before one baselines the specifications, including requirements.

24.9.1 Connecting requirements and implementation artifacts

People must be able to navigate from a requirement to its associated implementation artifacts and vice versa. The people implementing a part of a system according to requirements need to be able to quickly and accurately find the requirements that they need to comply with. In the other direction, the people verifying requirements must be able to find the artifact or artifacts that implement a particular requirement.

The approach to organizing systems artifacts that I advocate here, which organizes many systems work around a hierarchical component breakdown structure, is designed to meet this need conveniently. The set of requirements that apply to some component are implicitly connected to other specifications and the implementation of that component because they are all organized by the same component names and identifiers.

One can also explicitly label artifacts with component identifiers or requirement ids. For example, verification test specifications are associated with specific requirements, so the test specification needs to be labeled with the requirement ids that it applies to.

24.10 Verification

Verification is the process of showing that the implementation of the system, or parts of it, complies with the requirements.

Verification involves gathering evidence that every requirement is satisfied by the implementation.

There are four general methods used to verify the implementation’s compliance:

Inspection is verification by having people review parts of the implementation to check that it complies with a requirement. The inspection review should be performed by people who did not implement that part of the system, so that the reviewers are not misguided by preconceptions (“I’m sure I implemented this correctly”).

Some inspections are particularly simple. Consider a high-level requirement that is the source for a few lower-level requirements. In many cases, the high-level requirement is satisfied when the lower-level derived requirements are all satisfied. In these cases inspection becomes a simple matter of checking that the derived requirements are all satisfied. The rationale associated with the derivation or with the high-level requirement should indicate when this situation applies.

Test and demonstration are similar. Testing is generally more exhaustive, and necessary lower-level components. A single electronic component, for example, might be operated across all the specified thermal, vibration, and atmospheric environments it must handle. Demonstration is less exhaustive, and used to verify top-level system objectives. A prototype spacecraft radio transceiver might demonstrate that it can communicate with ground stations from a similar orbit to where the final spacecraft system will operate.

Some requirements cannot effectively be verified by test or demonstration, and must be verified using analysis. This occurs when one is verifying a negative condition: the verification must show that the system will not perform some action or be in some condition at any time. Providing evidence of the absence of some condition is a long-standing scientific and engineering problem because proving the presence of some condition is relatively easy—demonstrate it happens in one case and that’s sufficient—but showing absence often requires exhaustive search. These verification problems often arise in safety and security requirements, where unsafe failures must be rare (e.g. no more than once in 109 operating hours) or a system must resist a class of attacks (showing that no attack of that class will succeed).

Each requirement should have an associated verification specification. The specification should lay out what steps must be taken to determine whether the implementation is correct or not. A verification specification is often complex—many pages of documentation for a three-line requirement.

Verification status is a measure of how well the implementation matches the specification, including requirements. In practice this means how well a version of the implementation complies with a version of the specification, as both implementation and specification evolve over time. This means that, during design or implementation, there is no one single “verification status” that can be tracked: with each new update to the implementation, the verification status changes. Some practitioners and tools make the mistake of tracking verification status only in terms of requirements: which requirements have been satisfied by the implementation? This leads to project management errors when a change is made to the implementation that improves the implementation in one area but causes other parts of the system to go out of compliance—a common occurrence while in the middle of implementation using iterative approaches.

24.11 Limitations of requirements

Requirements have limitations. Writing a good specification for a system means understanding these limitations and addressing them in one way or another.

One limitation is that requirements are written in natural language. Human language is notoriously difficult for pinning down precise meanings, even within a single group of people. Specifications, including requirements, are used to communicate between different groups of people with different outlooks, experiences, and jargon. This makes it hard to write requirements that will be interpreted the same way by all of the people involved.

The limitation of natural language can be partly mitigated using a couple of techniques. One is to maintain a glossary that defines words or phrases that have specific meanings in the specification beyond common understanding. The second is through social cohesion: having enough people from different groups interacting and discussing the system so that they evolve a common understanding of the meanings of things.

Precision is another limitation. Some specifications can be clear and simple in mathematical notation, while they are hard to follow in prose. (Consider expressing Newton’s law of gravitation as an equation versus in prose.)

A third limitation comes from requirements being single statements. Sometimes the specification needs to encode a complex, multistep activity. Each of the steps might be encoded as an individual requirement, but it is awkward and hard to understand. Sometimes the better answer is to write part of the specification in a different form—a flowchart, a state machine, or a set of equations.

As a result, requirements are only one part of the total specification. They cannot do the entire job of recording the full specification of the artifact in question—but they are often the most flexible way to organize most of the specifications. Be prepared to supplement textual requirements with other kinds of specification to get the whole job done.

24.12 Working with requirements

This chapter has mostly covered what requirements are. This section touches on what one does with them and how they evolve over time.

Requirements will change continuously over the life of a project. The rate of change will be high at the project’s beginning, when the team is trying to sort out what the system should be. The rate of change will increase after the high-level system purpose is sorted out and as the design work proceeds in parallel on different components in the system. The rate will taper off as the design and implementation become more mature, with occasional bumps as people find problems with the specifications, or as stakeholders request changes. Ideally the rate will reach zero when the system is ready to go operational, but even while in use people will find changes they would like to make.

Detailed requirements are expensive to develop and maintain. They encapsulate the complexity of how all the parts of a system are interconnected. They require effort to develop in the first place, involving checking for consistency and feasibility across large parts of the system. Changes later involve even more effort, especially if the changes involve reorganizing specifications that have already been developed.

This leads to a tension: changes will always happen, especially with modern, flexible systems, but the cost incentivizes developing all the requirements at once and then freezing them to minimize the cost of change.

This tension is unavoidable, but there are things one can do to reduce the difficulty.

24.12.1 Supporting the life cycle

The requirements for a system—and indeed all the specifications for the system—grow and evolve over time. The times and ways when requirements change depends on the development process a project is using. However, all these processes share some tasks in common.

Collaborative development. In some phases of developing the specifications and requirements for a system, there will be many unknowns and the possible specifications will be in constant flux. In periods like this, many people will be involved in writing down possible requirements, often collaboratively. In phases like this, what matters to people is the ability to quickly sketch out some requirements, and the ability to share and collaborate on these sketches.

Incremental change. At other times, when the requirements and specifications are more stable, there will be incremental changes to the requirements. When someone makes a request for a change to the system, a systems person will need to evaluate the effects of that change. The ability to trace out the implications of a change using derivation relationships helps make the analysis process accurate. As the systems person works out the effects of the change, they need to be able to create an independent working version of the requirements where their updates will not affect an official, baselined version of all the specifications.

Baseline. While the requirements and specifications will be in some degree of flux all the time, the people who use those requirements need stability. The most common approach is to designate a version of the requirements as the current stable version, and then control updates to that stable version. The stable version goes by different names in different fields: baseline, release, plan of record, committed version. For the purposes of this document, we use the term baseline.

A project should use a configuration management or version management process to maintain the baseline requirements. There are many tools that automate such processes. The key features needed are that

Review and approval. People will propose updates to the system’s design as a project moves forward. This occurs often at the beginning of a project, as the design goes from vague ideas to concrete specifications; it continues during the life of the project as stakeholders ask for changes, as engineers find problems or improvements with the current design; and it can continue after a system is released to operation, as people find problems in actual use. These changes will result in specific proposed updates to the requirements. The proposed updates need to be checked before they are accepted and applied to the baseline. Once applied to the baseline, everyone developing the system implementation will need to work to revise their part of the implementation to match, and verification steps will be required, and so on—thus it is important to control changes to the baseline to be sure that they are sound and within the project’s scope before committing to them.

Projects generally use a review and approval process to decide whether to apply an update to the baseline or not. In the review part, systems engineers check the updates to ensure they meet guidelines, including consistency, completeness, and minimality. People who will be affected by the update are asked to review the update, to evaluate whether it is technically correct from their point of view and whether the change is feasible. Project managers are asked to evaluate the update to determine whether the change is in scope and whether there are resources to accommodate the change. If all those parties agree, then the update is approved and someone creates a new requirements baseline that incorporates the changes.

Verification. The implementation of the system needs to be verified from time to time to ensure that what is being constructed complies with specifications. Verification can happen at many different times and with different scopes. As someone implements a feature into a component, verification tests can provide immediate feedback to the implementer. In software development, this is related to test-driven development. Regular verification activities can detect whether a change in the implementation in one place has had an unexpected consequence that causes something else to go out of compliance. This is sometimes called continuous integration testing. When a vendor supplies a prototype component, the prototype needs to be verified for acceptance testing. And when the system is believed to be complete, final verification checks are required before the system enters into operation.

24.12.2 Who works with requirements

Many people generate or use requirements during the lifetime of a project. These include:

24.13 Tools

The right tools make working with requirements much easier and more accurate. However, different requirements management tools are designed to support different styles of requirement writing and use, so you need to choose tools that match how you will write, organize, and use requirements.

Here are some questions that can help you evaluate requirements management tools.

People will use the requirements management tools to perform a number of tasks. You should evaluate how well requirements tools support these activities.

Part VII: Design

Chapter 25: Design introduction

14 March 2023

25.1 Purpose

Previous chapters introduced how to work out what a system or a component should do, by determining what the objectives are for it and then turning those objectives into a specification.

The next step is to design the system or component that will fill those needs.

A design for a component provides a simplified model of how the component will achieve the behaviors, qualities, and structure laid out in its specification. The design is not the full details of how it will achieve those things, or a detailed implementation. The design is a plan for how the component will be built, at a high level; it records the high-level decisions about how the component will be implemented without actually being the implementation.

“Design” is an activity that lacks sharp boundaries from other development activities. On the one hand, it responds to the objectives and specifications that have been developed for the thing being built; on the other hand, the act of designing usually reveals gaps in the specifications that lead to feedback that causes people to update the specification. Specification and design proceed recursively as a system is built, where the act of designing one component leads to writing specifications for its subcomponents.

“Design” also lacks a distinct boundary with “implementation”. Indeed, the boundary between the two varies by convention in different disciplines.

25.1.1 Defining “design”

Given the diversity of ways the word “design” is used, we define what we mean in general by the term.

A design is:

A design is not:

In some projects we have used the term “design model” for the design, to emphasize that the design is a simplification and explanation of the most important aspects of the component’s implementation.

25.1.2 Contents of design

There are several kinds of information that should be recorded in a design.

All of this information should be annotated with a rationale for the decisions that led to the particular design.

25.1.3 Purposes of a design

Why should one take the deliberate and separate step of putting together a design for a system or component, rather than just implementing a component directly based on its specification?

For an exceptionally simple component, one can skip design and just implement the component—but the component must be truly simple, completely understandable from its implementation, involving no significant design choices, and with no future need to change the component, for this to pay off in the long run.

The value of an explicit design comes partly from its abstraction and simplification, and partly from being done mostly before putting together the detailed implementation.

Time to reflect. This is perhaps the most important reason to take the time to build a design before implementing a component. Modern systems are deeply interconnected. The design choices for one component have effects not limited to that component, and the design choices must usually reflect the needs that many other components place on the one being designed. It takes time to find and understand all these interdependencies.

Many components can be designed in multiple different ways. It is often useful to spend some time developing multiple design approaches before settling on one of them. In many cases it is useful to have two or three design approaches, one of which imposes requirements on some subcomponent that are difficult to achieve. That difficulty may not reveal itself until people have proceeded into the specification and design of that subcomponent. Only then may one realize that an alternative design for the original component is better.

Finally, the design needs to support all of the component’s or system’s specification. Rushing through the design increases the likelihood that some essential requirement will get missed, leading to problems later when the component is integrated with others, or when the system goes into operation, and a subtle failure occurs.

Balanced and incremental design. Modern, complex systems involve many different kinds of constraints on components. A component may need to meet all of structural, safety, functional, security, reliability, environmental, maintainability, user interface, and budget constraints to meet its specification and thus to function correctly in the system as a whole.

We have found that focusing too much on any one of these aspects leads to an unbalanced design that does not meet some other aspect. This can lead to repeated partial design followed by redesign after redesign, each time focusing on a different aspect.

The alternative is to consider a little of each aspect at the same time, working to find a rough design that looks like it will be going in a feasible direction for all of these aspects. After there is a rough design, one can go into greater depth on individual aspects with lower risk that the dive into one area will result in not meeting constraints on another aspect.

As one example, reliability and safety often work against each other. The safer choice is often to shut down a component rather than trying to keep it in operation after a failure. Conversely, the redundancy needed to increase reliability increases the complexity of the component, leading to more conditions that could lead to a safety violation.

Guide and explanation. Multiple people will use a design over the course of a project. While one person may develop the first design, others will analyze it for safety or security; still others will review the design for completeness or correctness; one or more people will use it to implement the component; other people will use it to develop and perform verifications. Later, other people will use the design to understand a component that may need a bug fix or feature change.

In other words, the design is for communicating among many different people and over potentially long periods of time, when the people who originally made the design are no longer available to answer questions from their memory.

For all those people who work on the component later, the design provides a guide to understand how the component is organized.

All too often, an engineer is asked to figure out why some existing software component is not working as expected. There is no design, just the source code. The engineer has to try to extract the design from the source code in order to figure out where the component is not behaving as it should. Extracting the design takes time and effort that could be avoided if the design could just be consulted. An extracted design is rarely accurate: the source code does not have a record of where there are subtle, unobvious aspects of the design; nor does it record why the design is what it is. The result is greater cost and time required to update the component, and a higher risk of a change introducing more problems than it fixes.

Decision rationales. A good design includes an explanation of why particular decisions were made. This information helps those who review and analyze the design to determine whether good choices were made. More important, the rationale informs the people who later need to update or redesign the component.

It is common that any electronic board component that is in production more than a handful of years will run into a situation where some chip is no longer available. The manufacturer has stopped making the original chip X, but another manufacturer is making a chip Y that is supposed to be pin-compatible with chip X. Is it okay to substitute chip Y for chip X? That depends on what it was about chip X that led to it being the choice. If the choice was just on the basic chip function, the substitution is probably okay. However, if the choice was based on something unobvious like the chip X’s radiation tolerance resulting from a particular lithography technique, chip Y may not be an acceptable replacement. The only way to know that the radiation tolerance was a key part of the decision is if someone writes down that rationale.

Supporting analysis. Many key component properties, especially those related to safety, security, or reliability, are emergent from the design. It is increasingly evident that these properties are difficult to retrofit into a completed design: they involve the fundamental organization of elements of the design.

This leads to approaches of security-guided or safety-guided design. In these approaches, the security or safety properties are considered from the start and included in the design. As the design progresses from a rough sketch to something more detailed, it can be analyzed with progressively greater accuracy to determine whether these properties are being met.

This approach is relatively inexpensive and easy when it is being done as part of the original design effort. A safety analysis can determine what high-level aspects of a control loop are essential for safe operation; a security analysis can determine what information flow properties must be met to maintain security. These analyses help early pruning of potential design approaches that would not meet safety or security needs.

The alternative is to proceed without including safety or security considerations, then having to go back and work out control or data flow on a more complex design, then repeat parts of the design process while undoing earlier decisions. Repeating work like this takes more time and effort, and is more likely to result in an implementation that has safety or security flaws.

Alternative designs. In the early stages of designing a complex component, there are likely to be multiple different approaches for the component. The choice among the approaches is often not immediately evident. Which one uses chips that will be available on the needed schedule at the needed quantity? Which one uses a subcomponent that will require significant research to make work? Which one will require a significant up-front investment in acquiring long lead time parts? Which one will be acceptable to regulatory agencies? It may take quite some time and effort to find answers to these issues: prototyping a subcomponent, making legal arrangements with suppliers to find out about availability, and so on.

When there are these kinds of risks in the designs, it is helpful to explicitly keep multiple designs open during the investigations, and to delay investing in detailed implementation effort on any one design that would not be useful if that design turns out not to be feasible.

25.2 How designs are used

As noted above, a design enables communication among multiple people, across different times, and for different purposes.

Developing the initial design. One or more people take the objectives, CONOPS, and specification for a component and eventually produce one or more potential designs for that component.

Developing the design is not a single, monolithic activity. It almost always proceeds incrementally, evolving the design from a rough sketch through multiple ideas that turn out not to be quite right until reaching a design that looks like it will meet the component’s specification. The designers will need to try out multiple ideas along the way, meaning that what they document will need to evolve as they try different approaches.

The process of assembling a design can be characterized as working through each of the elements of the specification, while at the same time matching the specification against the possible building blocks for the component. As a simple example, this might involve matching a specification for an electrical energy storage system to store X mAh of energy against a catalog of available battery products.

Actual component specifications involve multiple aspects, some of which will work against each other. A realistic electrical energy storage system must meet performance specifications such as the amount of storage, maximum safe current, reliability constraints, and a number of constraints related to safety. This leads to the recommendation that a designer consider many specification aspects at once, but only at a high level, before going into greater detail.

In the end, the designers must either show that the design they have created fulfills the corresponding specification, or show that the specification is flawed in some way and feed that information back to the people responsible for the specification to get it changed.

Tracking alternative designs. There are usually many ways to design some component, with pros and cons to each. Early in design, there may be multiple promising approaches that require more investigation before a decision can be made among them.

This means that each of the alternatives needs to be documented, along with the investigations needed for each of them, until a decision can be made. It must also be clear to everyone working with the alternatives which one is which. When one alternative is selected, that choice must be clear to everyone working with the designs.

Evolving a design. Every design will evolve, both during the initial system development and over time as the system is used or upgraded or fixed. Any change to the design needs to be evaluated for its scope, its effects, and its correctness.

Evaluating scope and effects means determining what effects the change will have in addition to the specific change being considered. A change in one part of a component might affect some safety property of the component as a whole, for example. A change might also affect some behavior or structure that some other component depends upon, possibly indirectly across multiple intervening components. Substituting one chip for another in a board design might change the timing of some signal, which leads to a subtle change in the sequence of operations performed by software on another board, which in turn invalidates a monitor watching for faults.

Evaluating correctness involves checking that any analyses done on the previous design to show that safety, security, or other properties hold either continue to hold or that the analyses can be adjusted to show that the updated design still meets those criteria.

Analyses. Complex systems will have a number of properties they must exhibit to be correct. These include safety, reliability, and security properties; they also include meeting business objectives and other more mundane properties.

Safety- and security-guided design methods ! Unknown link ref involve incrementally building up these analyses as design progresses, so that a simple, preliminary analysis can provide input to an evolving design.

When a design is believed to be complete enough to select and baseline, it will need review to ensure that it meets all of its specification. Part of this review involves checking the analyses that show the design is compliant. The reviewers need to have the analysis in order to check it.

When a change is being made to a component’s design, the analyses provide a starting point for analyzing the effects of the changes to check that the safety, security, or other properties will continue to hold if the change is made.

Generate specifications for lower-level components. The choice of what subcomponents will be part of a component is a major part of the design effort. The choice of subcomponents means that the role each subcomponent will play has to be worked out; this amounts to developing a specification for each subcomponent.

A subcomponent’s specification is a reflection of the component design. The subcomponent will only work properly as a part of the component if it meets that specification. This leads to the layering principle discussed earlier ! Unknown link ref.

Navigating through the system. Many people will need to find things in the system over time—developers, reviewers, auditors, and many others. Virtually none of them will come in with a complete understanding of the system and its structure, so they will need a guide that helps them learn the structure of the system and to find where some behavior or feature is implemented.

The system design can support such users in three ways. First, the design can provide the breakdown structure, showing how the system is divided into components, those into subcomponents, and so on. The breakdown structure also groups related components together, so that a user can narrow down where they are looking. Second, the design can show how components are related to each other. If one component in one part of the system is providing feedback signals to a component in a different part of the system, making these relationships explicit provides a way for a user to trace out these interactions. And third, including explanations or rationales for why the design is the way it is helps educate the user about subtleties that are not going to be apparent from just reading about the structure, interactions, or behaviors.

Guiding project management. As the design progresses, there will be more components to design than there are people to work on them, and some components will be ready to implement or verify. Project management must make decisions about where to put effort.

Project managers will need information like how risky some potential component designs are, as opposed to those component designs that are fairly certain and thus reasonable to implement. They will need to know which component designs have significant uncertainty, and will benefit from investing resources to prototype a potential design.

These decisions benefit from information that can be gathered and maintained in the overall system design, such as:

Progress tracking. Project management needs to be able to track the development progress of different parts of the system, in order to determine whether a project is on track for completion or is having problems that need to be addressed.

Being able to name each of the components that need to be developed, and being able to determine the development progress on each of them, enables project tracking.

25.2.1 Design leading into implementation

As well as all the uses listed above, the developer uses the design as a guide for the implementation. The resulting implementation must be consistent with the design: having the same structure and behavior, including all the functions in the design, and including no functions not in the design.

The developer or implementer must be able to understand the design to build a component that matches the design. The developer must also be able to check that they understand the design properly, so that there is a way to catch misunderstandings. A good design uses consistent structure, terminology, and diagrams to aid understanding. It provides a glossary of terms that may have multiple meanings to define how they are used in the design.

Developers will find problems with the design as they proceed through implementation. They may find ambiguities, where the design is unclear or where the design does not address some important condition. The developer may find errors, where the design is inconsistent internally or with its specification. The developer may find that parts of the design aren’t feasible to implement. All of these problems need to be fed back to designers for clarification or correction.

When the design changes, the developer needs to be able to identify what parts of the design have changed so they can change the corresponding implementation. The change might come in response to feedback from the developer, or evolution of the design to address changing needs or broader system fixes. This can be supported by using tools that track design versions and highlight design changes between versions.

Finally, the developer must be incentivized to follow the design (or provide correction feedback) as they implement the component. This includes having the designer and independent people review the implementation to compare it to the design. If they find that the design and implementation are not consistent, they must decide on how to change the design, the implementation, or both in order to achieve consistency. The component implementation should not be accepted as complete until they match.

25.3 Design artifacts

The artifacts that record the design enable all the usage cases listed above. The key functions they need to fill include:

25.3.1 Supporting infrastructure

The designs for a system need to be available to everyone associated with the project, so that they can use the design to learn about the system and navigate through it.

An ideal solution provides a “single source of truth”: a user can go to one place and see all of the information about the system. The ideal solution also ensures that the user always sees a single consistent version of all the information. To the best of our knowledge, at present there are no systems that completely meet this ideal. However, there are ways to come close by integrating multiple tools and applying conventions to how they are used.

The infrastructure for maintaining designs needs to, at minimum:

25.3.2 The artifacts

The following sections list the key artifacts that should be part of a design. Later chapters will detail these artifacts. Breakdown structure

The breakdown structure consists of the hierarchical relationship of system, components, and their subcomponents recursively. It gives a name or identifier to each component, and provides the index or table of contents to the parts that make up the system. See ! Unknown link ref. Control structure and other large-scale behaviors

A complex system will have behaviors or structures that cross multiple parts of the system, and don’t neatly fit within a single hierarchy of components. There are two important examples of these behaviors to document.

The first example is behavior or activity sequences that show how different parts interact with each other. These are sometimes documented as UML or SysML activity diagrams, which show how control or data pass among components, and how different components take actions in response to those. The point of these patterns is to show how components work together, which informs the interfaces, actions, and states that the components involved in the activity must support.

The second example is the hierarchies of control that operate in the system. These document how one part of the system controls the functions in other parts, including how some components provide sense data to drive the control logic, and how the control logic in turn sends commands to other components to effect control actions. Documenting and analyzing these control systems is an essential part of some safety and security process methodologies, such as STPA [Leveson11]. Details of each component

Each component in the system should have its own design. This is the primary content about individual components, as opposed to how components work together.

A component’s design can be represented in many different ways. However, it is easiest for users if all designs follow the same general format so that they know how to find particular kinds of information within every design.

All designs should include:

Each component’s design should include rationale: the reasons why different design choices were made. This information helps those who must come along later to review or update the design.

In some cases, not all of this information can be represented in one way or in one tool. For example, for electronics designs the best way to represent some information will be in a CAD drawing that is maintained in a separate tool from the rest of the design information. In these cases, there should be unambiguous references from the main design to the CAD drawing and vice versa, and the versioning in the main design should be reflected in versioning in the CAD tool. Safety, security, and other analyses

Part of the reason for developing a design—as a simplified model of what will be implemented—is to enable analysis of the design’s essentials. These analyses address whether the design will meet aspects of the component’s specification. These can include safety and security, as well as meeting business objectives, regulatory requirements, performance specifications, or resource budgets.

As we will discuss in the next section, it is recommended practice to develop these analyses incrementally in parallel with the design itself. In this way, a rough analysis of a rough design can provide quick, early feedback that will guide the design toward meeting its specified properties as it is developed in more detail.

These analyses become an important part of the record of a design once complete. They provide an extended rationale for why the design is the way it is. They may be needed to answer to external stakeholders, including regulators or courts of law, when it becomes necessary to provide evidence why the design is acceptable. The analyses also help people who must later evolve the designs to understand both the constraints on what they can change, and where they have freedom to make changes without invalidating the safety or other properties of the design.

25.4 Developing designs

As a matter of principle, the design for a system or component should be done after its objectives and specification are done, and before its implementation. Similarly, the design for the components in a system should proceed top down, starting with the system as a whole and proceeding to lower and lower level components. When the design of one component depends on the design of another, the two should be designed together.

These principles often lead people to conclude that systems should be built using a waterfall-like process, where everything is specified before design, everything designed before implementation, and so on.

Real projects are not so simple. We have never observed a project that actually used such a process, even when they tried to. This is because every complex system we have encountered is not fully and accurately knowable in advance. One can write a set of specifications that turn out to require some impossible component design. One might miss some important system objective when developing the initial system concept because the customer was not able to conceive of system operation until they could see part of the system in operation, or because the customer’s needs change. An initial design may be invalidated because a supplier discontinues an essential part. Some part of the system may require significant investigation or research before one can find a feasible way to approach its design.

All of these situations lead to cases where the specification, design, and implementation of the system does not proceed in a tidy one-way sequence through the waterfall stages. Instead, part of a component’s specification gets worked out, and some tentative design goes ahead using that part of the specification gets worked out. Or multiple possible design approaches are defined, and then someone proceeds to build simple prototype implementations of two or more of them to compare their feasibility. Or the design for a component must change, leading to a change in implementation. All of these may be happening in multiple parts of the system at once.

At the worst, all this change happening all over a system can lead to chaos where people working on different components are working to incompatible specifications or designs and building parts that will not integrate into a system. Project management may not be able to determine how much progress has actually been made on any part of the system, and thus be unable to detect when there are schedule or resource problems.

Therefore while the simple waterfall model, which organizes the work on a system, is not feasible, there is still a need to organize development work.

25.4.1 Applying general principles, flexibly

The principles we started with are good ideas in general, when used flexibly.

Develop specifications, then design. When one designs a component without first working out what the rest of the system needs that component to do, one usually ends up with a design that doesn’t actually meet needs (once those are worked out). When a specification gets developed, the people involved will tend to look at the effort that has already been spent on designing (and possibly implementing) the component and will try to adjust the specification to fit that sunk cost—after all, that work has already been done, why should it be discarded? Unfortunately this tends over time to produce safety and security problems, and to dramatically increase the cost of the system as people try to integrate the wrong component into the rest of the system.

It is better to explicitly defer some design decisions until the specification is firm—but not avoid doing any design. (Doing no design until specification is done is not possible when the design activity can reveal problems with a specification.) Do a minimal amount of design, bearing in mind the risk that design may need to change as the specification changes, as well as the risk that the specification may need to change as design reveals problems.


Develop design, then implement. Similar to the way design reflects specification, the implementation reflects design. Proceeding with implementing a component before it has been designed is not really possible: doing so means that design is done implicitly and is left unrecorded. This leads to components that fail to meet functional, safety, or security constraints because those constraints have not been properly considered and analyzed before committing effort to implementation.

At the same time, deferring all implementation until all design is complete is a recipe for an infeasible system. It is all too easy to create a design that involves impossible feats of implementation, from requiring metals that do not currently exist (“unobtainium”) to algorithms that have not been invented.

We have found that a middle ground often works well. As we will discuss in future chapters on implementation, we have used a software implementation approach that emphasizes continuous integration (by which we do not mean continuous testing) and skeleton building for implementation, where the implementation proceeds in many small iterations. Using this approach we can build a simplified implementation of the general structure of a component, focusing on those aspects where the design appears either to be relatively certain or where there is higher risk in the design that needs to be checked with a rough implementation.

We have also made a point of prototyping implementations of parts of a design in order to validate whether the design is feasible. We will also discuss prototyping in a future chapter.

There is a high risk with any implementation done before specification and design are solid, even when the implementation is done for good reasons (like prototyping to validate a design approach). The effort spent on implementing something is a sunk cost: it cannot be recovered. As the design evolves, there is a strong incentive to try to continue to reuse the implementation that has already been completed, as the incremental cost or time of modification is almost always perceived to be less than starting a new implementation from scratch. This leads to a sequence of incremental changes, each of which by themselves can be perceived as the lower-cost way of handling a sequence of design changes. However, it is often the case that after a few of these incremental changes, it will have become more cost-effective to have thrown away the initial prototype or implementation and started over with better information. This sequence of incremental changes also tends to result in an implementation that has many vestiges of implementations that are no longer applicable, but which continue to present a source of bugs, security flaws, or safety problems.

The cost of incrementalism is often apparent only in retrospect. It is also driven by basic business imperatives to minimize cost at each step, or to get features implemented as rapidly as possible. This is an example of an online optimization problem, which is often hard to solve well theoretically and even harder when human incentives are involved. The techniques used to solve similar online optimization problems (notably the ski rental problem ! Unknown link ref) apply. Limiting the amount of implementation effort that may be at risk for incrementalism by deferring as much implementation as possible until the design is solid helps avoid this situation.

Thus we:

Design top down and coordinate the design of interdependent components. Many aspects of a system’s design can only be developed effectively when they are developed from the top down, notably safety and security properties. That is because these properties apply to the system as a whole and are emergent from the designs of the components that make up the system. (See [Leveson11] for an in-depth discussion of this effect.)

However, designing from the top down creates risk, similar to the previous principles, that a high-level design may create unachievable specifications for lower-level components. There is also a risk that during high-level design the cost or time involved in developing some lower-level parts of the system is unknown. This can lead to effort being spent, unknowingly, on subcomponents that are simple to design and build while subcomponents that will take far longer to develop are left for later, leading to a drawn-out schedule.

Our recommendation for managing this risk is to sketch the design for multiple layers, creating a rough outline of a design for a component and some layers of its subcomponents, then proceeding to add detail to the high-level component and fleshing out the specification for its subcomponents. Proceeding incrementally in this way allows one to obtain some information about the feasibility and complexity of a particular design approach before committing all of one’s effort to the detail of the top-level component. This approach is similar to our recommended implementation approach of building skeletons or prototypes of components rather than immediately progressing to detailed implementation.

The same issues about the cost of incrementalism apply to top-down design as they do to implementation. It can be useful to make sketch designs that are not in the final form needed, to reduce the temptation to turn sketches that have been changed over and over directly into the design for a component.

25.4.2 Additional principles

Balance design work. We have found that focusing on one aspect of a component’s design to the exclusion of others often leads to dead-end designs, where a work in progress becomes too biased toward one aspect and is not readily evolved as other aspects begin to be considered. Focusing on primary features first, and leaving security or safety for later, is a common example of this pattern.

We have found it more useful to consider many different aspects of a component’s design at a high level, sketching out different rough possible designs and making simple comparisons as we learn about the design problem. This approach has the advantage of investing relatively less effort on detail design and analysis while the design has a higher degree of uncertainty, and focusing effort on those approaches that pass the first simple evaluations.

This approach to design has its pitfalls. Some components’ designs are constrained by particular aspects—such as a need for high performance or the ability to operate in an extreme environment. These aspects are sometimes called design drivers: they have a disproportionate effect on the final design. Recognizing when some aspect drives the design in this way, and putting more effort earlier into understanding these drivers, is part of the art of designing well.

Plan for updates. Nearly every design in a successful system will be updated as time goes by. Over time, the effort spent on these updates will dwarf the effort spent on the initial design. This means that if one is developing a system for the long run, the processes, tools, and artifacts used in the design effort should be organized in a way that supports those who will come along to learn about, evaluate, and redesign parts of the system—long after those who initially designed it have moved on.

This necessitates documenting more than just the structure of the implementation. For these people to understand a design, they need to know the thinking behind what the choices were and the subtle aspects that are not necessarily apparent from looking at the implementation. These people will need guidance for how components relate to each other. They will need to understand the analyses that determined whether the component’s design was sufficiently safe or secure. This documentation takes more effort than proceeding through a one-time design, building an implementation, and then moving on, but it provides a project with a future.

Making updates effective also involves creating a team structure and human processes that can handle updates. This involves giving the team a clear way to understand how design changes happen, and how to distinguish proposals or work in progress from a design they should work from, or how to determine what design applies to a specific deployed system. It also involves developing a team culture that incentivizes good design and good documentation, giving them enough time to document enough design that their successors can build on their work and avoiding creating unnecessary time pressures that disincentivize people from doing good design.

Use appropriate infrastructure. Finally, effective design relies on having the tools, processes, and standards that give people the tools to do design work. The key principles we recommend include:

Chapter 26: Breakdown structure

9 June 2022

The component breakdown, or breakdown structure, is the way to name and organize all the components that make up a system.

26.1 What is the component breakdown for?

The component breakdown organizes and names all the pieces in the system. It serves three main purposes:

These purposes lead to a few objectives that a breakdown should meet.

26.1.1 Component breakdown versus work breakdown

Some institutions, notably NASA [NPR7120][NASA18] and other parts of the US Federal government [DOD22], specify the use of a work breakdown structure (WBS) in project management and systems engineering. A WBS as used in those projects is different from a component breakdown structure as defined here.

A WBS is oriented toward project management, not systems engineering. It is focused on defining the work to be done (hence the name) rather than the items or components being built by the work. From the NASA WBS Handbook [NASA18, p. 35]:

The WBS is a project management tool. It provides a framework for specifying the technical aspects of the project by defining the project in terms of hierarchically-related, product-oriented elements for the total project scope of work. The WBS also provides the framework for schedule and budget development. As a common framework for cost, schedule, and technical management, the WBS elements serve as logical summary points for insight and assessment of measuring cost and schedule performance.

This difference in intent leads to two major differences in the contents of a WBS compared to a component breakdown. The first is that a WBS includes work items that are not product artifacts. The standard NASA WBS, for example, includes project management, systems engineering, and education and public outreach branches of the work breakdown tree [NASA18, p. 47]. Given that part of the goal of the WBS is to organize resources and budget for a project, that’s an appropriate choice. The other difference is that some people break a task for building a component down into multiple revisions or releases. For example, a “motor control software” component might have subitems “prototype”, “release 1”, and “release 2”, recording the phases of work done to develop that software package.

The component breakdown structure presented in this chapter is narrower in focus than a WBS. The component breakdown lists only the things that are being built. It must be complemented by other engineering and management artifacts to provide everything needed to run a project.

26.1.2 Component breakdown versus other views

The component breakdown is one of several views into the system’s design and specification. The component breakdown has only two purposes: listing all the components and giving them unique names, and providing a structure that people can use to navigate through the components to find one they are looking for.

The component breakdown is not for expressing other facts about components and relationships between them. There are other views and other breakdowns for representing that information—and for doing so in ways that are better suited to the specific information that needs to be explained. For example, a network or wiring diagram does a better job of illustrating how multiple hardware components are connected together. Mechanical drawings are a better way to show how components relate to each other physically. Data and control flow diagrams, perhaps realized as SysML activity and sequence diagrams, are better suited to expressing relationships between software components.

26.2 Basic concepts

When developing a component breakdown, the first question to be settled is: what is a component?

First, a component is something that people think of as a unit. Terms like “system”, “subsystem”, or “module” are all clues that people think of a thing as a unit. More generally, a component is something

Components do not have to be atomic units. Systems have subsystems; components have subcomponents. For example, the electrical power system (EPS) in a spacecraft is a medium-level component in a typical breakdown structure. It is part of the spacecraft as a whole. It is made up of several subcomponents: power generation, power storage, power distribution, and power system control. Each of those subcomponents in turn have constituent components themselves: for example, power generation has solar cells, perhaps arrays that hold the cells, perhaps some other power generation mechanism.

This illustrates the general pattern for the breakdown structure. The structure is a tree, with the highest-level component being the system as a whole. The system as a whole is typically not just a vehicle or box; it is the entire mission or business on which a vehicle is part. Underneath the whole system come the major component systems. For a spacecraft mission, this might be the spacecraft, ground systems, launch systems, and related assembly and test systems. The next level of components are the major subsystems. The structure continues recursively until reaching components that are the smallest that are sensible to model using systems tools.

The recursive process of defining smaller and smaller components ends when there is a judgment that further subdivision won’t help the systems engineering process. In practice, for example, continuing the breakdown structure all the way to individual resistors and capacitors on a printed circuit board is too detailed to be useful for systems engineering tasks.

Some criteria I have used for deciding when to continue subdividing a component into subcomponents include:

Some examples:

26.2.1 Satisfying the objectives

  1. Completeness. Completeness depends on the exercise of identifying all the components to run to completion. The hierarchical approach does not directly inhibit or support this objective. However, the hierarchical approach makes it easier to approach completeness iteratively: one can start with a high-level breakdown, and incrementally expand parts of the breakdown tree when one finds that some components need to be refined.
  2. Supporting navigation. People generally talk about the structure of systems in a hierarchical way: system and subsystems and components and subcomponents and so on. This means that a hierarchical breakdown structure matches common usage (as long as the refinement into smaller and smaller components follows the common usage).
  3. Usable identifiers. The hierarchical structure does support unique names for each component, as will be discussed later. The identifiers are usually not the most compact possible, because the identifiers reflect a path of names through the breakdown structure tree, similar to the way file systems and URLs organize hierarchical names. However, the hierarchical identifiers in practice have worked well as a readable and writable form of identifiers in other domains, including URLs.

26.2.2 Alternatives

The approach laid out here is fundamentally hierarchical, and reflects the way people usually approach breaking down a complex system—by a reductive approach that organizes parts into a hierarchy.

That is not the only approach to organizing the components. Mechanical and electrical engineering systems often use a more-or-less flat space of part numbers to identify components. The specifications for each part can have attributes, and the attributes allow one to search for a desired part.

A flat part number approach works well for low-level, physical components. A 100 ohm resistor can be used in many different components; there is little value in giving a different name for its use in one place on one board and a different name for a second place on that board, or on a different board. Similarly, when manufacturing many instances of a vehicle, using a part number to identify the part in an assembly works well.

I have generally not used a part number approach for higher-level systems activities, however, because the uses are not the same. During design, each component that systems engineering deals with is generally unique.

26.3 Component identifiers

A component’s identifier provides a unique way to refer to that component. It is like the address for a building: it allows one to find the component (or its specifications), but does not by itself convey much more information. The keys are that the identifier be unique, and that people can use the identifier to find what they are looking for.

The pathname is the long-standing practice for creating identifiers for elements in a hierarchy. This is familiar from file systems and URLs: the path /a/b/c/d refers to a file or object named “d”, which is contained in “c”, which is in turn contained in “b”, which is part of “a”, which is one of the top-level objects or folders in the system. While the object name “d” is not necessarily unique (there can be another object /a/f/d, for example), the path as a whole does give a unique identifier for the object or file.

This approach applies to the identifiers for components in a breakdown structure as well. The names in the path are typically separated by a slash (/) or period (.).

The names of each component in the tree can be abbreviations or short words describing the component. Both work well; the choice is primarily a matter of style. When there are commonly used abbreviations for some components, it is reasonable mix and match abbreviations and longer names. For example, a spacecraft’s computing system is often called the CDH (command and data handling); attitude control is the ACS (attitude control system); and the electrical system is the EPS (electrical power system).

Some examples from a fictitious spacecraft system:

Abbreviations Short names
sc spacecraft
sc.eps spacecraft.power
sc.eps.batt spacecraft.power.battery
sc.cdh.fp spacecraft.cdh.flightprocessor

Long component identifiers can become a problem. Long identifiers are harder to type than shorter ones. Sometimes there are limits on how long an identifier can be; for example, if one is recording information about components in a spreadsheet and putting each different component on a different sheet, most spreadsheet packages have limit on how long a sheet name can be.

The length of an identifier is driven by how deeply the breakdown structure tree goes. The path name for a component six layers down in the hierarchy will be much longer than the path name for a component in the third layer. This suggests that one should try not to make the component hierarchy any deeper than it needs to be.

26.4 Viewing the breakdown structure

Many people find a visual representation of the breakdown structure helpful for understanding it. Here is a drawing of an incomplete breakdown structure for a simple spacecraft:

undisplayed image

It is worth finding tools that can show this kind of visual representation of the breakdown structure.

26.5 Context and relationships

The breakdown structure provides the fundamental organization for most systems engineering artifacts. This means that the structure chosen for the breakdown will affect how most other parts of a specification are organized.

undisplayed image

Each component named in the breakdown has a specification. The specification includes information like

When two components interact, the interface between them must name which components are involved. The specifications for each component must indicate what data or control they will be sending and receiving in the interaction.

The identifier for a component provides a way to express a reference between implementation and test artifacts, like source code or drawings, and the specifications to which they should comply.

The breakdown structure affects almost everyone working on the project. This includes:

26.6 Advice

26.6.1 Evolution

The understanding of the system evolves gradually from the initial concept to the time that a final product is delivered (if indeed there is a final product). At each step of this evolution, the understanding of what should be in the breakdown structure and how it should be organized will change.

Because the breakdown structure is central to many other processes and artifacts, a change to the breakdown structure will result in changes to potentially many other artifacts. The cost of the change grows as the size of the breakdown structure tree grows.

Don’t try to build an elaborate and complete breakdown structure too early. At the beginning, while still working out the basic concepts of the system and its structure, just sketch out the first level of the structure—and try out several potential structures until one appears to match the system’s objectives. Often the main structure will be suggested by common practice for similar projects: the automobile industry has a common, vernacular breakdown of cars and trucks into common subsystems, for example.

In general, it is best to keep a branch of the breakdown structure shallow as long as there is significant uncertainty about how that part of the system will be designed. In an aircraft, for example, the propulsion system should be left unrefined in the breakdown structure until the team has settled on the general approach to propulsion—will it use turbofans, turboprops, propfans, electric rotors, or some combination? The broad choice can typically be settled early in concept development by working out the concept of operations and determining what capabilities, performance, and physical layout will meet the aircraft’s operational needs. Once the general architecture has been decided, then one can refine the propulsion system by adding a layer of components for each engine or other major unit involved in propulsion.

26.6.2 Depth

The point of the breakdown structure is to help people find and refer to components. The breakdown structure should reflect common ideas of how a system breaks down into components, and should result in short, easy-to-use identifiers. The breakdown structure should focus on these capabilities and not be drafted into serving other purposes.

Consider the breakdown structure for all the sensors that provide information to an autonomous vehicle. One way to organize the sensors is to create a general “sensors” component, and then include all the sensors as children of the general sensors component. Another way is to break the sensors down first by general type (camera, lidar, radar, sonar, microphone), then by general location of the sensor on the vehicle (front, left, right, top, back), and then by the specific sensor unit. In this example, the first approach leads to a shallow and broad breakdown structure; the latter example leads to a narrow and deep structure.

In general, a shallow, broad breakdown structure will meet these objectives better than a narrow and deep structure. There are a few reasons for this.

This leads to a general principle. The breakdown structure should be used only for providing a unique name, and not for embedding a taxonomy or search attributes. The tools that people use to navigate through the breakdown structure and its related artifacts, like specifications, should provide search mechanisms that let someone find a component by attributes. Embedding extraneous information, like a location attribute or model number or power requirement in the name will just make the names longer, harder to use, and less resilient to change.

26.6.3 Multiple fit

The hierarchical, tree-structured approach recommended here makes each component part of exactly one parent component. It does not accommodate components that have more than one natural affinity to parent groupings.

Consider a radio transceiver that is used to communicate between aircraft, such as the ADS-B systems used for collision avoidance. This transceiver could be categorized multiple ways. It is part of the aircraft, but it is also part of an air traffic management safety system. The transceiver within the aircraft is part of a communication system, but it is also a part of the flight control system and intimately connected with human interface components on the flight deck. The transceiver, in other words, is part of several different groupings of components, depending on who is looking and for what purpose.

There is a fundamental tension between simple organizing structures, like a tree, and the richer relationships that elements of a system have with each other. For an excellent discussion of this, see Alexander’s essay on trees as a structuring approach for cities [Alexander15]. In that essay, Alexander proposes that a lattice structure is a more appropriate model for organizing urban structures. In his account, a tree-oriented description of a city fails to account for the ways that a house can be both a place for a family to live as well as a node in a social network and a place of work; in each of these roles, the house is related to different buildings or locations in the city.

The systems engineering approach presented here addresses this problem by separating naming or identity from the complex relationships that each component actually has. The breakdown structure only tries to give a name to each thing, like the address for a building. The relationships, functions, requirements, and everything else that goes into defining a component are all left to other artifacts, such as the component’s specification and models of the components.

This means: don’t try to make the breakdown structure do too much. When a component fits into multiple categories, pick the one that seems most natural for most users and leave it at that. Other artifacts and tools will address greater complexity.

26.6.4 Not by function

The breakdown structure is for organizing components: things that are built and that can be seen or touched (possibly virtually).

There is sometimes a temptation to try to organize system functions into the breakdown hierarchy. Don’t do that. The breakdown of function—and of the allocation of function to component—is a separate task that needs to be addressed by a structure that focuses on how functions are organized.

A better approach is to maintain the component breakdown and a functional breakdown separately, and maintain an allocation mapping that shows how different subfunctions are achieved by different components. The functional breakdown is often better reflected in the structure of how specifications or requirements derive from each other. See the chapter on requirements for more on this.

26.6.5 Keep related things together

Some projects have proposed organizing components primarily by some fundamental, nonfunctional attribute. One project was considering separating hardware from electronics from software from operational procedures at the top level, and then organizing components within each of those categories by subsystem. Another project organized components first by the vendor organization that was to implement the component.

These approaches make it harder for people to use the breakdown structure to find things. Consider an electrical power controller on a spacecraft. This has an electronic component (the board and processor that runs the power control function) and a software component (that makes the decisions about what to power on and off, and to report information to a telemetry function). Someone working on the power controller will generally want to know about both aspects. Requiring them to look in two widely-separated parts of the breakdown structure is inconvenient, and (more seriously) it increases the chances that someone will miss a component that they need to know about to do their work.

As a general principle, it is better to group components by how people naturally think of them as being grouped. Keep functionally-related components close together in the breakdown structure so that people find everything they need about something by looking in one place.

As noted above, this doesn’t always work. The breakdown structure will not be perfect because not everything in a system naturally falls into a hierarchical organization. But the more that like things can be grouped, the easier it will be for people.

26.6.6 Generic and reusable components

There is one special case of a component fitting into multiple places in a breakdown structure that deserves special treatment: generic and reusable components.

Consider an operating system. There may be multiple processors within a system that may all run instances of the same operating system. It is useful to have one specification for that operating system: there’s one product that is acquired from a vendor, there is one master copy kept somewhere, and so on. At the same time, that operating system will be loaded onto many different processor components in different subsystems.

One way to address this is to have a part of the breakdown structure for generic components, and then put an instance of that component in the places where it is used. The specification of each instance component can refer to the specification for the generic, with those functions or requirements that are specific to the instance added. This is an example of using the class-instance model from object-oriented programming to solve the problem.

26.7 Examples

26.7.1 NASA Work Breakdown Structure

The NASA project management process and systems engineering standards use a common WBS structure across all NASA projects. The use of the WBS is codified in a Procedural Requirement document [NPR7120], with details in an accompanying handbook [NASA18].

The NASA WBS is used as a project management artifact to organize work tasks, resources and budget, and report progress. The hierarchy must “support cost and schedule allocation down to a work package level” [NPR7120, p. 113]. A “work package” means one task or work assignment that is tracked, budgeted, and assigned as a single unit.

A NASA project’s WBS tree is rooted in the official NASA project project authorization, with its associated project code.

The first level of elements is defined by NASA standards, and each element has a standard numbering. The standard elements for a space flight project are: [NPR7120, Fig. H-2, p. 113]:

undisplayed image

Note how this organization mixes technical artifacts (payloads, spacecraft, ground systems) and management activities (project management, safety and mission assurance, public outreach).

The NASA WBS is intended to be one part of an overall project plan document. The project plan also contains information like:

26.7.2 MIL-STD-881 Work Breakdown Structure

This breakdown structure standard aims to provide a “consistent and visible framework” [DOD22] for communicating and contracting between a government program manager and contractors that perform the work. It addresses needs such as “performance, cost, schedule, risk, budget, and contractual” issues [DOD22, p. 1]. This kind of WBS is thus focused on supporting contractual relationships with suppliers.

The standard defines a number of different templates for different kinds of projects. It includes templates for aircraft systems, space systems, unmanned maritime systems, missiles, and several others.

The template for an aircraft system includes the following Level 2 items:

As should be clear from this example, this WBS template aims to address not just the design and building of a system but rather the operation of the entire program, including testing, deployment, and initial operation.

26.7.3 A simple spacecraft system

This is an example component breakdown for a simplified imaging spacecraft. The spacecraft uses solar panels to collect energy; it has a single imaging camera to collect mission data; it has a flight computer to run the system; an attitude control system to point the imager where needed; and a radio to communicate to ground. (The graphical version of this breakdown structure is included earlier in this chapter.)

Id Title
space Space segment
space.acs Attitude control system
space.acs.control Control logic
space.acs.sun Sun sensor
space.acs.wheels Reaction wheels
space.cdh Command and data handling avionics
space.cdh.gps GPS receiver
space.cdh.gps.ant Antenna
space.cdh.main Main processor Data storage
space.comm Communications system
space.comm.ant Antenna
space.comm.ant-tran Cable
space.comm.trans Transceiver
space.eps Electrical power system
space.eps.battery Battery
space.eps.controller Power controller
space.eps.panels Solar panels
space.eps.sep Separation switch
space.harness Harnesses
space.harness.canbus Data CAN bus Payload harness
space.harness.power Power cabling Radio harness Payloads Imager payload
space.prop Propulsion system
space.prop.lines Fuel lines
space.prop.tank Fuel tank
space.prop.tank.pressure Pressurization system
space.prop.tank.sensor Fuel pressure sensor
space.prop.thruster Thruster
space.structure Structure
space.thermal Thermal management system
space.thermal.propheat Prop tank heater
space.thermal.radiator Thermal radiator

This example only goes four levels deep. The actual breakdown structure would likely include at least two more levels, to represent, for example, different parts of the flight control software or subcomponents of the radio transceiver.

The example includes an example of a component that could fit in multiple places in the structure: the propellant tank heater. This is part of the thermal management system—its function is to keep the fuel in the propellant tank within a certain temperature range—but it is also part of the propulsion system. In this example the choice was to categorize it as part of the thermal management system.

Part VIII: Team organization

Chapter 27: Team introduction

9 March 2024

Consider, for example, meetings that involve too many people, and accordingly cannot make decisions promptly or carefully. Everyone would like to have the meeting end quickly, but few if any will be willing to let their pet concern be dropped to make this possible. And though all of those participating presumably have an interest in reaching sound decisions, this all too often fails to happen. When the number of participants is large, the typical participant will know that his own efforts will probably not make much difference to the outcome, and that he will be affected by the meeting’s decision in much the same way no matter how much or how little effort he puts into studying the issues. […] The decisions of the meeting are thus public goods to the participants (and perhaps others), and the contribution that each participant will make toward achieving or improving these public goods will become smaller as the meeting becomes larger. It is for these reasons, among others, that organizations so often turn to the small group; committees, subcommittees, and small leadership groups are created, and once created they tend to play a crucial role. [Olson65, p. 53]

that needs careful design just as must as the system product

Part IX: Project plan

Part X: Appendixes

Appendix A: From stakeholder need to model purposes

8 January 2024

A.1 Introduction

In Chapter 14, I presented an approach for determining what features and capabilities should be supported in the project in order to do a good job of building a system, and meeting stakeholder needs. In this appendix, I present the detail of that derivation.

Bear in mind that this derivation results in a set of objectives for a project. It does not say how any particular project should meet these objectives; each project must decide those things in ways that meet the specific needs of that project and that system. The objectives can be seen as a set of considerations that each project should examine as they decide how to run the project.

The derivation only addresses matters that are related to the project’s approach to building a system. There are many other factors outside this scope: matters of project management, or of policy in the organization that hosts the project. Where appropriate I have made notes of these matters external to the system-building scope.

A.1.1 Stakeholders

The set of stakeholders is:

  1. The customer for which the system is being built;
  2. The team that builds the system;
  3. The organization(s) of which the team members are part;
  4. Funders who provide the investment to build the system; and
  5. Regulators who oversee the system and its building.

I introduced each of these in Section 14.2.

A.1.2 Model elements

I introduced the model for making systems in Section 6.3. This model is organized around the tasks that need to be performed to build the system, and has the following elements:

  1. Artifacts that are created by performing tasks, and represent the system and records about it;
  2. The team that builds the system by performing tasks and making artifacts;
  3. The tools that people on the team use in doing tasks; and
  4. The plan that organizes what tasks need to be done, in what order, and using what resources.

In addition to these elements, I have include an element for matters external to the system-building project for matters that stakeholders need but that aren’t about building the system itself.

A.1.3 Derivation

The derivation maps stakeholder needs onto objectives for parts of the model.

undisplayed image

The result is a set of objectives or capabilities that people should consider when working how how the project should operate.

I discuss each stakeholder in the sections that follow, along with tables of the needs or objectives of each. The objectives that support these stakeholder objectives are annotated with a right-pointing arrow: →.

A.2 Stakeholders

A.2.1 Customer

The customer (see Section 14.2.1) is a stakeholder who wants the system built because they are going to use the system. They may or may not be funding system development directly—if they are, then they are also a funder below.

model:2 Customer
2.1 Fill purpose
The project must deliver a system that meets the customer’s purpose
2.1.1 Know purpose
The project must know what the customer’s purpose for the system is
2.1.2 Build to purpose
The project must produce a system that meets the customer’s purpose
model.artifacts:1.1, 2.1, 4.2, 4.4, 4.5, 5.1, 5.2
model.plan:1.2, 2.1, 2.2, 2.3, 3.3, 3.3.2, 2.2.2, 2.5.1, 3.2, 4.1
2.1.3 Know requirements
The project must know the customer’s reliability, safety, and security requirements
2.1.4 Meet requirements
The project must produce a system that meets the customer’s reliability, safety, and security requirements
model.artifacts:2.1.2, 4.5, 5.1, 5.2
model.plan:3.3, 3.3.5, 2.2.2
2.1.5 Free of errors
The project must produce a system that is free of errors
2.2 On time and budget
The project must deliver a system by the required deadline and within the needed budget
model.plan:1.2.5, 4.1, 4.2
2.2.1 Know budgets
The project must know the budgets and deadlines for delivering the system
model.plan:3.2.2, 4.1
2.2.2 Know consumption to date
The project must know the resources and time that has been used to date that count against budgets or deadlines
2.2.3 Project forward usage
The project must be able to project the resources and time required to complete the system or meet other deadlines
model.plan:1.2 Uncertainty
The project must be able to estimate the uncertainty in any forward projections of resources or time
2.2.4 Control execution
The project must be able to control execution to adjust resource and time consumption
2.3 Certifications
The project must deliver a system that has appropriate certifications or approvals
2.3.1 Know regulations
The project must know the regulations or standards that apply to certification/approval
2.3.2 Follow process
The project must follow any processes required to get certification/approval
model.artifacts:8.2, 8.3
model.plan:,,, 3.3.7
2.4 Release and deployment
The project must be capable of releasing a version of the system and deploying it to a customer
model.artifacts:1.1, 6.1
model.plan:3.4, 4.3
2.5 Evolve system
The project must evolve the system in response to changes in customer or other needs
2.5.1 Receive requests for change
The project must be able to receive and process requests for change from the customer
model.plan:5.1, 5.3
2.5.2 Receive regulatory changes
The project must be able to receive and process changes in regulatory requirements
2.5.3 Know purpose of change
The project must know the purpose of the change (and the change in system purpose that results)
2.5.4 Build to meet change
The project must be able to produce a system that meets the changed purpose while maintaining the system’s other purposes and requirements
model.artifacts:1.1, 2.1, 2.2, 4.2.1
model.plan:1.2, 2.1, 2.2, 2.3, 3.3.6, 2.2.1, 2.2.2, 2.5.1

A.2.1.1 Filling purpose

A customer has some purpose for the system, meaning something they want to achieve by deploying and using the system. This is the problem that the customer wants solved, which is a higher-level concern that the specific features that the system will provide.

A customer may have additional requirements on the system. They likely have a need for a minimum level of reliability. They likely have needs related to safety and security of the system.

The project needs to build a system that can meet this purpose and the requirements.

The project can meet these needs by:

A.2.1.2 On time and budget

The customer likely has a deadline by which they would like the system delivered. They likely also have a budget for how much they want to invest in acquiring the system. At minimum, customers generally want the result as soon as possible and for as low a price as possible.

To meet these needs, the project should:

A.2.1.3 Certifications

In many industries, some kind of certification or approval is necessary to operate the system. An aircraft, for example, needs a type certification from the local aviation authority as well as approval for a specific instance of the aircraft. Even if there is no overt certification required, there are often regulatory standards to be met.

The project must build the system in compliance with regulations. When certification is needed, the project must follow the process to get that certification.

To achieve this, the project should:

A.2.1.4 Release and deployment

The customer needs the system actually to be delivered and put into operation. The project must deliver the system, and provide or support its deployment.

To do this:

A.2.1.5 Evolve system

The the system is successful, the customer often finds that it can be made even better with some changes. Or the customer’s needs may change, and they will want the system to adapt to meet their changed needs. The project should be able to maintain and evolve the system to support the customer’s changing needs.

A system may also need to change when regulations change.

The project can support an evolving system by:

A.2.2 Team

The team (see Section 14.2.2) is the collection of stakeholders who build the system. These people need the things that skilled, technical workers generally need: satisfaction, security, confidence, compensation.

Meeting these needs is mostly outside the scope of systems-building itself. These needs are largely met by the project and organization management who create the environment in which the team works. Still, there are aspects of systems-building that can help (or hinder) meeting the team’s needs.

The analysis of a team’s needs presented here is somewhat idealistic. It focuses on skilled workers who are not readily interchangeable, whose value to a project derives in part from the knowledge they carry about the system being built. It assumes workers motivated largely by work satisfaction and have essential material needs met by their compensation. These assumptions lead to a particular balance of power between the team and the organizations that employ them. This would not apply to a team of interchangeable workers or workers whose material needs are not well met by their employment.

model:3 Team
3.1 Satisfaction in the work
The team must have work that challenges them and results in satisfaction in what they produce
3.1.1 Positive outcome of work
The team must have confidence that their work will have a positive outcome
model.plan:1.1, 1.2
3.1.2 Challenging work
The team must find that the project’s work challenges them and makes use of their skills while remaining achievable
model.external:1.1, 1.2
3.1.3 Avoid irrelevant work
The team must believe that they are not being asked to do irrelevant work as part of the project
model.artifacts:1.3, 1.3.1
3.2 Appropriate staffing
The team must be staffed with the right people to do the work
3.2.1 Sufficient staffing
The project must have sufficient staff, with the right skills, to build the system
model.plan:1.2.3, 6.1, 6.3
3.2.2 Not overstaffed
The project must not be overstaffed in a way that leaves some unable to make meaningful contributions
model.plan:1.2.3, 6.1, 6.4
3.3 Sufficient supporting resources
The project must provide the team with sufficient resources to do the work, 3.3, 4.1, 4.2, 5.1
3.4 Secure position
The people in the team must feel secure in their position in the team
3.4.1 Understanding of fit
The team members must understand how they fit into the organization
3.4.2 Clear expectation
The team members must have a clear and correct understanding of their responsibilities in the project
model.plan:1.2.7, 2.4, 3.2.3, 6.2, 6.3
3.4.3 Fair evaluation
The team members must have an expectation that their work will be fairly evaluated
3.4.4 Clear lines of authority
The team members must have a clear understanding of the authority of others in the project
model.plan:3.2.3, 6.2, 6.3, 1.1.2
3.4.5 Ability to raise issues
The team members must have the ability to raise issues about the team and about the system, without retribution
model.plan:2.4, 3.3.5
3.5 Fair compensation
The team must be fairly compensated for their time and effort
3.6 Belief in project
The team must be able to believe in the project, its purpose, and its leadership
3.6.1 Belief in objective
The team must have confidence that the organization is accurately working with the customer
3.6.2 Ethics
The team members must believe that the system will be used in ways that accord with their ethical beliefs

A.2.2.1 Satisfaction in the work

Team members are expected to need satisfaction arising from the work they are doing on the project.

The satisfaction comes in part from believing that the work they are doing will have some positive outcome. That outcome might be that they see the system deployed and having a positive effect on the world. It might be that they see their work acknowledged, publicly or privately, even if the system ultimately is not deployed. It could come from social standing among their peers improving because of their association with the work.

Skilled workers also want work that makes use of their skills—which leads to a sense that they, as a specific individual, are making a contribution to the work. Work that challenges them or from which they learn things contributes to that satisfaction.

Doing work that is seen as not relevant or not likely to have value decreases their satisfaction. If asked to do something that is not achievable, they will lose enthusiasm. If they are asked to do work that they perceive as irrelevant, they will feel a lack of their individual value.

To support team satisfaction, the project can:

Other aspects are outside of the project’s scope.

A.2.2.2 Appropriate staffing

A team needs to have enough of the right people to do the work—but not too many people. Having enough people on the team who can do the work contributes to a team member’s sense that the project has a good chance of having a positive outcome.

Having too few people, or too many people who lack necessary skills, leads to an overworked and burnt out team.

Having too many people leads to team members who don’t have useful work to do. It can lead to people making up new work just to feel like they are contributing.

The “right” staffing level is dynamic. It changes over time as the project moves forward: a particular skill in designing electronics boards may be important for one period in the project, but once the necessary hardware has been designed and built, the need decreases. It changes over time as people change. As a team members learns new things, they may find that they should move on to a different project. Life events occur that change a person’s motivations and needs. The key is not to always have the perfect cohort working on the project, but to have a pretty good group and work to address changes as they happen. If the team has trust in their management that the management is able to address team composition, people will generally stay satisfied.

Ensuring appropriate staffing involves:

Much of this is outside the scope of the project itself. The organization holds the funds used to pay staff. It also provides the ability to hire and fire people.

A.2.2.3 Sufficient supporting resources

As with staffing, the team needs resources to do their work: a place to work and the tools to do the work, for example. They may need consumable resources as well. For example, a team might need a ready supply of liquid nitrogen in order to test a hardware component that is supposed to operate at low temperature.

If the team lacks these resources, they can’t do their work. This affects their satisfaction.

The project needs to have:

A.2.2.4 Secure position

Team members need to have a sense of security in their position. This means that they need to believe that they understand their position in the project and organization, believe that they will be treated fairly, and believe that issues they raise will be addressed. The opposite of this is when they have a sense of insecurity—because they do not understand what is expected or how they are evaluated, or because they believe that problems will not be resolved, even if they raise an issue.

The sense of security allows people to put their effort into their work, rather than spending their time and energy on worry. The sense also helps keep people on the team so that their knowledge of the system continues to benefit the project.

This comes from the project:

The organization also needs to:

A.2.2.5 Fair compensation

A technical worker needs to believe they are being fairly compensated for their time and effort. They need to be compensated well enough that they are not distracted by want. That compensation may be monetary, but it may take other forms as well.

Setting compensation policy is usually a responsibility of the organization, not the project.

A.2.2.6 Belief in project

Skilled workers often have choices about what project they work on. Many of them are motivated by a belief in the work they do: that it will help its users, or that it will result in some good for the world. If they come to believe that either or both is not true, they will be demotivated.

The project should:

The organization should also maintain an ethics policy that details:

A.2.3 Organization

The people in the team work for the organization, which provides a home for the project (see Section 14.2.3.) The organization is responsible for finding funding for the project and providing a legal entity for doing the work. The organization provides the business operations that make the project possible.

There is no one kind of “organization” that fits all situations. The organization might be anything from a single person, to a company, to a consortium of organizations, depending on the project. The organization might exist to return profit in exchange for the work, or it might be a non-profit or a governmental organization that looks for non-monetary benefits from the project. Some organizations exist only to build and deliver one system; others expect to deliver the system to many customers and to build more systems in the future.

Many of an organization’s needs are not to be met by the system-building project itself; they are met by how well the organization’s management and business operations. Nonetheless, how the system is built can help or hinder business management or operations.

The diversity of kinds of organizations means that the list of needs below has to be tailored for each project and each organization.

model:4 Organization
4.1 Ability to deliver
The organization must have the ability to deliver the working system to the customer
4.1.1 Ability to communicate with customer
The organization must be able to communicate with the customer Conflict resolution
The organization must be able to negotiate and resolve conflicts between the team and the customer
4.1.2 Support for the team
The organization must have the infrastructure to support the team Leadership
The organization must have leadership that can run the organization in a way that enables the team
model.external:1.4, 1.5, 1.7, 1.8 Infrastructure
The organization must have the ability to staff and finance the team
model.external:1.3, 3.1 Resources
The organization must have resources to hire the team and for them to operate
model.external:1.3 Workplace regulation
The organization must provide a workplace that meets regulation
4.2 Ability to sell
The organization must have the ability sell the system produced (when appropriate)
4.2.1 Articulate value
The organization must be able to articulate the value of the system product being sold
4.2.2 Market
There must be a market for the system being sold
4.2.3 Sales and marketing team
The organization must have a sales or marketing capability, with the resources to do its job
4.3 Profit
The organization must get enough profit from the project to fund overhead and to support future projects
model.plan:1.2.5, 1.2.6, 3.2.6
4.4 Positioning for future work
The organization must be positioned for future projects and/or maintenance of this system
4.4.1 Reputation
The organization must have a reputation for being able to build systems well
model:2.1, 2.2, 2.5
4.4.2 Reusable capability
The organization must have capabilities in processes, teams, and tools that will apply to future projects
4.4.3 Ongoing improvement
The organization must be able to learn and improve its capabilities over time

A.2.3.1 Ability to deliver

The purpose for the organization pursuing a system-building project is to deliver a system to the customer. If the project does not deliver something, the organization will see little return on its investment and effort.

Of course, an organization might get a contract from a customer and get started, only for the customer to cancel the contract. (Hopefully the organization has taken this into account in its planning.) The organization still needs to have been able to deliver the system, even if the work was stopped.

The ability to deliver has two aspects: communication with the customer and support for the team, in addition to the general ability to build a system for the customer.

When the system being built has a specific customer, the organization needs to be able to talk to them, keep them updated on progress, and hear concerns or issues from the customer. When there is disagreement, the organization needs people who can negotiate and resolve issues.

The project can help this by maintaining the interface with the customer, including having people assigned to work with the customer, documenting what they learn from the customer, and negotiating with the customer as issues arise.

The project team can do little without the organization supporting them. The team needs leadership; it needs workspace and other infrastructure; it needs human resources and payroll and accounting support. The organization needs to:

A.2.3.2 Ability to sell

If the system is expected to be delivered to multiple customers over time, the organization needs to be able to find those customers, make the case to them that the system will benefit them, and work out a deal to deliver the system.

I have written this need in terms of selling, but the needs apply when something is being delivered not for monetary return. An open-source project that is freely available to users does not sell the system for money, but the way that project has value is for users to pick up, deploy, and use the system. The project may want to attract developers to build up an ecosystem of related products or services. Meeting these needs involves making potential users aware of the system and making the case that they will benefit from the system.

To be able to sell the system, the organization needs to:

A.2.3.3 Profit

The organization will be expecting to get some kind of return on its effort. That may be a monetary return, but a non-profit or government agency may look for a non-monetary return, such as a community benefit.

The project can support this in two ways. First, the organization can set business objectives for the project, such as expected profit. The project can keep records of these objectives, and take them into account in the system’s design. Second, the project can organize its work as efficiently as possible so that investment goes as far as possible (consistent with deadlines). The project’s management can monitor the time and money being spent and work out how to adjust the project if it looks likely that the project will not meet the organization’s expectations on return or profit.

A.2.3.4 Positioning for future work

Many organizations build multiple systems over their existence—whether this is building multiple bespoke systems for customers, or building multiple products that are delivered to many customers. The ability to continue to deliver profitable systems is a major part of a company’s stock performance: the stock price is determined by the market expectation of future returns to the investors.

An organization’s reputation affects its ability to attract customers and investment, as well as its ability to hire talented staff. The reputation depends in part on its ability to deliver good systems.

An organization can become more productive over time—and thus improve its reputation, its ability to deliver, and its profitability. This comes from learning and improving. If the organization builds up a staff that knows how to run a system-building project well, future projects can be executed more efficiently. Better tools will help the next projects. However, learning and improvement does not often happen by chance; it happens when an organization sets out to learn from its performance.

The project can:

The organization should:

A.2.4 Funders

The funders provide the investment that funds the team building the system (see Section 14.2.4.) The funder provides these resources in the expectation of some kind of return, be that monetary or not. A venture capital funder is most likely to look for a monetary return from future profits from the organization it is funding. A company providing internal funding more likely is looking the project to add to the company’s capabilities, which will in turn enable the company to increase its future profits. A government agency is likely looking for something that will benefit the public in some way.

As noted earlier, there are many different kinds of funders, from venture capital to company internal funding to customers paying for development.

model:5 Funders
5.1 Return on investment
The funder must get at least the expected return on its investment
model:4.2, 4.3
5.1.1 Visibility
The funder must have sufficient visibility into the organization’s behavior and progress to determine when the project is at risk of not providing a return on investment
model.plan:1.2, 4.1, 4.2
5.1.2 Influence
The funder must have influence on the organization or project in order to address performance that will jeopardize return on investment
5.2 Ability to attract future investment
The project must help the funder attract investment for future projects

A.2.4.1 Return on investment

Funders provide capital to run the project on the expectation that they will get some return on that investment.

In some cases, the return will come from profit realized in building the system (Section A.2.3.3) or from an increase in the value of the organization (Section A.2.3.4). In other cases the return will come from the value of the system after it is delivered and deployed (Section A.2.1.1, Section A.2.3.1).

The funders will also expect to be able to track the organization’s and project’s progress, and to raise issues when they find that there may be a problem that could jeopardize the funder’s return. The organization needs to have people whose responsibility includes interfacing with the funders.

The project can support the interface with funders by maintaining a realistic plan for the work, managing its budget, and keeping the organization informed of progress. The project should also have the processes in place to respond when the funders raise an issue that leads to a potential change to the system.

The project may also need to maintain accurate records and artifacts that allow the funder to audit the project—verifying that the information the funder has received is accurate and complete.

A.2.4.2 Ability to attract future investment

The funders get the capital they invest from somewhere. In many cases, the investment capital comes from their customers: individual and institutional investors for venture capital, legislatures and the public for government investors. The funders will keep their investor customers satisfied if they can show that their investments produce the expected returns, leading to a reputation for using capital wisely. At the same time, funders want to avoid bad press from projects that have problems, which can reflect on the funders’ ability to select organizations or projects.

The ways that the project can address this funder need are all included in the previous section, on the funder’s need for return on investment.

A.2.5 Regulators

Regulators (Section 14.2.5) provide an independent check on work to ensure that it meets regulations or standards, thus ensuring that some public good is maintained that the organization or project might not otherwise be incentivized to meet.

The interaction between the project and regulators depends on the countries involved and the nature of the project. Some industries require licensing or certification of some kinds of system: most aircraft, for example, must obtain type certification from the local civil aviation authority before that aircraft is allowed to fly or be sold. Spacecraft require a set of licenses for launch and communication. Other industries, such as consumer electronics or automobiles in the United States, depend on compliance with regulation but compliance is only checked after the fact.

I include voluntary standards as part of regulation. Non-governmental organizations set interoperability standards; the standards for USB (set by the USB Implementer’s Forum) and Wifi (set by the Institute of Electrical and Electronics Engineers 802.11 working group) are examples. Other organizations set safety standards that help to ensure consumer products are checked to be safe.

The regulators perform multiple tasks:

model:6 Regulators
6.1 Compliance and certification
The regulator must be able to work with the project to ensure regulatory compliance and (when appropriate) certify the system
6.1.1 Available regulation
The regulator must make information about regulations available to the organization and possibly user
6.1.2 Application
The project must apply to the regulator for certification and then follow the certification process
6.1.3 System auditability
The regulator must be able to audit that the system complies with regulations
model.artifacts:4.2, 4.4, 4.5, 8.4
6.1.4 Process auditability
The regulator must be able to audit that the organization has followed required processes in building the system
6.2 Monitoring
The regulator must be able to monitor the organization, project, and/or system for compliance with regulation
6.2.1 Accurate information available
The organization and/or user must make available to the regulator accurate and complete information about the system and organization behavior
model.artifacts:4.2, 4.4, 4.5, 8.3, 8.4
6.2.2 Notify regulator
The organization or user must proactively provide information to the regulator when a potential regulatory problem is detected, as required by regulation
6.3 Problem resolution
The regulator must be able to work with the project and/or user to identify and resolve potential regulatory problems
6.3.1 Communicate with organization or user
The regulator must be able to communicate with the organization or user about potential regulatory problems
6.3.2 Accurate information
The regulator must obtain cooperation and accurate information from the organization or user to investigate a potential regulatory problem
model.artifacts:4.2, 4.4, 4.5, 8.3, 8.4
6.3.3 Respond to remedy
The organization or user must be able to respond to a regulator’s remedy

A.2.5.1 Compliance and certification

The regulator makes regulations (or standards), and makes them available to teams building affected systems.

The project responds to the regulations by designing and building the system so that it meets the regulations, maintaining records needed to show that the regulations have been met, and beginning a process for getting certifications or licenses when appropriate.

The project is responsible for:

A.2.5.2 Monitoring

In some cases, the regulator must monitor the project’s work—for example, during aircraft certification, which is generally a joint effort between the aviation authority and the company building the aircraft. A regulator might also need to monitor the project after a violation has been found and the team is working on remedial action.

Accurate and timely information is paramount when this occurs. The project must maintain good records to be able to provide that information to the regulators. The information potentially covers everything about the project: the design and analyses of the system, its implementation, records of design rationales, and logs of the processes followed.

The team must also be prepared to notify the regulator proactively as situations arise. The team should have people who will watch for situations and communicate with the regulator.

A.2.5.3 Problem resolution

I have never observed a licensing or certification process to go with no problems. The processes and regulations are often complex, and unless a team has done the process several times before there will almost certainly be things the team needs to learn to get through the process.

This means that there will be problems to resolve. Sometimes the team will discover the problem and need guidance from the regulator. Other times the regulator will raise the issue.

The team can make smooth the problem resolution process by:

A.3 Model elements

All of the objectives in the previous section map to objectives related to artifacts, team, tools, and plan. Some of them also map to things other than the system-building that goes on in the project.

This section lays out tables of the objectives for each element of the model. Each objective is annotated with its parents; that is, the objectives that are the reason that this objective is included. These are annotated in the tables with an arrow pointing down and right: ↘. If one of the objectives has children, those are annotated with a right-pointing arrow: →.

A.3.1 Artifacts

model.artifacts:1 Artifact management
1.1 Store artifacts
The project must have a place to store artifacts
model:2.1.2, 2.4, 2.5.4
model.artifacts:2.1, 2.2, 3.1, 4.2, 4.4, 4.5, 5.1, 5.2, 6.1, 6.2, 7.1, 7.2, 7.3, 7.4, 8.1, 8.2, 8.3, 8.4, 9.1, 3.2.1, 3.4
1.1.1 Consistent versioning
The artifact storage must be able to maintain versions of all artifacts that are consistent with each other