Making systems

Volume 2: Life cycles
Richard Golding

Copyright ©2025 by Richard Golding

Release: 0.4-review

Table of contents

Part V: Development methodology and life cycles

What life cycles and development methodologies are. It covers what goes into a life cycle pattern and how the patterns relate to other parts of how a project operates. It defines development methodologies and how these relate to life cycles. The last chapter lists several example life cycle patterns, which lead into a comprehensive reference life cycle in Part VI.

Chapter 21: Introduction to life cycle patterns

2 October 2024

21.1 Introduction

System building in general follows a common story.

A project to develop a new system begins when someone has an idea that people should make the system. At this initial moment, the system is largely undefined. There is a vague concept in a few minds, but all the details are uncertain.

The project then moves the system from this initial concept through to an operational system, and through the system’s operational life and eventual retirement. During development, the team will need to ensure steps are taken in order to produce a correct, safe system. Designs will be checked. Implementations will be tested. The system as a whole will be verified before being deployed into service. At the same time, the resources spent on building the system must be used efficiently, doing the work that needs to be done and avoiding the work that doesn’t need to be done.

Many projects continue system development beyond the first operational version, with ongoing development or problem fixes. Some projects include the steps to shut down and dispose of the system once it has completed its functions.

The life cycle is how a project organizes the way the team moves through this story. It is a pattern that defines the phases and steps in the work: what will come first, what will done before something else, when checks will happen. It provides checklists to know when some step is ready to be done, and when it should wait for prerequisites. It provides checkpoints and milestones for reviewing the work, so that problems are found and dealt with in a timely way. It provides an overall checklist to ensure that all the work that needs to be done is in fact done.

Section 20.3 introduced the basic ideas for life cycle patterns. These include:

Each project will use its own life cycle patterns. The patterns may incorporate a framework that is standard for the industry or the parent organization. Selecting and documenting the patterns is an essential part of starting up a project, and people in the project should review how well the patterns are working for them from time to time and may want to improve the patterns.

In this part, I discuss life cycles in general. In Part VI, I present a reference life cycle pattern.

21.2 Life cycle and development methodology

Life cycle patterns are related to, but separate from, the development methodology that a team chooses to use, such as waterfall, spiral, or agile methodologies. I address these methodologies in the next chapter (Chapter 22).

Speaking broadly, the development methodology determines how the work is organized in time: in a single sequence or iteratively, synchronized tasks or separate tasks, how far ahead to plan. The life cycle patterns reflect some of those methodology decisions and encode how to do different tasks.

Put another way, the life cycle patterns help organize what work the project has to do, and what dependencies there are among different steps in the work. The development methodology organizes how that work is planned and scheduled. As a result, the two go hand-in-hand but are distinct from each other.

21.3 Key ideas

Almost all project life cycle patterns, for both whole systems and for components, follow a similar overall flow. Abstracting from the story in the introduction, there are phases:

  1. Working out how the project will operate
  2. Identifying purpose
  3. Developing a concept
  4. Refining concept into specification and design
  5. Implementation
  6. Verifying the result
  7. Operating the system or component
  8. Evolving it over time
  9. Retiring the system or component at end of life
  10. Shutting down the project

For a whole system, this looks like:

undisplayed image

Note that this flow starts with the system or component’s purpose. Good engineering always begins with having a clear understanding of what a thing is for. I have watched many engineers rush into designing and building a component without putting time into understanding what the component is going to be used for. By random chance their design has occasionally worked out to match what the component actually needed to do, but only rarely.

Understanding a system’s purpose or a component’s purpose also provides a way to bound the work. If one doesn’t know what a component is for, it is easy to keep working on a design without stopping because there isn’t a clear way to know when the design is good enough to be called done.

There are many points in this flow where one might add checks. At these times one can check on the correctness of the work. These checks improve system quality by building in the opportunity to discover and correct flaws before other work builds on the flawed work. Finding minor problems quickly usually means the cost of correction remains low.

There are also points where a project might have project-wide decisions--go/no go decisions or key decision points. These provide opportunities to check the entire project progress, sometimes occurring in the middle of other work, or at times when irrevocable actions are to be taken, such as funding, launch, or public announcements.

This general pattern applies recursively. One can start by creating a specification and design for the system. The system design will decompose the system into high-level components (Section 6.4). The act of defining a set of components implies identifying a purpose for each one, then specifying and designing each high-level component. The design of a high-level component might in turn decompose into a set of lower-level components, which in turn need a purpose, then specification and design.

The overall flow shows a move from high uncertainty at the beginning to lower uncertainty as the work proceeds. I will address managing using uncertainty in ! Unknown link ref.

Finally, a project’s life cycle patterns will reflect the development methodology that the team has selected. Waterfall, spiral, and agile development all affect the contents of the patterns. I discuss this more in Chapter 22.

The life cycle is provides a general set of patterns for how work should proceed, but it should not define exactly how each work step should be done. That is left to procedures (Section 20.4), which should provide step-by-step instructions for how to do key parts of the life cycle. For example, if a life cycle phase indicates that a design review and approval should occur before the end of a design phase, then there should be a corresponding procedures for design reviews. That procedure should indicate who should be involved in a review, what they should look for, how those people will communicate about the results, who is responsible for approving the design, and how they indicate approval.

The life cycle patterns are the basis for the project’s plan (Section 20.5). The patterns are a set of building blocks that people in the project can use to develop the plan. The plan, in turn, guides tasking: the selection of which tasks (as defined in the plan) people should be working on next.

21.4 Purpose of life cycle patterns

Life cycle patterns address problems that projects have. They can help the team have a predictable and reproducible flow to how work should be done, so that everyone shares the same understanding of how the team works.

There are six ways that life cycle patterns help a project.

  1. Quality of work. The team must build a system that addresses the customer’s purpose, and in doing so must meet quality, safety, security, and reliability objectives.
  2. Efficiency. The project will be expected to deliver the final system as quickly as possible, at the lowest reasonable cost, while meeting the quality objectives. This means that the team needs to be kept busy doing useful work.
  3. Team effectiveness. People on the team need to know how to work together. Building trust depends, in part, on having shared expectations of how each person will do their work.
  4. Management support. Project management will need to plan and track the work in order to ensure the team meets deadlines and that they have sufficient resources to do the work.
  5. Customer and regulatory support. The customer may have specific milestones they expect the project to meet in support of the customer’s acquisition processes. Regulators often have similar expectations if a system must be certified or licensed for operation.
  6. Auditing support. The project’s work may be audited to check that the processes followed meet regulatory requirements, certification requirements, or as part of a legal review.

Gaining these benefits is not a result of using life cycle patterns per se; rather, it comes from using patterns that are designed to provide the benefits. For example, if the customer has an acquisition process that specifies certain milestones, then the top-level life cycle pattern for the project should incorporate those milestones. If the project is likely to have auditing requirements, then the patterns should include tasks to generate and maintain auditing records.

Quality of work. The purpose of a project’s approach to operations is, in the end, to produce a system for the customer that meets their objectives. This means it should do what they need, meet safety and security needs, and support future system evolution. In other words, the team’s work needs to produce a system with good quality.

Neither the life cycle patterns by themselves nor the plan that derives from them directly result in good product quality. System quality comes from all of the detailed work steps that everyone on the team performs. If they do their work well, and if mistakes they make are caught and corrected, then the system can turn out well. If some work is not done well, nothing in the life cycle patterns can prevent that.

However, the life cycle patterns can create an environment that will more likely lead to good quality. They can proactively make flaws less likely by ensuring that steps happen in order: identifying purpose and concept before design and implementation, for example. They can insert points in the work that encourage people to think through what they should design or implement. They can also avoid problems by providing a checklist for what should be complete at the end of a work step. They can ensure that when a system is delivered, all the work needed to put it into operation is complete. They can build in checkpoints for reviews and verification to catch problems early. They also help project management organize the work so that it is complete, that is, so that no parts of the system or some work steps are overlooked.

Sometimes the value of a life cycle pattern will come from slowing down work. Most of the work done on a project is done by people who are focused on a particular part of the system; it is not their job to manage how the project goes as a whole. Their job is to get that one part designed and built, according to the specifications they have been given. If the specialists start building before the context for their work has been established, they are likely to design or implement something that does not meet system needs. I have been part of more than one project where the resulting rework caused the project to be canceled or required a company to get additional funding rounds to make up for the resources spent on the mistakes.

Efficiency. Most systems projects will be resource-bound, with more tasks than there are people on the team to do them. In this kind of project, it is important to keep each person busy with useful work. This means that nobody on the team is blocked with no tasks they can usefully perform. It also means that almost all the tasks that people perform contribute to the final system—that there is little work that has to be thrown out and redone because it had flaws that made it unusable.[1]

As project management builds the project’s plan, using the life cycle patterns as building blocks, they must detect where there are dependencies between work steps and plan the work steps so that later steps are unlikely to get blocked. For example, if some part will require an unusually long time to specify and acquire from an outside vendor, then the management will need to ensure that work on that part starts early. The life cycle patterns provide part of the structure on which the plan is based, and provides a template for some of the dependencies.

Life cycle patterns can also help avoid unnecessary rework. This comes partly from the ways that the patterns help improve the quality of work. In particular, a good life cycle pattern can lead people to take the time to think through the purpose and specification of something before they jump into design and implementation unprepared, and then build something that does not meet the system’s needs.

Finally, the patterns can help bound the work to be done. When a project does not define the scope of work to be done, it is likely that someone will start working on something in excess of or not related to the customer needs. Good patterns help avoid this by defining an orderly and thoughtful process for identifying what work needs to be done.

Team effectiveness. Members of an effective team respect and trust each other. Having shared norms and understandings for how work is done and how people communicate is important as part of the environment that allows the team to develop respect and trust.

A defined life cycle for a project addresses part of this by defining a common understanding of how work should be done. Good patterns define expectations of what will be done in different work steps. Everyone on the team can agree when a work step has been completed. Good patterns also create times when people know they are expected to communicate about some work step. This makes it easier for someone to trust that they will be consulted at appropriate points about work that might affect what they are doing, so that they do not need to create separate, ad hoc communication channels or try to micromanage something that is not their direct responsibility.

As I have noted elsewhere (Section 20.8.3), the life cycle patterns can only have this benefit if the team actually follows them.

Management support. The team, or designated parts of it, will be responsible for making a plan (Section 20.5) for the project’s work, then coordinating and tracking the resulting tasks. The life cycle patterns provide templates for the tasks that will go into the plan, and the key milestones that anchor the work. The life cycle sets the pattern for phases that the project will go through, such as initial conception, initial customer acceptance, concept exploration, implementation, and verification. The cycle also sets the pattern for milestones that gate the progression from one phase to another, such as a concept review, a design review (and approval), or an operational readiness review.

The plan will change from time to time, both in response to external change requests and as the project progresses and the team learns more about the work ahead. Sometimes the need for change occurs gradually, with an issue slowly manifesting itself but causing no acute problem that causes people to recognize there is a need for change. A good life cycle will build in times for people to step back to get perspective and detect when there is a slow-building problem to address. Review milestones are often a good time to plan for this.

Having life cycle patterns and corresponding procedures that apply when these changes occur will help the team adjust their work in an orderly way. It will help them ensure that steps don’t get missed as they work out how to change the plan (and the system being built).

Good life cycle patterns can help a project steadily decrease its uncertainty and risk as work proceeds. Most of the time, a project will start with high uncertainty about what the system will look like, and early project phases result in increasing understanding of what the system will need to be. This process will repeat at smaller scales: once the general breakdown of the system into major components is decided on, each of those components will start with high uncertainty about how it will be structured. The uncertainty about the major components will then gradually resolve, and so on. However, this occurs when the project is guided in a way that uncertainty is addressed systematically, not haphazardly.

Customer and regulatory support. Many customers will have a process they go through to decide whether to build a system and to track its development process. For US governmental customers, much of the process is encoded in law or regulation, such as the Federal Acquisition Regulation (FAR) [FAR] or Defense Federal Acquisition Regulation Supplement (DFARS) [DFARS]. The process governs matters like which design proposal is selected for contract, providing evidence of good progress, providing information that determines periodic contract payments, accepting the finished system, and determining whether the project should continue or be terminated.

These customers will expect deliverables from the project from time to time. The life cycle process must ensure that there are milestones when these are assembled and delivered. (It is then the job of project management to ensure that these milestones, and the tasks for preparing deliverables, can be completed by the time line that the customer requires.)

Whether the customer requires explicit intermediate deliverables or not, formally involving the customer may be important for keeping the project on track.

Similarly, regulatory bodies have processes by which a system that must be certified or licensed before operation can apply for that approval. Those processes will define activities that the team must perform, along with milestones and deadlines by which applications must be submitted or approvals received.

Auditing support. A project’s development practices may be audited for many reasons. Auditors may perform a review as part of an appraisal or certification against standards, such as CMMI [CMMI]. They may review processes to ensure compliance with regulatory standards, especially for security-sensitive projects. The processes may also be audited as part of a legal review. These reviewers need to see both the entire definition of processes, including the life cycle patterns, as well as evidence of how well the team has followed these practices.

21.5 A model for patterns

Each project will have several life cycle patterns, each covering a different part of the work.

Each pattern is defined by its purpose, the circumstances in which it applies, the phases or steps involved, and the dependencies among the steps. It should also include rationale that explains why the pattern is structured the way it is. In a previous chapter I used the example of a simple pattern for building one component:

undisplayed image

This pattern applies to building one low-level component where the purpose of the component is already known, and the component is straightforward to design and build in house. Similar but slightly different patterns might apply when the component has to be prototyped before deciding on a design, or when the component is being acquired from a supplier outside the project. This pattern would be used as one part of a larger pattern for building a higher-level component that includes this one.

Each phase of a pattern defines a way to move part of the work forward. It should have a defined purpose that defines what work should be achieved in that phase.

undisplayed image

The details of the phase are defined by:

Each action should also indicate who is responsible for performing that work. The responsibility will usually be defined as a role, not a specific individual. For example, a component design phase might involve three actions: design the component, review the design, and approve the design. The design action would be the responsibility of the component developer; the review action would be the responsibility of the developers responsible for components that interact with the one being designed, and the approval would be the responsibility of a systems engineer overseeing some higher-level component of which this one is part.

The rationale for this example design phase might say:

The actions defined for the phase should reference the procedures for doing those actions, when those procedures are defined. For the example design review action, the procedure might be:

The procedure might also name the tools to be used (an artifact repository for the design, a review workflow tool for the reviews).

21.6 Documenting life cycle patterns

A team needs clear documentation of the phases if they are to execute them properly. A team can’t be expected to guess at what they need to be doing, or how their work will be reviewed; it needs to be spelled out.

This documentation is assembled during the project preparation phase. The details are usually not completely worked out before any other work is begun; rather, “project preparation” more often proceeds in small increments, working out the rules shortly before the associated work begins.

Each life cycle pattern should have a purpose, and the steps or phases in the pattern should be checked that they can achieve that purpose (and that there is no extraneous work in the pattern).

A pattern should also have an explanation of when it applies and when it does not. For example, there may be multiple patterns for designing a component: one for a simple component that is built in house; one for a component that is outsourced to a supplier; one for a high-level component that is made up of several lower-level components; one for a component that requires investigation or prototyping before deciding on a conceptual approach to its design. All these patterns likely have a lot in common, but procuring an outsourced component will have contracting steps that an in-house component will not.

Someone using the documentation should be able to tell accurately whether they are using the correct version of the patterns. The life cycle patterns will be revised from time to time—as the team grows and as people find ways to improve how they work together. This means that the material that a user sees should indicate not just a revision number but have a clear indication of whether the version they are looking at is not longer current.

The form of the documentation is not as important as the content. It can be a written document. It can be made available electronically, with structured access and search capabilities (such as in a Wiki). Some companies offer tools that help define and document development processes or life cycle patterns, including definitions of phases. What matters is that each person who needs to use the documentation can do so conveniently and accurately.

21.7 Work steps and artifacts

Each phase or step has a number of artifacts that the team must develop. At the end of a phase, some of those artifacts need to be complete (allowing for future evolution), and others need to have reached some defined level of maturity. The work in a phase consists of the tasks that develop those artifacts.

I discussed artifacts in Chapter 17. The artifacts are the products of building the system, including the system being delivered as well as documentation of its design and rationale, records of actions taken during development, and information about how the project operates.

These artifacts are the inputs and outputs of the work specified in life cycle patterns (and the associated procedures). Using the component design step example, the work uses:

The design step produces:

In general, every artifact involved in building the system should be a product of some work phase or step, and every input or output of work steps should be included in the set of artifacts the team will develop. Ideally, the life cycle patterns will be checked for consistency with the list of artifacts the project uses.

Artifacts are developed at different times during the course of a project. A few artifacts should be worked out as the project is started—especially those recording the initial understanding of the system’s purpose and initial documentation of how the project will operate. These will be refined over time. Other artifacts are developed during the course of development, and the life cycle patterns indicate which ones are to be worked out before others. The artifacts will be in flux during development: the team learns about the system as it designs and develops it; the customer or mission needs often change over time; flaws get discovered in designs or implementations.

Many of the project’s artifacts support how people work together, and the life cycle patterns should reflect these communication needs. For example, one person may work out the protocol that two components need to use to communicate with each other. Two other people may design and implement the two components. The interface specification that the first person develops serves to communicate the details of the interface among all three people. The patterns should record that the design and implementation work steps depend on the work to develop the interface specification. Later, if one of the component developers identifies a flaw in the interface, the people involved can work through how to revise the interface—and the revised specification artifact informs each person how to update their work to match the change. The pattern helps to show how information about a change to the interface specification triggers rework on dependent artifacts.

A good life cycle pattern has procedures to manage the change in artifacts, and how those changes affect other artifacts downstream from them. There are two separate problems these procedures must address:

  1. Managing how changes are coordinated across multiple artifacts and through the team while a part of the system is in development; and
  2. Ensuring that when a part of the system is complete, all the artifacts are consistent with each other.

Different life cycle patterns approach this in different ways, which we will discuss in later chapters on different patterns. The most common approach is to maintain different versions of an artifact, with at most one version being designated as a baseline or approved version, and other versions designated as works in progress. Many configuration management tools have a way to designate a baseline version, and many software repository tools provide branching and approval mechanisms to track a stable version.

21.8 Life cycle and teams

What is the team size and background? How is it expected to change over time? A small team can often be a little less formal than a large team, because the small team (meaning no more than 5-10 people) can keep everyone informed through less formal communication. A large team is not able to rely on informal communication, so more explicit processes and communication mechanisms are important. Many teams start small when the project is first conceived, but grow large over time. A team that will grow will need to communicate more formally from the beginning than they otherwise might so that as they add people to the team, the larger team works smoothly.

Conversely, if the life cycle patterns indicate that some action will be performed by some person, does the team actually have the staff to do that work? When a project says that some work is to be done and then does not staff that function sufficiently, it sends a message to the team that they should not take the process as written seriously. This undermines the team’s trust. If the function is actually needed, either the team will find an ad hoc workaround or the function will not get done adequately. Either way, there will be a disconnect between what is written down and what actually happens.

21.9 Life cycle and planning

The life cycle patterns are just patterns that provide a guide to work that goes in the project’s plan. The plan is the actual definition of the tasks to be done. When the plan needs to be updated, the patterns provide a template for the work that goes into the plan.

Assembling the plan, however, takes into account many inputs, of which the pattern is only one. Planning involves deciding on the priority and deadlines for work, which is based on project deadlines, risk or uncertainty, and the project’s development methodology.

Chapter 61 discusses in detail how the plan is developed and maintained, including how the life cycle patterns get incorporated.

Consider the following example of how a pattern gets incorporated into the plan. This example shows how the pattern is only a template, and there are many decisions that will depend on other information.

This pattern defines what should happen when a customer requests a change. The basic pattern is that first someone on the team should evaluate the request; this may involve working with the customer to clarify the request, and with other engineers to estimate the scope and cost of the work. The project can then make a decision whether to accept the change or not. If the decision is to make the change, work to build, release, and deploy the update will follow. If not, there is another pattern for how to communicate with the customer that the change will not be made.

undisplayed image

The activity starts when the project receives a change request. Based on this, the plan can be updated to include three tasks right away: the evaluation, review, and decision tasks.

At the same time, the planner must make decisions: who should each task be assigned to? What priority should the flow of tasks have? The pattern can indicate the roles involved in the tasks, such as there being a small team responsible for evaluating change requests and a customer representative from the marketing team, but it doesn’t determine which specific people. That’s for the planning and tasking efforts to determine. Similarly, the pattern does not specify how the work should be prioritized relative to other work the same people are doing. The planner incorporates information about how urgent the customer’s request might be and the importance of the customer into the decision. The project might have decided, for example, that there should be a queue of outstanding change requests and they should be evaluated in their order in the queue.

Determining who should be involved in a review of the evaluation might depend on the results of the evaluation. The pattern might indicate that the evaluation should be reviewed by engineers responsible for each high-level component that will be affected by the change. This means that the decision about who specifically will be tasked with the review can’t be made until the evaluation has worked out the scope of the change.

The decision to proceed with making an update will depend in part on whether the team has the time and resources to make the update. The team will need to determine whether adding the work to the plan will cause a problem with meeting deadlines that have been established already, or if it will overload a team that is already busy. This determination will involve analysis of the current plan—something that the life cycle pattern can help with only to the extent that the patterns can help with generating estimates of the work that would be involved.

When the project takes the decision to go ahead with developing an update for the request, the pattern shows that work steps follow to develop a change and then release and deploy the update. When the decision gets made, this will trigger the planning activity to add development and release work into the plan. These are high-level work steps with little detail. The planner will find patterns for these steps and populate those patterns into the plan.

Decisions about the work involved in development will depend on the development methodology that the team has selected to follow. If the update will involved extensive changes and the team is following a spiral-style methodology [Spiral], the development plan might consist of two or three development rounds. Each round would design and implement part of the changes, with a milestone at the end of each round showing how the partial changes have been integrated into the system.

Decisions about the release and deployment work will also incorporate policy decisions about how the team works. Will each change request result in a separate update release? Or will updates be bundled together into releases that combine several updates, perhaps on a schedule defined in advance?

21.10 Principles for a life cycle pattern

In this section I list some principles to consider when designing a workflow pattern.

The act of designing—or refining—a life cycle pattern is an opportunity to think deliberatively about how the team should get its work done. Life cycle patterns are the templates for the project’s plan, and so they should be designed to achieve the work that is needed to move the project forward well.

Designing the patterns ahead of time means having time to define good work patterns. The pattern does not have to be worked out under pressure, as a reaction to something unexpected happening in the project. It can be discussed among multiple team members to get different perspectives and to ensure everyone’s needs are met. Working in advance gives time to check that the steps in the pattern are consistent with each other. It means that there is time to think about what exceptional situations might happen and define what to do in those cases.

Note that if an organization already has an approach to life cycle patterns, whether documented or not, one should aim for continuity with that approach. Anyone already in the organization will know that approach to organizing work; making a major change would mean losing the advantage of established team habits. On the other hand, if the current approach is not working well, then a new project is an opportunity to improve.

The life cycle patterns encode principles and methodology that encourages good work. Principles to consider include:

  1. Know the purpose for something before developing it.
  2. Build in time for and incentivize deliberative thinking.
  3. Assign decision-making authority to an appropriate level based on the nature of the decision.
  4. Build in ways to check work, and design them so they are a team norm and not prone to triggering defensive reactions.
  5. Build for the longer term.
  6. Build in project-wide decision points.
  7. Think about exceptions that might happen, how to handle them, and when to change course.
  8. Define the work so that everyone on the team can agree when a step has been completed.
  9. Give a clear definition for each step of the quality considerations by which the work can be judged.
  10. Make the pattern as light-weight as possible without compromising quality.

Purpose. I have mentioned this principle several times already, and I believe it is a basic principle of effective system-building. The life cycle patterns encode this principle for specific parts of the team’s work.

As with anything else that is designed, a pattern itself starts with a purpose. That purpose might be “build a simple component” or “build the whole system” or “handle a customer’s change request”. A good pattern addresses its purpose thoroughly, without trying to achieve other purposes.

The pattern that results should then ensure that team members follow this approach when building parts of the system. If the pattern is for handling a customer’s change request, for example, the pattern should address understanding and documenting what the customer wants changed (and why), before starting to work out whether to agree to the change or to begin implementing the change.

Time to think. Key parts of a complex system are best served by taking some time to properly understand the purpose or need of that part, and to look at options for how it can be designed or built. A project running at too fast a pace skips this thinking and uses the first thing that someone thinks of—though there may be subtle ramifications of that decisions that are not appreciated until the decision causes a problem later. Asking someone to document the alternatives they considered and rewarding them to do so works to improve the quality of the system.

At the same time, people can take too long to make a decision or fixate on making it perfectly. The time spent on deliberation should be bounded to avoid this.

Decision-making authority. Bezos introduced the idea of reversible and irreversible decisions [Bezos16]. He wrote:

Some decisions are consequential and irreversible or nearly irreversible—one-way doors—and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation. If you walk through and don’t like what you see on the other side, you can’t get back to where you were before. We can call these Type 1 decisions. But most decisions aren’t like that—they are changeable, reversible—they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long. You can reopen the door and go back through. Type 2 decisions can and should be made quickly by high judgment individuals or small groups.

As organizations get larger, there seems to be a tendency to use the heavy-weight Type 1 decision-making process on most decisions, including many Type 2 decisions. The end result of this is slowness, unthoughtful risk aversion, failure to experiment sufficiently, and consequently diminished invention.

For engineering projects, many decisions fall in the middle ground between reversible and irreversible. Consider building an aircraft. As long as the designs are just drawings, the designs can be changed with low to moderate cost. Early in the design process changes can be quite low cost; as the design progresses and more and more interdependent components are designed, the cost of rework increases. Once the airframe has been machined and assembled, the cost of changing its basic design becomes high, possibly high enough in time or in money that it is in effect irreversible.

Good life cycle patterns will account for different costs of reversing decisions. They should both build in time for deliberation and consultation before making hard-to-reverse decisions and use lighter-weight decision-making for less risky decisions. Similarly, the patterns should ensure that the authority for hard-to-reverse decisions is assigned to someone with high-level responsibility in the project, while the authority for low-risk decisions should be placed as close to the work as possible.

Checking work. Checking that work has been done well is commonly understood to improve the quality of results. It is essential for parts of a system that require high assurance—safety- or security-critical parts.

The key to checking is that the checks not be subject to implicit biases that the developer might have. This can be handled either by the developer doing analyses that force a stepping back from decisions (perhaps by encoding them mathematically) and that can be checked for accuracy by someone else, or by having an independent person review the work.

Either way, the developer’s pride in their work can feel threatened. Setting out life cycle patterns in which every part of the work is checked enables the project to make checks a norm. Designating in advance that checks will happen, and who will do them, helps depersonalize the effort and in the long term contributes both to quality work and team morale.

Building for the longer term. It is easy to solve an immediate problem at hand quickly and move on, leaving a problem for the future. Taking time to think about the problem (the principle of taking time for deliberative thinking, above) will help but is not sufficient.

It is likely that someone will revisit the work sometime in the future. They may need to understand the work in order to fix a flaw or make an upgrade. They may be auditing the work as part of a critical safety review. They will need to know the rationale for decisions that were made, and they will need to understand subtle aspects of the work. If this information has been documented, these people in the future will be able to do their work accurately and relatively quickly. If they have to deduce this information by looking at artifacts built in the work, they will have to spend time reverse-engineering the work and their accuracy is generally low.

Building into the pattern checks for documentation of rationale and explanations will accelerate future work.

Project-wide decision points. Most projects have times when there is a decision whether to proceed or to cancel or to redirect the project. These include whether to start, times when funding is needed, public announcements, and irrevocable steps like launch. These decision points generally require work to prepare for them, which should be accounted for.

Exceptions. Things often go not to plan. What then? Who needs to know? What needs to be done to respond?

Sometimes this is as simple as setting an expectation for the team. If a component’s specification is inconsistent or cannot be met, who gets informed, and how does the problem get corrected?

Sometimes the situation is time-critical. If a major piece of equipment catches fire, what is the response? What if an insecure component has been incorporated and deployed? What if a large part of the system has been built, and someone finds a fundamental flaw? The responses to situations like these are complex, and there often isn’t time in the moment to work out the details.

Good life cycle patterns include pre-planned responses to these exceptional situations. This might consist of references to procedures that should be followed, or it might reference a pattern used to respond to the situation.

Completeness. Can everyone on the team agree when a part of the work has been completed? The person assigned a task should understand their assignment, so that they can do their work independently. Others will check the work, or mentor the person doing the work—and they should have the same understanding of the assignment.

The definition of actions, as well as the list of outputs and post-conditions for a pattern, should be clear to everyone.

Quality considerations. As with completeness, the people assigned to work on tasks need to have a clear definition of what makes the results of their work acceptable, or what makes one way better than another. Sometimes this is simple: when objectives or specifications, which would be inputs to a work step, are met. Other times considerations of quality arise not from specifications but from things like coding standards. In those cases the quality considerations should be spelled out explicitly so the people doing the work know to use them.

Light-weight patterns. Good patterns are lightweight enough to get their job done, and not more (Section 20.8.3). Working out the pattern in advance is an opportunity to work out what parts of the work are truly needed and which can be omitted or simplified. For example, a pattern should be adapted to the possible cost of making a wrong decision (see decision-making authority above). Patterns that involve easily-reversible decisions should include streamlined decision-making steps, pushing the decision authority to as low a level in the team as possible and involving as little work as possible. On the other hand, more difficult decisions should involve a pattern that calls for greater deliberation, more checking and consultation, and places decision-making authority higher in the team’s hierarchy.

Similarly, the patterns should be achievable by the team. If the team is small, it makes no sense to mandate complex work flows for which there isn’t the staff. Each decision about what to include in a pattern should be measured against what is possible for the team to perform.

Sidebar: Summary

Chapter 22: Development methodologies

2 October 2024

22.1 Purpose

A project should choose life cycle patterns that fit with its development methodology.

A development methodology is the overall style of how a project decides to organize the steps in developing the system. This includes decisions like whether to develop the system in increments of functionality, whether to design everything before building, whether to synchronize everyone’s efforts to a common cycle, and so on. These decisions are reflected in obvious ways in the life cycle patterns a project uses.

There are many methodologies named in the literature: waterfall, spiral, agile, and so on. Different sources interpret each of these differently, and they are rarely compared on a common basis. Some of these, like waterfall methodology, have evolved over time and do not have a single clear source or definition. Others, such as agile development, have a defining document (manifesto) to reference.

All of the methodologies I know of have come to be treated as dogma, and are more often caricatured than treated thoughtfully. This is unfortunate because each of the methodologies has something useful to offer, while all of them are harmful to project effectiveness if taken as dogma or used without thoughtful understanding.

These methodologies can be organized and compared based on a few characteristics.

Rather than try to argue for or against any specific methodology—which is difficult, since most methodologies are hard to pin down among many published variants—I focus on these characteristics, and argue for choosing a methodology that has the characteristics that a project needs.

Size of design-build cycle. Methodologies like waterfall use “big design up front”, where the entire system is specified and designed before implementation begins. Other methodologies break up development into many specify-design-implement cycles.

undisplayed image
Size of design-build cycle.

The argument for doing as much design up front as possible is that errors are easier and cheaper to catch and correct before implementation than after. The arguments against are that in some complex systems the design work is exploratory and requires implementing part of the system to learn enough to know how to design—or not design—critical system parts.

Many iterative methodologies claim to be better at supporting adaptation as system purposes change.

Coupled or decoupled design-build. Some iterative methodologies plan to complete adding a feature to the system in one iteration, by executing an entire specify-design-build-integrate cycle for that feature. Other methodologies break up that cycle into multiple steps, and allow those steps to spread across multiple iterations.

undisplayed image
Coupled or decoupled design-build.

Advance planning. Some methodologies emphasize planning out work activities as far as possible into the future, while others focus on planning as little as possible in order to adapt as needs change.

undisplayed image
Planning to different horizons.

The argument for planning as far as possible into the future is that it gives the team stability: they have a reasonable expectation of what they should be working on now and have a sense of how that work will flow into other tasks soon after.

The argument for planning to shorter horizons is that someone will come along and change priorities or system purpose, and so the work will need to be changed to adapt. Planning too far ahead is wasted effort, it is argued, and gives teams a false sense of stability.

Regular release or integration. When a methodology uses many design-implement cycles, at the end of each cycle it can require that new implementations be integrated into a partially-working system, or it can go farther and require that the partially-developed system be releasable. Most iterative methodologies recognize that very early partial systems may not be releasable because they are too incomplete.

Regular release is feasible for products that are largely software, where a new release can be put into operation for low effort. It is less feasible for products that involve a large, complex hardware manufacturing step between development and putting a system into operation.

The choice of whether to release regularly or not is often dictated by the relationship with the customer(s) and whether the system is still being implemented the first time, or is in maintenance. Once the system has been deployed, development is likely either for fixes or for new features; these are often released and deployed as soon as possible.

Synchronization across project. Some methodologies that break up development into multiple iterations align all the work being done at one time so that the iterations begin and end together. Other iterative methodologies allow some work iterations to proceed on different timelines from other work.

undisplayed image
Synchronized versus unsynchronized tasks.

Synchronizing work iterations across the whole project can provide common points to check that work is proceeding as it should and to share information about progress. However, it can also break up tasks that run far longer than others and result in a perception that the synchronization is wasteful management overhead rather than something useful.

Shared short-term purpose across project. Iterative methodologies can focus the entire team on one set of features across all the work going on at one time, or they can allow different streams of work to have different focuses in the short term.

The argument for this practice is that the more people share a common goal, the more they will be motivated to work together to meet that goal and to defer work that does not address that common goal. The argument for having multiple work streams with different focuses is that too often a project will involve work from different specialties and on different timelines: mechanically assembling an airframe and building a flight control algorithm have little in common.

undisplayed image
Shared purpose.

22.2 Methodologies

I present three of the most commonly discussed development methodologies in order to illustrate how they can be characterized. Each of these methodologies has many variants, and all are the subjects of debates comparing tiny details of each variant. The purpose of this section is to illustrate how they can be analyzed, not to capture all nuances of every methodology in use.

Waterfall development. This approach to development follows the major life cycle phases in sequential order. It begins with concept development, moves through specification to design, and only then begins implementation.

Waterfall development is well suited to building systems that have decision points that are difficult or expensive to reverse. The NASA project life cycle (Section 23.2.1) follows a waterfall-like sequence for its major phases because there are three decision points that do not allow for easy adjustment: getting government funding approval; building an expensive vehicle; and spacecraft launch.

This methodology can be inefficient when the system cannot be fully specified up front. When the system’s purpose changes mid-development, or when some early design decision proves to have been wrong, the methodology does not have support built in for how to respond. Projects using this kind of methodology are known to have difficulty sticking to schedules and costs that were developed early in the project, usually because some unexpected event happened that was not anticipated from the beginning.

In one spacecraft design project I worked on (Section 4.1), the team assembled a giant schedule for the whole project on a 20-foot-long whiteboard. This schedule detailed all the major tasks needed across the entire system. That schedule ended up requiring constant modification as the work progressed.

Waterfall development requires great care when building a system with significant technical unknowns. The serial nature of execution means that some important decisions must be made early on, when little information is available on which to base that decision. When those unknowns are understood, the project can put investigation or prototyping steps into the specification or design phases in order to gather information for making a good decision. On the other hand, if the team does not learn that some technical uncertainty exists until the project is into the implementation phase, the cost of correcting the problem can be higher than with other methodologies. In addition, the sequential nature of execution can create an incentive for a team to muddle through without really addressing the unknown, resulting in a system that does not work properly.

In the spacecraft design project I mentioned, there were technical problems with the ability for spacecraft to communicate with each other. These problems were not properly identified and investigated in the early phases of the project. As the team designed and implemented parts of the system, different people tried to find partial solutions in their own area of responsibility but the team over all continued to try to move ahead. In the end the problems were not solved and the spacecraft design was canceled.

Waterfall characteristics
Cycle size One design-build cycle for the whole project
Coupled design-build One cycle, so implicitly coupled
Planning Plan as far as possible, especially after design
Release and integration At end of project
Synchronization n/a
Short-term purpose n/a

Iterative and spiral development. This development methodology is characterized by building the system in increments. Each increment adds some amount of capability to the system, applying a specify-design-build-integrate cycle. Typically the whole team works together on that new capability.

Early increments in such a project often build a skeleton of the system. The skeleton includes simple versions of many components, along with the infrastructure needed to integrate and test them. Later increments add capabilities across many components to implement a system-wide feature.

Teams using iterative development often plan out their work at two levels: a detailed plan for the current iteration, and a general plan for the focus of the iterations that will follow.

This methodology provides builds in more flexibility to handle change than does the waterfall methodology.

Iterative development can be used to prioritize integration (Section 8.3.2), in order to detect and resolve problems with a system’s high-level structure as early as possible. This involves integration-first development, where the team focuses on determining whether the high-level system structure is good ahead of putting effort into implementing the details of the components involved.

Iterative and spiral characteristics
Cycle size One set of features crossing the whole system
Coupled design-build Generally add to design and implementation for the feature(s) in the iteration
Planning At the beginning of each iteration; variants maintain a roadmap of iterations or spirals
Release and integration Either; every iteration ends with an integrated working system
Synchronization All work synchronized to the iteration
Short-term purpose Shared within the iteration

Agile development. The agile methodologies—there are many variants—focus the team on time-limited increments, often called sprints. The approach is to maintain a list of potential features to build or tasks to perform (the backlog). At the beginning of a sprint, the team selects a set of features and tasks to do over the course of that sprint. By the end of the sprint, the features have been designed, implemented, verified, and integrated into the system. In other words, there is a life cycle pattern that applies to building each feature within a sprint.

Agile development aims to be as responsive to changes as possible. The start of each sprint is an opportunity to adjust the course of the project as problems are found or the team gets requests for changes. The agile methodologies arose from projects that were trying to keep the customer as involved as possible in development, so that the team’s work would stay grounded in customer needs and so that the customer could give feedback as their own understanding of their needs changed.

At their worst, the agile methodologies have been criticized for three things: an excess of meetings, drifting focus, and difficulty handling long-duration tasks. Note that these critiques come from people in teams who claim to be using agile methodologies, and reflect problems with the way teams implement agile approaches and not necessarily problems with the definition of the methodology itself.

Agile development emphasizes continuous communication within a team. In practice, this can lead to everyone on the team having multiple meetings each day: daily stand up meetings, sprint planning, sprint retrospectives, and so on. This likely comes from teams using meetings as the primary way to communicate, and from democratizing planning decisions that could be made the responsibility of fewer people.

Some agile projects have been characterized as behaving like a particle in Brownian motion: taking a random new direction in each iteration or sprint. This can happen when the team only looks at its backlog of needed tasks each iteration, or when new outside requests are given priority over continuing work. The focus on agility and constant re-evaluation of priorities can lead teams to this behavior, but it is not integral to the ideal of agile development. A team can develop a longer-term plan and use that plan as part of prioritizing work for each new sprint.

Finally, many complex systems projects involve long-running tasks that do not fit the relatively short timeline of sprints or iterations. Acquiring a component from an outside vendor or manufacturing a large, complex hardware component do not really fit the model of short increments.

Agile characteristics
Cycle size One short iteration with many independent features and tasks, bounded in duration
Coupled design-build Some agile practices focus on features, with a design-build cycle within one sprint to implement a feature. Other agile practices decouple designing, building, and verifying, allowing those to be spread over multiple iterations
Planning At the beginning of each iteration; variants have a longer-term general plan
Release and integration Either; every iteration ends with an integrated working system
Synchronization All work synchronized to the iteration or sprint
Short-term purpose Each task has its own purpose

22.3 Practical considerations

Most projects actually choose to use a hybrid among the different methodologies. They may start from one of the generally available methodology definitions, but they adapt that template based on the needs of their project and their own experience.

I have been part of several projects that had hard decision points, reflected in project milestones. At these points, the project was expected to provide information that would lead to the project continuing or being canceled: the decision to award the team a contract to build the system, or decisions to continue funding. These decision points impose a degree of waterfall-like structure on the work. For example, one project had to present a proposal to a government agency in order to get a contract to perform detail design and prototype implementation. That proposal involved developing the concept for the system, showing how it met the customer’s objectives, and showing that there was a likely feasible design.

However, once the contract had been awarded, the team used a spiral development methodology to build a sequence of increasingly capable versions of the system, and used those to demonstrate the completion of defined features at the end of each spiral. Within each spiral, the software team used an agile-like approach based on two-week sprints.

Every successful project I have been part of has done some degree of planning and high-level design well ahead of detail design and implementation. When the project used spiral or iterative development, the general flow from one spiral to another provided guidance to keep the work on track to reach the defined end point. When the project used agile methods to manage tasks in the short term, the longer-range plan kept the tasking decisions in each sprint from going off track by providing a basis for prioritizing the work.

No matter what methodology the projects have followed, they have all had some kind of regular cycle for checking in so that project leadership could find out when there were problems, and so that team members could maintain awareness of progress across the whole project. In some cases this looked a lot like agile practice, with short daily stand up meetings and regular, more in-depth discussions. In other cases the nature and schedule for checking in depended on the part of the system people were working on—from continuous interaction in some parts of software development to weekly updates from people doing safety analysis.

All the projects I have worked on have also had tasks varied widely in duration. Many software-related tasks were short, in the range of hours to a few days, while testing or hardware implementation often required weeks to complete. Most of the projects did not try to make all these tasks fit into a synchronized schedule across the project.

In practice, therefore, the projects I have seen that have been successful have applied common sense to the choices they make about how they chose the design methodology for their specific project.

Sidebar: Summary

Chapter 23: Example life cycle patterns

2 October 2024

23.1 Introduction

In this chapter I survey some of the many different life cycle patterns in use.

The patterns have different scopes. Some cover the whole life of a system, from conception through retirement. Some are concerned only with developing a system. Others focus on more narrow parts of the work.

I group the patterns in this chapter into four sets, based on scope. The first group covers the whole life of a project, without much detail in the individual steps. The second dives into the development process. The third addresses post-development processes—for releasing and deploying a system; these patterns overlap with development processes. The fourth and final group is for patterns with a narrow focus on some specific detail of building a system.

Patterns with different scopes can potentially be combined. Most patterns that cover a system’s whole life, for example, define a “development phase” but do not detail what that is. One of the patterns for developing a system can be used for the details.

Each of the examples will include a comparison against the following baseline pattern for the whole life of a project.

undisplayed image

The baseline phases are the same as in Section 21.3:

23.2 Whole project life cycle

These patterns organize the overall flow of a project, from its inception through system retirement and project end. I have selected two examples: the NASA project life cycle, which is used in all NASA projects big and small, and the Rational Unified Process, which arose from a more theoretical understanding of how projects should work.

23.2.1 NASA project life cycle

The NASA life cycle has been refined through usage over several decades. It is defined in a set of NASA Procedural Requirement (NPR) documents. The NASA Space Flight Program and Project Management Requirements document [NPR7120] defines the phases of a NASA project.

The NASA life cycle model is designed to support missions—prototypically, a space flight mission that starts from a concept, builds a spacecraft, and flies the mission.

NASA space flight missions involve several irreversible decisions, and this is reflected in how the phases and decisions are organized. Obtaining Congressional funding for a major mission can take months or years. During development, constructing the physical spacecraft, signing contracts to acquire parts, and allocating time on a launch provider’s schedule are all expensive and time-consuming to reverse. Launching a spacecraft, placing it in a disposal orbit, and deactivating it are all irreversible. These decision points are reflected in where there are divisions between phases, and when there are designated decision points in the life cycle.

There are several life cycle patterns for NASA projects, depending on the specific kind of program or project. I focus on the most general project life cycle [NPR7120, Fig. 2-5, p. 20], which is reproduced below:

undisplayed image

The pattern includes seven phases. There is a Key Decision Point (KDP) between phases. Each decision point builds on reviews conducted during the preceding phase, and the project must get approval at each decision point to continue on to the next phase.

The key products for each phase are defined in Chapter 2 of the NPR and in Appendix I [NPR7120, Table I-4, p. 129].

Pre-Phase A (Concept studies). This phase occurs before the agency commits to a project. It develops a proposal for a mission, and builds evidence that the concept being proposed is both useful and feasible. A preliminary schedule and budget must be defined as well. If the project passes KDP A, it can begin to do design work.

Phase A (Concept and technology development). This phase takes the concept developed in the previous phase and develops requirements and a high-level system or mission architecture, including definitions of the major subsystems in the system. It can also involve developing technology that needs to be matured to make the mission feasible. This phase includes defining all the management plans and process definitions for the project.

Phase B (Preliminary design and technology completion). This phase develops the specifications and high-level designs for the entire mission, along with schedule and budget to build and complete the mission. Phase B is complete when the preliminary design is complete, consistent, and feasible.

Phase C (Final design and fabrication). This phase involves completing detailed designs for the entire system, and building the components that will make up the system. Phase C is complete when all the pieces are ready to be integrated and tested as a complete system.

Phase D (Assembly, integration, test, launch, checkout). This phase begins with assembling the system components together, verifying that the integrated system works, and developing the final operational procedures for the mission. Once the system has been verified, operational and flight readiness reviews establish that the system is ready to be launched or flown. The phase ends with launching the spacecraft and verifying that it is functioning correctly in flight.

Phase E (Operations and sustainment). This phase covers performing the mission.

Phase F (Closeout). In this phase, any flight hardware is disposed of (for example, placed in a graveyard orbit or commanded to enter the atmosphere in order to destroy the spacecraft). Data deliverables are recorded and archived; final reviews of the project provide retrospectives and lessons learned.

This pattern of phases grew out of complex space flight missions, where expensive and intricate hardware systems had to be built. These missions often required extensive new technology development. The projects involved building hardware systems that required extensive testing. The NASA procedures for such missions are therefore risk-averse, as is appropriate.

I have observed that many smaller, simpler space flight projects have not followed this sequence of phases as strictly as higher-complexity missions do. Many cubesat missions, where the hardware is relatively simple and more of the system complexity resides either in operations or in software, have blurred the distinctions between phases A through C. In these projects, software development has often begun before the Preliminary Design Review (PDR) in Phase B.

At the same time, I have observed some of these smaller space flight projects failing to develop the initial system concept and requirements adequately before committing to hardware and software designs. This has led to projects that failed to meet the mission needs—in one case, leading to project cancellation.

The phases in the NASA life cycle compares with the baseline model presented earlier as follows.

undisplayed image

The NASA life cycle splits the system development activities across four phases. The NASA approach does this because it needs careful control of the design process, in particular so that agency management can make decisions whether to continue with a project or not at reasonable intervals. The NASA approach also places reviews throughout the design and fabrication in order to manage the risk that the system’s components will not integrate properly. Many NASA missions involve spacecraft or aircraft that can only be built once because of the size, complexity, and expense of the final product; this makes it hard to perform early integration testing on parts of the system and places more emphasis on design reviews to catch potential integration problems.

The NASA pattern is notable for some initial work on a mission concept starting before the project is officially signed off and started. There are two reasons for this. First, because all NASA missions have common processes, there is less unique work to do for each individual project. Second, NASA is continuously developing concepts for potential missions, and this exploratory work is generally done by teams that have an ongoing charter to develop mission concepts. For example, the concept for one mission I worked on was developed by the center’s Mission Design Center, which performed the initial studies until the concept was ready for an application for funding.

23.2.2 Unified Process

The Unified Process (UP) was a family of software development processes developed originally by Rational Software, and continued by IBM after they acquired Rational. Several variants followed in later years, each adapting the basic framework for more specific projects.

The UP was an attempt to create a framework for formally defining processes. It defined building blocks used to create a process definition: roles, work products, tasks, disciplines (categories of related tasks), and phases.

The framework led to the creation of tools to help people develop the processes. IBM Rational released Rational Method Composer, which was later renamed IBM Engineering Lifecycle Optimization – Method Composer [IBM23]. A similar tool was included in the Eclipse Foundation’s process framework, which appears to have been discontinued [EPF]. These tools aimed to help people develop processes and then publish the process documentation in a way that would let people on a team explore the processes.

While the UP and its tools gained a lot of attention, their actual use appears to have been limited. I explored the composer tool in 2014, and found that it remarkably hard to use. It came with a complex set of templates, which were too detailed for project that I was working on. Another author wrote that “RUP became unwieldy and hard to understand and apply successfully due to the large amount of disparate content”, and that it “was often inappropriately instantiated as a waterfall” [Ambler23]. Certainly I found that the presentation and tools encouraged weighty, complex process definitions and that they led the process designer into waterfall development methodology.

The UP defined four phases: inception, elaboration, construction, and transition.

  1. Inception. The inception phase concerns defining “what to build”, including identifying key system functionality. It produces the system objectives and a general technical approach for the system.
  2. Elaboration. This phase is for defining the general system structure or architecture and the requirements for the system. The results of this phase should allow the customer to validate that the system is likely to meet their objectives. This phase may be short, if the system is well defined and or is an evolution of an existing system. If the system is complex or requires new technology, the elaboration phase may take a longer time.
  3. Construction. This involves developing detailed component specifications, then building and testing (verifying) the components. This includes integrating the components together into the whole system and verifying the result. The result is a completed system that is ready to transition to operation. RUP focuses on constructing the system in short iterations.
  4. Transition. This phase involves beta testing the system for final validation that the customer(s) agree that the system does what is needed, and deploying or releasing the final software product.

The UP does not directly address supporting production, system operation, or evolution; however, the expectation is that, for software products, there will be a series of regular releases (1.0, 1.1, 1.2, 2.0, …) that provide bug fixes and new features. Each release can follow the same sequence of phases while building on the artifacts developed for the previous release.

The four phases in UP compare with the simple model presented earlier as follows:

undisplayed image

The Unified Process provides lessons for defining life cycle patterns: keep the patterns simple, make them accessible to the people who will use them, and put the emphasis on what they are for, not on tools and forms. The basic ideas in UP are good—carefully defining a life cycle, and building tools to help with the definition. I believe that these good ideas got lost because the effort became too focused on elaborate tools and model, losing focus on the purpose of life cycle patterns: to guide the team that actually does the work.

23.3 System development patterns

Some patterns focus only on the core work of developing a system. These patterns generally begin after the project has been started and the system’s purpose and initial concept are worked out. The patterns go up to the point when a system is evaluated for release and deployment. In between, the team has to work out the system’s design, build it, and verify that the implementation does what it is supposed to.

These examples all share the common basic sequence of specifying, designing, implementing, and verifying the system or its parts. Some of the examples include similar sequences of activity to evolve the system after release.

23.3.1 Systems V model

This pattern is used all over in systems engineering work. It is organized around a diagram in the shape of a large V. It is used in many texts on systems engineering; it has also been used to organize standards, such as the ISO 26262 functional safety standard [ISO26262, Part 1, Figure 1].

In general, the left arm of the V is about defining what should be built. The right arm is about integrating and verifying the pieces of the system. Implementation happens in between the two. One follows a path from the upper left, down the left arm, and back up the right side to a completed system.

There is no one V model. There are many variants of the diagram, depending on the message that the author is trying to convey. Here are two variants that one often encounters.

The first variant focuses on the sequence of work for the system as a whole:

undisplayed image

The second variant focuses on the hierarchical decomposition of the system into finer and finer components:

undisplayed image

The key idea is that specifications, of the system or of a component, are matched by verification steps after that thing has been implemented.

In general this model conflates three ideas that should be kept separate.

  1. Development follows a general flow of specification, then design, then implementation, then verification.
  2. System development proceeds from the top down: start with the whole system, and recursively break that into components until one reaches something that can be implemented on its own.
  3. Development follows a linear sequence from specification and design, through implementation of components, followed by bottom up integration of the components into a system (with verification along the way).

The first two ideas are reasonable. Having a purpose for something before designing and building it is a good idea. There are exceptions, such as when prototyping is needed in order to understand how to tackle design, but even that exception is merely an extension to the general flow. The second idea, of working top down, is necessary because at the beginning of a project one only knows what the system as a whole is supposed to do; working out the details comes next. Again there are exceptions, such as when it becomes clear early on that some components that are available off the shelf are likely useful—but again, that can be treated as an extension of the top down approach.

The third idea works poorly in practice. It is, in fact, an encoding of the waterfall development methodology into the life cycle pattern, and so the V model inherits all the problems that the waterfall methodology has.

In particular, the linear sequence orders work so that the most expensive development risk is pushed as late as possible, when it is the most expensive to find and fix problems. By integrating components bottom up, minor integration problems are discovered first, shortly after the low-level components have been implemented when it is cheapest to fix problems in those low level components. Higher-level integration problems are left until later, when complex assemblies of low-level components have been integrated together. These integration problems tend to be harder to find because the assemblies of components have complex behavior, and more expensive to fix because small changes in some of the components trigger other changes within those assemblies already integrated.

Development methodologies other than waterfall address these issues better, as I discussed in Chapter 22.

23.3.2 Systems or software development life cycle (SDLC)

There are several life cycle definitions for system development, primarily of software systems, that go by the SDLC name. They generally have similar content, with variations that do not change the overall approach.

I have not found definitive sources for any SDLC variants. It appears to be referenced as community lore in many web pages and articles.

The core of the SDLC consists of between six and ten phases, depending on the source, that give a sequence for how work should proceed in a project. The phases are:

Phases marked (*) are not included in all sources.

Most discussions of SDLC stress that the pattern is meant to help organize a project’s work, not to dictate the sequence of activities. Some sources then discuss how the SDLC relates to development methodologies. A project using the waterfall methodology would perform the phases in sequence. Iterative and spiral development would lead to a project repeating parts the SDLC sequence multiple times, once for each increment of functionality that the project adds to a growing system. A project using an agile methodology would perform tasks at multiple points in the SDLC sequence in any given iteration, as long as for any one part or function of the system the work follows that sequence. I discussed how life cycles fit with development methodologies in Chapter 22.

23.4 Post-development patterns

23.4.1 EVT/DVT/PVT

Many electronics development organizations use a set of development and testing phases:

undisplayed image

This set of phases is intended for developing an electronic hardware component, such as an electronics board. Developing this kind of hardware differs from developing a software component: while software source code can be compiled and tested immediately, a board design must be built into a physical instance before much of its testing can happen. Simulating the board can be done earlier, of course, but much testing is only done on the physical instance. This is especially true for integrating multiple boards together.

This pattern also addresses not just the design and testing of the component itself, but also the ability to manufacture it—especially when the component is to be manufactured in large numbers. The NASA, V, and SDLC patterns do not address manufacturing specifically; this pattern can be combined with those if a project involves manufacturing.

EVT. The EVT phase is preceded by developing requirements for the hardware product. It is often also preceded by development of a proof of concept or prototype for the board.[1]

During EVT, the team designs and builds initial working version, often continuing through a few revisions as testing reveals problems with the working version. The EVT phase ends when the team has a version whose design passes basic verification.

DVT. The DVT phase involves more rigorous testing of a small batch of the designed board. The design should be final enough that sample boards can be submitted for certification testing. The DVT phase ends when the sample boards pass verification and certification tests.

PVT. The PVT phase involves developing the mass manufacturing process for the board. This includes testing a production line, assembly techniques, and acceptance testing.

23.5 Detail patterns

The last two patterns have to do with managing changes to the system: when errors are found, and when customer needs change.

Both these patterns apply to specific, short parts of a project. They apply as needed—when a error report or a change request arrive. Both also potentially involve repeating parts of the overall development life cycle pattern. Both may be used many times in the course of a project.

23.5.1 Defect or error management

This life cycle applies when someone reports a defect or error in the system. It includes fixing the problem and learning from it.

Common practice is to use an issue or defect tracking tool to keep track of these reports and the status of fixing them. Many of those tools have an internal workflow, and parts of this life cycle pattern end up embedded in that internal workflow.

There are two different times when people handle error reports: when errors are found during testing, before an implementation is considered as being verified, and later, when a verified design or implementation must be re-opened. In the first case, the people doing verification are expected to be working closely with the people implementing that part of the system; the pattern for that activity amounts to reporting an error, fixing it, and verifying the fix.

undisplayed image

The general pattern for addressing later errors is:

  1. Reporting. Someone finds an error and reports it into the tracking tool.
  2. Triage and ranking. Determine what to do about the report. Someone investigates the report to determine whether it is understandable and actionable. This may involve communicating with whoever reported the problem to get more information, either about their expectation or about what they found. The result is either accepting or rejecting the report. If accepted, the report is given a priority level (typically one of four or five levels) and a likely part of the system affected. If rejected, the result is an explanation of why the report will not be acted on.
  3. Assignment. Who will be responsible for resolving this error? Most projects have a standard procedure: whoever is responsible for the component identified as the likely source, or people can pick up reports when they have time, or a manager can make assignments.
  4. Analysis. What is the actual problem that caused the report, and how can a fix be verified? This investigation might involve working to reproduce the problem. The analysis may find that the source of the erroneous behavior is a defect in a different part of the system than originally determined during triage, which may lead to assigning the responsibility to someone else. The analysis may show that the problem is broadly systemic or that it arises from multiple defects in multiple parts, in which case several people will be involved with fixing the problems and overall responsibility for the report may be moved to someone who can oversee all the affected parts.
  5. Fix. Making changes to the system amounts to recapitulating the overall life cycle pattern for building a part of the system. This can be seen as an instance of rewinding progress, as discussed in Section 20.3. The fix might be simple—there is one part of one component that is reimplemented, a test is developed to check the change, and it can be reviewed and approved. On the other extreme, the problem might come from a high-level design decision; the fix may involve changing that design, which in turn changes the specifications for multiple components, each of which must have their designs updated, their implementation and tests updated to match.
  6. Verify. Once a fix has been implemented, the changes are verified. The fix is verified to ensure that it actually addresses the reported problem, and that it hasn’t created new problems. The changes may have affected how components integrate together, in which case the verification status of integration and interactions among them may be invalidated by the change, and the integration must be revalidated.
  7. Review and accept. Once the fixes are complete and have been verified, the fix can be reviewed and accepted. At this point the updated designs and implementations are baselined (that is, made as the current mainline working version). The record of the error can be marked as completed.
  8. Learning. Is there something that can be learned from the error and its fix that can be used to avoid similar errors in the future? This may be informal learning by the people involved, or it may be important enough to document and used to educate others on the team.

23.5.2 Change requests

From time to time, someone will request changes to the system. The request may come from a customer, asking for a change in behavior or capability. The request may come from the organization or funder, reflecting a desire to meet a different business objective. The request might even come from a regulator, when the regulations governing a system change or when the regulator finds a problem when reviewing the system.

The pattern for handling a change request has much in common with the one for handling a defect report.

undisplayed image

After receiving a request, someone evaluates the request to ensure that it is complete and that they understand the request. After that, there is a decision whether to proceed making the change and, if so, what priority to put on the change. After making the decision to proceed, there are steps to design, implement, and verify the changes and eventually release the new version of the system.

Change requests differ from defect reports in two ways. First, requests for changes do not reflect an error in the system as it stands. The team can proceed building the system to meet its current purpose and defer making changes until after the current version is complete and released. Second, most requests are expressed as a change in the system’s purpose or high-level concept rather than as a report that a specific behavior in a specific part of the system does not meet its specification or purpose. A high-level request will have to be translated into, first, changes in the top-level system specification, and then propagated downward through component specifications and designs to work out how to realize the changes. This sequence of activities to work from the change of objective to specifications to designs to implementations is essentially the same as the activities to specify, design, and implement the system in the first place. In the pattern shown above, the “develop update” step amounts to recapitulating the overall system development pattern.

The decision to proceed with a change or to reject it depends on whether the change is technically feasible and whether it can be done with time and resources available. This depends on having an analysis of the complexity involved in making a change. Ideally, the team will be able to estimate the complexity with reasonable accuracy and little effort. Analyzing a change request will go faster and quicker if the team has maintained specification and design artifacts that allow someone to trace from a system purpose, down through system concept, into specifications and designs, to find all the parts of the system that might be affected by a change. If the team has not maintained this information, someone will have to work out these relationships from the information that is available—which is difficult and error-prone.

23.6 Comparisons and lessons learned

The life cycle patterns in this chapter have all been developed in order to guide teams through their work. To meet this objective, they have to be accessible and understandable by the teams using them; they can’t be explained in legalistic documents that include many layers of qualification and exceptions. Some of these have passed this test and have been used successfully. Others, such as the Unified Process, have not caught on.

Some of the patterns cover the whole project, while others address specific phases or activities. One pattern often references other patterns: for example, a high-level pattern like the NASA project life cycle uses lower-level patterns for developing components or handling change requests. Some low-level patterns, such as handling change requests or error reports, can end up using or recapitulating higher-level patterns.

The specific patterns that a project uses depend on that project’s needs. A software project that is expected to be continuously reactive to new customer needs works differently from a project that is building an aircraft, where rebuilding the airframe can cost lots of money and time. The NASA approach is influenced by the US Government fiscal appropriation and acquisition mechanisms, which require programs to have multiple points where the government can assess progress and choose to continue or cancel a program.

All of these patterns implicitly start with working out the purpose of some activities before proceeding to do detailed work.

These patterns also implicitly reflect the cost of making and reversing a decision (Section 21.10). The NASA life cycle puts design effort before a decision to spend money and effort building hardware. The change request and defect report patterns place evaluating the work involved ahead of committing to make a change.

Sidebar: Summary

Part VI: Reference life cycle

A reference life cycle pattern for projects. It models what a full life cycle contains, and can be the basis for developing an actual project’s life cycle.

Chapter 24: Introduction

2 October 2024

The previous chapters have introduced the ideas of life cycle patterns and development methodologies, along with the ways that the two affect each other. Chapter 22 introduced a number of characteristics that one can choose to match a project. Chapter 23 presented a number of example life cycle patterns, along with a rough framework for comparing the examples.

In this chapter, I present a reference model of development methodology and life cycle patterns. This approach is based on the approaches I have used myself or have observed others using in successful projects, along with learning from projects that have gone poorly. These recommendations do not attempt to follow any of the development methodologies dogmatically, instead taking the parts from several of them that work well. In other words, I have tried to distill a pragmatic set of solutions from the many options available.

The reference life cycle covers the entire life of a systems-building project. It has four high-level phases: preparation, development, operation, and ending.

undisplayed image

Project preparation is about setting up the project: how it will work, who is sponsoring it, who is funding it. Development covers working out what the system is for and then designing and building it, until it is ready for use. Operation is about producing the system, deploying it, using it, and evolving it. Ending is about shutting down the project when its work is done.

This reference also includes a project support “phase”, which includes all the activities that go on throughout the project to support operations.

Some projects are only concerned with building a system; once the system has been implemented and tested, it goes into production or operation and is no longer the concern of the development team. Those projects skip the operations phase. Most projects, on the other hand, have some level of involvement after the system is deployed and in operation, such as fixing bugs or enhancing the system. These projects involve all the phases.

The phases in the top-level life cycle in turn expand into more detailed patterns. Development consists of working out a purpose and a concept for the system, then developing a system to match, ending with a review to determine that the system is acceptable for putting into operation. Operation expands into a pattern of several phases, which I will discuss below.

undisplayed image

Some projects will spend most of their time in development, while others spend most of their time in evolution after the system is in operation. Exploratory spacecraft missions usually consist mostly of development, since once the spacecraft is launched there is little opportunity to change the spacecraft beyond the occasional software update. Mass-market consumer software, on the other hand, often spends as little time as possible on initial development and can spend years developing upgraded versions to keep consumers satisfied. This reference life cycle fits both kinds of projects.

The arrows in this diagram show how information and artifacts flow from one phase to another, but they do not necessarily indicate complete temporal orderings. For example, the project preparation phase often lasts quite a while, and overlaps early parts of the development phase. Within operation, different customers might deploy and operate their own instances of the system, and the project may be working on multiple system improvements at once.

Two of these phases—system development and system evolution—involve designing and implementing parts of the system. These are the two phases where a development methodology applies.

I will discuss each of the top-level project phases in turn in the coming chapters.

24.1 Projects with proposals

Some projects require developing a proposal to get funding or approval to proceed.

The life cycle for this kind of project adds a phase between preparation and development to develop a proposal. Developing the proposal typically involves developing the purpose and a preliminary concept for the system, so that the potential customer or funder can understand what they will getting if they agree to fund developing the system. The initial concept is then documented as part of the proposal itself, which is typically a document (often a large document) explaining what the system will be, how it responds to the customer’s requirements, how long it will take to develop, and how much it will cost.

Much has been written about how to do proposal development well. There is best practice for how to organize a proposal development team and what kinds of reviews are helpful.[1]

undisplayed image

After the customer or funder has agreed to the proposal, system development proceeds as it does for other kinds of projects.

24.2 Project-wide decisions and reviews

Projects have times when there will be a decision whether to continue the project, end it, or continue with significant changes. Some examples: whether to start a project, when additional funding is needed to continue, or at periodic progress reviews.

These are often not driven by progress on making the system. They can be driven by external considerations, such as the need for funding, or by a regular cadence of progress checks.

Such reviews or decision points do not fit neatly into the flow of phases defined in the life cycle pattern. When multiple steps are in progress concurrently, as happens during most of the development phase, the decision often happens in the middle of several of them. Preliminary specification or design reviews are also common; they happen part way through specifying or designing part of the system. Design reviews often mean that design should have reached a given level of completeness for the top X layers of components in the system.

I will note some representative decision points in this reference lifecycle, but the actual milestones are project-specific.

Sidebar: Summary

Chapter 25: Project preparation

2 October 2024

The project preparation is for getting together the things that the team will need to operate.

The case for the project. Preparation includes getting funding or approval to begin pursuing the project. This usually includes developing an initial pitch for what the project might be about, who will benefit, and roughly what level of resources will be needed. This initial case for the project will evolve from a vague notion at the start to whatever is needed to get approval and funding.

I have found two guides useful for making this initial case. The so-called Heilmeier Catechism [Heilmeier24] is a set of questions originally developed to guide people pitching project ideas to the US Defense Advanced Research Projects Agency (DARPA). (Appendix B lists the questions.) It consists of eight questions that prompt one to articulate the what and why of the project, along with what it will take to do the work. The second is the CSP project startup document template [Wilkes90], which was developed at the Concurrent Systems Project at HP Labs in the 1990s to guide people to think through what they mean to do in a new project. It is organized around the scientific method, and is phrased in terms of a research investigation; however, it is just as useful for other kinds of projects. There are variations on these guides that add questions, such as: How might the result of the work be misused?

In practice the people starting the project will not begin with answers to these questions. They will have some general ideas for a system project, and their job during the preparation phase is to investigate those ideas to work out answers to the questions. As anyone who has tried to form a startup knows, the system that eventually gets built usually is different from the first ideas—and it is the process of investigating answers to these questions that will find the final answer.

These efforts to work out the project’s case naturally include identifying stakeholders (Section 16.2). They also include some of the work to define the system’s purpose (Section 27.3).

Project operations. Project preparation also works out how the project will operate. This includes:

Decision point. At some point during preparation, a decision must be made whether to pursue the project or to stop, based on the case for the project and a general understanding of its costs. The decision should be included as an explicit milestone for the preparation phase, or immediately after, so that people on the team are reminded to take the time to think through whether the project makes sense before more resources are committed.

It may seem that the decision can be left implicit when the project needs no external resources—but in practice the resources used always represent an investment and there is an opportunity cost if the team could be working on something more useful.

Outputs. The preparation phase results in many document artifacts, which the team uses later as they execute the project. The documents record the many decisions that people make during preparation.

People will use these artifacts in a number of situations:

Timing. Bearing in mind that as the project and team are systems that bear careful design and implementation of their own, working out how the project will run is a process that takes time. Most projects start small, with just a few people and a general approach to how the project will operate, and develop additional details over time. Project preparation thus usually overlaps the beginning of development.

Progress on developing the project’s operations plans is balanced against the project’s progress on getting started and working out the system concept. Bear in mind Section 8.1.5—Principle: Team habits: the team will develop habits based on the procedures and organization they are working with, and changing those habits is hard. If the project leadership takes too long to develop team organization or life cycle patterns and procedures, it can become expensive and error-prone for the team to change behavior. On the other hand, if the project leadership rushes to develop these procedures and organization and gets them wrong, the team can end up in a similar situation.

The resolution to this dilemma depends on judgment by the project leadership; I know of no recipe for getting things exactly right. A few principles can help:

Completion. Project preparation is complete for the most part when the project is set up to execute. This includes having funding or approval to do the project, as well as having team structure, life cycle and procedures, artifact management, basic tools, and resources are worked out.

Preparation is never truly complete, however. Many of the things worked out in preparation will need to be revised as the project goes on. For example, a team’s organization usually needs to change as the team grows from a few people who can collaborate informally to a large team who need more formal organization (Section 19.3.2). The project may also need to change the focus of the system based on funder or customer needs; changing the system may mean changing how the project runs.

Milestones. There are no milestones intrinsic to project preparation in general. The principle of working out how some part of the project will work before the team needs that information applies, but that is not a milestone in itself.

Other stakeholders may impose milestones on project preparation. For example, getting funding from a funder or approval for the project from the organization may be required.

Sidebar: Summary

Chapter 26: Project support

Project support covers all the various things done continuously in the project to keep the team working. Project support starts with the beginning of the project and ends when the project ends.

This phase includes work to monitor and manage parts of the project. Teams are one example (Section 19.3.1); maintaining plans and tasking (Section 20.5) is another. Tracking project risk (Chapter 63) and technical uncertainty (Chapter 62) supports planning.

Other elements of project support include:

These efforts, similar to project preparation, will usually start small and develop over time. Similar principles apply to the timing of work on project support.

Sidebar: Summary

Chapter 27: Development

2 October 2024

The development phase sees the project work out what the system is supposed to do, and then build the system to meet that objective.

27.1 What is development?

Before going into the sub-phases that make up the development phase in the large, it’s worth thinking about how a system actually gets developed. A great many systems have been built over the centuries without the benefit of methodologies; with some experience, good systems engineers usually have intuition that guides them through development.

Development starts with a rough idea of what the system is for: what problem the system will solve, or what it can do for people. An aqueduct begins with the idea that something should transport water from a source to a town. A pump driven by a steam engine starts with an idea that a machine could pump water out of a mine better than a human- or animal-driven pump, and thus allow mines to go deeper than they had before.

The people thinking about the problem to be solved also often have some approaches in mind that might be applied. Someone responsible for moving water into a town might know about aqueducts that have already been built. Steam pumps were developed incrementally by many people over a period of over two hundred years.

For developing modern complex systems, the development process still begins with a general idea of what the system might do and what problems it might solve, perhaps with some key technical approach in mind.

The team needs to get from this general idea to a clear and precise definition of what they need to design and implement. This does not occur in one step; the detailed design of the system does not spring fully-formed from the chief engineer’s head. Instead, the team starts with a vague understanding and refines it bit by bit until it is clear enough for design and implementation to start.

The team does need to understand the system’s purpose before working out how the system should work. However, in practice these are often parallel efforts, where some people work with customers and other stakeholders to clarify the system purpose while some people begin to brainstorm ideas of what kind of system might meet that purpose. As the understanding of purpose becomes clearer, those who are investigating what the system might look like—the concept of the system—will refine their ideas. Those who are working on the system concept track updates to the purpose, often feeding questions back to stakeholders when they find something potentially ambiguous or when they suspect that some part of the purpose might not yet be worked out.

The system concept represents the bridge between understanding the customer’s needs and building the details of the system. The concept sets the general approach that the team will use. Working out the concept is a time for creativity, when the team can entertain many possible ways to build the system, eliminating those that aren’t likely and refining those that are promising. The team evaluates these possible approaches along the way to see if they are likely feasible to build and to meet the system purpose.

The team may be tempted to turn the concept-building exercise into a full system design exercise. This is unwise. First, the techniques used to develop a system concept are meant to be fast and fluid, not working to the degree of rigor that design and implementation require. Second, this can lead to a concept and design period that drags on and on when the team needs instead to make a decision about the high-level structure and then move on to investigating design based on that decision. Third, stopping to review the basic concept before committing to it makes for a better concept that will better guide the team later.

This means that a system concept will be (and should be) incomplete. It should show some of the big ideas of the system’s structure, and it should show that these ideas are likely to meet the stakeholders’ needs and are likely to be technically feasible. It should be accurate, in that anything named in the concept should in fact be a necessary part of the system, but it should not be precise, having all of the details worked out.

Once the team has a concept, it is a good time to step back. Is this system still worth building? Is it likely to be feasible? Is it going to be a good answer to customer needs? And is it plausible that the resources needed will be available?

As the development work moves forward, the team will refine the concept. They will find things missing in the concept and have to find designs that fill those gaps. They will find inconsistencies or mistakes, and they will have to correct them. At the same time, customer needs may change—so the initial concept will always be different from the final system.

The level of detail and analysis needed in the concept depends on the project. A project that is building a revolutionary system for potential future customers probably only needs a rough sketch of the system, since investigations will continue for months or years into what those customers really need. On the other hand, a project that is answering a request for proposal typically needs a much more developed concept in order to explain to a funder what they will get and why their funding will be a reasonable risk.

Once the purpose and concept are completed, the team can turn to actually developing the system itself. In practice this is rarely a sharp transition; instead, some part of the team may begin moving forward in working out a system specification even before the concept is finalized, or they may begin prototyping parts of the system that seem especially uncertain.

27.2 The development phase

Development consists of many sub-phases. Purpose development comes first, in which the team determines what customer needs the system will address. After determining the system’s purpose, the team develops a high-level concept for the system, then builds the system itself. The development phase ends when there is agreement that the built system is ready to be produced, deployed, and put into operation.

undisplayed image

The first two steps set the direction for the system development work. Purpose development establishes a record of who the stakeholders are for a project, and what each of them needs the system to do. This record of the system’s purpose will be incomplete, initially, but it must be accurate at the time it is documented. The concept phase then provides the time to explore different ways that a system might be built to meet those needs. The concept records a high-level picture of how the system will behave, the environment in which it operates, and some of the main top-level components that will make up the system. The concept phase is also the time when constraints related to security and safety are refined, turning general objectives coming from the customer or other stakeholders into more precise statements of what those objectives mean. Part way through or at the end of concept development is a good time for a review and decision about whether to continue the project.

The system development step in turn consists of many tasks. In this reference approach, the development phase is organized first into a number of system feature development phases, using the development methodology to determine what those phases are. Each system feature development phase, in turn, is organized as a sequence of specify-design-implement-verify patterns.

undisplayed image
Figure 27.1: The development phase recurses through the component hierarchy

In this section, I will first discuss the development phase as a whole, then go into more detail about each of the subphases and development methodology.

Beginning. Development begins as soon as the project has a general idea of what customer needs to meet, has gotten funding and approval to start working on the system, and the project leadership has completed enough preparation that people can know the basics of how to do development work.

As I noted earlier, project preparation work is rarely complete by the time development begins. Enough of the preparation should be done that people can begin working out and documenting the system concept, and later parts of development should be gated on other preparation steps.

Completion. Development ends with a system that is ready to be released for production and deployment. Being ready means that the system purpose identified in concept development has been met in the system’s implementation, that this fact has been verified, and that the customer and other stakeholders agree.

The acceptance phase addresses checking that stakeholders (Section 16.2) agree the system is ready for production. The customer—or a proxy for the customer—provides a final validation check that their needs will be met. The organization and funder, as other stakeholders, may weigh in to validate that their objectives have been met, such as that the system will be sufficiently profitable, before investing in production. Some systems will require regulator approval or certification before the system can proceed to production; for example, civil aviation authorities require type certification for commercial aircraft before mass-producing and deploying new aircraft models.

Outputs. There are five kinds of artifacts that are created in the development phase:

Milestones. The primary milestone comes at the end, in the acceptance phase. This milestone can go by different names. The NASA life cycle calls this the operational readiness review, for example. Passing this milestone implies that the system is ready for production (manufacturing) and deployment. As I noted above, this involves checking that the system meets stakeholder needs, and their agreement that this is true. This can include also regulatory approval.

There are other possible project-wide decision points or milestones for checking whether the project is on track and can continue or not. These do not necessarily fall at the beginning or end of phases; sometimes they happen in the middle, in order to correct the project’s trajectory or as dictated by external needs.

Other subphases in development define their own milestones.

27.3 Purpose development

The purpose development phase is for working out in detail what the system is to be, in terms of what it will do for its users and who those users are (Chapter 9).

undisplayed image

The people responsible for working out the purpose work with the customer (or a proxy for customers, when the customers are hypothetical; see Section 16.2.1). This requires the team to work directly with the customers, in order to understand not just what the customers are saying they need but also to identify implicit needs and to find constraints on the system that the customers may not be able to articulate.

The team does similar work with other stakeholders. They identify the objectives that their organization has: is it to make a certain level of profit? Are there time constraints on demonstrating capability? Who might be the funder, and what are they looking for? And finally, who might have regulatory authority over the system, and what regulation or standards apply? All this information creates constraints on how the system can be built and what it can do, and will be considered when determining whether these other stakeholders will agree for the project to continue.

The needs found in this phase define objectives that the system should try to address. The constraints, on the other hand, define things that must be true about the system. ! Unknown link ref

I discuss working out system purpose further in Chapter 31.

Inputs. The project should already have a vague idea of who the system will benefit and what their needs are. This is usually worked out when making the initial case for the project, as part of project preparation (Chapter 25).

Completion. The purpose development phase is complete when the list of stakeholders is complete, when the needs of each of those stakeholders are understood and have been documented, and the stakeholders agree that their needs have been documented correctly.

Outputs. The purpose phase produces two artifacts:

These artifacts together define the system’s purpose and constraints on its design.

Milestones. The purpose phase can end when each of the stakeholders, or a reasonable proxy for them, has reviewed the list of their needs and agrees that the list is complete and accurate.

27.4 Concept development

The concept development phase is the transition between working out system purpose and beginning to design the system in detail. It is a time to work out an initial, rough idea of how a system might be built to meet the purpose and constraints worked out in the purpose development phase. It is a time to brainstorm many different possible approaches and to be creative. These different approaches can be evaluated and narrowed down to one concept. That concept is the start, not the end, of design; it will guide the work in the subsequent system development phase.

The system concept is a sketch of the system on paper or similar media. It should cover all the major behaviors of the system, but it should not go into great detail about how those will be achieved.

undisplayed image

The concept has two general parts: an external view and an internal view. The external view takes a black-box perspective of the system, and includes:

The internal view is an initial sketch of the insides of the system’s black box. This view includes:

The concept does not usually go more than one or two levels deep in the component breakdown.

undisplayed image

This information can be recorded in different forms, and it usually takes more than one to capture it adequately.

Documents recording analyses complement these records. The whole collection of concept documents also records the rationale for decisions taken, and perhaps includes records of alternative designs that were considered and not chosen.

The concept is used for three purposes. First, it reveals whether there is likely to be a feasible approach for implementing a system that meets the customer needs. Second, it provides an illustration to customers and other stakeholders that they can use to validate whether the concept meets what they expect their needs to be. Third, it provides a guide as the team begins to specify and design parts of the system for real.

A likely-feasible concept is one where there is likely to be some way to design and implement each of the high-level parts of the system, and that combining those parts will likely satisfy stakeholder needs. The concept can only be likely, because it is supposed to be developed quickly; the uncertainties about whether the concept will actually work are not completely resolved until the whole system has been built and verified. The process of developing the concept can generate a list of what technical uncertainties people have found or suspect. (These uncertainties guide work planning as the project moves into the system development phase.)

The concept gets reviewed by stakeholders, including customers or customer proxies. While a stakeholder might look at the list of their needs as generated in the previous phase and think it complete, I have found that when they step through how a system concept will operate they get a different perspective and come to realize things they missed in the list of needs. When they find that a system concept appears to meet all their needs, the act of validating the concept with them provides them confidence in the project.

Finally, the road that the team follows from initial idea to a complete design (and implementation) has to start somewhere. The concept provides that starting point. The high-level components identified in the system concept become the starting point to specify, design, and build all of the rest of the component in the system.

Chapter 32 discusses the work involved in making and documenting the concept. To summarize that chapter, developing a system concept involves brainstorming many possible approaches to meeting customer needs and sketching them out. These different approaches get evaluated and compared to find out how well they meet the system purpose and how feasible they are; this often involves doing simple analyses. The evaluations show where there are gaps in meeting customer needs or in the technical solution. The best possibilities get refined or combined and improved until the approaches have been narrowed to one best option.

I have said that the concept should be likely feasible, and that the technical uncertainty and project risk uncovered in the investigation should be acceptable. The obvious next question is, how likely or how much uncertainty? In fact these uncertainties and risks are not generally quantifiable, as they deal in unknowns and the point of the concept development exercise is to expose unknowns and not to work them out. Qualitatively, some projects can accept more risk than others: a startup that is developing a speculative new technology can accept far more risk than a project proposing a system for a fixed-price contract. The decision will require a judgment call on the part of project leaders.

Inputs. The concept phase starts with the list of stakeholders and their objectives and constraints, which was developed in the purpose development phase. It can also use whatever informal investigation has been done in advance about system function or possible implementation approaches.

Completion. The concept phase is complete when either the team has found what they believe to be the best approach to designing the system, or they have determined that they cannot come up with a feasible approach.

A feasible system concept provides an understanding of how the system will function when viewed from the outside as a black box, and when that function has been shown to meet stakeholder needs.

A feasible system concept also defines some amount of internal structure and behavior, enough to support an argument that the team can plausibly build a system that works that way. This means that there are likely ways to build each of the components, and that the amount of time, money, and people required to build and verify the system is within what is available to support the project.

The system concept phase must end while the concept is still a concept. In many projects I have seen the temptation to keep improving the concept—make things a little more certain, make things a little better—before declaring the concept done. When this is left unchecked, concept development slides into system design and development, and leaves out the check of reviewing the imperfect and incomplete concept. Skipping that check means that easy and inexpensive course corrections don’t happen and the problems that will always be there aren’t detected and corrected until they are more expensive to fix.

Outputs. The concept development phase produces a number of artifacts that record the system concept, along with the rationales for why that concept was chosen. I noted earlier what the documentation of the concept should include. These artifacts are placed under configuration management, as they are likely to be revised as the project continues.

Milestones. The concept development phase ends with a conceptual design review (CoDR). This review checks the system concept to ensure that the concept meets stakeholder needs, is internally consistent, and is likely feasible to build. Customers and other stakeholders participate in this review when possible. Team members also participate, as a way to both check each other’s work and to share a common understanding of the concept. Some independent reviewers should also participate in order to check for gaps or biases that the team may have missed.

The conceptual design review is often used as a project go/no go decision point. If the team has not found a likely feasible concept, or one that meets organization and funder needs, this is a time for the organization to decide not to continue with the project. In this way the least resources are used before deciding to stop the project.

27.5 Development methodology

The system development phase is about creating the system based on the concept worked out in the previous phase. At the end, the project has the artifacts for a working system ready to hand off to production and deployment. Along the way, the project may need to meet other milestones—preliminary and critical design reviews for government customers, or feature demonstrations for funders.

The reference development methodology structures how the team does the work to design, implement, and verify that system. It is based on the spiral or incremental methodology. Project leadership works out a set of intermediate milestones where the team builds and demonstrates some set of system features working—usually integrating different parts of the system along the way. There is a life cycle phase leading up to each of these milestones, in which the team does the tasks needed to add features to the system. These are called feature development phases. Each feature development phase has an expected duration. If it appears not to be on track to meet that deadline, the team takes this as a signal that corrective action is needed. Unlike in the spiral methodology, this methodology leads to multiple overlapping feature development phases, running in parallel on different timelines and working toward different milestones.

undisplayed image

This approach was motivated by several goals.

  1. Provide mesoscale guidance to the development process, so that there is continuity and planning while providing a way to adapt when needed. This level of guidance operates on a time scale longer than a few days or a couple weeks, but less than the whole project; it is intermediate between the time scales used in agile and waterfall development.
  2. Allow threads of activity (feature development phases) to operate on longer or shorter timelines as is appropriate to the work.
  3. Allow different feature development phases to follow the microscale methodology appropriate to the work in that phase. For example, software development often benefits from short, agile-like sprints within one feature development phase, while fabricating large mechanical structures is better done using techniques derived from construction industries.
  4. Promote interaction and collaboration across parts of the system during design, while avoiding involving people working on unrelated parts of the system. Those who are actively working together to build some interconnected capabilities in a few components need to communicate frequently, but people working on some other part of the system do not need to sit through meetings discussing things they aren’t working on.
  5. Support integration-first and uncertainty-first planning practices (Chapters 47 and 62).
  6. Support a partial plan—one that is concrete in the short term and less so in the medium to longer term, with growing specificity as the project progresses (Chapter 64).

Compare this approach to waterfall and agile development methodologies.

Waterfall development, practiced strictly, does not handle uncertainty or adaptation well: the system is designed up front, and implementation follows thereafter. In practice, projects nominally using the waterfall methodology often develop intermediate milestones to organize the work.

Agile development, on the other hand, can lead teams to constantly change direction—unless they develop a plan with some longer-term objectives. When they do so, agile development ends up looking a lot like this reference methodology. Short sprint periods can also work poorly for parts of a project doing work that does not complete within one sprint, like building an airframe or developing detailed analyses.

Example. Consider the following example, taken and simplified from a spacecraft project I worked on. The mission involved multiple spacecraft working together to perform a science mission.

The mission’s concept development defined the overall design of the system: multiple spacecraft, communication links between them, communication with ground stations, and so on. The concept also defined an initial breakdown of the system, where the spacecraft had a set of major subsystems like structure, power, avionics, sensors, flight software, and so on. The concept identified some existing software and hardware designs that could be re-used for this mission.

The development phase, then, was about building hardware, software, and operational procedures that would implement that concept.

The team worked out the major steps that had to happen to build the system, such as designing the avionics, designing the structure, testing and integrating them, and putting sample spacecraft units through environmental testing (heat, vacuum, vibration). The project also would build software to run each spacecraft, which involved tasks like prototyping algorithms for attitude control and then verifying that they would work in testbed equipment. These major steps were partly worked out based on experience on previous missions, and partly from working backwards from the high-level system design to determine major functions to be implemented.

The following shows the first part of the sequence of feature development phases for the main flight software (simplified and abstracted from the original). The flight software had a series of milestones that started with the basic software infrastructure and a simulation environment for testing it. Later milestones then added capabilities one after another. Each milestone integrated new functions across several different components. In most milestones, the work involved behind the scenes was as important as what was overtly demonstrated; for example, the first demo was as much about establishing a software configuration management and build system as it was about demonstrating simple software running.

undisplayed image

This project made extensive use of software skeletons or scaffolds, mockups, and emulations. This is typical of a project that prioritizes integration over feature depth. In this case, the main spacecraft control software for the first couple of demos was a simple skeleton of what it would become. The software modules involved could start up, and interact with some others in simple ways, but there was no real logic in the control. Building this part first reduced integration risks that the control software modules would not interact properly with the middleware and operating system on which they ran—and indeed showed middleware bugs that cause the system to crash. By the third demo, the team added basic attitude control logic to the control software. This attitude control still only had limited function; its purpose was as much to show that the control software could interact with (emulated) sensors and actuators.

Sidebar: Kinds of development output artifacts

Feature development phases can produce four different kinds of artifacts, and it is important to differentiate between them.

  1. Real artifacts are the ones that will be part of the final system. They may be incomplete at some points in development, but they will evolve into the final artifact. These include operational procedures, software source code, and hardware drawings or designs.
  2. Skeleton, scaffold, or emulation artifacts stand in for real artifacts until they are developed. They may be a seed from which a real artifact is developed, or they may be replaced by a real artifact later.
  3. Prototype artifacts, which are developed rapidly and to less than production quality solely for the purpose of learning about a potential design. These artifacts will not and must not be turned into real parts of the system (see Section 8.3.5 and Chapter 41).
  4. Verification artifacts, which are used in testing that an implementation meets its specifications.

27.6 System feature development

A system feature development phase is a stream of work that adds a defined set of features (the purpose of the phase) to the system, ending in a milestone with those features implemented, integrated, and demonstrable. It starts with design work that has already been done and the purpose of the work, and ends with system artifacts updated to meet the phase’s purpose.

This approach to organizing development is focused on the features rather than on the components or component breakdown. One feature development phase usually involves several components (and their subcomponents). It promotes the integration of work across parts of the system.

undisplayed image

Inputs. A feature development phase takes as input the system concept, design, and implementation artifacts that have already developed, plus a definition of the features that are to be implemented in this particular development phase.

Completion. The feature development phase is complete when the system has been built or modified to implement all the features named for this phase. The completeness and correctness of the implementation is documented in verification records and by demonstrating selected features working in the new system version.

Outputs. Feature development produces several different outputs.

Along the way, the design phase work may also produce:

Milestones. A feature development phase has one milestone, at its end. At this milestone, the completion conditions listed above should hold. The verification records are checked to ensure that the implementation passed verification, and the team who worked on the changes demonstrate key features to the rest of the project.

As will be seen next, the feature development phase is made up of several subphases, and each of these have their own milestones.

Reference pattern for feature development. A feature development phase recapitulates the life cycle of the overall system development life cycle. It starts with purpose, works out a concept, then proceeds into the specification, design, implementation, and verification of parts of the system to build in that purpose.

undisplayed image

The concept for a feature development phase includes working the general design approach for adding the phase’s features. As with the system concept, the feature concept involves brainstorming different ways to implement the features, along with evaluations of the alternatives until the team selects one concept. The concept for the features should give a general idea of what components will be modified or created in this phase, along with the internal structure among those components and a narrative of how they will interact (the concept of operations for the features).

Identifying the components that will be affected is key to being able to scope how much effort will be required to implement the features, and who will need to be involved in the work.

The next step is to develop or modify specifications for the components involved (Chapter 33). These detail how the components are to behave and the non-functional attributes they are to provide. This may involve adding to or modifying the top level system specifications, or flowing those specifications down to components. Security, safety, and reliability specifications are particularly important.

Design follows specifications, working out how each component can be built to provide the behaviors and properties it is specified to have (Chapter 37). Design may require evaluating alternatives, perhaps by modeling or prototyping (Section 8.3.5; Chapter 41).

Two separate and independent implementation steps follow. One step implements components and changes to components, following the design. The other step works out how to verify the features in the feature development phase, including verifying both the individual components by themselves (using unit tests, for example) and the features that are provided by the components integrated into the system. If the verification implementation runs ahead of the component implementation, the component implementers can verify as they go (using test-driven development).

As parts of the feature set are implemented, they are verified. By the end of the feature development phase, the components created or changed in the phase and the features the phase is adding are all verified.

The feature development phase ends when the team successfully demonstrates that the system now has the features they have worked to implement. This demonstration might amount to showing that the new system version has passed its verification checks, but doing an actual demonstration gives the people who did the work an opportunity to show the rest of the project what they have done and for the project as a whole to celebrate their work.

Once again, note that this work is organized around the features, not the components. This methodology does not necessarily mean implementing each component’s changes in isolation, verifying those, and then verifying their integration. Rather, the team can order the work however works best for the particular task at hand. For example, an integration-first approach might lead the team to build simple skeletons or mockups of component changes and focus on checking out how the components will interact before implementing detailed changes to the components—which means verifying integration before verifying the unit components. (Of course, the finished changes still need to be verified as a whole before the verification work is done.)

The reference pattern for the feature development phase, in the diagram above, includes review milestones for each of the steps (concept, specification, design, implementation, verification) involved. These reviews serve two purposes. First, they are an opportunity for someone independent to check the work in order to find things the team doing the work might miss. Second, they provide an opportunity for the team working on the features to pause long enough to ensure that they all understand the work in the same way.

Finally, the team responsible for a feature development phase may decide that the phase is large enough that it should be split up into subphases. Each of the subphases might have its own milestone goals; those subphase goals build on each other to reach the features of the main feature development phase. These subphases might focus on individual components or smaller groups of components, or they might split the work into sequential steps, or some combination of the two. These subphases follow the same pattern as the higher-level feature development phase of which they are part.

Interaction between parallel feature development phases. The feature-oriented focus of this methodology can cause problems. If the team is working on two sets of features in parallel, these features could affect some of the same components. Someone working on feature set A might change component C to support A’s features. At the same time, someone working on feature set B might also change component C. In the worst case, the changes might be in conflict and the changes for A might preclude the changes for B working, or vice versa.

The underlying problem is known as serializability in database and parallel computing systems, where it has been studied extensively. In these systems, different approaches to handling concurrent changes are measured by whether they produce the same result as if the work was done serially, one task at a time rather than concurrently. That is, the work is serializable if it ends up with component C looking as if the work for feature set A were done entirely and then the work for feature set were done, or vice versa. This has led to many algorithms for coordinating concurrent work.

The simplest approach is to make changes serially: the people working on feature set A change C first, and when they are done, people working on feature set B get a turn. This is useful when the component cannot be physically shared, like a paper drawing or a mechanical device. There are two costs to this approach. First, one group must wait for the other to be done. Second, when group A changes C in ignorance of what group B will need, group B may have a lot of rework to do when its turn comes (and it is likely to need to consult with group A to keep their changes working).

Another approach is to let the two groups independently change C in parallel, keeping two separate versions of C and merging the changes when both groups are done. This is the approach taken by distributed version control systems like git [Git], which were developed for use by geographically separated, non-communicating software development teams. These tools rely on being able to reliably compare the different versions and to guide people through reconciling conflicting changes. The cost comes when the two groups make incompatible changes that cannot just be merged together.

The third way, and the one I have found most successful in complex systems projects, is to have one person or a small team be responsible for the shared component C. That person (or team) becomes part of both groups A and B working on parallel feature set changes. This responsible person can choose to handle the changes serially, or may choose to use a version control tool to manage their work. The advantage of this approach is that the person responsible for C understands the rationale for why the component is designed as it is, and will make changes that fit with the designs already completed. That person can also understand the needs of both sets of features, and design changes to support both rather than having to undo and redo incompatible design work.

27.7 Recursion to component development

A system feature, in the end, is made up of behaviors and properties of a number of components. That is, system features are emergent from the individual components involved.

The work to implement a system feature is thus made up of the work on each of the components, along with the effort to integrate those components and their changes. The team working out the concept for the feature determines how parts of the high-level feature are allocated to components. That is, they work out what behaviors or properties are needed from each component so that together they produce the high-level feature. Along the way during concept development, the team works out what components are affected by the feature development work.

undisplayed image

The feature development life cycle pattern for the high-level feature applies for developing the changes to each of the affected components. Just as the feature as a whole has concept, specification, design, and implementation steps, so do each of the components. Developing the concept for the feature includes developing a concept for each affected component. Developing the specification for the feature leads to developing specifications for each component, and so on. The implementation of the feature is the implementation step for each component.

The people who are working on all these component pieces coordinate their work so that it all integrates properly and produces the desired features.

That coordination means that the work on each component moves at a pace at least partly constrained by the work on other components: for example, the specification step for any one component cannot be completely finished until the specifications for all the affected components are finished. Otherwise, the specification work in some other component could reveal a surprise that affected the specification that was thought to be finished.

At the same time, teams rarely just stop and sit idle when the work on some component lags. They proceed from specification to design to starting implementation, accepting the risk that some surprise may happen that will require them to re-do some amount of work. The choice of how much work to do at risk has to be made based on the usual estimates of likelihood and consequence. If the work on some other component is almost done and is in the final stages of cleaning up details, the likelihood of finding something that will require a change to other components is unlikely. On the other hand, if the work on some other component is just getting started, then the chances of a surprise are high. If part of the component in question appears to be fairly immune to changes in other components, then there is little risk of having to redo that work. For example, if the component will definitely need to communicate over a network with other components, then getting network communication designed is low risk.

undisplayed image

The figure above illustrates how the work for a feature is coordinated across all the components. The top row shows the steps or phases for the feature as a whole. that work is broken down into the work for two components, shown in the middle two rows. The components each follow the feature development pattern of concept, specification, design, implementation, and verification. The last row covers the thread of work done to address integrating the changes to individual components, and it follows a reduced form of that pattern. The feature integration thread of work is primarily about checking that the work on the components properly combines to produce the high-level system features, and so it focuses on verification methods for this integration.

The figure also shows that the concept development work for the high-level feature and the affected components may often be done as a single task. If the feature and components are simple enough, a small group can work out the concept together and produce one set of concept artifacts that cover both the feature as a whole and its effects on specific components. In this case, the artifacts for each component will reference the shared concept artifacts; after a while, the records for a component may reference several concepts for different features.

If the feature or the components are more complex, the work may need to be divided up so people can work on different parts in parallel, combining and reconciling the pieces before the concept is completed. The artifacts for the components will then reference their own concept for that feature as well as the high-level feature concept documents.

27.8 Feature development variations

The feature development pattern in the last section covers the simplest case: when the team is designing and building a straightforward feature. There are three variants to consider: when the component carries enough uncertainty that prototyping is warranted; when the component will be acquired from outside the project rather than built in house; and the specific needs for implementing hardware components.

Prototyping. Prototyping is used when there are possible technical approaches to designing some part of the system, and the technical uncertainty is too high. In these cases, taking steps to reduce the uncertainty before committing to one particular design can lead to better outcomes.

The uncertainty can take different forms. In one case, the team might have an idea, but they don’t know if it will work correctly. In another case, they may not have an idea for a solution, and they need to explore and learn in order to find possible solutions. Or the team might have a solution, but lack skills essential to completing design or implementing it. Finally, the team may have a solution that is not technically mature enough, and they need to validate its suitability. In each case, developing a prototype of some kind can help.

undisplayed image

The prototyping effort is added to the design step. The prototype might take the form of a simple implementation, or of a model of a possible solution. Any prototyping effort should have a clear purpose: to see if an idea works (and working out what it means “to work”) or the like. The focus must be on learning what is needed as quickly as possible. The work should prioritize speed of learning over quality of the prototype implementation.

Prototyping can be a necessary part of learning about a design and managing its uncertainty, but its contribution to the system is indirect—by leading to a good design. The amount of effort or time spent on the prototype should be bounded so that the prototyping effort does not take over the development effort.

The principles about prototyping (Section 8.3.5) apply. The prototype artifacts should be built as quickly as possible to maximize efficient learning, without putting in effort to make them high quality. The artifacts that come out of the prototyping work must not end up in the real implementation.

Acquired component. Sometimes components are best acquired from somewhere else rather than being designed and built by the team. This might involve reusing a component from another project, or using an open source design, or purchasing a component from a supplier. Acquiring a design or component can take advantage of work that others have already done, reducing development costs. It can take advantage of expertise that the team does not have itself, such as a supplier that can manufacture an electronics board or a software vendor that has developed a component with a particular algorithm.

The pattern for an acquired component proceeds with developing a concept for what is needed and a specification for the component. The specification is the basis for a request for proposal (RFP), which is sent out to potential suppliers that are expected to offer potential solutions. The suppliers in their turn use the specification to develop a design, which might simply be an off-the-shelf product or might involve development work on their part. Once the suppliers have a design, they respond to the team. The team evaluates whether the design in fact meets the specification and determines which option is best, if there they have more than one potential choice. In many cases the team will build a simple prototype using a supplier’s prototype implementation, if they have one, as part of the evaluation. After that, the supplier implements, builds, and delivers the component. In other words, this pattern moves the design and implementation work away from the project team and onto the supplier.

undisplayed image

The team, however, still does some amount of verification once they have received the implementation. This acceptance testing may be more limited than it would be for a bespoke design, if the supplier provides information about the verification steps they have taken. Nonetheless, the team should spot check any verification work that the supplier has done and must check that the supplied component integrates as expected into the rest of the system.

Acquiring components like open-source designs or software do not have to go through the process of developing a formal RFP. However, these components do still require evaluation before deciding whether to use the design or not. The team must ensure that the license terms are compatible with the system. The team must also ensure that the potential component meets the specification of what is needed of it. Finally, the team must evaluate the quality of the component—which for open-source components, includes not just the quality of the artifact itself but also its governance and supply chain security [Goodin24][CVE24].

This pattern involves support roles that I have not detailed out elsewhere. For example, the acquisition might involve someone who manages contracting or payment. The acquisition will likely involve checking that the license terms and intellectual property rights associated with the component are appropriate for the system the team is building, which may require legal expertise.

Hardware components. Hardware development has different constraints than some other kinds of component development, and so a different development pattern applies. The primary cause of the differences is that a hardware component involves physically building one or more artifacts, which can take time and resources. This makes iterating on a design to work out bugs or to change features much more expensive than it is for software or higher-level designs. In addition, some verification testing is destructive, putting a component in increasingly harsh environments or under harder loads to determine when it fails.

Hardware development also differs from other kinds of component and feature development in the way terms like “design” are used. A design for an electronics board is a full description of how it is to be implemented; in some cases, it can be sent to an automated production system to create a complete physical board. Similarly, many mechanical designs are complete enough to send to a CNC machine or additive printer to create the physical artifact. By comparison, a software design is more abstract; it cannot be directly translated into a working program. Software source code is closer to mechanical or electronic designs, as source code can be sent to compilation tools that produce the executable artifact.

These constraints have led to disciplines about how to organize hardware development. I discussed the EVT/DVT/PVT pattern earlier (Section 23.4.1), which defines a sequence of phases for developing and verifying a hardware component. The NASA approach uses different language [NASA16, p. 124] to describe the sequence of hardware artifacts to be developed and verified. The two approaches are similar, with one naming the phases and one naming the artifacts.

This approach splits up the design, implementation, and verification phases into multiple iterations. There are typically four iterations.

  1. Preliminary development: produces a breadboard or brassboard, which are low to medium fidelity versions of the component. This version is focused on function but often has a form unlike that of the final. It may use off-the-shelf components that will not be used in the final version. These may be subscale or digital models. This step may be repeated multiple times, adding features at each iteration.
  2. Engineering unit development (or EDU, engineering demonstration unit): produces a version that closely resembles the final version in both function and form. It is put through EVT (engineering validation and testing), which verifies that the version mostly meets its specifications. The engineering unit may have a small number of defects, but at the end of verification the team should have confidence that these can all be corrected to produce the final version.
  3. Qualification unit development: this produces one or more units that are built to the final design. These units are typically built manually in small numbers. They are put through DVT (design validation and testing), which involves thorough testing of the function and form of the component. For some systems, some of these units will be tested to failure in order to show that the design can function correctly across the full range of environments in which it will operate. Aircraft wings, for example, are tested by flexing them until they fail. Spacecraft electronics are subjected to heat, vacuum, radiation, and vibration beyond what they will experience in use, and those tests are likely to damage the components. These units are also used for certification with regulators or similar industry organizations.
  4. Production or flight unit development: this step produces a small number of components that can be deployed into operational systems. These are used to verify final manufacturing processes, using the PVT (production validation and testing). This includes matters like supply chain operation, manufacturing logistics, and delivery and storage of the final units. These units are put through acceptance testing, which verifies that the manufacturing process builds components that are identical to those built for qualification.
undisplayed image
Figure 27.2: Hardware development phases

The fourth step, producing production or flight units that can be deployed, can occur as part of development or later, in a production phase after the system has been accepted (Section 28.1). If a component is going to be mass produced, verifying the manufacturing methods is worth doing before declaring that the component is complete. After acceptance, the manufacturer will build more units. On the other hand, if only a handful of units will be built and they are expensive to build, such as with individual spacecraft, delaying the production of those units until after acceptance can manage risk.

Finally, the development of a hardware component is part of the development of the larger system. This leads to two ways that the hardware development steps can be organized, depending on how the hardware development will be synchronized and integrated with other parts of the system.

The first way is to plan out the hardware component development as its own thread of work. This way has the advantage of keeping the team focused on designing and building the component.

The second way is to break up the hardware development thread into smaller steps, and put some or all of those steps in feature development threads. For example, when building a circuit board that will run a control system, it will be hard to verify that the board works without some version of the software that runs on it or the interfaces to sensors and actuators of what it controls. In other words, verifying the integration of the hardware component with other parts of the system is an essential part of checking that the component actually works. This is the way virtually every project I have worked on has actually planned out its hardware development work.

undisplayed image

As an example, this sequence of feature development steps is loosely based on two different control system implementations in projects I have worked on. The sequence shows how different hardware and software components come together to implement increasingly complex features. This approach integrates the hardware and software parts in incremental steps.

27.9 Acceptance

The acceptance phase is the time for final checks that the developed system is indeed ready for production and deployment. It is the last step in the overall system development life cycle.

undisplayed image

There are three kinds of checks involved: that the system can be put into production and deployed; that the customer (or their surrogate) validates that the system is what they need; and that regulators approve the system, if needed.

The check for production and deployment involves verifying that the manufacturing and distribution process is ready for operation, and that all the procedures and tools are in place to install a manufactured product for customer use. For a software-only product, the manufacturing and distribution procedure involves packaging the software release and putting it on distribution servers (or manufacturing distribution media if it is not distributed over networks). The deployment readiness involves verifying that the packaged software has prominent and understandable instructions on how to install it and start using it. On the other hand, for a mass-produced hardware product, verifying manufacturing and distribution involves checking that the manufacturing line can correctly build the system, that it has the proper supply chains in place to support the manufacturing, and that the products can be shipped and warehoused before delivery to customers.

Validating that the system meets customer needs involves customers trying out an instance of the system—not just looking at documentation about the system. This often involves getting one or more customers to use a test installation of the system to do the tasks that the customers need. For some systems, this kind of validation can be done by beta testers, who are given an almost-ready version of the system and try it out in their environment while providing feedback of what works or doesn’t. Other systems that involve more installation and setup can involve setting up test installations that the customers come to use.

Regulatory approval involves different procedures in different industries. An aircraft, for example, must be reviewed and certified by the appropriate civil aviation authority. A spacecraft mission typically requires licenses for launch, communication, and certain kinds of earth observation. Other systems may need approval by an industry safety organization. Most of the work to get these approvals or licenses is part of the development phase, and the acceptance phase is the final check that the necessary approvals are in place.

Once these checks are completed, the final milestone is for the organization and the project to decide whether to proceed to production and deployment or not. Many systems are designed and built, but in the end the organization behind the project decides that the result does not justify the investment in production. Many commercial aircraft, for example, are designed and built, but in the end there is not sufficient sales interest to start production and the aircraft model is quietly retired.

Sidebar: Summary

Chapter 28: Operation

Once the system has been developed and verified, it is ready to be manufactured, deployed, and put into use. The initial work of building is done, but there is much more to go. There are several ways the operation phase can proceed, depending on the kind of system, kind of customer, and the role that the organization that developed the system plays.

undisplayed image

The general flow is first to manufacture or produce the system using the artifacts that have been developed, then deploy instances of the system. After that, the system instance is in operation. Further development of the system, to evolve it or to fix problems, continues in parallel with customer operation. Finally, at some point, the customer will decide to retire and dispose of the system instance. The steps of deploying, operating, and retiring system instances can occur multiple times in parallel for different customers.

28.1 System production

Production is not the application of tools to materials. It is the application of logic to work.

—Peter Drucker [Drucker93, Chapter 17]

The production phase covers manufacturing the artifacts to be deployed.

Bear in mind that this is a brief overview of manufacturing, intended to explain the main points that people like systems engineers or project managers will need to know in order to understand the general scope of the work, and to understand how the manufacturing steps are related to other parts of the system-building work. Manufacturing has been studied and refined for a couple centuries, and there is an extensive literature with far more information.

There are several kinds of production that different projects might use. These include:

Production of a new system for a new installation can also differ from production of parts for maintaining or upgrading an existing installation. A new system might consist of a complete collection of hardware components that will be assembled from scratch for the installation. Producing replacement or upgraded parts, on the other hand, consists only of manufacturing a few parts and making them available for deployment into existing installations.

undisplayed image

A review and approval to begin production milestone checks that the project has everything ready before committing to production, as discussed below. The review checks that the system development has completed all its milestones and that a system will be ready to deploy when manufactured. It also checks that everything needed for production itself is ready: the manufacturing tools and people, suppliers, testing. It also checks that the organization is prepared by being able to pay for supply and manufacture, and that people are ready to deploy systems once their parts have been manufactured, so that capital does not remain tied up in unneeded inventory.

Production relies critically on security of the supply chain, management of the developed artifacts, the manufacturing process, and the delivery mechanisms. All these elements of the production process have been attacked in recent years. For example, the SolarWinds attack [Zetter23] compromised the production process for their software, which was then distributed to and installed by many other organizations and led to attacks on those other systems. There are other reports of fake hardware components (e.g. pressure sensors [Control19]) being injected into a supply chain. These attacks can result in loss of system components, delaying deployment to a customer, exposure of intellectual property, deployment of a faulty or dangerous system, or creation of security problems for the system’s customer.

The overall production process has the following steps:

undisplayed image

This flow depends on the supply chain of parts used in manufacturing or production. Any physical parts or stock used must be on hand to perform manufacturing; this implies that the stock is in inventory, and that it has been supplied from some qualified source. Sourcing implies selecting the suppliers and setting up contracts for them to provide the stock. The contracts with the suppliers should include clear specifications of exactly what stock or components are to be supplied, along with evidence that the delivered parts meet the specification.

Procedures for receiving materials from suppliers and maintaining inventory are part of the definition of manufacturing procedures. The procedure will typically need some amount of space for maintaining this input stock, along with managing information about what stock is on hand and what should be used next. The storage space maintains the input components or stock in an environment that will keep the material in its designed storage conditions. The procedures include determining when to order more stock. The receiving and storage facilities should have security that ensures that material is not stolen or replaced.

The production process relies on accurate configuration or version management. The artifacts used to manufacture the production components should have consistent versions, and those should match the versions used for final verification during development. If inconsistent implementations were manufactured, the components might not work together—and the resulting problems are often subtle.

The manufacturing procedures specify who does what steps, in what order, using what tools. These procedures are designed during system development and verified during production verification testing (see the section on hardware development above).

After system components have been manufactured, they are checked to ensure that there are no manufacturing defects. This is typically called acceptance testing. For many hardware components, this involves putting the component through a set of tests that are defined during system development. These tests do not stress the component to a level that will induce faults, like testing at high temperatures or voltages; the tests only look for potential manufacturing problems. Some mechanical or electrical components go through a “burn in” period, which operates the component long enough to catch early component (“infant mortality”) failures. For some other kinds of components, only a sample of each batch of components gets tested, under the assumption that manufacturing defects will tend to cluster in one production batch (for example, one day’s production shift).

The production process involves a significant amount of record keeping. Each produced component has its own set of records. These records start with the component’s identity, typically represented as a serial number. The record identifies what version of the input development artifacts were used, often by associating a release version number or code with the serial number. The records include when, by whom, and using what equipment the component was built, so that if parts start failing an analysis can identify other components that may be at higher than expected risk of failure. The records track what parts or stock were used to manufacture the component: the serial number of components used, if appropriate, or the supplier, model, and batch number of stock.

In addition, each manufactured component must be identifiable. That typically means that it should be clearly labeled with its model or version information and serial numbers, at minimum. The labeling is typically in both human- and machine-readable forms.

Once a component has been manufactured and checked, it is placed in inventory and later delivered for deployment. The components in inventory are stored in secure spaces that maintain the components in their designed storage environment—often dust-free, within a particular temperature and humidity range, and so on. The inventory is managed to know what components are in stock and ready to send for deployment.

The production process needs to be resilient to disruptions. One company I worked for was building hardware systems outside the US, and investors asked the company how they would handle a political or military disruption in that country. (The answer was that the company would go out of business because it had no alternative manufacturing option.) Many production or manufacturing processes are also in places that can be vulnerable to natural disasters, including earthquakes and storms.

Finally, the manufacturing process is generally a human process, and processes involving humans have a tendency to drift over time away from their originally-intended procedures (see e.g. Leveson [Leveson11, Chapter 12]). This drift can come from changes in how people are trained, people finding potential simplifications in the procedures, changes in the environment in which the people are working, and many other causes. The designs of robust, safe manufacturing procedures include periodic audits to check that people are performing the procedures as originally designed, and to re-design the procedures if they are found to have problems in use.

Inputs. The production step uses many inputs:

I use two terms loosely: input component and stock. By input component, I mean something that is used as it is in manufacture, such as a chip or a valve. By stock I mean material that has to be worked during manufacture, such as a metal or wood block that is machined to make a component, or plastic that is melted and formed in a 3D printer to make something else. Others may use other terms for these two kinds of inputs, but the distinction remains.

Outputs. Production has two major outputs: Deployable artifacts that are in inventory storage or on their way to a customer, and records of each artifact.

Milestones. Production does not begin until there has been a review that ensures that the organization is ready to perform production activities. The approval milestone checks that all of the manufacturing, testing, inventory, and logistics procedures are complete and performable. These checks typically depend on results from production verification testing. The review also checks that all the necessary suppliers are qualified and under contract to deliver manufacturing inputs. Finally, approval to begin production depends on having the capital or cash flow needed to support production, and that the organization is ready to deploy the manufactured system once it has been produced.

Each component has an acceptance testing milestone, as discussed above.

28.2 System production examples

I mentioned earlier that different projects follow different kinds of production patterns. Here are a few examples that show some of these different approaches.

Software only. This example covers a software-only system that is delivered electronically to customers for installation.

When building a software-only system, many people don’t put much thought into what happens between when a version of the source code is marked as ready for release and the delivery to a consumer. In practice there are several steps between the two, and those steps require careful design.

The input to production is a version of the software—either as source code or as binaries—that has been verified to meet its specifications, and validated against the original customer needs. This code is under version control and has been labeled as being ready for release.

The output is one or more installation packages on servers that customers (or deployment teams) can access over networks. Some software packages are not or cannot be delivered over networks, in which case the output is some physical artifact, such as a CD or USB drive, containing a copy of the installation package.

The production process involves the steps to generate these installation packages then stage them on distribution servers. The procedure typically involves building binary versions of the software from the appropriate source code artifacts, then performing acceptance tests on the binaries. The binaries are then bundled with other material, such as manuals and configuration files into an installable package. The package also includes metadata recording what the package is, its version, and the environment in which it is intended to be used. The procedure also adds security information, such as signatures or encryption to ensure the integrity of the package. The installation package is then copied to distribution servers, and tested to ensure that the package can be downloaded and verified correctly. Once the package is available for distribution, the final step is to let customers know that the package is available.

If the software is intended to run in multiple environments, such as on different operating systems or CPU architectures, the procedure will need to be repeated for each target environment.

In recent years, the integrity of the software production and distribution process has received increasing attention [CISA21]. This has led to standards for protecting the production and distribution processes.

Single spacecraft mission. Building a spacecraft is different from producing software: it involves physical artifacts, and it produces only one or a few instances of the spacecraft.

A project will typically build at least one spacecraft that will fly the mission, but may build a backup or an extra that is used on the ground to verify behavior during the mission.

The objective is to deliver a flight-ready spacecraft that is ready to ship to the launch site, be placed on a launch vehicle, and fly the mission (the deployment), or to deliver a test unit that is otherwise identical to the flight unit to testing teams.

Before assembling the flight instance, many projects often separately manufacture all or parts of additional spacecraft that are treated as Qualification Units for testing, especially for environmental testing that pushes the test unit beyond normal operating limits and might damage it. These units may be built and tested as part of the development phase or during production, as appropriate to a specific project’s rules.

The production process starts with acquiring and building all the components, then assembling them according to procedures worked out during development. The assembly is typically done in a “clean room” that keeps out contaminants that could affect the spacecraft’s ability to function, such as dust entering into cable connectors or hinge bearings. The team typically performs incremental acceptance testing along the way to ensure that subassemblies have been built correctly while they are accessible.

The team assembling the spacecraft document what components are used in each unit as they are assembled. The accumulated records are maintained for the entire life of the spacecraft, as they can be essential to establishing the causes of problems encountered in flight.

Once the entire spacecraft has been assembled, the team performs final acceptance testing, ensuring that testing remains within limits that will not inflict damage. They then package up the built spacecraft for delivery, typically in sealed containers that will protect it from contamination and shock during shipping. The packaged spacecraft is then delivered to the launch site, where it is mounted to the launch vehicle in preparation for launch.

Some spacecraft require final preparation shortly before launch. This can include charging batteries, entering final configuration data, or loading gases and fluids (such as fuel). These steps follow carefully-defined procedures, as they often involve hazardous materials (such as hydrazine fuel) and because there is risk of damaging the launch vehicle in ways that could cause in-flight failure.

The overall production step typically has strong requirements for safety and security. A malfunctioning spacecraft can lead to the failure of a mission, at the cost of significant invested capital. In some cases a malfunction can risk life and property on the ground, such as when a spacecraft causes failure of a launch vehicle, enters the atmosphere and damages or injures something on the ground, or creates debris that damages other spacecraft or injures people on orbit. To this end spacecraft are regulated and must obtain safety approvals before being allowed to launch (see, for example, the US regulations [14CFR450]).

Mass consumer product. This kind of production is for a device that is produced in large numbers for use by the public. These are often produced regularly, in multiple shifts or over multiple days, though not necessarily continuously. The production rate is often ramped up and down to reflect demand. Mass production for consumer products is often done by a contract manufacturer rather than in house, but not always.

Mass production requires a supply chain that can deliver the right parts on a steady schedule, with warehousing to maintain enough parts to keep the production line going and absorb any expected interruptions in delivery.

While mass production for consumers often does not use security standards as high as those for high-assurance systems, security still applies. In particular, using component parts different from that specified can cause unexpected failures in use. Consumer products also need security to keep the features of a new product secret until it is released, and security to avoid theft during and after production.

The manufacturing process uses assembly instructions for workers. These instructions are developed during the system development phase, and are verified during PVT. The instructions must be understandable by the people who will actually do the assembly, who often have different backgrounds from the people who develop the system. The instructions must also account that people may switch from working on one product to another and back over time.

Manufacturing may involve molds or jigs used to create mechanical parts. These are designed and produced during development and verified during PVT.

Products need acceptance testing and possibly burn-in after being assembled. The acceptance tests are also designed and verified during the system development phase. The tests often use test equipment that is also designed and verified during development.

Manufacturing results in many assembled and packaged products ready for delivery. These are then delivered to customers or to warehouses using a logistics provider.

The production process should be checked regularly. Because production goes on for a long time, the people or procedures may drift from the procedures originally developed. People find shortcuts, or worker training may change, or the environment in which assembly is done may change. The production activities may also reveal mistaken assumptions embedded in the assembly and testing procedures. Regular checks or audits will find where these discrepancies exist, and allow people to either bring the assembly and testing procedures back on track or create change requests to update the procedures.

28.3 System deployment

The objective of deployment is to set up a system instance for a customer and get them successfully using that system.

There are several kinds of deployments. The first variation is: who is doing the deployment? Consumer products are set up and installed by the customer. More complex systems are delivered and set up by a team that is part of the project. I will refer to this as “assisted deployment”. Other systems are deployed and used internally by the organization that created them. The second variation is whether one is deploying a complete new system installation, or installing an upgrade into an existing system.

The overall flow of events is the same for all these variants:

  1. Deliver and install the system (or its new components) at a customer’s location;
  2. Verify that the installed system works as expected;
  3. Train users on how to use the system properly;
  4. Migrate any data or material if needed; and
  5. Put the system into operation.

A system is deployed into an environment. That environment might be a customer site for a physical system; it might be spread over multiple sites; it might be an attachment to a launch vehicle; it might be resources on a compute server somewhere. In all those cases, the customer finds the places where the system can be installed. The deployment team and the customer usually interact before deployment starts to let the customer know what is required for the system, and for the customer to let the deployment team know what is available.

The environment for a software system might include the number and kind of compute servers used, the amount of memory or storage on each, the reliability and security of the servers, and the reliability of each server.

The environment for physical systems might include physical space, along with the temperature and atmosphere in that space. It might include the mechanical mounting needed, along with electrical, water, networking, and other supply lines.

Some customers will be migrating from an existing system to the new system being deployed. The migration might include moving information from the old system to the new system, or it might involve moving physical artifacts or supply from one to the other. Developing the migration procedures are a development activity on their own; in effect, they are a second mini-system to design and implement.

Complex systems will have users who need to be trained in order to operate the system safely and correctly. The initial group of users are trained during deployment, so that they can verify that the system works correctly and can take over its use once the installation has been accepted. Other users will learn to work with the system later, perhaps years later.

The installed system includes education and training materials for these users. These materials are assembled during the development phase of the project.

Different kinds of users may interact with the system. At the simplest, there are users who directly command and use the system’s primary behavior. A system may also have administrators who are responsible for specialized tasks, such as managing the set of users or the system’s security. It may have people who are responsible for maintenance and repair. I likely has other people who set policy for how the system should be used. All these people use the education and training material, and that material must address each of their needs.

Deployment presents a number of ways that someone could attack and compromise the system. The system components will be in transit from the production facility or warehouse and could be tampered with; they will be received at the customer site and might be accessed before being installed. The system components may be partially installed but not fully configured to be secure during the deployment process. The deployment procedures themselves could be altered or hijacked. All these potential exposures mean that the deployment procedure must be designed with security in mind, and that security must be evaluated as part of the system requirements.

Deployment includes setting up the customer on a customer service system. Once the system has been installed and the initial users have been trained and given access, the customer will begin to take over system operations. As they do this, they are likely to find they do not actually understand some parts of the system and have questions. They will use the customer service system to communicate with the project team for questions and to report problems.

Accidents and incidents may happen during system operation. When these happen, the customer works with the team that developed and maintains the system to investigate what happened. If the accident is serious enough, regulatory agencies may be involved. During the deployment process, the team establishes the necessary working relationships with the customer that will help the customer to detect when accidents have happened and to bring in the team for investigation. The investigation may determine that there is a flaw in the system, in which case a problem report and change requests are sent to the team to guide fixing the flaws. Section 28.7 below addresses how the team handles such changes.

Setting up the customer for ongoing success using the system is the last part of deployment. Once the system is in operation, the customer’s users are responsible for the ongoing safe and secure use of the system. Users of complex systems tend, over time, to find shortcuts and workarounds for how they use the system. They may forget part of their training, and new users may not be trained fully correctly. The environment in which the system operates may also change—parts might be moved, air conditioners changed out, or electrical feeds changed, for example. All of these can slowly change how the system is working and lead to accidents. Regular monitoring or auditing of system and user behavior is necessary to detect and correct these drifts and avoid accidents, and this auditing must be backed by management policy and actions. (See Leveson [Leveson11, Chapter 12] for background.) The deployment activities, therefore, must include working with the customer to establish the necessary monitoring activities and to establish necessary management policies.

Customer deployment. The components for these systems are delivered to the customer, who is responsible for installing or upgrading the system. The process includes:

undisplayed image

Assisted deployment (internal or external). When someone from the project team does the deployment, the process is similar to customer deployment.

The process includes:

undisplayed image

Inputs. The deployment step takes as input:

Deployment can also involve migrating materials or information from a previous system. If so, the procedures for doing the migration are also an input.

Outputs. The deployment step results in:

Milestones. When the customer handles deployment, the milestones involved are their concern.

When the team handles deployment, there are three potential milestones:

  1. Deployment readiness is for determining whether the customer and the team are prepared for the deployment. The customer is prepared by having the environment ready for the installation and having people ready for training. The deployment team has the parts and time available to set up the system and perform training.
  2. Migration readiness applies when the customer is moving from an earlier system. Readiness means that procedures have been developed and tested for moving information or materials from the old system to the new, hat the old system has been prepared to extract the information or materials, and the newly installed system is ready to take in what is being migrated.
  3. Deployment acceptance is the final check that the system works as expected and that the customer’s users have been trained. At this point the customer is ready to put the system into regular operation.

28.4 System deployment examples

Deployment follows many different patterns, depending on the kind of system and customer. The following four examples illustrate some of the range of ways that the general deployment step can happen.

Digital product. Start with a digital product, such as a software application. These are often deployed by the customer, and involve deploying no physical artifacts. The customer downloads the application over a network and runs an installer package to perform the deployment.

The deployment process begins with the customer ensuring they have the resources needed to support the application. This includes operating system and CPU architecture compatibility, and the amount of available memory and storage needed. The customer gets and checks this information, presumably online, before deciding to download and install the application.

Next, the customer downloads an installation package and runs the installation. The package performs checks to ensure that the application is supported in the local environment, and copies in the application contents. The download or the installation package may interact with the customer for payment or licensing.

This process is more or less the same whether the customer is installing a new application, or installing an upgrade to an application they already have.

At this point, the customer has an application they can use. However, they may not know now to use it yet. The customer can learn about the application using training media provided with the application. If the customer is updating an application, they usually look for information on whatever changes update might include.

The customer is responsible for copying in any information that they may already have that they want to use with the new application.

Most consumer applications provide some kind of customer service, which the customer can use to report problems they find or to ask for help. These services are often provided on line as web sites.

Consumer product. Now consider a simple consumer hardware product: a home light fixture.

In this example, the customer is responsible for all of the deployment steps. Unlike the previous example, the deployment involves hardware artifacts and includes steps required to maintain safety.

The customer starts the process of deploying a new light by determining what kind of light they need—in the ceiling, on the wall, stand alone, and so on, as well as the needed brightness and the electrical supply voltage. They then research what fixtures are available from their preferred suppliers, implying that an organization that is building light fixtures sends out specifications and advertising materials to those suppliers well before the customer goes looking.

Once the customer selects, purchases, and receives the fixture, they review the installation instructions that the team has developed and included with the fixture.

The customer then installs the fixture using those instructions. The instructions should include basic safety steps, like turning off power to the affected circuit before working with the wiring. The customer tests that the light works after it has been installed.

Complex system, shared deployment responsibility. The previous examples have been simple, performed entirely by the customer. The next example covers a more complex deployment.

Consider an information system that supports a repair and maintenance workshop. This example is based loosely on a system I worked on for local government public works agencies, which maintained a wide range of equipment from buses to lawn mowers to backhoes.

The repair and maintenance organization had multiple shop sites. Some shops were specialized for working on particular kinds of equipment.

The system provided record-keeping support for managing work orders (repair orders), scheduling resources like work bays or large equipment, and managing parts inventory. It also interfaced with the customer’s other IT systems: security and user authentication systems, and systems to place orders to buy parts and to pay for them.

For a particular installation, the customer asked for a set of features to be added to an existing software package. The development phase of the project for this customer involved work to determine their specific objectives and changes, implement changes to the base system, and then validate the customized system with the customer. Once the customer accepted the changes at the end of the development phase, a production phase generated the software installation packages and other materials for the deployment.

Physically, the system consisted of a small set of servers in a server room, plus workstations of different kinds at the workshops. This equipment used communication equipment between the sites and the server room. The server room provided power, cooling, communications, and support services like backup and security for the servers.

The customer wanted to perform a phased installation and roll-out, where initially only a few people would use the system and over time its use would be extended to more and more sites. The goal was to minimize risk by avoiding disruption to the shop’s existing work, and to contain any problems that might come up as the shop users learned to work with the system. A phased installation would also allow the customer and deployment team to monitor the performance of the servers and communication systems in order to identify unexpected behaviors before they caused problems. The customer decided to continue using their existing (paper-based) system for all existing work, so no data would be migrated into the new system.

The project’s deployment team was responsible for installing and configuring the initial system, and for training the initial users. The customer installed servers and communications, along with workstations at the shop sites. The customer was also responsible for adding users to the system and would take over training and configuration after the system had been rolled out to half the shop sites.

The deployment process proceeded as follows:

This system thus had a phased transition between deployment and operation, rather than a hard split between one phase and another.

undisplayed image

Spacecraft. Deploying a spacecraft covers the activities from when it is delivered to the integration site to be integrated into a carrier or onto the launch vehicle to when it is on orbit and ready to perform its mission.

The general sequence for spacecraft deployment is:

  1. Operational training and rehearsals. The team that will be operating the spacecraft learn how to use the ground systems that will interact with the spacecraft and learn how the spacecraft works. The training includes doing rehearsals of events that might happen during the mission.
  2. An operational readiness review, to verify that the operations team is ready.
  3. Spacecraft delivery to integration site.
  4. Final preparation and checkout. The spacecraft gets final checks to verify that it was not damaged in transit. Final data or software are loaded. Fuels and gases are loaded and batteries charged.
  5. Integration with the launch vehicle or carrier. The spacecraft is mounted to its attachment points on the launch vehicle or in a carrier spacecraft. Shortly after this, the spacecraft becomes inaccessible.
  6. Flight or mission readiness review. The final check that the team has completed all the work needed for the spacecraft to be ready to launch, including performing all planned checks.
  7. Launch and deployment. The spacecraft is placed on orbit and released from the launch vehicle or carrier.
  8. Startup and stabilization. The spacecraft turns itself on and stabilizes its state. This typically means stabilizing its attitude and spin, beginning to generate electrical power, and beginning to communicate.
  9. Checkout. The operations team communicates with the spacecraft to ensure that it is in good working order. The operations team can address any problems that have come up during and after launch, and can take steps to calibrate on-board sensors.
  10. Commissioning. The spacecraft is deemed operational.

When done with deployment, spacecraft is ready to perform its planned mission, is in communication with other systems, and operations team is managing the spacecraft

Deploying a spacecraft is different from the other deployment examples above in two key ways. First, a spacecraft poses far higher safety risks than the other examples. The deployment process reflects this by using procedures that have been designed and checked to meet safety constraints, and the deployment team are trained accordingly. Second, significant parts of the deployment occur beyond human access: while the spacecraft is on orbit, people cannot stop by to observe or fix a potential problem. The spacecraft’s design thus must provide sufficient information to the operations team on the ground to be able to detect and analyze problems without visiting the spacecraft. The operations team also uses detailed records of the spacecraft’s configuration, and so the production process must record all the details of what components were used, their provenance, and inspections of the work.

28.5 System operation

In this phase, the system is placed into operation. The customer uses the system, performing administration and maintenance as needed. Most of the system operation is the customer’s responsibility; in this section, I focus only on what the project does to support the customer’s operation.

The system operation phase affects the project team in two ways. First, the team will sometimes support the customer during operation. Second, point of the team’s work is to build a system that can go into operation, which means that the system’s design supports all the activities that the users will do. This includes the rare and exceptional activities, not just everyday usage, so these activities are included in the concept and specification to which the system is built.

The customer is responsible for maintaining the system. That may mean only following procedures for periodic checking, but for many systems maintenance can be far more intrusive, and involve regular replacement of some components. The customers rely on maintenance procedures that are designed as part of the system to keep the system operating safely; these maintenance procedures are designed to take safety, security, and reliability constraints into account. The customer also periodically orders replacement parts to install into their system.

The customer also takes care of their users. This includes adding and removing user access to the system and training those users. The project team supports these tasks by including features to manage users and their roles as part of the system. The team also develops training material that the customer uses when bringing on new users.

The system may have problems from time to time. These may reflect flaws in the system, improper usage, wear and tear, or combinations of all three. The customer, as the system owner, is responsible for handling the problems. However, the project team sets the customer up to be able to address problems by developing instructions for detecting and diagnosing problems, and training some of the customer’s staff on how these work. The project may also provide services to help diagnose and repair problems. The project also provides some form of customer support that the customer can use to report problems back to the project.

Most complex systems have human elements—users who operate the system and in doing so act as a control system that manages system behavior. As I noted in the previous section, these users can change how they interact with the system over time, finding shortcuts or using the system in ways they are not expected to. The customer establishes usage policies and performs monitoring and auditing tasks that check that people continue to interact with the system in safe and secure ways. The project team sets the customer up to perform this work by documenting what constitutes safe and secure system usage, including the rationale for why some interactions are acceptable and others are not.

Accidents happen. When some loss or injury occurs due to the use of the system, both the customer and the project team have a responsibility and interest to determine why the accident occurred in order to avoid future accidents. The accident investigation may also be mandated by regulation, in which case regulators are involved. The customer may be able to pursue the investigation on their own, if they have sufficient information about how the system is supposed to be used safely. The project assists in the customer’s investigation by providing that information, which includes the documentation of how to use the system safely, and why. However, for serious accidents, the investigation often requires a more in-depth understanding of the system’s design and implementation. The project prepares for supporting these investigations by maintaining complete records about the system’s concept, specification, design, and implementation, including explanations of the rationale for why choices were made and safety or security analyses that the team did about the system’s design.

Finally, the customer may find that their needs change over time, or that there is some aspect of the system that does not work as well as they had planned. These changes can be externally driven; for example, regulatory changes that affect the customer’s industry can affect what the customer needs from the system. The project team can receive change requests (along with problem reports) through a customer service mechanism.

Inputs and outputs. The operation phase is ongoing, unlike some other phases. It continues as long as the customer continues using the system. It is also primarily the customer’s responsibility.

The working system, as accepted by the customer at the end of deployment, is the primary input. That working system includes parts that support the customer’s tasks:

From the point of view of the project, the customer’s operation produces a few outputs:

Milestones. Most organizations require some kind of authorization to operate in order to place a system in operation. This is typically a review that all of the system deployment steps, including acceptance, have been completed successfully and that the system meets the customer’s policies. All these steps should have occurred earlier, and the authorization to operate is usually just confirmation that none of the steps were skipped.

The system remains in operation as long as the customer chooses and as long as they maintain the system in good repair. The customer thus periodically performs maintenance tasks and audits that usage remains safe and secure. The customer periodically determines—perhaps implicitly—whether to continue the system in operation.

28.6 System operation examples

Operations vary widely depending on the kind of system. Here are some examples illustrating the range.

Consumer product. A consumer product is generally the responsibility of its users. The development team is responsible primarily for designing a system that the users can understand and providing enough documentation or training material so that the users learn how the system works. The development team also provides documentation on any cleaning or maintenance tasks the users should perform.

Some consumer products can require occasional more complex maintenance, and a product team might offer a maintenance service in addition to the system itself.

Aircraft. Operating a commercial aircraft is a joint endeavor between the air carrier and its staff, the manufacturer, and the civil aviation authority (CAA). While the carrier’s pilots are responsible for an aircraft in flight, the carrier has overall responsibility for safe operation. The carrier is responsible for setting policy and training its staff in order to meet CAA regulation. The manufacturer supports the carrier by, first, getting type certification for the aircraft design, and then providing the carrier with documentation on the general limitations of the aircraft’s design.

The air carrier is generally responsible for ensuring all its employees and contractors have training and know which rules to follow—pilots, flight attendants, ground handlers, maintenance personnel, dispatchers and so on. Individual people are responsible for complying with the rules and limitations of their certificates—pilots, dispatchers, and mechanics, for example.

The manufacturer works in concert with the air carrier and repair facilities to develop training materials and is responsible for promulgating maintenance documentation, including service bulletins generated from operational reports back to the manufacturer about problems discovered through use of the aircraft. This means that the project team develops this material during the development phase.

If there is an incident or an accident with the aircraft, the carrier typically works together with the CAA and other government organizations as well as with the manufacturer to investigate what occurred. The records of the aircraft’s design and manufacture, along with safety analyses, implementation, and verification, are one of the inputs to these investigations.

Summarizing, the project team has the following responsibilities that affect operations:

Uncrewed spacecraft. Unlike the other examples, an uncrewed spacecraft is operated completely remotely. The only way to interact with it is through command and telemetry communication channels. Without the ability to interact physically with the spacecraft, its operators rely on design records and hardware instances on the ground to interpret the information they receive.

A spacecraft is typically managed by an operations team. This team uses ground systems—which are designed and implemented as an integral part of the overall mission system—to watch the telemetry sent by the spacecraft and send up commands. The operations team plans upcoming activities for the spacecraft, such as observations to take or maneuvers to make, based on mission plan. The team uses design information about the spacecraft’s capabilities to determine what activities to plan, and the order in which different steps must occur. The team turns these plans into commands that are sent up to the spacecraft, which then follows the commands. The spacecraft sends telemetry messages down to the ground systems. The operations team processes and interprets this data. They use information about the sensors generating the information, such as records of how the sensor has been calibrated, its position and attitude on the spacecraft, and the format of data it sends.

The operations team also monitors the telemetry data for off-nominal conditions. It detects that the spacecraft has had a problem by comparing the data received against what is expected from the plan, such as expected attitude information, and looking for data values that are out of normal range, such as a high temperature or low battery voltage. After identifying that a problem has occurred, the operations team looks for the causes of the problem and then works out how to return the spacecraft to normal operation. The investigation relies on the spacecraft’s design records. The team often uses simulation models or duplicate spacecraft systems on the ground to see if they can replicate the problem and to verify that any recovery plans will work as intended. Once they have a plan, they formulate the corresponding commands and send them up to the spacecraft.

For example, consider the first crewed Starliner CST-100 flight [Foust24]. During the early part of the flight, several thrusters began showing poor performance that led the flight systems to shut them down. Even though the spacecraft was carrying crew and eventually docked to the International Space Station, no one could physically access the thrusters to determine what had happened. In the end, teams on the ground replicated the performance problems using duplicate thruster units. Having learned the likely cause of the failures, NASA changed the flight procedures for departing the ISS and returning to ground. (The agency also determined that the failures posed sufficient safety risks that the vehicle did not carry crew on the return to Earth.)

Factory system. Consider a generic plant that produces chemicals. Its operation involves multiple chemicals that can cause serious injury and death to both workers and the surrounding population in an accident. While parts of the plant’s operations are automated, there are many manual operations—cleaning, responding to a failure, maintaining machinery, and so on. The plant, therefore, relies on its operators following safe procedures. This generic example is inspired by several real-world examples; see Leveson [Leveson11, Section 2.2.4] for one relevant case study.

Chemical plants are subject to incentives that work against safety. The desire for profitability leads to streamlining operations or shutting down safety-specific systems, which then break safety requirements. Individual staff are likewise incentivized to work quickly, and often look for workarounds that make their jobs easier or faster. These can also break safety requirements. Finally, staff turnover leads knowledge gaps at all levels, so that workers and management don’t know what is needed to maintain safe operation.

Plants like this are operated by a company. The company’s upper management are the ultimate authority that is responsible for safe and profitable plant operation. They set policy for how the plant’s workers will balance profitability against safety. The plant management act on this policy to run the plant, making specific operational decisions to set procedures. The plant workers then follow the procedures to operate the plant (or shut it down when needed).

The hierarchy within the company forms a control hierarchy, involving decisions, feedback, and commands. Upper management sets policy, gives instructions to plant management, and observes feedback metrics. Plant management give instructions to staff, adjusting those instructions to meet the company’s policies. The staff in turn control portions of the plant.

Two steps are needed for this control hierarchy to keep the plant operating safely. The first is that everyone working on the plant or overseeing it must have an accurate understanding of how the plant has been designed for safety. The project staff who design and build the plant make this information available to people in the company, both as reference documentation and as training material. The second is that the behavior of each level of the control hierarchy must be regularly monitored to ensure that the people are operating their part of the system consistent with safety designs. If there is a deviation from safe practice that violates safety constraints, the company takes corrective action to stop the unsafe behavior. This is true at all levels of the company, and especially for upper management: cost-cutting measures meant to improve profitability are a common cause of accidents, and upper management must be answerable to checks that will prevent such decisions.

These control systems are part of the system to be designed and implemented during the system’s development phase. Accurate controls do not arise spontaneously; they come from intentional design. A safe system’s implementation defines roles for upper management, plant management, and plant staff, and includes the procedures that each is to follow. These procedures are verified both analytically and (where possible) by testing, in order to ensure that each level will behave in ways that keep plant operation safe. The analyses account for human factors—what kind of information each role can receive, how likely that is to convey the correct understanding of what is happening in the system, the incentives driving people in each role, and how accurately they can implement instructions.

In some cases, auditing operations will find that people are not following the designed procedures but that these changes do not pose a safety risk. These changes first must be checked thoroughly for evidence that they do not violate safety constraints in the system. If they are found to be acceptable, they should lead to a formal change to the documented procedures (in the form of a change request; see Section 28.7 below). The documented procedures must always remain consistent with what people are actually doing so that all staff clearly understand what is acceptable operation and what is not.

28.7 System evolution

The system evolution phase is about making changes to the system after it has been released and potentially deployed to customers. These changes can happen for many different reasons—such as a planned roadmap for adding to the system over time, requests for changes from customers, fixing problems, or changes in regulation. System evolution can occur in parallel with system deployment and operation.

Overall, system evolution is a recapitulation of system development (Chapter 27). It involves working out a purpose for the change, a concept for how the system will work when changed, leading to specification, implementation, and verification. These steps use information about what has already been specified and implemented in the system, along with the reasons why it is that way, to work out how to make changes that achieve the desired results without disturbing the system’s existing behaviors.

undisplayed image

Making changes starts with a change request. In whatever form the request takes, it identifies who is asking for a change, what their purpose is in the change, and why it is worth doing. In practice change requests are usually maintained in a database. Requests can come from many sources. They may be part of the project’s long-term plan to continue developing the system. They may come from customers, who ask for new or changed capabilities. They may stem from the investigations into reported problems or accidents, in order to avoid problems in the future.

A project does not act on all change requests. Some of them will be technically impossible; some will be infeasible because of time or resources. Others might be reasonable requests that have to wait until higher-priority requests have been addressed. The team looks at each request received to determine its importance, its feasibility, and its cost, and makes decisions about whether to accept or reject the request based on the analysis. If a request is accepted, the team determines a relative priority compared to other work or a potential deadline. These are used in planning the team’s upcoming work (Section 20.5).

Determining whether a request is feasible involves determining how much of the system will be affected by a potential change. While working out the concept for the change, a team member determines what parts of the system will be affected, using documentation about the system’s structure and design (Chapter 12). The result is a preliminary analysis listing the set of components that will be changed and the general nature of those changes. This information is then used to estimate the effort that will be needed to design and implement changes for the request.

Changes happen iteratively. There may be multiple iterations in progress concurrently if multiple changes have been accepted. Handling multiple concurrent iterations requires careful configuration management discipline (Section 17.4).

Making the changes involves changing the specifications and designs for affected components. These changes can be difficult to make accurately because they are done to an existing, complex set of relationships between components. Making a change without causing flaws depends, then, on being able to accurately understand the structure of the system and how parts of that structure contribute to emergent properties like safety constraints. This relies on having rationales, analyses, and earlier designs available, so that people can work from an accurate information base.

Once a change has been specified, designed, and implemented, it is verified. Verifying the work for a change has two parts: ensuring that the modified system meets the new specifications and the purpose of the change request, and ensuring that the rest of the system continues to work correctly.

Once the changes have been verified, they can be deployed to customers as an upgrade or incorporated in new deployments, using the production (Section 28.1) and deployment (Section 28.3) patterns already discussed.

The team continues to evolve the system until the team is relieved of responsibility for fixing problems or when the system is taken out of operation.

The overall process for system evolution includes:

Inputs. The system evolution phase starts with change requests. A change request is a record of the desired new behavior or properties, or the problem that should be fixed. It also records who is making the request, their reasons for doing so, and information about priority or deadlines if appropriate. The change request may reference incident analysis reports or other background information needed for context.

The evolution phase will take in the current development plan and the current system.

Outputs. The primary outputs are updates to the system artifacts, including updated concept, specifications, design, rationale, and verification artifacts. These artifacts feed to production and deployment phases, which result in other outputs.

The development plan is updated as a side-effect of deciding whether to process a change request, and its priority or deadlines if so.

Milestones. There is one milestone unique to the system evolution phase: the decision whether to proceed or reject a change request.

In addition, this phase incorporates all the milestones associated with the development pattern while developing a new version of the system.

28.8 System evolution examples

Consumer software. Many consumer applications are released initially as a simple initial version, with a roadmap to add features in future releases. This approach lets the developer test the market and develop awareness of their application as early as possible, and with the least investment possible before adapting to customer needs.

These upgrades are often planned to be released on a regular schedule, with a plan or roadmap of what new capabilities will be released each time. Additional bug-fix releases are released as needed between the planned upgrade versions. These are driven by a balance between problem fixes and the roadmap, which is updated by a marketing team listening to customer requests.

Spacecraft. In most missions, the spacecraft hardware cannot be changed once the spacecraft is launched. The opportunities for evolving the system are to update on-board flight software and ground systems.

Flight software is updated for several reasons: correcting bugs found after launch, adding fixes to work around hardware problems discovered in flight, and adding new capabilities. New capabilities might include new kinds of data analysis and science operations, such as the autonomous dust devil detection uploaded to the Mars rovers [Castano06]. The project team develops and tests these software updates using simulations and replicas of the spacecraft on the ground before risking sending changes to the spacecraft. This test equipment is an important output of the original development phase. The ability for the spacecraft systems to continue functioning even after a buggy software update is also an important system property, often addressed using internal fault detection, software rollback, and “safe modes” where the spacecraft operates with only a minimum amount of well-tested software running [Wertz11, Chapter 14, p. 410].

Flight software updates are driven first by problem fixes that are needed and second by mission opportunities to use new capabilities. It is uncommon to plan to regularly produce new flight software versions during a mission.

Ground systems are easier to update, since people can access them directly. For example, a mission can add new ground communication stations or upgrade the workstations and severs in mission control. New mission planning or data analysis tools are regularly tried out during a mission. Some ground system updates are planned on a regular schedule over the course of the mission, though more happen when problems or opportunities are identified.

Some spacecraft mission systems in recent years have tackled in-flight upgrades. The GPS constellation is regularly updated with new spacecraft [Albon24]. Low Earth orbit constellations, such as the Starlink communication constellation, use spacecraft in low orbits that have intentionally limited life spans, and they are regularly replaced with newer-generation spacecraft. The System F6 project, on which I worked, looked at flying in new capabilities over time [LoBosco08].

Factory system. Consider the chemical plant example from the previous section. Over the plant’s life, there can be many reasons why the plant will change from what was originally implemented. New technology can become available that will improve the factory’s operation. Parts can wear out and need replacement, but duplicate parts might not be available any longer and a substitute must be found. The factory’s chemical process may be changing to meet new demands, leading to changes in the plant’s equipment. And finally, there will be changes to operational procedures as noted in the previous section.

All of these involve changing the design of the plant. Following the pattern for system evolution ensures that the necessary design and implementation steps are done so that the plant continues operating safely.

For example, when substituting a different model part for one that is not longer available, there are a number of questions to answer. Does the replacement part meet the functional and safety assumptions of the original? Will the replacement fit into the physical space available, and connect to other parts properly? Does it fit into the control mechanisms, both automated systems and manual control? Is the replacement manufactured with equivalent reliability, and does the supplier provide the same assurances about provenance? How do maintenance and operation procedures need to change to reflect the substitute part?

28.9 System retirement

No system lives forever, and most are deliberately taken out of service when their usefulness has ended.

Most systems continue in operation until there is a decision to retire them. For some systems, this comes when the purpose for the system has been completed—for a spacecraft mission, for example. For others, it comes when the system has worn out enough that ongoing maintenance and repair costs outweigh the cost of replacement, such as for vehicles that wear out. Yet others are replaced because newer systems become available that can meet the customer’s need better.

A system being retired and disposed of typically goes through three periods. In the first period, the system is in normal operation, but the decision has been made to retire it. During this period people plan how to shut the system down and transition its functions or information. They should conduct dress rehearsals to verify that the procedures will work as expected. The system then enters the second period, where it is no longer in normal use but may remain at least partly operational to support transition and archival. Once those are verified complete, the system is shut down for the last time, is dismantled, and its resources are disposed of.

undisplayed image

There are two primary aspects of retiring a system to consider: what to do with information or materials that should be migrated to a new system, or archived, and how to dispose of the artifacts that make up the system.

I discussed migrating into a system in Section 28.3 above. The task of migrating out of a system is part of the same process, involving developing a plan for migrating information or materials from the old system and into the new.

There will be other information that people will want long after the system has been retired, in many cases. This can include logs of system activity or user access that may be needed for later accident investigations or legal inquiries. It can also include information or materials that the system processed that are not being migrated to another system, but that may be valuable in the future. This information is moved from the working system to some kind of archive. Developing the procedures to archive the information, how the information will be organized, and the system to hold the archive requires development on its own, just as migrating information from one system to another does. This development phase involves determining what information needs to be archived and how it will be used once it has been stored, which in turn leads to a concept, then specifications, then a design.

Archived information is usually retained for a long term. If a system has been used for business or manufacturing, retention is mostly governed by regulation—anywhere from one to 30 years in the US, depending on the kind of information. Scientific and medical data is often of value indefinitely, though legal retention requirements may be shorter. Scientific data is often re-evaluated decades after it was first gathered; for example, data collected from the Viking landers on Mars in the mid-1970s was re-interpreted thirty years later after other missions gathered more information about Martian soil composition [Navarro-Gonzalez10]. This particular example also illustrates a problem with many data archives: the mission data were recorded on microfilm and had to be scanned to get digital data to process.

Long-term archival media often have two problems. First, the media wear out and decay over time, which has led to information believed to be safely archived to be found to be unreadable [Purdy24]. Second, even if the media are readable, there may no longer be machines that can read them. I have a number of backup tapes for which I have not been able to find a drive to try reading them.

Sometimes physical artifacts are retained from a retired system. It is common to keep parts of aircraft and spacecraft in museums after they are retired, for example.

undisplayed image

Disposing of system artifacts can range from trivial to complex. Erasing a software application and its data, for example, is easy; once the storage media have been erased, there is no further meaningful trace of the system remaining. Disposing of a system that processed hazardous biological or chemical materials, on the other hand, can be difficult.

The retirement and disposal procedures must be secure. An unauthorized attempt to shut down a working system can cause major losses, and can lead to safety hazards. Information and materials are being moved around during migration and archival, and are potentially accessible to being copied or corrupted. Physical artifacts that are being decommissioned can carry confidential information about both the way the system works and about the customer that has been using the system.

Inputs. Retirement begins with a system in operation, along with records of its specification and design.

Some systems develop data archival, shutdown, and system disposal procedures during the development phase. If so, then these are input to system retirement. If not, then the procedures are developed during the retirement phase.

If the system’s function is being migrated to a new system, the specification and design of the new system is an input, and is used to develop a migration plan during the retirement phase. An unpopulated but functional installation of the new system is also involved.

Outputs. There are three kinds of outputs from system retirement:

  1. A new system that includes information and materials migrated from the system being retired, if appropriate.
  2. Archived information and artifacts.
  3. Physical resources and debris. Resources that have residual value or are reusable are separated from debris that cannot be reused feasibly.

Milestones. The overall retirement phase starts with a milestone decision that the system should be retired.

After that, the three threads of activity—migration, archival, and disposal—each have readiness milestones for reviewing and approving a plan for each, and a verification milestone to confirm that each was completed correctly. The disposal readiness milestone also checks that migration and archival have completed.

There is also a decision milestone to determine when the running system should be taken out of service in order to start migration to a new system and archival.

28.10 System retirement examples

There are many different ways systems are retired. Here are three examples that illustrate different approaches.

Simple software system. When retiring software such as a workstation or phone application, the objective is to remove the software from the system on which it runs, so that none of the software or its related files remain. This is typically done by running an uninstall program that is set up to remove any files that were added on installation, plus any internal files that might have been created (configuration, logs, caches). This uninstaller is typically developed as a part of the application and packaged with it. In some cases, the software can be disposed of by erasing the storage devices that held the software and its related files.

Sometimes retiring an application means that the server on which the software was running is no longer needed, and so the server can be retired. Disposing of the server is similar to disposing of a vehicle, as discussed next.

Vehicle. Retiring a vehicle, such as a car or aircraft, involves getting rid of the vehicle’s physical parts while recovering as much value from the parts as possible. At the same time, records about the vehicle are retained for longer in order to meet financial record-keeping needs as well as supporting analysis of maintenance or reliability for other similar vehicles.

The overall process is:

Spacecraft disposal. The objective when retiring a spacecraft is to ensure that it will pose no future hazard to the Earth, other spacecraft, or other bodies. Some of the most important hazards are impacting the Earth and causing damage or injury; colliding with other spacecraft; or contaminating other planets or moons that potentially carry life. Collision can occur either with the whole spacecraft, or with fragments of it if the spacecraft breaks up on orbit. Interfering with radio spectrum is another, though lesser, hazard.

There are four approaches usually used to retire and dispose of a spacecraft.

  1. Cause it to enter an atmosphere and disintegrate, so that no parts remain in orbit and no parts will pose a risk of falling on people or property. This is used for spacecraft in low Earth orbit, such as small spacecraft that burn up when entering the atmosphere and large spacecraft that are directed to impact in the deep ocean. It has also been used for deep space probes around Jupiter and Saturn to avoid the spacecraft impacting and contaminating moons there.
  2. Cause it to enter the Earth’s atmosphere, land, and be recovered. This is used for crewed missions and returning cargo from low Earth orbit. The returned spacecraft equipment is often re-used for later flights.
  3. Cause it to impact another body, such as the moon. This has been used for Apollo hardware and science missions such as LADEE [LADEE13].
  4. Place it in a parking orbit that is stable and reserved for retired spacecraft. This is used for spacecraft in Earth geosynchronous orbit, which require significant energy to deorbit.

If a spacecraft is going to remain in orbit after its useful mission is complete, such as if it is being placed in a parking orbit or being left in a low decaying orbit to enter the atmosphere passively, then regulations require passivating the spacecraft. This involves removing any energy that could cause the spacecraft to explode, change its orbit, or activate radios—eliminating ways that it could cause collisions or interfere with communications. This typically involves venting any fuel and other gases or fluids and permanently shutting down any electrical systems.

All of these disposal approaches can experience problems. A spacecraft may lose its communication capacity before passivation commands have been sent to it. Thrusters may fail, interfering with the ability to put the spacecraft into an orbit that will enter the atmosphere or impact as planned. The design of the disposal methods must account for these potential problems, and safety analyses must show that the spacecraft and its procedures will avoid the identified hazards with acceptable likelihood.

The NASA life cycle standards require that a mission develop the plan for retiring and disposing of a spacecraft during the development phase [NPR7123]. This includes the plan for how the spacecraft will be disposed of, including meeting safety requirements. The plan must also include the procedures for archiving all mission and project data.

Sidebar: Summary

Chapter 29: Project ending

When a project ends, there are three objectives: completing obligations and support for stakeholders (Section 16.2); saving information and artifacts that might be needed in the future; and releasing resources that the project used.

Note that ending the project is separate from retiring any particular instance of a system. Ending the project is about stopping development and support for a system product, independent of whether there are instances of that system in operation or not. Some projects will combine these, such as for exploration space missions that build and fly one spacecraft.

A project might end for one of many reasons. It might have a fixed term or have completed a defined system deliverable. It might run out of money or time. It might no longer fit the organization’s or funder’s strategy, perhaps because a better replacement system is planned. Competitors might have won over customers and there is no longer demand for the system. The team might be unable to deliver, with the project behind schedule or over budget or lacking key features.

The first step is a decision to wind down the project. This is typically a decision made by the organization that hosts the project, or its funders; the project staff generally do not make the decision on their own.

A decision to end the project is followed by a plan for how to do so, which defines the steps the team will take to meet the final objectives. The plan typically gets review and approval before proceeding. (In some environments, at least part of the plan must be worked out early in the project, long before any decisions are made.)

undisplayed image

The following sections list some of the steps that are involved in ending a project. The specific steps will depend on the project; for example, not all projects have contracts with funders that must be closed out. The team can use this list to help build the plan, bearing in mind that some steps should be done in particular orders. Customers should be notified of the project’s impending end before ending contracts; shutting down production should happen only after all upgraded components have been manufactured.

Obligations to customers. If there are any system instances still in operation, the first step is to let those customers know that the project is ending. If system instances are owned and operated by customers separate from the project, then they will want to work out how to keep their system in operation after project support ends or they will decide to retire the system. The terms on which the system is licensed may affect what the customer can do after the project ends.

The project develops any final updates or fixes for the system, and releases them for deployment along with appropriate documentation and training. The project builds or acquires such spare parts inventory as is needed for remaining customers before shutting down production.

If there are contractual relations with the customer, the contract is closed out. This might include final billing or payments, or other deliverables.

Finally, the customer service mechanisms are shut down.

Obligations to team. Ending a project means loss of work for everyone working on it. It can also mean the loss of social relationships.

The first obligation to the team is to keep them informed once the decision has been made to end the project. The people should understand why the project is ending, the plans or timeline for winding down, and their roles during that time.

People will be needed on the project for different lengths of time. Some roles will end shortly after starting to wind down the project, such as doing development of new system changes. Other roles will last to the end, such as closing out finances and contracts. Each person needs an expectation of how long they will be needed so that they can make plans for what to do next. (In some jurisdictions, notices of layoffs are required well in advance.)

At the same time, many people will have incentives to move on to something else before their project role is complete. The plans for ending the project must take this into account, and often include incentives for people to stay on as long as they are needed.

Finally, the team’s experience represents an asset. These people can be a resource to other projects in their organizations. Helping people transition can help other projects and, done well, generates good will that helps incentivize people not to leave early.

Obligations to funders. Some projects will have contracts or other agreements with funders. These projects provide final reports and other deliverables to the funder. They can then finalize financial accounting with the funder and close out the contractual relationship.

Obligations to regulators. Some projects for systems in highly-regulated industries may need to work with their regulators when the project is shutting down. This might include filing notices that the project is ending. The project is responsible for determining what other requirements their regulators may have.

Obligations to organization. The project takes two final steps: saving information and releasing resources.

There are several reasons that information about the project may be needed in the future. There may be a need to restart the project, in which case the new team must be able to learn about the system’s design and implementation, as well as the reasons behind its design. The intellectual property in the system may be valuable for licensing or sale. There may also be investigations related to the system or the project that need information about how the project was conducted.

The project may archive the artifacts needed to restart the project. It may also archive records of project execution, known issues, and any plans that will not be completed. Some projects will archive physical artifacts: molds and forms that support production, for example; some artifacts may be kept for museums.

The end of the project is time to gather a retrospective on how the project went. A bit of introspection about what went well and what didn’t will help people on the team to do better on future projects, and helps build institutional knowledge.

Archiving project information has security concerns. The process of moving information to an archive must maintain the information’s integrity and confidentiality: it must not be modified, lost, or disclosed during the move. After that, the archive must maintain the information’s integrity and confidentiality.

The project also releases the resources it has held. This includes:

Lastly, the people on the team will move on as discussed above.

29.1 Project cancellation

Some projects end because they are canceled, even before they have completed their development phase. Anecdotally, it seems that more projects are canceled than go to completion—this is a consequence of using competitive approaches to programs, and the net effects of competition are generally regarded as valuable. The information in this chapter applies to canceled projects just as to other projects.

Consider two examples, based on projects I have worked on.

In the first project, the team was writing a proposal for a US DoD spacecraft system. In the proposal-writing phase, the team has to establish the basic architectural and management approaches for the project, show they meet the department’s needs, and establish the price at which the team proposes to build the system. The team progressed through establishing the initial concept and architecture for the system, and we began evaluating the solution to see how good a job it would do for the customer and how much it would cost to build it.

We had a checkpoint milestone where we reviewed what we had found. At that review, it became clear that while our team had a decent solution for the needs, we did not have a great solution, and that other companies we expected to propose designs would likely have better solutions (because they had more experience in a couple of key technical areas). We made the decision not to pursue the proposal.

This was a good decision. Assembling a proposal is not a small task; we had a team of about 15 people working long hours. For US government projects, the proposer generally pays for the proposal development. Choosing to spend our team’s time and money on this project meant that the team couldn’t work on some other project. We judged that the opportunity cost was not matched by the probability of successfully winning the contract, so we freed up the team to work on a different system that did prove successful. If we had continued to work on the original proposal, we would have spent the budget available to develop proposals and could not have spent it on the proposal that succeeded.

In the second example, a different US DoD spacecraft program, the team was about two years into a multi-year contract. The team had performed excellently in a competitive first prototyping phase, and was the only team to be selected to move on to a second phase for building an initial working version. A key subcontractor on the team had staffing and management problems, and were not delivering results. Within the team we were struggling to fix the execution problems or find another way to build the necessary components, all the time keeping a large staff on payroll and running through budget. While the technological solutions for many system capabilities were probably sound, the team could not deliver. The customer observed the problem, and after working with the team to try to resolve the problems, went through the process to cancel the project.

This was also a good decision. In hindsight, the team lacked necessary capability in the subcontractor and in the project management team. If the project had been allowed to continue, it is unlikely that the team would have solved the problem and more money would have been spent without benefit in the end.

The take away from these examples is that there are many sound reasons for canceling a project. Sometimes the cancellation is designed in (as with competitive acquisition); other times it is because continuing to invest money, time, and the care of the team building the system has become unlikely to pay off.

For a more general discussion of US DoD project failures, see the report by Bogan et al. [Bogan17].

Sidebar: Summary

Chapter 30: Using the reference life cycle

25 September 2024

The last several chapters have presented a reference life cycle pattern. This pattern is intended to inspire thoughts about how a project can organize its own work. It is not in itself ready to use off the shelf. Each project will have its own needs, its people will have preferred ways to work, and some projects will have life cycles mandated by regulation or industry standards to follow.

undisplayed image
Figure 30.1: The entire reference life cycle.

XXX purpose of the life cycle – get a system built, deployed, and sustained that meets customer and other stakeholder needs – in doing so, develops the whole system including all its development artifacts, not just the end product – plea for thinking about flexibility with discipline, and the idea of the artifact edifice

The reference pattern does not discuss the roles involved. A full definition of each phase will include definitions of who performs different tasks in each phase, and in particular who is responsible for milestones. I argue elsewhere (e.g. Section 8.2.6) that, to be meaningful, reviews must be done by people with an independent perspective on the material being reviewed, and that approvals must reflect a check on the work fitting into the project’s big picture.

The objective for any project is to develop and adopt a life cycle that meets its needs. In the next section I discuss several principles that a good life cycle will follow; these can help people evaluate a life cycle they are considering. Some other considerations include:

The project works out what life cycle patterns it uses, and documents the patterns. This effort starts during project preparation (Chapter 25). It does not necessarily need to define the entire life cycle all at once; it can be done iteratively, as long as the work keeps ahead of what the team needs. In practice I have found that enough should be completed in the project preparation phase that the team understands the general complexity of the work ahead, has chosen a development methodology, and can name major milestones they will need to meet. The remainder of the life cycle can be worked out during purpose and concept development and likely will be refined or adjusted as the project moves along. (I worked on one project that had a limited budget, and spent much of that budget writing elaborate management and engineering plan documents before even beginning to work out the high-level system concept. The result was a pile of such documents that were never looked at again, which was a waste of their efforts.)

In no case should the team get ahead of the defined life cycle. See Section 8.1.5—Principle: Team habits for a discussion of this principle.

The life cycle patterns have value only if the team actually uses them. This means that the team must know that the patterns exist, understand them, and agree that they are useful. The people in the team must also understand that they have a responsibility to follow the patterns, or to raise an issue when they find a problem with the life cycle’s definition. Achieving these means educating people as they join the team about what the life cycle is and how to learn about it, as well as monitoring that everyone actually follows the patterns. The team can also learn about and accept a life cycle a bit more easily if they are involved in developing the patterns; at minimum, they should be able to give feedback before the patterns are adopted.

The life cycle patterns are documented in a way that the team can find them and learn about them when they are joining the team and when they need to refresh their understanding of how some step works. The documentation is an artifact that should be managed using the principles in Section 17.4: it should be versioned and under change management; it should be stored in a way that the team can find it when needed; and it should be secure enough that it will not be tampered with.

There is no one right way to document them, as long as the documentation is well-organized and accessible. Some organizations prefer to define the life cycle in a prose plan document, which can be printed in its entirety if needed. I have had some success maintaining the documentation in a wiki or in a collection of web documents; the advantage of these is that they allow linking between parts of the document. The patterns should be explained and listed explicitly; they should not be hidden in a workflow system that doesn’t let team members see and understand the whole context for their work (see Section 4.6 for an example).

The documentation for each phase or step in the life cycle should include the information listed in Sections 21.5, 21.6, and 21.7.

30.1 Meeting life cycle principles

In Section 21.10, I listed principles that a life cycle pattern should meet. The reference life cycle pattern in this part reflects these principles, though it cannot address all of them. Here are ways that a life cycle built using this reference as a base can address them.

Know the purpose for something before developing it. The development phases in the reference life cycle all start with a purpose development step, in which the purpose for the system, component, or feature gets worked out before proceeding on to concept and design. The system evolution phase reiterates these patterns.

The project preparation phase is a time to think about the purpose for the project as a whole, and to work out the purposes for the different aspect of project operations; for example, what the team organization should achieve, or what is expected of life cycle and procedures.

Documenting these purposes means that when the questions are revisited—and they will be—people can understand the reasons why decisions were made, instead of forgetting why and making up new and probably different reasons.

A good life cycle definition will ensure that these phases have review and approval milestones that check that the purpose has been worked out and documented.

Build in time for and incentivize deliberative thinking. The concept steps in development and evolution support this kind of deliberation, as long as the team culture actually incentivizes taking the time to work through a concept deliberately.

The procedures and instructions for reviews complement the life cycle patterns by prompting reviewers to ask questions about deliberations taken, and encouraging them to reject work that has not been thought through. Again, projects that are in a rush will tend to disincentivize this, usually storing up trouble for themselves for later. The project leadership can create an example and incentivize taking enough time to think.

Assign decision-making authority to an appropriate level based on the nature of the decision. The reference life cycle does not address this as written. The structure of the team, and how roles are organized in the team, complement the life cycle and determine how authority is distributed. The specific decisions about what role can take what decisions is encoded in the details for life cycle phases and in the procedures that apply during those phases.

For more details, see ! unknown reference XXX.

Build in ways to check work, and design them so they are a team norm and not prone to triggering defensive reactions. The reference life cycle includes reviews at regular points in the work in order to support this principle. The definitions of procedures for reviews augment the life cycle by making it clear what is to be reviewed and how people are to go about the reviews.

Build for the longer term. The reference life cycle supports this somewhat by providing development steps when thinking about how to design for the long term and when documentation to support future revision can happen. A project can define more specifically what kinds of documentation is expected from development phases, and review procedures can make it clear that such documentation must be provided before a piece of the work advances in its development steps.

Project-wide decision points. I have pointed out some times when the life cycle might have review and decision points, such as after purpose development, during concept development, and in the acceptance phase at the end of development.

Think about exceptions that might happen, how to handle them, and when to change course. I have not tried to address this principle in the reference life cycle. Working out how to handle exceptions is a process like designing for safety or reliability (Chapter 43): working out the kinds of hazards (exceptions) that might be foreseen, then deciding what should be done about each one. The particular kinds of exceptions depend on the project: a delay in getting a new funding round affects a project in a startup but not a small project in a well-funded organization, for example.

Some kinds of exceptional conditions are not really a matter for the life cycle, but rather for the development methodology, procedures, and planning approach that the life cycle patterns organize. Risk management (! unknown reference XXX) and the way that planning accounts for uncertainty (! unknown reference XXX) are ways to anticipate specific exceptional conditions and, in many cases, avoid them.

The choice of development methodology affects how easily the project can adjust when it needs to change direction (Section 27.5).

Define the work so that everyone on the team can agree when a step has been completed. This is achieved by clearly documenting each step or phase in the life cycle.

Give a clear definition for each step of the quality considerations by which the work can be judged. Similar to the previous principle, this is met by the documentation for each phase or step.

Make the pattern as light-weight as possible without compromising quality. The reference life cycle in these chapters is only a skeleton of a complete life cycle definition. I believe that everything in it is necessary for most projects, though some projects will likely be able to trim out some parts. As long as a project’s life cycle does not add too much to this reference, the life cycle itself will likely be acceptably lightweight.

There are three ways that I have seen a project end up with a too-heavyweight process. One is to add too many new phases or steps to the life cycle, to the point that people in the team have trouble figuring out where the project is and what steps they should be doing. Another is to make the work inside one phase too complex: adding more reviews than the minimum necessary, for example. The third is when the procedures that say how to do parts of the steps get complex. In Section 4.6, I discussed how complex one procedure (in this case, for qualifying component vendors) caused problems for a large launch vehicle project.

30.2 Relation to NASA life cycle

As many people are familiar with the NASA life cycle, or may be obliged to use it (or a variant of it), in this section I discuss how the canonical NASA life cycle compares to this reference life cycle. I will use the general NASA life cycle defined in NPR 7120.5 [NPR7120, Figure 2-5]. I presented an overview of this life cycle in Section 23.2.1.

The NASA life cycle is divided into seven major phases:

  1. Pre-Phase A: concept studies that define a potential mission.
  2. Phase A: concept development that results in the definition of a feasible and useful mission.
  3. Phase B: high-level design of the system for a mission and evaluation of the technology available for the mission.
  4. Phase C: the bulk of development, in which the system components are developed, verified, and manufactured.
  5. Phase D: assembly, integration, and test of the flight and ground systems; launch and initial on-orbit checkout.
  6. Phase E: operation in flight.
  7. Phase F: close out, including flight system disposal.

This life cycle was developed over several decades as NASA learned how to develop and operate complex missions. Elements of this approach have been adopted by many other organizations—terms like “System Requirements Review” and “Preliminary Design Review” have become nearly ubiquitous in the aerospace industry.

The overall flow of the NASA life cycle is organized around two constraints: fitting in with the US Federal funding cycle, and managing risk for a few highly expensive steps. The funding constraints come at the transition from Pre-phase A to Phase A, when the mission is approved and funded enough to develop its concept, and between Phases B and C when the agency commits to funding the full mission [NASA16, Section 3.5, p. 25]. The distinction between Phases C and D comes with Phase C covering development of designs and fabricating components, but actual assembly of a spacecraft does not start until Phase D, at which point there should be little residual risk that the system design will not work out.

undisplayed image
Figure 30.2: NASA life cycle mapped to reference life cycle

The NASA approach was, however, developed for hardware-heavy systems and people who today develop spacecraft or aircraft that have a greater amount of software components sometimes find it difficult to map software project best practices onto the NASA approach. There are usually two issues: software development best practice puts integration earlier than the way many people interpret the NASA model; and many software developers combine design and implementation, especially for novel software functions. I show one way to reconcile these approaches in the mapping in this section.

The reference life cycle I have presented is organized around types of work—conception, specification, design, and so on. The NASA life cycle is organized at the highest level around milestones that check progress early, allowing corrections before committing agency resources. This means that the NASA life cycle splits several of the early phases in the reference life cycle in two, with a major review or checkpoint of the project’s progress before continuing. These two approaches are compatible: almost every project will have some kind of project-wide milestones alongside the milestones specific to the work phases.

In the following, I present how each of the NASA phases maps to the reference life cycle.

30.2.1 Outside the NASA life cycle

The reference life cycle defines the project preparation phase and project support “phase”. The preparation phase involves a rough definition of the project and establishing basic operations abilities. Project support covers support functions, like managing teams, finances, or artifacts.

In the NASA environment, the initial support is provided by one or more agency centers and external collaborators, using budget, tools, space, and people for general concept exploration. Each center has its own procedures for starting up a concept exploration project.

Similarly, the NASA agency provides essential support services to its projects.

In one project I worked on, the NASA Ames Research Center had a Mission Design Center that was charged with exploring potential mission concepts. A small group developed the mission idea and explored ways it could be realized. Ames and the agency provided all the key support infrastructure: staffing, finance, office and lab space, and IT services, for example.

30.2.2 Pre-phase A—Concept studies

The Pre-phase A work develops a concept for a mission, presumably in response to NASA agency priorities. It is expected to limit its work to the concept of a mission: what it might achieve, who would benefit from the mission, and high-level technical approaches that might support such a mission.

There is one major review in this phase: the Mission Concept Review (MCR). This checks that the potential mission is well formulated and that there is sufficient interest to justify funding “project formulation”—working out a detailed concept and high-level design.

At the end of Pre-phase A, after the MCR, the agency makes a decision whether to continue the project and fund it for “formulation”: the phases where the concept and high-level designs are worked out. This involves greater financial commitment than the early studies, and is the start of the “real” mission.

Pre-phase A maps to the purpose development phase (Section 27.3) and part of the concept development phase (Section 27.4). The purpose development phase covers identifying what the mission might do, and who the mission stakeholders might be. The concept development produces an initial sketch of a mission concept, without breaking the concept down into great detail.

30.2.3 Phase A—Concept and technical development

This phase is the first of two that are about developing a feasible high level design for a mission and ensuring that necessary technologies are available. Phase A includes developing a complete mission concept and high-level system designs. The team identifies any new technology that the mission will require and works out what will be needed for it to be ready to use in flight.

The depth of design and requirements is not clearly specified in the NASA procedural documents. However, my experience is that it is generally taken to include the spacecraft and its major subsystems, ground systems and their major subsystems, potential launch vehicles, and testing and other ground support equipment to a similar level. The exercise is intended in part to develop the general structure of the system and its likely cost, and in part to find those parts of the system that will require new technology.

Phase A includes developing a list of new technology that will be used for the mission, an evaluation of its maturity, and plans to develop that technology so that it will be mature enough for flight.

This is the first phase where a NASA project is funded for itself, as opposed to using resources allocated for general mission concept development. The various management and development plans required by NASA procedures get developed in this phase.

Phase A includes two key reviews:

The NASA Phase A maps to the second part of the concept development phase in the reference life cycle, along with concept and specification and preliminary design steps for the highest-level components in the system.

30.2.4 Phase B—Preliminary design and technology completion

Phase B continues the work from Phase A, completing a preliminary design and refining any new technology to the point where it is sufficiently mature to use in flight. This often involves building models and prototypes of parts of the system.

Phase B also involves developing the safety and security of the mission. The high-level design should incorporate designs for safety, security, and other critical mission success factors, and the design should be backed up by analysis showing why the design is sufficient. (See Chapter 43 for more on safety design.)

At the end of Phase B, the project should have a high-level design for the entire mission. That design should meet all the mission objectives, be technically feasible, and fit within cost and schedule available.

After Phase B, the agency allocates money to actually implement the system. The process can be complex and time-consuming, potentially involving legislative approval. The estimates for cost and schedule should be accurate enough that the project is unlikely to exceed them, which would require repeating the process to find more funding or time. This imposes limits on how much risk the project can carry going from Phase B to Phase C.

There is one key review in Phase B:

The end of Phase B maps to a slice across the development phase in the reference life cycle. It includes the concept, specification, and preliminary design of the first two or three levels of components in the breakdown hierarchy (Section 11.3; Chapter 38). In general this might include the major spacecraft subsystems—payload, structure, propulsion, attitude control, and so on. The portion of the design step includes prototyping or modeling of components that pose technical risk, and the design may go to deeper levels of the breakdown hierarchy if needed to understand and address that risk.

The mission-level PDR follows reviews of the component-level preliminary designs.

30.2.5 Phase C—Final design and fabrication

This phase is when most of the development and production work is done. It involves designing, building, and verifying all the components in the system, to the point where they are ready to be assembled into the working spacecraft and ground systems.

Phase C is designed around the spacecraft being difficult and expensive to assemble, involving building large structures, using complex manufacturing tools, threading complex wiring harnesses through the structure, and putting large amounts of money at risk during the assembly. This leads to organizing the final assembly work to avoid as much risk as possible by ensuring that all the components are ready to assemble before committing to the final assembly steps.

During this phase, the team completes all of the designs and implementations of the system components, and verifies all of them. This usually includes producing engineering and qualification units of hardware components (Section 27.8) for testing, including destructive testing for some parts. It also usually includes integrating all of the engineering or qualification units and the corresponding software into a testing version of the entire spacecraft in order to verify the entire integrated system.

Verification in Phase C typically includes verifying the human interfaces in the system. Can an operations team use the ground systems to accurately control the spacecraft, using simulated telemetry showing the spacecraft in different conditions (including off-nominal conditions).

Phase C is typically divided into two parts: the first part for completing all the designs, and the second part for implementing and producing the components. The Critical Design Review separates the two parts, where all the designs are checked.

I have seen the Critical Design Review milestone cause confusion: how far should work progress before the CDR? What is the boundary between “design” and “implementation”? For hardware components, such as an electronics board, engineers work on the board design: the layout of the components and traces that will be fabricated. The NASA CDR definition ([NPR7123, Table G-7, pp. 113-4]) indicates that the CDR should include “integrated schematics” and “fabrication, assembly, integration, and test plans”, which would indicate that the board design is complete. That the document also indicates that the CDR and Production Readiness Review are often coupled lends credence to the interpretation that the CDR reviews the board designs.

If this same interpretation were applied to software, it would imply that the software would be essentially complete by CDR. Software source code is the equivalent of electronics board design: while it is thought of as implementation, it must be processed through a build system to produce the actual executable software, just as a board’s design file is used to manufacture the boards.

However, the NASA Systems Engineering Handbook states that the CDR for a software component should occur “prior to the start of coding of deliverable software products” [NASA16, Section 3.6, p. 29]. In other words, the documents appear to disagree, though NPR 7123.1 is presumed to have precedence.

Further, software is often developed iteratively, implementing one version after another, each version adding some amount of functionality over time. Some lower-level functionality is left as a mockup, perhaps not even fully designed, until some of the higher-level integrated functionality has been implemented and verified (the idea of integration-first development (! unknown reference XXX), done to reduce risk as quickly as possible). Software development best practice also has verification proceeding continuously throughout implementation, with feedback to the implementer as early as possible. This often implies having some of the hardware components built and available for testing the software before the software is completed.

An official answer to how a team should resolve the discrepancies and interpret the CDR for a NASA project will have to come from the relevant NASA authorities.

However, in practice, I have found that focusing on the review before implementation is more useful for components in the upper and middle levels of the breakdown hierarchy. For example, this might include the major components with a subsystem, such as power distribution or generation within the electrical power subsystem, or attitude control algorithms in the guidance, navigation, and control subsystem. Components at these levels realize the important relationships between components in the system structure (Section 12.2) and the way components work together to produce emergent properties (Section 12.4). Analyzing these designs allows one to check whether key system behaviors will be met, and that properties like safety or reliability are handled correctly. These are the properties that are difficult to change if the implementation is found during verification not to meet them. The design and implementation of low-level components should be reviewed, but as long as there is an obvious, low-risk approach for them their review need not block the design reviews of the system as a whole. This interpretation, of performing the critical design review before implementation, means that the team is then free to implement software components incrementally if that is the best approach for that part of the system.

There are three reviews in Phase C:

The NASA Phase C maps to completing the development phase, the acceptance phase (Section 27.9), and the system production phase (Section 28.1) in the reference life cycle.

The CDR milestone maps to a slice through the system and component development phases, at the end of the design step for most or all of the components. The PRR for a component is equivalent to a review at the end of the production unit development step (Section 27.8). Note that the reference life cycle has a manufacture and deployment check milestone in the acceptance phase; this applies when the entire system is manufactured together, rather than the model implied in the NASA life cycle where different hardware components go to production individually. Finally, the SIR is equivalent to the deployment readiness review that is at the beginning of the deployment phase (Section 28.3) in the reference life cycle.

30.2.6 Phase D—System assembly, integration, and test, launch and checkout

Phase D covers the work between the end of designing and building all the parts and having a spacecraft on orbit ready to begin its mission proper. This includes assembling the spacecraft and ground systems, and verifying that they work (and work together). The verification typically involves testing the assembled spacecraft in vacuum, under strong vibrations, and in thermal environments equivalent to what it is expected to handle in flight—but not testing beyond those levels, in ways that might damage the vehicle. After testing, the team proceeds onward to integrating the spacecraft with its launch vehicle, final preparations, launch, and starting operations on orbit. The team on the ground finally checks the spacecraft out before declaring it ready to begin its mission.

Some missions build a second copy of the spacecraft to be used on the ground for debugging issues with the one in flight and to test possible commands before sending them to the operational spacecraft. The duplicate is typically assembled in Phase D. It might use qualification units for hardware that were used for testing in Phase C, rather than flight-ready units.

There are several reviews in Phase D. All of them are final checks that some part of the mission is ready for taking an irrevocable step. These include:

The NASA Phase D maps directly to the deployment phase in the reference life cycle. It takes in manufactured components and procedures, assembles them into a working system, tests that it has been assembled properly, and starts it in operation. The milestones in the NASA Phase D are different from the deployment phase milestones mainly because they are specific to launching a spacecraft.

30.2.7 Phase E—Operations and sustainment

In this phase, the team operates the mission through its end.

There are two kinds of reviews that occur in Phase E:

Phase E is equivalent to the system operation (Section 28.5) and evolution (Section 28.7) phases in the reference life cycle. The Decommissioning Review is equivalent to the decision to retire the system at the beginning of the system retirement phase (Section 28.9).

30.2.8 Phase F—Close out

The final phase in the NASA life cycle involves retiring and disposing of the flight systems, retiring or releasing ground systems, archiving mission data, and closing out the project.

There is one review identified in the NASA life cycle:

This phase corresponds to the system retirement (Section 28.9) and project ending (Chapter 29) phases in the reference life cycle.

Sidebar: Summary

Bibliography

[14CFR450] “Part 450—Launch and reentry license requirements”, in Title 14, Code of Federal Regulations, United States Government, August 2024, https://www.ecfr.gov/current/title-14/chapter-III/subchapter-C/part-450, accessed 2 September 2024.
[Albon24] Courtney Albon, “Space Force may launch GPS demonstration satellites to test new tech”, C4ISRNET, February 2024, https://www.c4isrnet.com/battlefield-tech/space/2024/02/09/space-force-may-launch-gps-demonstration-satellites-to-test-new-tech/, accessed 11 September 2024.
[Ambler23] Scott Ambler, “What happened to the Rational Unified Process (RUP)?”, https://scottambler.com/what-happened-to-rup/, accessed 29 February 2024.
[Bezos16] Jeffrey P. Bezos, “2015 Letter to Shareholders”, Amazon.com, Inc., 2016, https://s2.q4cdn.com/299287126/files/doc_financials/annual/2015-Letter-to-Shareholders.PDF, accessed 22 February 2024.
[Bogan17] Matthew R. Bogan, Thomas W. Kellermann, and Anthony S. Percy, “Failure is not an option: a root cause analysis of failed acquisition programs”, Naval Postgraduate School, Technical report NPS-AM-18-011, December 2017, https://nps.edu/documents/105938399/110483737/NPS-AM-18-011.pdf.
[CISA21] “Defending against software supply chain attacks”, Cybersecurity and Infrastructure Security Agency, U.S. National Institute of Standards and Technology, April 2021, https://www.cisa.gov/sites/default/files/publications/defending_against_software_supply_chain_attacks_508.pdf.
[CMMI] ISACA, “What is CMMI?”, https://cmmiinstitute.com/cmmi/intro, accessed 24 March 2024.
[CVE24] Information Technology Laboratory, National Institute of Standards and Technology, “CVE-2024-3094 detail”, in National Vulnerability Database, https://nvd.nist.gov/vuln/detail/CVE-2024-3094, accessed 4 August 2024.
[Castano06] Andres Castano, Alex Fukunaga, Jeffrey Biesiadeick, Lynn Neakrase, Patrick Whelley, Ronald Greeley, Mark Lemmon, Rebecca Castano, and Steve Chien, “Autonomous detection of dust devils and clouds on Mars”, Proceedings of the International Conference on Image Processing, October 2006.
[Control19] “Yokogawa announcement warns of counterfeit transmitters”, Control, 29 May 2019, https://www.controlglobal.com/measure/pressure/news/11301415/yokogawa-announcement-warns-of-counterfeit-transmitters.
[DFARS] “Defense Federal Acquisition Regulation Supplement”, General Services Administration, United States Government, January 2024, https://www.acquisition.gov/dfars, accessed 16 February 2024.
[Drucker93] Peter F. Drucker, Management: Tasks, Responsibilities, Practices, New York, NY: Harper Business, 1993.
[EPF] Eclipse Process Framework Project (archived), Eclipse Foundation, 2018?, https://projects.eclipse.org/projects/technology.epf, accessed 29 February 2024.
[FAR] “Federal Acquisition Regulation”, General Services Administration, United States Government, January 2024, https://www.acquisition.gov/browse/index/far, accessed 16 February 2024.
[Foust24] Jeff Foust, “Slow Burn: How Starliner’s crewed test flight went awry”, Space News, 4 September 2024, https://spacenews.com/slow-burn-how-starliners-crewed-test-flight-went-awry/, accessed 9 September 2024.
[Git] Git contributors, “Git documentation”, https://git-scm.com/doc, accessed 31 July 2024.
[Goodin24] Dan Goodin, “What we know about the xz Utils backdoor that almost infected the world”, Ars Technica, 31 March 2024, https://arstechnica.com/security/2024/04/what-we-know-about-the-xz-utils-backdoor-that-almost-infected-the-world/, accessed 4 August 2024.
[Heilmeier24] George H. Heilmeier, “The Heilmeier Catechism”, in DARPA, https://www.darpa.mil/work-with-us/heilmeier-catechism, accessed 13 July 2024.
[IBM23] Engineering Lifecycle Optimization—Method Composer, IBM, version 7.6.2, 2023, https://www.ibm.com/docs/en/engineering-lifecycle-management-suite/lifecycle-optimization-method-composer/7.6.2, accessed 29 February 2024.
[ISO26262] “Road vehicles — Functional safety”, Geneva, Switzerland: International Organization for Standardization, Standard ISO 26262:2018, 2018.
[LADEE13] “LADEE—Lunar atmosphere and dust environment explorer”, NASA Ames Research Center, Fact sheet FA-ARC-2013-01-29, 2013, https://smd-cms.nasa.gov/wp-content/uploads/2023/05/ladee-fact-sheet-20130129.pdf, accessed 16 September 2024.
[Leveson11] Nancy G. Leveson, Engineering a safer world: systems thinking applied to safety, Engineering Systems, Cambridge, Massachusetts: MIT Press, 2011.
[LoBosco08] David M. LoBosco, Glen E. Cameron, Richard A. Golding, and Theodore M. Wong, “The Pleiades fractionated space system architecture and the future of national security space”, AIAA Space 2008 Conference, September 2008, https://chrysaetos.org/papers/Pleiades%20fractionated%20space%20system.pdf.
[NASA16] “NASA Systems Engineering Handbook”, National Aeronautics and Astronautics Administration (NASA), Report NASA SP-2016-6105 Rev2, 2016.
[NPR7120] “NASA Space Flight Program and Project Management Requirements”, National Aeronautics and Astronautics Administration (NASA), NASA Procedural Requirement NPR 7120.5F, 2021.
[NPR7123] “NASA Systems Engineering Processes and Requirements”, National Aeronautics and Astronautics Administration (NASA), NASA Procedural Requirement NPR 7123.1D, 2023.
[Navarro-Gonzalez10] Rafael Navarro-Gonzalez, Edgar Vargas, José de la Rosa, and A. C. Raga, Christopher P. McKay, “Reanalysis of the Viking results suggests perchlorate and organics at midlatitudes on Mars”, Journal of Geophysical Research, vol. 115, December 2010.
[Purdy24] Kevin Purdy, “Music industry’s 1990s hard drives, like all HDDs, are dying”, Ars Technica, 12 September 2024, https://arstechnica.com/gadgets/2024/09/music-industrys-1990s-hard-drives-like-all-hdds-are-dying/, accessed 13 September 2024.
[Spiral] Wikipedia contributors, “Spiral model”, in Wikipedia, the Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Spiral_model&oldid=1068244887, accessed 14 February 2024.
[Wertz11] Space Mission Engineering: The New SMAD, James R. Wertz, David F. Everett, and Jeffery J. Puschell, editors, Torrance, CA: Microcosm Press, 2011.
[Wilkes90] John Wilkes, “CSP project startup documents”, Concurrent Computing Department, Hewlett-Packard Laboratories, Report HPL-CSP-90-42, 11 October 1990, https://john.e-wilkes.com/papers/HPL-CSP-90-42.pdf.
[Zetter23] Kim Zetter, “The untold story of the boldest supply-chain hack ever”, Wired, 2 May 2023, https://www.wired.com/story/the-untold-story-of-solarwinds-the-boldest-supply-chain-hack-ever/.