The Evolution of Incident Management

Have you ever thought about the history of incident management?

If you’re an SRE, you might be so caught up in the day-to-day work of managing reliability and responding to incidents that you never take time to step back and reflect on the evolution of your role and your responsibilities. And that’s a shame because SREs didn’t invent incident management concepts and strategies on their own.

On the contrary, the way SREs think about the incident response, structure incident management teams and rank the priority of incidents owes much to incident management strategies developed in the offline world decades ago. To fully understand what it means to be an SRE today, you have to appreciate this deep history of incident management and response.

So, let’s take a look at that history and examine where modern incident response concepts originated.

Historical Problems in Incident Management

Societies have always had incidents, of course. Fires, floods, infrastructure breakdowns and similar crises have been happening for millennia.

For most of history, however, humans lack an efficient, purposeful way to manage these sorts of incidents. Response efforts were ad hoc, and their effectiveness owed more than a little of their success to sheer luck.

Particular challenges included:

  • Lack of effective, consistent communication between stakeholders.
  • Vary organizational structures which made it hard to identify leaders, coordinate response efforts and delegate tasks.
  • Inconsistent response strategies.
  • Differing approaches to assessing the priority of incidents.

Historically, organizations may have been able to handle incidents well enough if the incidents required response from only one, small group. But the more stakeholders involved, the harder it was to respond quickly and effectively.

Putting out Fires: The Birth of the ICS

Things began to change for the better when stakeholders started thinking about better ways to put out fires—literally.

By the 1960s, fire chiefs in California realized that they were struggling to respond effectively to the wildfires that broke out every summer. Each year brought worse blazes than the last, with more land burned and more buildings lost. The Laguna fire of 1970 brought matters to a head and was the catalyst for a new approach to incident response for fire agencies.

After assessing what was going wrong, the fire chiefs determined that it wasn’t a lack of equipment or personnel. It was poor coordination among the various firefighting agencies that responded to blazes. Lacking a clear chain of command and a systematic approach to firefighting, the agencies struggled to deploy their resources rapidly and, thus, effectively.

To fix the problem, California fire chiefs developed what became known as the Incident Command System, or ICS. The ICS defined a hierarchy for incident response with an incident commander at the top. It also defined several categories of incident response processes including operations, planning, logistics and finance. And it established a consistent set of terms that stakeholders can use to describe their actions during the incident response, which makes it easier to communicate clearly.

Although the ICS was initially conceived to fight fires, it became the de facto standard for incident response strategies of all types.

From ICS to NIMS

The history of incident response doesn’t end with the ICS. A new chapter began in the early 2000s when the US federal government developed an even more comprehensive approach to incident management called the National Incident Management System, or NIMS.

NIMS was born in the wake of the September 11, 2001, terrorist attacks, which underlined the importance of efficient communication not just between different agencies of the same type (like fire departments), but of entirely separate organizations. To achieve this, NIMS expanded upon the principles of the ICS.

In addition to adopting most of the incident command principles and practices included in the ICS, NIMS included standards for coordinating the distribution of resources. It also embraced the concept of the emergency operations center, which is in some ways similar to a network operations center in the digital world.

In some respects, NIMS resembled a compliance framework (although to be clear, that’s not what it is). It includes fourteen management principles, similar to compliance controls, that organizations must implement to manage incidents using a NIMS approach.

Incident Management Today

Obviously, putting out forest fires and responding to terrorist attacks is pretty different from dealing with data center failures or a buggy application deployment. ICS and NIMS weren’t designed for site reliability engineering or IT teams specifically.

Still, the influence of ICS and NIMS on the way SREs think is clear enough. Terminology like “incident commander” comes from these frameworks. So do concepts like shared accountability for incident response processes and the importance of involving all stakeholders—not just technical—in incident response.

ICS and NIMS may not be acronyms familiar to most SREs. But they should be, because they are the historical sources of the incident management philosophies that form the foundation for SRE work today and offer valuable lessons for any SRE on the job today.

Leave a Comment