Incident Response Structure and Roles IMAG Model

Welcome back. This lesson explains how to structure incident response using the IMAG model (Incident Management at Google). The IMAG approach applies proven incident-command principles from emergency response to software incidents, helping teams act quickly and consistently by assigning clear roles, a single source of truth, and repeatable processes. When incidents start without structure, teams duplicate effort, documentation is missed, and leadership gets fragmented updates — all of which increase mean time to recovery (MTTR). A structured response assigns coordination, communications, and technical work to specific roles so engineers can focus on restoration.

An illustration titled "A Story of Incident Response" showing an incident commander and two engineers at desks, each with speech bubbles like "Who's fixing the database?" and "I thought you were doing it!". The caption at the bottom reads, "Clear roles stop chaos during incidents."

Why structure matters

Prevents redundant work and conflicting actions.
Establishes a single source of truth for status and decisions.
Keeps stakeholders informed while technical responders focus on fixes.
Shortens time to recovery through coordinated mitigation and investigation.

A presentation slide titled "Incident Management Frameworks" showing four problems organizations face without structured incident response: uncoordinated troubleshooting, unclear communication and redundant work, stakeholders left in the dark, and difficulty tracking incident progress.

Core IMAG principles IMAG imports emergency incident-command best practices and tailors them for software operations. Its goals are clarity of roles, controlled span of control, a unified command, shared terminology, and modular activation of roles depending on incident scale.

Clear chain of command: every role has defined responsibilities.
Span of control: limit direct reports to roughly 3–7 people to avoid overload.
Unified command: one authoritative view for status and decisions.
Common terminology: consistent language reduces cross-team confusion.
Modular organization: activate only the roles you need for the incident.

A presentation slide titled "Incident Management Frameworks" that lists key principles for incident response. It names and briefly explains five items: Clear chain of command, Span of control, Unified command, Common terminology, and Modular organization, each with an icon.

Common practices shared across incident frameworks Most mature incident programs use the same building blocks: defined roles, severity classification, formal response phases, blameless postmortems, and regular practice via tabletop drills or game days.

Resource	Purpose	Example
Roles	Define responsibilities and single points of coordination	Incident Commander, Communications Lead, Operations Lead
Severity classification	Communicate impact and escalation	SEV-1 / SEV-2 / SEV-3
Response phases	Structure work from triage to postmortem	Detection → Mitigation → Investigation → Resolution → Post-Incident
Postmortems	Learn and prevent recurrence	Blameless analysis and action items
Practice	Reduce friction under pressure	Tabletop drills / game days

A slide titled "Incident Management Frameworks" showing a two-column table of common practices and examples for incident response. It lists items like clear roles, severity classification, formal response phases, postmortems, and regular practice exercises with examples (Incident Commander, P0/P1/P2 or Sev-1/Sev-2, NIST/SANS, blameless culture, tabletop drills).

Key IMAG roles and responsibilities Below are the core IMAG roles and their primary responsibilities. These roles separate coordination, communications, and technical operations so responders can work without interruption.

Incident Commander (IC): coordinates the response, sets priorities and update cadence, makes key decisions, and protects responders from interruptions. The IC focuses on coordination and decision-making rather than hands-on technical fixes.
Communications Lead (CL): manages stakeholder communications, prepares summaries for technical and non-technical audiences, and ensures updates are timely and clear.
Operations Lead (OL): coordinates technical troubleshooting, organizes responders and runbooks, and provides technical status updates to the IC.

A slide titled "Key Incident Response Roles – Primary Responsibilities" showing three columns for Incident Commander (IC), Communications Lead (CL), and Operations Lead (OL). Each column lists duties like coordinating the incident and making decisions (IC), managing communications and stakeholder updates (CL), and leading technical troubleshooting and fixes (OL).

Traits that make each role effective

IC: calm under pressure, decisive, and coordination-focused.
CL: strong communicator, translates technical detail for broader audiences, organized and reassuring.
OL: strong technical skills, pragmatic problem-solver, and able to coordinate across teams.

Match people to roles based on these traits so the response runs smoothly.

A slide titled "Key Incident Response Roles – Traits by Role" showing three roles—Incident Commander (calm under pressure, decisive, coordination-focused), Communications Lead (strong communicator, translates technical details, organized), and Operations Lead (strong technical skills, problem solver, coordinates teams). The slide is copyrighted by KodeKloud.

Detection, declaration, and initial triage Incident response starts with detection: an alert, monitoring signal, or user report. The first triage question is whether the event requires a formal incident response. If not, handle it via normal support channels. If yes:

Assign an Incident Commander.
Declare the incident and set a severity.
Assemble the response team and open the single source-of-truth (timeline/comm doc).

A flowchart titled "Incident Declaration and Classification" showing steps: Identify Potential Incident → Formal Response Required? → (yes) Assign Incident Commander → Declare Incident and Assign Severity → Assemble Response Team (or → End if no).

Severity classification Severity indicates customer impact and required response intensity. Organizations may label severities differently (SEV-1/2/3, CEV-1/2/3). Use a clear table to align on expectations for escalation, staffing, and runbook activation.

Severity	Impact	Typical Response
SEV-1 / CEV-1 (Critical)	Complete outage, major data loss, revenue-impacting failure	Full incident command active, 24/7 response, rapid escalations
SEV-2 / CEV-2 (Major)	Significant degradation or partial outage affecting many users	Core roles activated, extended coverage, focused mitigation
SEV-3 / CEV-3 (Minor)	Limited impact, non-critical features	Business-as-usual handling, scheduled follow-up if needed

A presentation slide titled "Incident Declaration and Classification" showing three columns for SEV-1 (Critical), SEV-2 (Major), and SEV-3 (Minor). Each column lists brief bullet points describing impact, response level, and escalation risk for that severity.

Severity describes impact (how bad the incident is). Priority (P0, P1, etc.) describes urgency (how quickly it should be addressed). Keep them separate — for example, a low-severity bug can become high-priority if it affects a key customer.

IMAG response phases IMAG divides response into distinct, repeatable phases to keep work organized and traceable:

Detection & Declaration — Triage and decide whether to declare an incident and its severity.
Mitigation — Reduce user impact quickly (workarounds, rollbacks, traffic shifts).
Investigation — Collect logs, traces, and metrics to identify root cause.
Resolution — Implement and verify a permanent fix; confirm service health.
Post-Incident — Run a blameless postmortem, document findings, and update runbooks.

Tools and channels commonly used

Chat platforms for real-time coordination: Slack, Microsoft Teams.
External status pages to inform customers: statuspage.io.
Docs & wikis for the incident timeline and decisions: Google Docs, Confluence.
Video/voice bridges for live coordination and recordings: Zoom.
Timeline/tracking tools to maintain a single source of truth.

External communication and transparency Public status pages and timely customer updates help preserve trust during incidents. Coordinate internal and external messages via the Communications Lead so messages are consistent and appropriately technical for each audience.

A slide titled "External Incident Communication" showing two status-page screenshots labeled AWS and Slack. Both screenshots display service health/status information (indicating no recent issues).

Capture the incident in a single source of truth Document decisions, timestamps, actions, and responsible parties in real time. This reduces duplicated effort, speeds handoffs, and simplifies the post-incident review. Typical documentation flow:

Slack channels for coordination and quick context.
Google Docs (or similar) for the incident timeline and IC notes.
Status page entries for customer-facing updates.
Voice/video recordings for later review.

A flowchart titled "Incident Response Communication and Documentation Flow" showing that when an incident occurs it branches to Slack channels, Google Docs, a status page, and a video bridge. Each branch lists subcomponents (Slack channel names, incident/technical docs, external customer communications, and voice/recording for review).

Example timeline — payment-processing outage (condensed)

14:15 — Multiple alerts trigger; potential incident identified and declared.
14:20 — Incident Commander assigned and severity set.
14:25 — Team assembles; IC sets update cadence; Communications Lead begins status updates; Operations Lead gathers technical responders.
14:45 — Mitigation in progress: rollback initiated, traffic routed to backups, partial service restored.
15:30 — Rollback completes, payments verified, full service restored.
16:00 — Incident closed, status page updated, stakeholders notified, and a postmortem scheduled.

This timeline demonstrates how defined roles and a single source of truth accelerate recovery and reduce confusion. Wrap-up A structured IMAG-based incident response reduces chaos, clarifies responsibilities, and shortens MTTR. Follow-up activities — blameless postmortems, updates to runbooks, and regular practice — are essential to improving resilience and preventing recurrence. For more on runbooks and tabletop exercises, see the related resources and vendor docs like Kubernetes documentation or statuspage examples.

Course Introduction

Fundamentals of SRE

Service Level Objectives and Measurements

Managing Complexity, Risk, and Toil

Incident Management

Release Engineering

Observability and Monitoring

Advanced Reliability Engineering

Bringing it All Together

Incident Response Structure and Roles IMAG Model

Watch Video