Guidance on structuring software incident response with the IMAG model, defining roles, severity, response phases, communications, documentation, and post-incident learning to reduce recovery time.
Welcome back. This lesson explains how to structure incident response using the IMAG model (Incident Management at Google). The IMAG approach applies proven incident-command principles from emergency response to software incidents, helping teams act quickly and consistently by assigning clear roles, a single source of truth, and repeatable processes.When incidents start without structure, teams duplicate effort, documentation is missed, and leadership gets fragmented updates — all of which increase mean time to recovery (MTTR). A structured response assigns coordination, communications, and technical work to specific roles so engineers can focus on restoration.
Why structure matters
Prevents redundant work and conflicting actions.
Establishes a single source of truth for status and decisions.
Keeps stakeholders informed while technical responders focus on fixes.
Shortens time to recovery through coordinated mitigation and investigation.
Core IMAG principles
IMAG imports emergency incident-command best practices and tailors them for software operations. Its goals are clarity of roles, controlled span of control, a unified command, shared terminology, and modular activation of roles depending on incident scale.
Clear chain of command: every role has defined responsibilities.
Span of control: limit direct reports to roughly 3–7 people to avoid overload.
Unified command: one authoritative view for status and decisions.
Common terminology: consistent language reduces cross-team confusion.
Modular organization: activate only the roles you need for the incident.
Common practices shared across incident frameworks
Most mature incident programs use the same building blocks: defined roles, severity classification, formal response phases, blameless postmortems, and regular practice via tabletop drills or game days.
Resource
Purpose
Example
Roles
Define responsibilities and single points of coordination
Incident Commander, Communications Lead, Operations Lead
Key IMAG roles and responsibilities
Below are the core IMAG roles and their primary responsibilities. These roles separate coordination, communications, and technical operations so responders can work without interruption.
Incident Commander (IC): coordinates the response, sets priorities and update cadence, makes key decisions, and protects responders from interruptions. The IC focuses on coordination and decision-making rather than hands-on technical fixes.
Communications Lead (CL): manages stakeholder communications, prepares summaries for technical and non-technical audiences, and ensures updates are timely and clear.
Operations Lead (OL): coordinates technical troubleshooting, organizes responders and runbooks, and provides technical status updates to the IC.
Traits that make each role effective
IC: calm under pressure, decisive, and coordination-focused.
CL: strong communicator, translates technical detail for broader audiences, organized and reassuring.
OL: strong technical skills, pragmatic problem-solver, and able to coordinate across teams.
Match people to roles based on these traits so the response runs smoothly.
Detection, declaration, and initial triage
Incident response starts with detection: an alert, monitoring signal, or user report. The first triage question is whether the event requires a formal incident response. If not, handle it via normal support channels. If yes:
Assign an Incident Commander.
Declare the incident and set a severity.
Assemble the response team and open the single source-of-truth (timeline/comm doc).
Severity classification
Severity indicates customer impact and required response intensity. Organizations may label severities differently (SEV-1/2/3, CEV-1/2/3). Use a clear table to align on expectations for escalation, staffing, and runbook activation.
Severity
Impact
Typical Response
SEV-1 / CEV-1 (Critical)
Complete outage, major data loss, revenue-impacting failure
Full incident command active, 24/7 response, rapid escalations
SEV-2 / CEV-2 (Major)
Significant degradation or partial outage affecting many users
Business-as-usual handling, scheduled follow-up if needed
Severity describes impact (how bad the incident is). Priority (P0, P1, etc.) describes urgency (how quickly it should be addressed). Keep them separate — for example, a low-severity bug can become high-priority if it affects a key customer.
IMAG response phases
IMAG divides response into distinct, repeatable phases to keep work organized and traceable:
Detection & Declaration — Triage and decide whether to declare an incident and its severity.
Mitigation — Reduce user impact quickly (workarounds, rollbacks, traffic shifts).
Investigation — Collect logs, traces, and metrics to identify root cause.
Resolution — Implement and verify a permanent fix; confirm service health.
Post-Incident — Run a blameless postmortem, document findings, and update runbooks.
Tools and channels commonly used
Chat platforms for real-time coordination: Slack, Microsoft Teams.
External status pages to inform customers: statuspage.io.
Docs & wikis for the incident timeline and decisions: Google Docs, Confluence.
Video/voice bridges for live coordination and recordings: Zoom.
Timeline/tracking tools to maintain a single source of truth.
External communication and transparency
Public status pages and timely customer updates help preserve trust during incidents. Coordinate internal and external messages via the Communications Lead so messages are consistent and appropriately technical for each audience.
Capture the incident in a single source of truth
Document decisions, timestamps, actions, and responsible parties in real time. This reduces duplicated effort, speeds handoffs, and simplifies the post-incident review.Typical documentation flow:
Slack channels for coordination and quick context.
Google Docs (or similar) for the incident timeline and IC notes.
Status page entries for customer-facing updates.
Voice/video recordings for later review.
Example timeline — payment-processing outage (condensed)
14:15 — Multiple alerts trigger; potential incident identified and declared.
14:20 — Incident Commander assigned and severity set.
14:25 — Team assembles; IC sets update cadence; Communications Lead begins status updates; Operations Lead gathers technical responders.
14:45 — Mitigation in progress: rollback initiated, traffic routed to backups, partial service restored.
15:30 — Rollback completes, payments verified, full service restored.
16:00 — Incident closed, status page updated, stakeholders notified, and a postmortem scheduled.
This timeline demonstrates how defined roles and a single source of truth accelerate recovery and reduce confusion.Wrap-up
A structured IMAG-based incident response reduces chaos, clarifies responsibilities, and shortens MTTR. Follow-up activities — blameless postmortems, updates to runbooks, and regular practice — are essential to improving resilience and preventing recurrence. For more on runbooks and tabletop exercises, see the related resources and vendor docs like Kubernetes documentation or statuspage examples.