Managing Operational Toil

In this lesson we cover operational toil: the manual, repetitive work that keeps systems running but creates no lasting value. Every Site Reliability Engineer (SRE) should be able to identify, measure, and remove toil so teams can focus on engineering improvements rather than constant firefighting. Toil typically appears as frequent restarts, repeated manual fixes, or routine steps that require human attention. Left unchecked, toil grows with your service and becomes a blocker to reliability, team velocity, and long-term sustainability.

A presentation slide titled "Toil in SRE" that defines toil as repetitive manual work that grows with scale and adds no lasting value. Below the text is an illustration of two people arranging cards on a Kanban-style board with gears above it.

What is toil?

In SRE, toil is manual, repetitive operational work that provides no enduring value and tends to increase linearly as a system scales.
Toil consumes engineer time and energy without improving the system. Without deliberate automation or design changes, toil grows with the service.

Common signs that work is toil

Look for these characteristics to determine whether a task is toil:

Requires direct human involvement to execute.
Repetitive tasks that could be automated.
No cumulative value — doing it again tomorrow yields the same result.
Triggered repeatedly by the same conditions or occurs on a schedule.
Tactical focus: addresses symptoms rather than root causes.
Workload grows proportionally with the service.

If several of these apply, prioritize addressing the task as toil.

Concrete examples

Manual deployment processes with many manual steps or frequent interventions.
Repetitive alert responses where incidents require the same manual actions.
Routine configuration changes performed by hand.
Regular data cleanup tasks conducted manually.
User access management without self-service tooling.
Certificate renewals that are not automated.

Ask: Could this be a script, scheduled job, or CI/CD action? Often the answer is yes.

A presentation slide titled "Toil in SRE." It lists examples of operational toil such as manual deployment processes, repetitive alert responses, routine configuration changes, regular data cleanup tasks, user access management, and certificate renewals.

Impact of toil

Toil affects engineering teams, business outcomes, and long-term competitiveness.

Engineering impacts

Burnout: Repetitive, unrewarding work leads to fatigue and lower morale.
Opportunity cost: Time spent on toil is time not spent building improvements.
Technical debt: Short-term manual fixes accumulate as long-term cruft.
Skills atrophy: Teams focused on firefighting lose practice in development and design.
Career stagnation: Engineers trapped in operational routines miss growth opportunities.

Toil steals the capacity to build better, more reliable systems.

A presentation slide titled "The Impact of Toil" showing five engineering impacts—Burnout, Opportunity Cost, Technical Debt Accumulation, Skills Atrophy, and Career Stagnation—each in a rounded box with a colorful icon and brief explanatory text.

Business impacts

Slower time to market: Manual processes create bottlenecks.
Higher operational costs: More headcount required as systems grow.
Reduced reliability: Human steps are error-prone.
Scaling limitations: Manual operations do not scale effectively.
Competitive disadvantage: Teams burdened by toil innovate more slowly.

A presentation slide titled "The Impact of Toil" showing five cards: Slower Time-to-Market, Higher Operational Costs, Reduced Reliability, Scaling Limitations, and Competitive Disadvantage. Each card has an icon and a short note explaining how manual processes create bottlenecks, raise costs, and limit growth.

Measuring and identifying toil

Use a mix of quantitative and qualitative signals to locate and size toil. Combining both approaches helps prioritize which processes to eliminate or automate first.

Measurement type	Examples	Purpose
Quantitative	Time tracking; toil ratio; number of operational tickets; automation gap analysis; on-call burden	Estimate scale and cost of toil
Qualitative	Toil surveys; job satisfaction tracking; toil amnesty; value stream mapping; shadow programs	Surface hidden, contextual, and low-frequency toil

Quantitative details:

Time tracking: Log hours by category (deployments, incident response, manual maintenance).
Toil ratio: Percentage of time spent on purely operational tasks vs. engineering.
Toil tickets: Count tickets classified as pure operational work.
Automation gap analysis: Document manual steps in workflows.
On-call burden: Measure manual alert response hours.

Qualitative practices:

Toil surveys: Ask engineers for their primary pain points.
Job satisfaction tracking: Correlate morale with toil metrics.
Toil amnesty: Provide a safe way to report embarrassing or overlooked toil.
Value stream mapping: Visualize handoffs and manual steps.
Shadow programs: Observe and document undocumented operational work.

Combine these signals to prioritize reduction efforts.

A presentation slide titled "Measuring and Identifying Toil" that contrasts two approaches. The left box lists Quantitative Measurement items (time tracking, toil ratio, toil tickets, automation gap analysis, on-call burden) and the right box lists Qualitative Assessment items (toil surveys, job satisfaction tracking, toil amnesty, value stream mapping, shadow program).

Hierarchy of approaches to reduce toil

Prioritize changes from most to least effective:

Priority	Strategy	Example
1	Eliminate	Replace a flaky service rather than endlessly restarting it
2	Automate	Automatic certificate renewal or CI-driven deployments
3	Simplify	Consolidate dashboards and reduce steps
4	Delegate	Provide self-service portals or move work to the appropriate team
5	Batch	Reduce frequency by batching tasks (weekly vs daily)

Prefer elimination or automation where possible. Batching and delegation are last-resort options when elimination or automation are not feasible immediately.

Calculating the true cost of toil

To get budget and team buy-in, quantify direct and indirect costs. Direct costs:

Labor hours: engineer time × hourly rate.
Incident costs: downtime and remediation resulting from manual errors.

A presentation slide titled "Calculating the True Cost of Toil" showing a "Direct Costs" table. It lists Cost Factors and Descriptions with entries for "Labor Hours — Engineer time × hourly cost" and "Incident Costs — Downtime due to manual errors."

Indirect costs:

Opportunity cost: engineering improvements deferred because of toil.
Attrition cost: turnover, recruitment, and lost tribal knowledge.
Velocity impact: slower feature delivery and reduced competitiveness.

Example math:

Team size: 8 engineers
Toil per engineer: 15 hours/week
Hourly rate: $75/hour

Annual cost = 15 hours/week × 8 engineers ×

75/hr × 52 weeks =

468,000 per year.

A presentation slide titled "Calculating the True Cost of Toil" showing a centered grey box with input values. It calculates annual toil as 15 hours × 8 engineers × 75/hr × 52 weeks = 468,000.

Culture and process for sustainable reduction

Making toil reduction enduring requires process, incentives, and psychological safety:

Value engineering over heroics: Reward automation, refactoring, and systems thinking rather than heroic firefighting.
Dedicated time budget: Allocate explicit time (e.g., an “engineering improvement” sprint) for removing toil.
Psychological safety: Encourage raising and solving toil without blame.
Knowledge sharing: Make runbooks and automation common knowledge, not tribal information.
Continuous improvement: Treat toil reduction as an ongoing investment in reliability and velocity.

Do not treat toil as a rite of passage or a badge of honor. Normalizing manual firefighting hides systemic problems and increases long-term cost and risk.

Final thoughts

Toil is a symptom, not pride. Use measurement to prioritize elimination and automation, and build cultural practices that make toil reduction sustainable. This frees engineering capacity for durable improvements and better reliability. That concludes the lesson on managing complexity, risk, and toil. Next: incident management — change introduces instability, and effective incident practices determine how well a team recovers and learns.

Managing Operational Toil

What is toil?

Common signs that work is toil

Concrete examples

Impact of toil

Engineering impacts

Business impacts

Measuring and identifying toil

Hierarchy of approaches to reduce toil

Calculating the true cost of toil

Culture and process for sustainable reduction

Final thoughts

Links and references

Watch Video

​What is toil?

​Common signs that work is toil

​Concrete examples

​Impact of toil

​Engineering impacts

​Business impacts

​Measuring and identifying toil

​Hierarchy of approaches to reduce toil

​Calculating the true cost of toil

​Culture and process for sustainable reduction

​Final thoughts

​Links and references

Watch Video

What is toil?

Common signs that work is toil

Concrete examples

Impact of toil

Engineering impacts

Business impacts

Measuring and identifying toil

Hierarchy of approaches to reduce toil

Calculating the true cost of toil

Culture and process for sustainable reduction

Final thoughts

Links and references