Toil & Automation

Toil is the operational work that keeps a service running but does not improve it. It is manual, repetitive, automatable, and it scales linearly with service growth. The SRE discipline aims to keep toil below 50% of an engineer's time. The other 50% goes to engineering work that permanently reduces future toil.

What Toil Is

Google's SRE book defines toil as work that has these characteristics:

Manual:       A human must perform the task
Repetitive:   It happens again and again
Automatable:  A machine could do it
Reactive:     Triggered by an event, not planned proactively
No lasting value: Doing it once does not prevent doing it again
Scales linearly: More services or traffic means more toil

Examples of Toil

High-frequency toil:
  - Manually restarting services after crashes
  - Rotating SSL certificates by hand
  - Responding to disk-full alerts by deleting logs
  - Running database migrations manually
  - Provisioning accounts for new engineers
  - Manually scaling services during traffic spikes

Medium-frequency toil:
  - Quarterly security patching across all services
  - Updating dependencies in every repository
  - Manually generating compliance reports
  - Copying configuration between environments
  - Reviewing and approving access requests one by one

Low-frequency but painful toil:
  - Disaster recovery testing (done manually)
  - Large data migrations between systems
  - Manual capacity provisioning for launches

What Is Not Toil

Not all operational work is toil. Some work requires human judgment and cannot be automated.

Not toil:
  - Incident response requiring investigation and decision-making
  - Architecture reviews
  - Writing post-mortems and extracting lessons
  - Mentoring team members
  - Evaluating new tools and technologies
  - Planning capacity for a unique event

The distinction matters because the goal is not to eliminate all operational work. The goal is to eliminate the mindless, repetitive parts so engineers can focus on work that requires their expertise.

The 50% Rule

Google's SRE practice sets a target: no more than 50% of an SRE's time should be spent on toil. The other 50% is engineering work that reduces future toil.

Week breakdown for a healthy SRE team:
  Monday:    On-call (toil): respond to alerts, handle requests
  Tuesday:   Engineering: build automation for most common alert
  Wednesday: Engineering: improve deployment pipeline
  Thursday:  On-call (toil): handle incidents, manual tasks
  Friday:    Engineering: write runbook automation

  Toil: ~40%  Engineering: ~60%

When toil exceeds 50%, it is a signal that the team is understaffed or under-automated. If toil reaches 80%, engineers burn out, quality drops, and the best people leave.

Toil at 30%: Healthy. Engineers have time for projects.
Toil at 50%: At the limit. Start prioritizing automation.
Toil at 70%: Unhealthy. Engineers are firefighting, not building.
Toil at 90%: Crisis. Retention risk. Quality declining.

Identifying Toil

Before you can reduce toil, you need to find it. There are several approaches.

Track Time

Ask your team to log what they spend time on for two weeks. Categorize each activity as toil or engineering work.

Example time log:
  Mon 09:00-10:00  Respond to disk-full alert (toil)
  Mon 10:00-11:30  Deploy hotfix to production (toil)
  Mon 11:30-12:00  Update dependency in 3 repos (toil)
  Mon 13:00-15:00  Build auto-scaling for staging (engineering)
  Mon 15:00-16:00  Review pull requests (engineering)
  Mon 16:00-17:00  Provision new engineer accounts (toil)

  Toil: 3.5 hours  Engineering: 3 hours  Toil ratio: 54%

Analyze On-Call

On-call data is the richest source of toil information. Look at your alert history and incident tickets.

On-call analysis for last quarter:
  Alert: "Disk usage > 90%"        — Fired 47 times — Toil
  Alert: "Certificate expiring"     — Fired 12 times — Toil
  Alert: "Service restart needed"   — Fired 31 times — Toil
  Alert: "Unusual traffic pattern"  — Fired 8 times  — Not toil (requires investigation)
  Alert: "Database connection pool" — Fired 22 times — Toil

The alerts that fire most frequently and have a known, mechanical resolution are your highest-impact automation targets.

Categorize Support Requests

If your team handles requests from other teams, categorize them.

Support request analysis:
  "Please provision a new database"    — 15/month — Automate
  "Please rotate this API key"         — 8/month  — Automate
  "Please increase the pod limit"      — 12/month — Automate
  "Help me debug this timeout issue"   — 5/month  — Cannot automate
  "Please update DNS for new domain"   — 6/month  — Automate

Automate the Highest-Frequency Toil First

Prioritize automation by frequency multiplied by time per occurrence.

Toil item                  Frequency   Time each   Total/month
Disk cleanup               47/month    15 min      11.75 hours
Service restarts           31/month    10 min      5.17 hours
Database provisioning      15/month    45 min      11.25 hours
Certificate rotation       12/month    30 min      6.00 hours
Pod limit increases        12/month    20 min      4.00 hours
API key rotation           8/month     20 min      2.67 hours

Priority order: Disk cleanup, Database provisioning, Certificate rotation, Service restarts

Automation Examples

Certificate Rotation

# Before: manual cert rotation every 90 days per service
# After: cert-manager auto-rotates certificates
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: payment-service-tls
spec:
  secretName: payment-service-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - payments.example.com
  renewBefore: 720h  # Renew 30 days before expiry

Disk Cleanup

#!/bin/bash
# Before: engineer responds to alert, SSHs in, cleans up
# After: automated cleanup runs on threshold

# Cron job that runs hourly
THRESHOLD=85
USAGE=$(df /var/log --output=pcent | tail -1 | tr -d '% ')

if [ "$USAGE" -gt "$THRESHOLD" ]; then
  # Delete logs older than 7 days
  find /var/log/app -name "*.log" -mtime +7 -delete
  
  # Compress logs older than 1 day
  find /var/log/app -name "*.log" -mtime +1 -exec gzip {} \;
  
  echo "Disk cleanup completed. Usage was ${USAGE}%"
fi

Self-Service Database Provisioning

# Before: file a ticket, wait 2 days for DBA to provision
# After: developer runs a command

# platform create-database --name orders-db --size medium --team orders
# Behind the scenes, this Terraform module runs:
resource "aws_rds_instance" "database" {
  identifier     = var.database_name
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = var.size_to_instance[var.size]
  
  allocated_storage     = var.size_to_storage[var.size]
  max_allocated_storage = var.size_to_storage[var.size] * 2
  
  backup_retention_period = 7
  multi_az               = var.environment == "production"
  
  tags = {
    Team        = var.team
    Environment = var.environment
    ManagedBy   = "platform-self-service"
  }
}

The Toil Budget

Just as error budgets manage reliability risk, a toil budget manages operational overhead.

Toil budget per engineer: 50% of time (20 hours/week)

Current toil: 15 hours/week (37.5%)
  → Healthy. Engineers have time for projects.

After onboarding 5 new services without automation:
Current toil: 22 hours/week (55%)
  → Over budget. Pause new service onboarding.
  → Invest in automation until toil drops below 50%.

The toil budget forces conversations about sustainability. When a product team wants to launch a new service, the SRE team can say: "Our toil budget is full. We need to automate X before we can take on more services." This is data, not opinion.

Real-World Example

An infrastructure team at a mid-size company supported 40 microservices. Their on-call rotation was brutal: the on-call engineer averaged 15 alerts per shift, most of which had known resolutions documented in a wiki. Engineers dreaded on-call weeks. Two senior engineers had quit, citing burnout.

The team conducted a toil audit:

Logged all on-call activities for one month
Found that 70% of alerts had a mechanical resolution (restart service, clear disk, refresh cache)
Calculated total toil at 68% of team capacity

They prioritized automation:

Built auto-remediation for the top 5 alert types (disk cleanup, service restart, cache refresh, connection pool reset, certificate rotation)
Moved database provisioning to self-service Terraform modules
Automated dependency updates with Renovate Bot

Three months later, on-call alerts dropped from 15 per shift to 4. Toil ratio dropped from 68% to 35%. The team had time for engineering projects again. The two open positions filled quickly because on-call was no longer a horror story.

Common Pitfalls

Accepting toil as normal -- "That's just how operations works" is a signal that the team has given up on improving; toil is a problem to solve, not a fact of life
Automating everything at once -- Trying to automate all toil in one sprint leads to half-finished automation that is worse than manual processes; prioritize by impact
Automation that nobody maintains -- An automation script written by someone who left the company and nobody understands is a liability, not an asset; automation needs tests, documentation, and ownership
Not tracking toil -- If you do not measure toil, you cannot manage it; track toil hours weekly and review the trend monthly
Automating a broken process -- If your deployment process is broken, automating it produces broken deployments faster; fix the process first, then automate
Ignoring the human cost -- Toil is not just an efficiency problem; it causes burnout, attrition, and declining morale; treat it as a people problem, not just a technical one

Key Takeaways

Toil is manual, repetitive operational work that scales linearly and can be automated
The SRE target is keeping toil below 50% of an engineer's time; the rest goes to engineering work
Identify toil by tracking time, analyzing on-call data, and categorizing support requests
Automate the highest-frequency, most time-consuming toil first
A toil budget forces conversations about sustainability and prevents toil from silently growing
Toil is a people problem as much as a technical one: unchecked toil causes burnout and attrition