Toil & Automation
Toil is the operational work that keeps a service running but does not improve it. It is manual, repetitive, automatable, and it scales linearly with service growth. The SRE discipline aims to keep toil below 50% of an engineer's time. The other 50% goes to engineering work that permanently reduces future toil.
What Toil Is
Google's SRE book defines toil as work that has these characteristics:
Manual: A human must perform the task
Repetitive: It happens again and again
Automatable: A machine could do it
Reactive: Triggered by an event, not planned proactively
No lasting value: Doing it once does not prevent doing it again
Scales linearly: More services or traffic means more toil
Examples of Toil
High-frequency toil:
- Manually restarting services after crashes
- Rotating SSL certificates by hand
- Responding to disk-full alerts by deleting logs
- Running database migrations manually
- Provisioning accounts for new engineers
- Manually scaling services during traffic spikes
Medium-frequency toil:
- Quarterly security patching across all services
- Updating dependencies in every repository
- Manually generating compliance reports
- Copying configuration between environments
- Reviewing and approving access requests one by one
Low-frequency but painful toil:
- Disaster recovery testing (done manually)
- Large data migrations between systems
- Manual capacity provisioning for launches
What Is Not Toil
Not all operational work is toil. Some work requires human judgment and cannot be automated.
Not toil:
- Incident response requiring investigation and decision-making
- Architecture reviews
- Writing post-mortems and extracting lessons
- Mentoring team members
- Evaluating new tools and technologies
- Planning capacity for a unique event
The distinction matters because the goal is not to eliminate all operational work. The goal is to eliminate the mindless, repetitive parts so engineers can focus on work that requires their expertise.
The 50% Rule
Google's SRE practice sets a target: no more than 50% of an SRE's time should be spent on toil. The other 50% is engineering work that reduces future toil.
Week breakdown for a healthy SRE team:
Monday: On-call (toil): respond to alerts, handle requests
Tuesday: Engineering: build automation for most common alert
Wednesday: Engineering: improve deployment pipeline
Thursday: On-call (toil): handle incidents, manual tasks
Friday: Engineering: write runbook automation
Toil: ~40% Engineering: ~60%
When toil exceeds 50%, it is a signal that the team is understaffed or under-automated. If toil reaches 80%, engineers burn out, quality drops, and the best people leave.
Toil at 30%: Healthy. Engineers have time for projects.
Toil at 50%: At the limit. Start prioritizing automation.
Toil at 70%: Unhealthy. Engineers are firefighting, not building.
Toil at 90%: Crisis. Retention risk. Quality declining.
Identifying Toil
Before you can reduce toil, you need to find it. There are several approaches.
Track Time
Ask your team to log what they spend time on for two weeks. Categorize each activity as toil or engineering work.
Example time log:
Mon 09:00-10:00 Respond to disk-full alert (toil)
Mon 10:00-11:30 Deploy hotfix to production (toil)
Mon 11:30-12:00 Update dependency in 3 repos (toil)
Mon 13:00-15:00 Build auto-scaling for staging (engineering)
Mon 15:00-16:00 Review pull requests (engineering)
Mon 16:00-17:00 Provision new engineer accounts (toil)
Toil: 3.5 hours Engineering: 3 hours Toil ratio: 54%
Analyze On-Call
On-call data is the richest source of toil information. Look at your alert history and incident tickets.
On-call analysis for last quarter:
Alert: "Disk usage > 90%" — Fired 47 times — Toil
Alert: "Certificate expiring" — Fired 12 times — Toil
Alert: "Service restart needed" — Fired 31 times — Toil
Alert: "Unusual traffic pattern" — Fired 8 times — Not toil (requires investigation)
Alert: "Database connection pool" — Fired 22 times — Toil
The alerts that fire most frequently and have a known, mechanical resolution are your highest-impact automation targets.
Categorize Support Requests
If your team handles requests from other teams, categorize them.
Support request analysis:
"Please provision a new database" — 15/month — Automate
"Please rotate this API key" — 8/month — Automate
"Please increase the pod limit" — 12/month — Automate
"Help me debug this timeout issue" — 5/month — Cannot automate
"Please update DNS for new domain" — 6/month — Automate
Automate the Highest-Frequency Toil First
Prioritize automation by frequency multiplied by time per occurrence.
Toil item Frequency Time each Total/month
Disk cleanup 47/month 15 min 11.75 hours
Service restarts 31/month 10 min 5.17 hours
Database provisioning 15/month 45 min 11.25 hours
Certificate rotation 12/month 30 min 6.00 hours
Pod limit increases 12/month 20 min 4.00 hours
API key rotation 8/month 20 min 2.67 hours
Priority order: Disk cleanup, Database provisioning, Certificate rotation, Service restarts
Automation Examples
Certificate Rotation
# Before: manual cert rotation every 90 days per service
# After: cert-manager auto-rotates certificates
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: payment-service-tls
spec:
secretName: payment-service-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- payments.example.com
renewBefore: 720h # Renew 30 days before expiry
Disk Cleanup
#!/bin/bash
# Before: engineer responds to alert, SSHs in, cleans up
# After: automated cleanup runs on threshold
# Cron job that runs hourly
THRESHOLD=85
USAGE=$(df /var/log --output=pcent | tail -1 | tr -d '% ')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
# Delete logs older than 7 days
find /var/log/app -name "*.log" -mtime +7 -delete
# Compress logs older than 1 day
find /var/log/app -name "*.log" -mtime +1 -exec gzip {} \;
echo "Disk cleanup completed. Usage was ${USAGE}%"
fi
Self-Service Database Provisioning
# Before: file a ticket, wait 2 days for DBA to provision
# After: developer runs a command
# platform create-database --name orders-db --size medium --team orders
# Behind the scenes, this Terraform module runs:
resource "aws_rds_instance" "database" {
identifier = var.database_name
engine = "postgres"
engine_version = "15.4"
instance_class = var.size_to_instance[var.size]
allocated_storage = var.size_to_storage[var.size]
max_allocated_storage = var.size_to_storage[var.size] * 2
backup_retention_period = 7
multi_az = var.environment == "production"
tags = {
Team = var.team
Environment = var.environment
ManagedBy = "platform-self-service"
}
}
The Toil Budget
Just as error budgets manage reliability risk, a toil budget manages operational overhead.
Toil budget per engineer: 50% of time (20 hours/week)
Current toil: 15 hours/week (37.5%)
→ Healthy. Engineers have time for projects.
After onboarding 5 new services without automation:
Current toil: 22 hours/week (55%)
→ Over budget. Pause new service onboarding.
→ Invest in automation until toil drops below 50%.
The toil budget forces conversations about sustainability. When a product team wants to launch a new service, the SRE team can say: "Our toil budget is full. We need to automate X before we can take on more services." This is data, not opinion.
Real-World Example
An infrastructure team at a mid-size company supported 40 microservices. Their on-call rotation was brutal: the on-call engineer averaged 15 alerts per shift, most of which had known resolutions documented in a wiki. Engineers dreaded on-call weeks. Two senior engineers had quit, citing burnout.
The team conducted a toil audit:
- Logged all on-call activities for one month
- Found that 70% of alerts had a mechanical resolution (restart service, clear disk, refresh cache)
- Calculated total toil at 68% of team capacity
They prioritized automation:
- Built auto-remediation for the top 5 alert types (disk cleanup, service restart, cache refresh, connection pool reset, certificate rotation)
- Moved database provisioning to self-service Terraform modules
- Automated dependency updates with Renovate Bot
Three months later, on-call alerts dropped from 15 per shift to 4. Toil ratio dropped from 68% to 35%. The team had time for engineering projects again. The two open positions filled quickly because on-call was no longer a horror story.
Common Pitfalls
- Accepting toil as normal -- "That's just how operations works" is a signal that the team has given up on improving; toil is a problem to solve, not a fact of life
- Automating everything at once -- Trying to automate all toil in one sprint leads to half-finished automation that is worse than manual processes; prioritize by impact
- Automation that nobody maintains -- An automation script written by someone who left the company and nobody understands is a liability, not an asset; automation needs tests, documentation, and ownership
- Not tracking toil -- If you do not measure toil, you cannot manage it; track toil hours weekly and review the trend monthly
- Automating a broken process -- If your deployment process is broken, automating it produces broken deployments faster; fix the process first, then automate
- Ignoring the human cost -- Toil is not just an efficiency problem; it causes burnout, attrition, and declining morale; treat it as a people problem, not just a technical one
Key Takeaways
- Toil is manual, repetitive operational work that scales linearly and can be automated
- The SRE target is keeping toil below 50% of an engineer's time; the rest goes to engineering work
- Identify toil by tracking time, analyzing on-call data, and categorizing support requests
- Automate the highest-frequency, most time-consuming toil first
- A toil budget forces conversations about sustainability and prevents toil from silently growing
- Toil is a people problem as much as a technical one: unchecked toil causes burnout and attrition