Docs That Save Time

Most documentation is written and never read. This is not because engineers are lazy readers. It is because most documentation is not written to be useful. It is written to satisfy a process ("every project must have documentation") or to check a box ("we documented the API"). The documentation that actually saves time is written for a specific audience with a specific need at a specific moment. If you focus on that, you will write less documentation and it will be read more.

The Test for Useful Documentation

Before writing any documentation, ask: "Who will read this, when, and what will they be trying to do?"

Good answers:
  "A new engineer joining the team, during their first week,
   trying to run the project locally."
  
  "An on-call engineer at 3 AM trying to figure out why the
   payment service is returning 500 errors."
  
  "Future me, in 6 months, trying to remember why we chose
   Kafka over RabbitMQ."

Bad answers:
  "Anyone who might be interested."
  "Whoever needs it."
  "Because we should have docs."

If you cannot name a specific reader, specific timing, and specific task, do not write the doc. It will not be read.

The Four Docs That Matter

Out of all possible documentation types, four consistently save the most time across engineering teams.

1. The README

The README is the entry point. Its job is to answer: "I just found this repository. What is it, and how do I run it?"

A README that saves time:

  # Order Service

  Handles order creation, payment processing, and fulfillment
  for the e-commerce platform.

  ## Quick Start

  Prerequisites: Docker, Node 20+

  1. Clone the repo
  2. Copy .env.example to .env
  3. Run: docker compose up
  4. The API is available at http://localhost:3000

  ## Running Tests

  npm test                # unit tests
  npm run test:integration  # requires docker compose up

  ## Architecture

  - src/routes/     HTTP endpoints
  - src/services/   Business logic
  - src/models/     Database access
  - src/workers/    Background job processors

  ## Key Decisions

  - We use Kafka for order events (see ADR-003)
  - Payment processing is async (see ADR-007)
  - All prices are stored in cents to avoid floating point issues

  ## Contact

  Team: #orders-team on Slack
  On-call: PagerDuty rotation "order-service"

This README takes 10 minutes to write and saves every new team member 2-4 hours of fumbling. The "Key Decisions" section is particularly valuable — it answers the "why" questions that come up repeatedly.

2. The Onboarding Guide

The onboarding guide answers: "I just joined this team. What do I need to know, install, and access?"

Structure of a useful onboarding guide:

  Day 1: Access and setup
    - Get added to: GitHub org, Slack channels, AWS account,
      PagerDuty, monitoring dashboards
    - Install: Docker, Node 20, IDE plugins (list them)
    - Clone repos: order-service, shared-libs, infrastructure
    - Run: order-service locally (link to README)

  Day 2-3: Context
    - Read: architecture overview (link)
    - Read: the last 3 ADRs (links)
    - Do: complete the "add a test endpoint" exercise (link)
    - Watch: 15-min recorded walkthrough of the deploy process (link)

  Week 1: First contribution
    - Pick a task labeled "good first issue"
    - Pair with [team member] on your first PR
    - Join the on-call shadow rotation

  Who to ask about what:
    - Infrastructure: @alice
    - Payment integration: @bob
    - Frontend: @carol
    - "How does X work?": check #orders-team Slack first

A good onboarding guide is a checklist, not an essay. New engineers do not want to read 20 pages. They want to know what to do next.

3. The Runbook

The runbook answers: "Something is broken in production at 3 AM. What do I do?"

Structure of a useful runbook:

  # Payment Service Down

  ## Symptoms
  - PagerDuty alert: "payment-service health check failed"
  - Users see "Payment processing unavailable" error
  - Monitoring: payment-service pod count = 0

  ## Immediate Actions
  1. Check pod status:
     kubectl get pods -n payments
  
  2. If pods are crashing (CrashLoopBackOff):
     kubectl logs -n payments deployment/payment-service --tail=100
     Common cause: database connection string changed (check secrets)
  
  3. If pods are healthy but unresponsive:
     Check the payment provider status page: https://status.stripe.com
     If Stripe is down, there is nothing to fix. Update the status page
     and notify #incidents channel.
  
  4. If pods are missing entirely:
     Check recent deploys: kubectl rollout history -n payments deployment/payment-service
     If a bad deploy happened: kubectl rollout undo -n payments deployment/payment-service

  ## Escalation
  - If not resolved in 15 minutes: page the payments team lead
  - If data integrity is at risk: page the VP of Engineering

  ## Post-Incident
  - File an incident report (link to template)
  - Update this runbook if you found a new failure mode

Runbooks save the most time per word of any documentation. A good runbook turns a 90-minute incident into a 15-minute incident.

4. The Troubleshooting Guide

The troubleshooting guide answers: "I'm getting error X. How do I fix it?"

Structure of a useful troubleshooting guide:

  # Common Issues

  ## "Cannot connect to database"
  Cause: Usually a local Docker issue.
  Fix:
    1. Check Docker is running: docker ps
    2. Check the database container: docker compose ps db
    3. If the container is stopped: docker compose up db
    4. If the container is running but connection fails:
       check DATABASE_URL in .env matches docker-compose.yml

  ## "Test suite hangs on CI"
  Cause: Usually a connection pool leak in integration tests.
  Fix:
    1. Check which test is hanging: look at the CI log for the
       last test that started
    2. Ensure the test calls cleanup/teardown after database operations
    3. Increase the CI timeout as a temporary fix if needed

  ## "Build fails with 'heap out of memory'"
  Cause: Node's default heap size is too small for our build.
  Fix: Set NODE_OPTIONS=--max-old-space-size=4096 in .env

Troubleshooting guides accumulate over time. Every time someone spends 30 minutes solving a common problem, they should add the solution to this guide. The next person finds it in 2 minutes.

Writing for Future-You

The most reliable reader of your documentation is future-you. You will come back to this code, this system, or this decision in 6 months and not remember any of it.

What future-you needs:
  - WHY you made this choice (not just WHAT the choice was)
  - WHAT the alternatives were and why they were rejected
  - HOW to run, test, and deploy this thing
  - WHERE to find related code, services, or documentation
  - WHO to ask if this doc does not answer the question

Write for the version of yourself that has forgotten everything. Because in 6 months, you will have.

Where to Put Documentation

Documentation that nobody can find is documentation that does not exist.

Put it where people already look:
  - README.md in the repository root (first thing anyone sees)
  - Comments in code (where the code lives)
  - Runbooks in the on-call wiki (where on-call engineers look)
  - ADRs in the repository (close to the code they describe)
  - Onboarding guide in the team wiki (where new hires are directed)

Do NOT put it:
  - In a random Google Doc that requires a specific link
  - In a Confluence page 4 levels deep in a hierarchy
  - In someone's personal notes
  - In a Slack message (Slack is ephemeral, not archival)

The ideal location is as close to the code as possible. Repository-level docs are found by anyone who has the repo. Wiki docs require knowing which wiki and which page.

Keeping Docs Current

Stale documentation is worse than no documentation because it actively misleads. The key to keeping docs current is making updates easy and making staleness visible.

Strategies for currency:
  - Review docs during code review ("does this change affect the README?")
  - Add a "last updated" date to every doc
  - Link docs to the code they describe (so changes in code prompt doc updates)
  - Use the troubleshooting guide as a living document (add entries when
    problems are solved)
  - Delete docs that are no longer true (deletion is maintenance)
  - Prefer short docs (easier to update than long ones)

A 10-line README that is accurate is more useful than a 50-page design doc that was accurate 18 months ago.

Real-World Example: The Runbook That Saved the Quarter

A payments team had a recurring production issue: once a month, the payment service would stop processing orders. Each time, it took 60-90 minutes to diagnose and fix. The fix was always the same — a connection pool was exhausted and needed to be recycled.

An engineer wrote a 12-line runbook entry:

  ## Payment Processing Stalled
  Symptoms: orders stuck in "processing" state for 5+ minutes
  Cause: connection pool exhaustion (usually after a traffic spike)
  Fix: kubectl rollout restart deployment/payment-service -n payments
  Verify: watch for new orders moving to "completed" within 2 minutes
  Follow-up: file a ticket to investigate pool sizing (link to template)

The next time it happened, the on-call engineer found the runbook, ran the command, and resolved the incident in 4 minutes. Twelve lines of documentation saved 60 minutes of production downtime. Over a quarter, it saved 3 hours of incidents and an estimated $50,000 in lost orders.

Common Pitfalls

Writing documentation nobody asked for — if you cannot name the reader and their specific need, do not write the doc. It will not be read.
Writing too much — long documentation is not read. Short, specific documentation is. A 10-line runbook beats a 10-page operations manual.
Never updating docs — stale docs mislead. Add "last updated" dates, review docs during code review, and delete docs that are no longer true.
Putting docs in hard-to-find places — docs in a random wiki page might as well not exist. Put them where people already look: the repo, the README, the on-call dashboard.
Documenting WHAT instead of WHY — "this function processes payments" is obvious from the code. "We process payments asynchronously because the provider has a 3-second latency SLA" is not obvious and is worth documenting.

Key Takeaways

Most documentation is written and never read. Focus on four types that consistently save time: READMEs, onboarding guides, runbooks, and troubleshooting guides.
Before writing any doc, answer: who reads this, when, and what are they trying to do? If you cannot answer, do not write it.
Write for future-you. Document WHY, not just WHAT. In 6 months, you will not remember the context behind any decision.
Put docs where people already look. Repository-level docs beat wiki pages. Inline comments beat external documents.
Keep docs short and current. A 10-line accurate runbook beats a 50-page stale design document. Delete docs that are no longer true.