Troubleshooting Guides

A troubleshooting guide starts with what the reader can see — a symptom — and leads them to a resolution. It is not an essay about how the system works. It is not an architecture document. It is a decision tree: "I see X. What do I do?" Every troubleshooting guide should be written after an incident, based on real failure modes, with real commands and real outputs. If you write troubleshooting guides from imagination, you will document problems that never happen and miss problems that happen every week.

The reader does not know what is wrong. They know what they can observe. Structure your guides around symptoms, not causes.

Bad structure (cause-based):
  ## Database Connection Pool Exhaustion
  ## DNS Resolution Failure
  ## Certificate Expiration
  ## Memory Leak in Worker Process

Good structure (symptom-based):
  ## API Returns 500 Errors
  ## API Returns 502/503 Errors
  ## API Latency Is High (> 2s p99)
  ## API Not Reachable (Connection Refused)
  ## Requests Hang Indefinitely

The reader knows their API is returning 500 errors. They do not know it is caused by connection pool exhaustion. The symptom-based structure meets them where they are.

Decision Trees, Not Essays

Each symptom section is a decision tree. The reader checks something, sees a result, and follows the branch.

## API Returns 500 Errors

  1. Check which endpoints are returning 500s:
     kubectl logs -l app=api-server --tail=500 | grep "HTTP 500" | \
       awk '{print $6}' | sort | uniq -c | sort -rn

  2. If all endpoints are returning 500s:
     -> Likely a shared dependency issue. Go to step 3.
     If only specific endpoints are returning 500s:
     -> Likely an endpoint-specific bug. Go to step 8.

  3. Check database connectivity:
     kubectl exec -it deploy/api-server -- pg_isready -h db.internal

  4. If output shows "accepting connections":
     -> Database is reachable. Go to step 5.
     If output shows "no response" or times out:
     -> Database is down or unreachable.
        Follow: Database Down Runbook (link)

  5. Check database connection pool:
     kubectl exec -it deploy/api-server -- curl localhost:8080/debug/pool

     Expected output:
     {"active": 20, "idle": 80, "max": 100, "waiting": 0}

  6. If "waiting" is greater than 0 or "active" equals "max":
     -> Connection pool exhausted. Go to step 7.
     If pool looks healthy:
     -> Go to step 8.

  7. Connection pool exhaustion:
     a. Check for long-running queries:
        kubectl exec -it deploy/api-server -- psql -h db.internal -c \
          "SELECT pid, now() - query_start AS duration, query
           FROM pg_stat_activity
           WHERE state = 'active'
           ORDER BY duration DESC LIMIT 10;"

     b. Kill queries running longer than 5 minutes:
        kubectl exec -it deploy/api-server -- psql -h db.internal -c \
          "SELECT pg_terminate_backend(pid)
           FROM pg_stat_activity
           WHERE state = 'active'
           AND now() - query_start > interval '5 minutes';"

     c. Verify the pool recovers (re-run step 5).
     d. Investigate root cause after the incident is resolved.

  8. Check application logs for the specific error:
     kubectl logs -l app=api-server --tail=200 | grep "ERROR"

Every step has a command, expected output, and a clear branch to follow based on the result.

Writing Effective Checks

Each check in a troubleshooting guide has four parts: the command, what normal looks like, what abnormal looks like, and what to do about the abnormal case.

Structure of a good check:

  Step N. Check [thing]:
     [exact command to run]

     Normal output:
     [what healthy looks like]

     Abnormal output:
     [what broken looks like]

     If abnormal:
     [what to do next — either a fix or the next step]

Example of a Complete Check

  4. Check disk usage on the database server:
     ssh db-primary "df -h /var/lib/postgresql"

     Normal output:
     /dev/sda1  500G  210G  290G  42% /var/lib/postgresql

     Abnormal output:
     /dev/sda1  500G  485G   15G  97% /var/lib/postgresql

     If disk usage is above 90%:
     a. Identify large files:
        ssh db-primary "du -sh /var/lib/postgresql/* | sort -rh | head"
     b. Check if WAL files are accumulating:
        ssh db-primary "du -sh /var/lib/postgresql/14/pg_wal"
     c. If WAL files are larger than 50GB:
        Follow: WAL Accumulation Runbook (link)

The reader never has to guess whether the output they see is normal or abnormal.

Common Troubleshooting Patterns

Certain patterns appear in nearly every system. Document them as reusable sections.

The "Is It Actually Down?" Check

Before debugging, confirm the problem exists and is not a monitoring false positive.

  1. Confirm the symptom from a different vantage point:
     a. Check the alert source dashboard:
        [link to dashboard]
     b. Make a direct request:
        curl -v https://api.example.com/health
     c. Check from a different network:
        ssh bastion-us-west "curl -v https://api.example.com/health"

  2. If the direct requests succeed but monitoring shows failure:
     -> Likely a monitoring issue, not a service issue.
        Check monitoring infrastructure before proceeding.

The Recent Change Check

Most outages are caused by recent changes. Check this early.

  3. Check for recent deployments:
     kubectl rollout history deployment/api-server | tail -5

  4. If a deployment happened in the last 2 hours:
     a. Check if the new version is healthy:
        kubectl get pods -l app=api-server -o wide
     b. If pods are in CrashLoopBackOff:
        Roll back:
        kubectl rollout undo deployment/api-server
     c. If pods are running but errors started at deploy time:
        Roll back:
        kubectl rollout undo deployment/api-server
     d. After rollback, verify the symptom is resolved.

The Resource Exhaustion Check

CPU, memory, disk, connections, file descriptors — the finite resources that run out.

  5. Check resource usage:
     a. CPU and memory:
        kubectl top pods -l app=api-server
     b. Disk (if applicable):
        kubectl exec -it deploy/api-server -- df -h
     c. Open file descriptors:
        kubectl exec -it deploy/api-server -- \
          cat /proc/1/fd | wc -l
     d. Network connections:
        kubectl exec -it deploy/api-server -- \
          ss -s

  Thresholds:
    CPU > 90%: likely CPU-bound. Check for hot loops or missing indexes.
    Memory > 85%: likely memory pressure. Check for leaks.
    Disk > 90%: clean up or expand. Do not wait for 100%.
    File descriptors > 80% of limit: connection leak.

Organizing Troubleshooting Guides

One Guide Per Service

Each service gets its own troubleshooting guide. Cross-service issues get a separate guide that references the per-service ones.

Troubleshooting guide organization:

  troubleshooting/
    api-server.md
    worker-service.md
    database.md
    cache.md
    message-queue.md
    cross-service.md        (for issues spanning services)
    external-dependencies.md (for third-party service issues)

Index by Symptom

Create an index page that maps symptoms to the relevant guide and section.

Symptom index:

  API returning 500         -> api-server.md#500-errors
  API returning 502/503     -> api-server.md#502-503-errors
  API latency high          -> api-server.md#high-latency
  Jobs not processing       -> worker-service.md#stuck-jobs
  Database connections full  -> database.md#connection-exhaustion
  Cache miss rate high      -> cache.md#high-miss-rate
  Messages not delivered    -> message-queue.md#delivery-failure
  Login failing             -> cross-service.md#auth-failure

This index is the entry point. The on-call engineer looks up their symptom, clicks the link, and lands directly on the relevant decision tree.

Writing Guides from Incidents

The best troubleshooting guides are written from real incidents. After every incident, ask: "Could we write a troubleshooting section that would have helped us find this faster?"

Post-incident troubleshooting update process:

  1. During the postmortem, identify:
     - What symptom did we observe first?
     - What checks did we run?
     - What was the actual cause?
     - What command revealed the cause?

  2. Add or update the troubleshooting guide:
     - Add the symptom if it is not already listed
     - Add the diagnostic steps that worked
     - Add the resolution steps
     - Include the actual commands and outputs from the incident

  3. Review the new section:
     - Have someone who was not in the incident read it
     - Verify the commands work in staging
     - Merge within one week of the incident

What Real Incidents Teach You

What you learn from real incidents:
  - Which symptoms actually occur (not which ones you imagine)
  - Which commands actually reveal the problem
  - What the output actually looks like when things are broken
  - How long each diagnostic step takes
  - Which steps are dead ends and should be skipped

What you cannot learn from imagination:
  - All of the above

Troubleshooting guides written before any incidents are better than nothing, but they will be substantially rewritten after the first real failure. Accept this and iterate.

Common Pitfalls

Cause-based organization — organizing by root cause assumes the reader knows the root cause before they start troubleshooting. They do not. Organize by symptom.
Missing commands — "check the database connectivity" without the exact command to run. At 3AM, "check" is not actionable. The command is actionable.
No expected output — providing the command but not showing what normal and abnormal output looks like. The reader cannot interpret results without a baseline.
Linear guides for branching problems — troubleshooting is a tree, not a list. If every reader follows every step in order, the guide is not structured correctly. Use explicit branch points.
Guides written from theory — documenting failure modes that seem likely rather than failure modes that have actually occurred. Start with real incidents and expand from there.
No index — the reader has a symptom and no idea which document to open. A symptom-to-guide index eliminates this search.
Stale commands — infrastructure changes but the troubleshooting guide still references the old tool, the old hostname, or the old metric name. Review guides after infrastructure changes.
Missing escalation points — the decision tree leads to a dead end with no resolution. Every branch must end with either a fix or a clear escalation: "If none of the above resolved the issue, page [team] at [contact]."

Key Takeaways

Organize by symptom, not by cause. The reader knows what they see, not what is wrong.
Structure as decision trees with explicit branches, not essays or linear lists.
Every check has four parts: the command, normal output, abnormal output, and what to do next.
Write troubleshooting guides from real incidents, not from imagination. Update them after every new incident.
Create a symptom index that maps observable problems to the right guide and section.
Every branch of the decision tree must end with a resolution or an escalation path. No dead ends.