Design Doc Structure

A good design doc structure does two things: it guides the writer through the decisions they need to make, and it helps the reader find the information they need. Every section exists for a reason. Understanding those reasons lets you adapt the structure to your context rather than filling in a template mechanically.

The sections below are not a rigid format. They are a proven sequence that works for most technical design documents. Use them as a starting point and adjust based on what your particular design needs to communicate.

The Core Sections

A design doc typically contains these sections, in this order:

1. Context & Problem Statement
2. Goals & Non-Goals
3. Proposed Solution
4. Alternatives Considered
5. Risks & Mitigations
6. Timeline & Milestones

Some teams add sections for metrics, security considerations, or operational readiness. These are useful additions when relevant. But the six sections above form the backbone. If you write nothing else, write these.

Context & Problem Statement

This section answers: why are we doing this? What is the current situation, and what is wrong with it?

The context section grounds the reader. Not everyone reading your design doc has the same background knowledge. A new team member, an engineer from another team, a future reader six months from now — they all need enough context to understand the problem before evaluating your solution.

Weak context:
"The payment service is slow. We need to make it faster."

Strong context:
"The payment service processes 50,000 transactions per day.
Over the past quarter, p99 latency has increased from 200ms
to 1.8 seconds. This correlates with a 15% growth in
transaction volume per month.

The latency increase is causing two business problems:
1. Cart abandonment has increased 8% in the past month,
   correlated with payment page load times exceeding 2 seconds.
2. Our payment provider's webhook timeout is 3 seconds. At
   current p99 latency, 5% of webhooks are timing out, causing
   payment status updates to fail silently.

If we take no action, we project that p99 latency will exceed
3 seconds by Q3, at which point webhook failures become
systematic rather than intermittent."

The strong version gives specific numbers, connects technical metrics to business impact, and explains what happens if nothing changes. A reader understands not just that there is a problem but how urgent it is and what is at stake.

Keep the context section focused. Provide enough background for an informed reader to evaluate your solution. Do not write a history of the entire system. If there is extensive background that some readers will need, link to it rather than inlining it.

Goals & Non-Goals

Goals define what success looks like. Non-goals define what you are explicitly not trying to achieve. Both are essential.

Goals should be specific and measurable when possible:

Weak goals:
- Make the payment service faster
- Improve reliability

Strong goals:
- Reduce p99 payment processing latency to under 500ms
- Eliminate webhook timeout failures (currently 5% of transactions)
- Support 3x current transaction volume without latency regression

Non-goals are at least as important as goals. They prevent scope creep and set expectations. They tell reviewers "I know this related thing exists, and I am deliberately not addressing it here."

Non-goals:
- Migrating off the current payment provider (planned for Q1
  next year as a separate initiative)
- Supporting multi-currency transactions (will be addressed
  after the latency work)
- Redesigning the payment UI (frontend team owns this separately)

Non-goals are not things you think are unimportant. They are things that are important but out of scope for this particular effort. Stating them explicitly prevents the inevitable "but what about..." feedback that derails design reviews.

Bad non-goals (obvious, adds no value):
- Non-goal: We will not rewrite the entire company codebase.
- Non-goal: We will not solve world hunger.

Good non-goals (reasonable things someone might expect):
- Non-goal: This design does not address the separate latency
  issue in the shipping calculation service, even though it
  contributes to overall checkout latency.
- Non-goal: We are not changing the database schema. The current
  schema supports the proposed changes without modification.

A good non-goal is something a reasonable reviewer might ask "are you also doing X?" The non-goal preemptively answers "no, and here's why."

Proposed Solution

This is the core of the document. It describes what you want to build and how it works. The level of detail should be sufficient for a reviewer to evaluate the approach and for an implementer to begin work without major ambiguity.

Structure the proposed solution to be scannable:

### Overview
A high-level summary of the approach in 2-3 paragraphs.
Someone reading only this subsection should understand
the general shape of the solution.

### Detailed Design
The specifics: data flow, API contracts, storage schema,
processing logic. This is where the technical depth lives.

### Data Model
If the solution involves new or changed data structures,
describe them explicitly. Schema changes, new tables, new
fields, data formats.

### API Changes
If the solution affects APIs (internal or external), describe
the changes. New endpoints, modified request/response formats,
deprecations.

A common mistake is writing the proposed solution as a narrative: "First we'll do X, then we'll do Y, then we'll do Z." This works for simple designs but falls apart for complex ones. Use subsections to organize by component or concern rather than by implementation sequence.

Narrative style (hard to review for complex designs):
"We'll start by adding a Redis cache in front of the database.
Then we'll modify the payment processing pipeline to check
the cache first. Then we'll add cache invalidation logic..."

Component style (easier to review):
### Caching Layer
How the Redis cache works, what it stores, TTL policy,
cache invalidation strategy.

### Payment Processing Pipeline Changes
How the pipeline is modified to use the cache, fallback
behavior when cache misses, error handling.

### Monitoring & Observability
Cache hit rates, latency dashboards, alerting thresholds.

The component style lets a reviewer focus on one piece at a time. It also makes it easier for multiple reviewers with different expertise to review the sections most relevant to them.

Alternatives Considered

This is the most important section of a design doc. Many engineers treat it as an afterthought — a box to check before publishing. This is a mistake.

The alternatives section is where you demonstrate rigor. It shows that your proposed solution was not your first idea that you ran with uncritically. It shows that you evaluated the option space and made a deliberate choice. It builds trust with reviewers because they can see that you considered the approaches they would have suggested.

Weak alternatives section:
"We considered using DynamoDB but decided against it."

Strong alternatives section:
"### Alternative A: Vertical Scaling (Increase Database Resources)
We could upgrade the database instance from db.r5.2xlarge to
db.r5.8xlarge.

Pros:
- No code changes required
- Immediate improvement (provisioning takes ~15 minutes)

Cons:
- Cost increase: $3,200/month to $12,800/month
- Only buys ~6 months before we hit the same problem at the
  next instance size
- Does not address the architectural issue (synchronous writes
  in the hot path)

Rejected because: this is a temporary fix that costs 4x more
per month without addressing the root cause.

### Alternative B: Application-Level Sharding
We could shard the payment database by merchant ID.

Pros:
- Linear horizontal scaling
- Each shard handles a fraction of the load

Cons:
- Cross-shard queries become complex (reporting, reconciliation)
- Shard rebalancing is operationally expensive
- Requires significant code changes to the data access layer
- Team has no experience operating sharded databases

Rejected because: the operational complexity is disproportionate
to the problem. Our projected growth for the next 2 years does
not require horizontal scaling — caching handles it."

For each alternative, include: what it is, why it would work (pros), why it would not (cons), and why you rejected it. The rejection reason is the most important part. It shows your decision criteria.

A strong alternatives section also preempts the most common design review feedback: "Have you considered X?" If X is in your alternatives section with a clear rejection reason, the reviewer's question is already answered.

Three to four alternatives is typical. Fewer than two suggests you did not explore the space. More than five suggests some of the alternatives are not meaningfully distinct.

Risks & Mitigations

Every design has risks. Pretending otherwise destroys credibility. The risks section demonstrates that you have thought about what could go wrong and have a plan for it.

### Risk: Cache Inconsistency
If the cache and database diverge, users may see stale payment
status.

Likelihood: Medium. Cache invalidation bugs are common.

Impact: High. Stale payment status can cause duplicate charges
or missed payments.

Mitigation: Implement cache-aside pattern with a short TTL
(60 seconds). Add a reconciliation job that runs every 5 minutes
to detect and correct inconsistencies. Add monitoring to alert
if the inconsistency rate exceeds 0.1%.

### Risk: Redis Cluster Failure
If the Redis cluster goes down, the service loses its caching
layer.

Likelihood: Low. Redis Cluster has automatic failover.

Impact: Medium. Service falls back to direct database queries.
Latency increases but service remains functional.

Mitigation: Implement circuit breaker pattern. On cache failure,
bypass cache and query database directly. Alert ops team for
manual investigation. Test this failover path quarterly.

Structure each risk with: what could go wrong, how likely it is, how bad it would be, and what you will do about it. This format lets reviewers quickly assess whether your risk analysis is reasonable and whether your mitigations are sufficient.

Avoid two extremes: listing every possible risk (which dilutes the important ones) and listing no risks (which signals either overconfidence or lack of analysis). Three to five risks is typical for most designs.

Timeline & Milestones

The timeline section answers: how long will this take, and how will we know we are on track?

### Phase 1: Foundation (Weeks 1-2)
- Set up Redis Cluster in staging environment
- Implement cache-aside pattern in payment service
- Write integration tests for cache hit/miss/invalidation
Milestone: Cache operational in staging with all tests passing.

### Phase 2: Rollout (Weeks 3-4)
- Deploy to production with cache disabled (feature flag)
- Enable cache for 5% of traffic, monitor for 48 hours
- Gradually increase to 100% over one week
Milestone: Cache serving 100% of production traffic with
p99 latency under 500ms.

### Phase 3: Hardening (Week 5)
- Implement reconciliation job
- Add monitoring dashboards and alerts
- Conduct failover testing
- Update runbooks
Milestone: All monitoring in place, failover tested, runbooks
reviewed by ops team.

Milestones are checkpoints that let the team and stakeholders verify progress. Each milestone should be a verifiable state — something you can demonstrate or measure, not a vague activity like "work on caching."

Include team requirements if relevant: how many engineers, any specialized skills needed, dependencies on other teams.

What to Leave Out

Knowing what to exclude is as important as knowing what to include:

Implementation details that do not affect the design. Which variable names to use, how to structure the code internally, which testing framework to pick — these are implementation decisions, not design decisions. Including them makes the doc too long and shifts the review from architecture to code.
Exhaustive error handling for every edge case. Cover the major failure modes. The implementation will handle the rest. If you try to enumerate every possible error, the doc becomes a specification, not a design.
Performance benchmarks you have not run. If you have not measured it, do not claim it. Say "we expect approximately X based on [reasoning]" and note that you will validate during implementation.
Background that most readers already know. If everyone on the team knows the payment service architecture, do not explain it. Link to existing documentation for readers who need it.
Future work that is not committed. "In the future, we could also..." sections are rarely useful. If the future work is important, it should be in a separate design doc. If it is speculative, it does not belong in this document.

Common Pitfalls

Skipping the alternatives section. This is the most commonly skipped section and the most damaging omission. Without alternatives, reviewers cannot evaluate whether your approach is the best option — only whether it is a viable option. These are different questions.

Writing the doc after the decision is made. If the team has already decided on the approach and the doc is retroactive justification, the review is theater. Write the doc before committing to an approach. If circumstances force a quick decision, at least write the doc before implementation begins.

Making every section the same length. Some sections need more depth than others. A simple problem statement should be short. A complex proposed solution should be long. A well-understood risk section might be brief. Do not pad short sections or compress important ones to achieve uniform length.

Omitting the problem statement. Jumping straight to the solution without establishing the problem leaves reviewers guessing about your constraints and success criteria. The solution only makes sense in the context of the problem it solves.

Writing for approval rather than feedback. If your design doc reads like a sales pitch — all benefits, no trade-offs, no risks — reviewers will not trust it. Honest assessment of downsides builds more credibility than a polished presentation of only the upside.

Not specifying who will review the doc. Identify reviewers explicitly. Include people from teams that will be affected by the change, people with relevant domain expertise, and at least one person who tends to ask hard questions. Unspecified review ownership means no one reviews it.

Key Takeaways

The six core sections — Context, Goals/Non-Goals, Proposed Solution, Alternatives Considered, Risks, and Timeline — form a proven structure that works for most design documents.
Each section serves a specific purpose. Context grounds the reader, goals define success, the solution describes the approach, alternatives demonstrate rigor, risks show awareness, and the timeline sets expectations.
The Alternatives Considered section is the most important. It demonstrates that you evaluated the option space deliberately and builds reviewer trust.
Non-goals are as important as goals. They prevent scope creep and preempt "but what about..." feedback.
Write for your reviewer: enough detail to evaluate the design, not so much that the doc becomes an implementation spec.
Leave out implementation details, speculative future work, and background that most readers already know.