Bus Factor & Knowledge Sharing

If one person gets sick and the company stops, your bus factor is 1. This is the most common and most dangerous risk at early-stage startups. It is not about buses. It is about vacations, burnout, job changes, and the flu. At a company with fewer than five people, every person is critical. The question is whether you have mitigated the risk or simply ignored it.

The bus factor measures how many people can be removed from the team before the project stalls. At most early-stage startups, the answer is one. Sometimes zero — there are startups where even the founder being unavailable for a week would be fatal.

Why Bus Factor 1 Is the Default

Common knowledge silos at small startups:
  - Only one person knows how to deploy
  - Only one person has the cloud provider credentials
  - Only one person understands the billing system
  - Only one person has the database access
  - Only one person knows why the cron job runs at 3am
  - Only one person has the relationship with the key customer

This happens naturally. It is not negligence — it is the consequence of moving fast with a tiny team. When one person builds a system, one person understands that system. Knowledge sharing requires deliberate effort, and deliberate effort requires time that feels better spent shipping features.

The result: the engineer who built the payment integration goes on vacation, a customer has a billing issue, and nobody can fix it for a week.

When Bus Factor 1 Is Acceptable

Honesty first: at the very earliest stage, bus factor 1 is sometimes the only realistic option.

Bus factor 1 is acceptable when:
  - You are a solo founder pre-revenue
  - You are in the first 2-3 months of building with a co-founder
  - The entire codebase is small enough to understand in a day
  - You are explicitly trading risk for speed, with eyes open
  - The "bus" scenario would mean shutting down, and you're OK with
    that risk at this stage

Bus factor 1 is NOT acceptable when:
  - You have paying customers who depend on your product
  - You have raised funding and have obligations to investors
  - The codebase is too large for someone new to understand quickly
  - You are past the first 6 months of operation
  - Anyone on the team is showing signs of burnout

The transition from "acceptable risk" to "unacceptable risk" happens gradually, and most teams miss it. The moment you have paying customers, bus factor 1 is a business continuity problem, not just a technical one.

The "Deploy From a New Laptop" Test

The simplest test of knowledge sharing: can a new team member (or your co-founder, or a contractor) go from a blank laptop to a running development environment and deploy to production?

The test:
  1. Clone the repo
  2. Follow the README
  3. Get the app running locally
  4. Make a small change
  5. Deploy to production

If any step requires calling someone for help, your documentation
has a gap. If any step requires credentials that only one person has,
your access management has a gap.

Run this test every three months. Seriously. Have someone who did not set up the infrastructure follow the README from scratch. Time it. The results are usually humbling.

Stripe runs "developer experience" reviews regularly. New engineers track every friction point they encounter during onboarding. This feedback loop keeps documentation current and identifies knowledge silos before they become dangerous.

The README That Actually Works

Most READMEs are either empty or aspirational. A README that prevents bus factor problems is brutally practical:

README structure that works:
  # Project Name
  One sentence: what this does and who it's for.

  ## Prerequisites
  Exact versions. Not "Node.js" but "Node.js 20.x (use nvm)".
  Not "Postgres" but "Postgres 16 (brew install postgresql@16)".

  ## Setup
  Numbered steps. Copy-pasteable commands. Every step.
  No "you'll need to configure your environment" without
  saying exactly how.

  ## Environment Variables
  List every env var. Say where to get the value.
  Include a .env.example file with dummy values for local dev.

  ## Running Locally
  One command. If it takes more than one command, script it.
  "make dev" or "docker compose up" or "./scripts/dev.sh"

  ## Deploying
  Exact steps. Including "who has the credentials" if they
  are not in the shared password manager (fix this).

  ## Architecture
  One paragraph. Where the code lives, what talks to what.
  Not a diagram — a paragraph that someone can read in 30 seconds.

The bar: someone who has never seen your codebase can get it running in under 30 minutes by following the README alone.

Shared Access

Credentials and access are the most common bus factor problems, and the easiest to fix.

Shared access checklist:
  Cloud provider (AWS, GCP, etc.):
    □ At least 2 people have admin access
    □ Root account credentials are in a shared password manager
    □ MFA recovery codes are stored securely and shared

  Domain registrar:
    □ At least 2 people can manage DNS
    □ Domain is not registered under a personal account

  Code repository:
    □ Organization owns the repo, not a personal account
    □ At least 2 people have admin access

  Deployment:
    □ CI/CD credentials are in environment variables, not personal config
    □ At least 2 people can trigger a deploy manually

  Database:
    □ Connection strings are documented (not just in one person's .env)
    □ Backup procedures are automated and documented
    □ At least 2 people can access production data

  Third-party services (Stripe, SendGrid, etc.):
    □ Accounts are under a shared email (team@company.com)
    □ At least 2 people have login access
    □ API keys are in a shared secret manager

  Password manager:
    □ Using one (1Password, Bitwarden)
    □ Shared vault for team credentials
    □ Personal vaults for individual access

This checklist takes an afternoon to complete. Not completing it means a single person leaving, losing their laptop, or forgetting a password can lock the entire company out of its own infrastructure.

GitLab made their entire infrastructure configuration public and documented. After a famous database deletion incident in 2017, they rebuilt their processes around the principle that no single person should be a bottleneck for any critical operation. Their transparency about the incident (live-streamed the recovery) became a model for how to handle these situations.

Pair Sessions for Knowledge Transfer

Documentation captures the "what." Pair sessions capture the "why."

Knowledge sharing pair sessions:
  Frequency: Once a week, 1 hour
  Format: One person walks the other through a system they built

  Session structure:
  1. "Here's what this system does" (5 min)
  2. "Here's how it works" — walk through the code (20 min)
  3. "Here's why I made these choices" (10 min)
  4. "Here are the gotchas and known issues" (10 min)
  5. The other person makes a small change and deploys (15 min)

  Outcome: The second person can now maintain the system at a basic level.
  Not expert-level. Enough to keep things running if the builder is
  unavailable.

This is not a full knowledge transfer. It is "break glass in case of emergency" knowledge. Enough to debug a problem, restart a service, or apply a hotfix. Deep expertise still lives with the builder, but basic operational capability is shared.

Basecamp (37signals) rotates team members across projects regularly. Not because they need to — because it prevents knowledge silos. The short-term cost (ramp-up time) is much lower than the long-term cost of a bus factor 1 system that nobody else can touch.

What to Document vs What to Pair On

Document (written, in the repo):
  - How to set up and run the project
  - How to deploy
  - Environment variables and where to get them
  - Architecture overview (one paragraph, not a novel)
  - Known issues and workarounds
  - Runbooks for common operational tasks

Pair on (verbal, scheduled sessions):
  - Why the architecture is shaped this way
  - What the tricky parts of the codebase are
  - What you tried that did not work
  - What you would change if you started over
  - How to debug specific categories of problems

The written documentation is for a cold start — someone new with no context. The pair sessions are for warm handoff — building intuition that documentation cannot capture.

Runbooks for Common Operations

A runbook is a step-by-step guide for operational tasks that happen regularly or during incidents.

Example runbook: Database migration
  When: Any time a schema change is needed
  Who can do it: Any engineer with production database access
  Steps:
    1. Run migrations on staging: make db-migrate-staging
    2. Verify staging works: run smoke tests
    3. Schedule production migration during low-traffic window (before 9am EST)
    4. Run migrations on production: make db-migrate-production
    5. Monitor error rates for 30 minutes
    6. If errors spike: rollback with make db-rollback-production
  Last updated: 2026-04-15 by [name]

Example runbook: Recovering from a failed deploy
  When: Deploy fails or causes errors in production
  Steps:
    1. Check deploy status: make deploy-status
    2. Rollback to previous version: make deploy-rollback
    3. Verify rollback succeeded: check health endpoint
    4. Investigate the failure in the deploy logs
    5. Fix the issue locally, test, and re-deploy

Runbooks sound bureaucratic for a two-person team. They are not. They are the difference between a 5-minute fix and a 2-hour panic when the one person who knows the procedure is asleep.

Measuring Bus Factor

Simple bus factor audit:
  For each critical system, ask: "If the person who built this
  disappeared tomorrow, how long until someone else could:
    a) Understand what it does: ___
    b) Fix a simple bug: ___
    c) Deploy a change: ___
    d) Handle a production incident: ___"

  If any answer is "days" or "I don't know," that system is
  bus factor 1.

  Critical systems to audit:
    - Authentication & authorization
    - Payment processing
    - Core product functionality
    - Deployment pipeline
    - Database management
    - Monitoring & alerting
    - Customer data access

Run this audit quarterly. It takes 30 minutes and surfaces exactly where your risks are.

Common Pitfalls

"We'll document it later." You will not. Document as you build, or it never gets documented. Even a few sentences in a code comment is better than nothing.
Documentation that rots. A README from six months ago that describes a setup process that no longer works is worse than no README, because it wastes time and erodes trust in documentation. Keep docs in the repo, close to the code they describe, and update them when the code changes.
Confusing documentation with knowledge sharing. Documentation is necessary but not sufficient. Pair sessions build intuition and context that written documents cannot capture. Do both.
Over-documenting. A 50-page architecture document that nobody reads is not knowledge sharing. Keep documents short, practical, and focused on what someone needs to do, not what they need to understand.
Personal accounts for shared infrastructure. The domain registered under the founder's personal GoDaddy account. The AWS account under an engineer's personal email. The Stripe account that only one person can log into. Fix these on day one.

Key Takeaways

Bus factor 1 is acceptable only at the very earliest stage, before you have paying customers or significant codebase complexity. After that, it is a business continuity risk.
Run the "deploy from a new laptop" test every three months. If a new person cannot go from zero to deployed by following the README, the README is broken.
Use a shared password manager for all team credentials. Ensure at least two people have access to every critical service — cloud provider, domain registrar, database, payment processor.
Schedule weekly pair sessions where one person walks another through a system they built. The goal is not expertise transfer — it is "break glass" capability.
Keep runbooks for common operations in the repo. Short, step-by-step, copy-pasteable commands. Update them when the process changes.