Feature Flags & Experiments

Feature flags let you deploy code to production without exposing it to users. This single capability transforms how you ship software. You can test in production with real data, run experiments with actual users, kill features that do not work, and undo changes without redeploying. Feature flags are the most underused tool in the startup engineer's toolkit.

What Feature Flags Are

A feature flag is a conditional that controls whether a piece of functionality is active:

Simplest possible feature flag:
  const FLAGS = {
    newPricing: false,
    betaDashboard: true,
    darkMode: false,
  }

  if (FLAGS.newPricing) {
    showNewPricingPage()
  } else {
    showOldPricingPage()
  }

That is it. A boolean and an if statement. Everything else — targeting rules, percentage rollouts, A/B testing — is built on top of this foundation.

Types of Feature Flags

Release flags:
  Purpose: Ship incomplete code safely
  Lifetime: Days to weeks
  Example: New checkout flow that's 80% done. Deploy it, hide it,
           finish it, enable it.

Experiment flags:
  Purpose: A/B test with real users
  Lifetime: Weeks
  Example: Does the green button or the blue button convert better?
           Show each to 50% of users, measure, decide.

Ops flags:
  Purpose: Control operational behavior
  Lifetime: Permanent or semi-permanent
  Example: Disable email sending during a migration. Rate-limit a
           heavy endpoint during a traffic spike.

Permission flags:
  Purpose: Gate features to specific users
  Lifetime: Permanent
  Example: Beta features for early adopters. Premium features for
           paying customers. Admin tools for internal users.

Ship Code Without Exposing It

The primary use case: you are building a feature that takes a week. On day three, your teammate needs to merge a fix that depends on code near yours. Without flags, you are either blocked or dealing with merge conflicts. With flags, you merge your incomplete work to main behind a flag. It is in production but invisible.

Development workflow with flags:
  Day 1: Start new reporting feature. Wrap it in a flag.
         Merge partial work to main. Flag is OFF.
  Day 2: More progress. Merge to main. Flag still OFF.
         Teammate deploys a fix. No conflicts.
  Day 3: Feature complete. Merge to main. Enable flag for your
         own account only. Test with real production data.
  Day 4: Enable flag for the team. Everyone tests.
  Day 5: Enable flag for 10% of users. Monitor metrics.
  Day 6: Enable for everyone. Remove the flag and old code path.

Flickr pioneered this workflow in 2009. They called it "flipping" and used it to deploy ten times per day. Their flags were simple config values that controlled which features were visible. The technique has not fundamentally changed since then — just the tooling around it.

Testing in Production

Staging environments lie. They have different data, different traffic patterns, different integrations, and different timing. The only environment that accurately represents production is production.

Feature flags make production testing safe:

Production testing workflow:
  1. Deploy the feature behind a flag (flag OFF)
  2. Enable the flag for internal users only
  3. Use the feature with real production data
  4. Check error logs, performance metrics, database queries
  5. If everything looks good, expand to beta users
  6. If something is wrong, turn the flag OFF (instant rollback)

What this catches that staging misses:
  - Performance issues with real data volumes
  - Edge cases from real user behavior
  - Integration issues with live third-party services
  - Race conditions under real concurrency
  - Data migration issues with actual production data

Netflix tests everything in production. Their Canary Analysis system deploys changes to a small percentage of servers, compares metrics against the baseline, and automatically rolls back if anything degrades. You do not need Netflix's scale to use this principle — you just need a flag and a dashboard.

A/B Testing With Real Users

Feature flags become an experimentation platform when you add percentage-based rollouts and measurement:

Simple A/B test setup:
  Flag: new_onboarding_flow
  Control (50%): Existing onboarding
  Variant (50%): New onboarding

  Measurement:
    - Activation rate (did the user complete onboarding?)
    - 7-day retention (did they come back?)
    - Time to first value (how long until they did the key action?)

  Decision criteria (set BEFORE the test):
    - If activation improves by >10%, ship it
    - If activation is within +/- 5%, keep the simpler version
    - If activation drops by >5%, kill the variant
    - Run for 2 weeks minimum or until 500 users per variant

The critical discipline: set your success criteria before you start the experiment. If you decide what "winning" means after seeing the data, you will rationalize keeping whatever you built.

Booking.com runs thousands of A/B tests simultaneously. They attribute much of their growth to a culture of testing rather than debating. An engineer with an opinion loses to an engineer with data. You do not need their scale, but adopting the principle — test instead of argue — changes how your team makes product decisions.

What to Test

High-value experiments:
  - Onboarding flows (biggest impact on activation)
  - Pricing pages (biggest impact on revenue)
  - Core product interactions (biggest impact on retention)
  - Email subject lines (biggest impact on engagement)

Low-value experiments (usually not worth the effort):
  - Button colors (the effect is almost always negligible)
  - Minor copy changes in non-critical flows
  - Layout tweaks on low-traffic pages
  - Anything where the sample size would take months to reach

Feature Flags as an Undo Button

When a feature causes problems in production, feature flags give you an instant rollback that does not require a deploy:

Without flags:
  1. Feature causes problems
  2. Identify the bad commit
  3. Write a revert
  4. Run tests
  5. Deploy the revert
  6. Total time: 15-60 minutes

With flags:
  1. Feature causes problems
  2. Turn off the flag
  3. Total time: 30 seconds

This changes the risk calculus of shipping. When the cost of reverting is near zero, the cost of trying something is near zero. Teams with feature flags ship more experiments because the downside of a failed experiment is a flag toggle, not an incident.

Keeping It Simple

The feature flag industry wants you to buy a complex platform with user segmentation, multivariate testing, audit logs, and governance workflows. For a startup with fewer than ten engineers, this is overkill.

What you actually need:
  - A config file or database table with flag names and boolean values
  - A way to change flag values without redeploying
  - Basic user targeting (at minimum: internal users vs everyone)

What you do NOT need yet:
  - A feature flag management platform ($$$)
  - Complex targeting rules
  - Percentage-based rollouts (until you have enough traffic)
  - Audit logs for flag changes
  - Flag lifecycle management
  - Mutual exclusion groups for experiments

Implementation Options

Simplest: Environment variables
  FEATURE_NEW_PRICING=true
  FEATURE_BETA_DASHBOARD=false
  Pro: Zero setup
  Con: Requires redeploy to change

Better: Database table
  CREATE TABLE feature_flags (
    name TEXT PRIMARY KEY,
    enabled BOOLEAN DEFAULT false,
    updated_at TIMESTAMP
  );
  Pro: Change flags without deploying
  Con: Database lookup per request (cache it)

Simple but effective: JSON config file served from CDN
  {
    "newPricing": false,
    "betaDashboard": { "enabled": true, "users": ["internal"] }
  }
  Pro: Fast, no database dependency
  Con: CDN cache delay

When you outgrow these: LaunchDarkly, Unleash (self-hosted), Flagsmith
  Threshold: When you have >20 flags or need percentage rollouts
  with statistical analysis

For most startups in the first two years, a database table with a simple admin page to toggle flags is more than sufficient. You can build it in an afternoon.

Flag Hygiene

Feature flags have a cost: every flag is a branch in your code. If you never clean them up, you end up with a codebase full of dead branches, unclear behavior, and "what does this flag do?" questions.

Flag lifecycle:
  1. Create the flag (name it clearly: new_checkout_flow, not flag_123)
  2. Ship the code behind the flag
  3. Test and roll out
  4. Confirm the feature is stable and the old path is no longer needed
  5. Remove the flag AND the old code path
  6. Deploy the cleanup

Target: flags should live for 2-4 weeks, max.
If a flag has been around for 3 months, either ship the feature or kill it.

Flag Naming Conventions

Good flag names:
  new_onboarding_flow
  enable_dark_mode
  beta_analytics_dashboard
  experiment_pricing_v2

Bad flag names:
  flag1
  test
  johns_feature
  temp_fix_delete_later (it will not be deleted later)

The Flag Cleanup Ritual

Once a week (or once a sprint), review your flag list:

Flag review checklist:
  For each flag:
    - Is it still needed?
    - If it's fully rolled out, remove the flag and the old code path
    - If it's been OFF for more than 2 weeks, delete the feature code
    - If nobody knows what it does, investigate and document or remove

Real-World Flag Usage

GitHub:
  Uses feature flags extensively for Copilot, Actions, and new UI features.
  Internal users see features weeks before public release. Gradual rollout
  to percentages of users. Instant kill switch if metrics degrade.

Etsy:
  Pioneered "config flags" as part of their continuous deployment culture.
  Every feature was behind a flag. Engineers could enable features for
  themselves, for the office, or for a percentage of traffic. This let
  them deploy 50+ times per day with confidence.

Reddit:
  Uses feature flags to manage experiments on the platform. Subreddit-level
  targeting, user-level targeting, and percentage rollouts. They test
  feed algorithms, UI changes, and new features continuously.

Common Pitfalls

Too many flags. If you have 50 active flags, nobody knows which combination of flags represents the "real" product. Keep the active flag count under 15. Clean up aggressively.
Flags that never get removed. The old code path behind a flag that has been ON for six months is dead code that looks alive. It clutters the codebase and confuses new engineers. Set a calendar reminder to remove flags after rollout.
Testing only the flag-ON path. When you have a flag, you have two versions of your code. Test both. Users on the old path still need a working product.
Using flags to avoid hard decisions. "Let's put it behind a flag" can become a way to ship something without committing to it. Flags should be temporary. If you cannot decide whether to ship a feature, a flag does not solve the decision problem.
Complex flag dependencies. If enabling flag A requires flag B to be on, and flag B requires flag C, you have created a configuration management nightmare. Keep flags independent.
Buying a platform too early. A feature flag SaaS costs $100-500/month and adds a third-party dependency to every page load. A database table with five rows costs nothing and does everything you need for the first year.

Key Takeaways

Feature flags decouple deployment from release. Deploy code continuously; release features when they are ready by toggling flags.
Start with the simplest implementation: a database table with boolean values and a basic admin UI. You do not need a platform until you have 20+ flags or need statistical experimentation.
Use flags to test in production with real data and real traffic. Staging environments do not replicate production conditions — production does.
Set success criteria before running experiments. If you decide what "winning" means after seeing data, you will rationalize keeping whatever you built.
Clean up flags aggressively. A flag should live for two to four weeks. If it has been around for three months, either ship the feature permanently or kill it.