Feature Flags & Experiments
Feature flags let you deploy code to production without exposing it to users. This single capability transforms how you ship software. You can test in production with real data, run experiments with actual users, kill features that do not work, and undo changes without redeploying. Feature flags are the most underused tool in the startup engineer's toolkit.
What Feature Flags Are
A feature flag is a conditional that controls whether a piece of functionality is active:
Simplest possible feature flag:
const FLAGS = {
newPricing: false,
betaDashboard: true,
darkMode: false,
}
if (FLAGS.newPricing) {
showNewPricingPage()
} else {
showOldPricingPage()
}
That is it. A boolean and an if statement. Everything else — targeting rules, percentage rollouts, A/B testing — is built on top of this foundation.
Types of Feature Flags
Release flags:
Purpose: Ship incomplete code safely
Lifetime: Days to weeks
Example: New checkout flow that's 80% done. Deploy it, hide it,
finish it, enable it.
Experiment flags:
Purpose: A/B test with real users
Lifetime: Weeks
Example: Does the green button or the blue button convert better?
Show each to 50% of users, measure, decide.
Ops flags:
Purpose: Control operational behavior
Lifetime: Permanent or semi-permanent
Example: Disable email sending during a migration. Rate-limit a
heavy endpoint during a traffic spike.
Permission flags:
Purpose: Gate features to specific users
Lifetime: Permanent
Example: Beta features for early adopters. Premium features for
paying customers. Admin tools for internal users.
Ship Code Without Exposing It
The primary use case: you are building a feature that takes a week. On day three, your teammate needs to merge a fix that depends on code near yours. Without flags, you are either blocked or dealing with merge conflicts. With flags, you merge your incomplete work to main behind a flag. It is in production but invisible.
Development workflow with flags:
Day 1: Start new reporting feature. Wrap it in a flag.
Merge partial work to main. Flag is OFF.
Day 2: More progress. Merge to main. Flag still OFF.
Teammate deploys a fix. No conflicts.
Day 3: Feature complete. Merge to main. Enable flag for your
own account only. Test with real production data.
Day 4: Enable flag for the team. Everyone tests.
Day 5: Enable flag for 10% of users. Monitor metrics.
Day 6: Enable for everyone. Remove the flag and old code path.
Flickr pioneered this workflow in 2009. They called it "flipping" and used it to deploy ten times per day. Their flags were simple config values that controlled which features were visible. The technique has not fundamentally changed since then — just the tooling around it.
Testing in Production
Staging environments lie. They have different data, different traffic patterns, different integrations, and different timing. The only environment that accurately represents production is production.
Feature flags make production testing safe:
Production testing workflow:
1. Deploy the feature behind a flag (flag OFF)
2. Enable the flag for internal users only
3. Use the feature with real production data
4. Check error logs, performance metrics, database queries
5. If everything looks good, expand to beta users
6. If something is wrong, turn the flag OFF (instant rollback)
What this catches that staging misses:
- Performance issues with real data volumes
- Edge cases from real user behavior
- Integration issues with live third-party services
- Race conditions under real concurrency
- Data migration issues with actual production data
Netflix tests everything in production. Their Canary Analysis system deploys changes to a small percentage of servers, compares metrics against the baseline, and automatically rolls back if anything degrades. You do not need Netflix's scale to use this principle — you just need a flag and a dashboard.
A/B Testing With Real Users
Feature flags become an experimentation platform when you add percentage-based rollouts and measurement:
Simple A/B test setup:
Flag: new_onboarding_flow
Control (50%): Existing onboarding
Variant (50%): New onboarding
Measurement:
- Activation rate (did the user complete onboarding?)
- 7-day retention (did they come back?)
- Time to first value (how long until they did the key action?)
Decision criteria (set BEFORE the test):
- If activation improves by >10%, ship it
- If activation is within +/- 5%, keep the simpler version
- If activation drops by >5%, kill the variant
- Run for 2 weeks minimum or until 500 users per variant
The critical discipline: set your success criteria before you start the experiment. If you decide what "winning" means after seeing the data, you will rationalize keeping whatever you built.
Booking.com runs thousands of A/B tests simultaneously. They attribute much of their growth to a culture of testing rather than debating. An engineer with an opinion loses to an engineer with data. You do not need their scale, but adopting the principle — test instead of argue — changes how your team makes product decisions.
What to Test
High-value experiments:
- Onboarding flows (biggest impact on activation)
- Pricing pages (biggest impact on revenue)
- Core product interactions (biggest impact on retention)
- Email subject lines (biggest impact on engagement)
Low-value experiments (usually not worth the effort):
- Button colors (the effect is almost always negligible)
- Minor copy changes in non-critical flows
- Layout tweaks on low-traffic pages
- Anything where the sample size would take months to reach
Feature Flags as an Undo Button
When a feature causes problems in production, feature flags give you an instant rollback that does not require a deploy:
Without flags:
1. Feature causes problems
2. Identify the bad commit
3. Write a revert
4. Run tests
5. Deploy the revert
6. Total time: 15-60 minutes
With flags:
1. Feature causes problems
2. Turn off the flag
3. Total time: 30 seconds
This changes the risk calculus of shipping. When the cost of reverting is near zero, the cost of trying something is near zero. Teams with feature flags ship more experiments because the downside of a failed experiment is a flag toggle, not an incident.
Keeping It Simple
The feature flag industry wants you to buy a complex platform with user segmentation, multivariate testing, audit logs, and governance workflows. For a startup with fewer than ten engineers, this is overkill.
What you actually need:
- A config file or database table with flag names and boolean values
- A way to change flag values without redeploying
- Basic user targeting (at minimum: internal users vs everyone)
What you do NOT need yet:
- A feature flag management platform ($$$)
- Complex targeting rules
- Percentage-based rollouts (until you have enough traffic)
- Audit logs for flag changes
- Flag lifecycle management
- Mutual exclusion groups for experiments
Implementation Options
Simplest: Environment variables
FEATURE_NEW_PRICING=true
FEATURE_BETA_DASHBOARD=false
Pro: Zero setup
Con: Requires redeploy to change
Better: Database table
CREATE TABLE feature_flags (
name TEXT PRIMARY KEY,
enabled BOOLEAN DEFAULT false,
updated_at TIMESTAMP
);
Pro: Change flags without deploying
Con: Database lookup per request (cache it)
Simple but effective: JSON config file served from CDN
{
"newPricing": false,
"betaDashboard": { "enabled": true, "users": ["internal"] }
}
Pro: Fast, no database dependency
Con: CDN cache delay
When you outgrow these: LaunchDarkly, Unleash (self-hosted), Flagsmith
Threshold: When you have >20 flags or need percentage rollouts
with statistical analysis
For most startups in the first two years, a database table with a simple admin page to toggle flags is more than sufficient. You can build it in an afternoon.
Flag Hygiene
Feature flags have a cost: every flag is a branch in your code. If you never clean them up, you end up with a codebase full of dead branches, unclear behavior, and "what does this flag do?" questions.
Flag lifecycle:
1. Create the flag (name it clearly: new_checkout_flow, not flag_123)
2. Ship the code behind the flag
3. Test and roll out
4. Confirm the feature is stable and the old path is no longer needed
5. Remove the flag AND the old code path
6. Deploy the cleanup
Target: flags should live for 2-4 weeks, max.
If a flag has been around for 3 months, either ship the feature or kill it.
Flag Naming Conventions
Good flag names:
new_onboarding_flow
enable_dark_mode
beta_analytics_dashboard
experiment_pricing_v2
Bad flag names:
flag1
test
johns_feature
temp_fix_delete_later (it will not be deleted later)
The Flag Cleanup Ritual
Once a week (or once a sprint), review your flag list:
Flag review checklist:
For each flag:
- Is it still needed?
- If it's fully rolled out, remove the flag and the old code path
- If it's been OFF for more than 2 weeks, delete the feature code
- If nobody knows what it does, investigate and document or remove
Real-World Flag Usage
GitHub:
Uses feature flags extensively for Copilot, Actions, and new UI features.
Internal users see features weeks before public release. Gradual rollout
to percentages of users. Instant kill switch if metrics degrade.
Etsy:
Pioneered "config flags" as part of their continuous deployment culture.
Every feature was behind a flag. Engineers could enable features for
themselves, for the office, or for a percentage of traffic. This let
them deploy 50+ times per day with confidence.
Reddit:
Uses feature flags to manage experiments on the platform. Subreddit-level
targeting, user-level targeting, and percentage rollouts. They test
feed algorithms, UI changes, and new features continuously.
Common Pitfalls
- Too many flags. If you have 50 active flags, nobody knows which combination of flags represents the "real" product. Keep the active flag count under 15. Clean up aggressively.
- Flags that never get removed. The old code path behind a flag that has been ON for six months is dead code that looks alive. It clutters the codebase and confuses new engineers. Set a calendar reminder to remove flags after rollout.
- Testing only the flag-ON path. When you have a flag, you have two versions of your code. Test both. Users on the old path still need a working product.
- Using flags to avoid hard decisions. "Let's put it behind a flag" can become a way to ship something without committing to it. Flags should be temporary. If you cannot decide whether to ship a feature, a flag does not solve the decision problem.
- Complex flag dependencies. If enabling flag A requires flag B to be on, and flag B requires flag C, you have created a configuration management nightmare. Keep flags independent.
- Buying a platform too early. A feature flag SaaS costs $100-500/month and adds a third-party dependency to every page load. A database table with five rows costs nothing and does everything you need for the first year.
Key Takeaways
- Feature flags decouple deployment from release. Deploy code continuously; release features when they are ready by toggling flags.
- Start with the simplest implementation: a database table with boolean values and a basic admin UI. You do not need a platform until you have 20+ flags or need statistical experimentation.
- Use flags to test in production with real data and real traffic. Staging environments do not replicate production conditions — production does.
- Set success criteria before running experiments. If you decide what "winning" means after seeing data, you will rationalize keeping whatever you built.
- Clean up flags aggressively. A flag should live for two to four weeks. If it has been around for three months, either ship the feature permanently or kill it.