Feature Flags

What is a Rollout?

A rollout is the process of releasing a feature to users in controlled stages using feature flags.

A rollout is the process of releasing a feature to users in controlled stages using feature flags. Instead of flipping a switch and exposing every user at once, you start with a small percentage, monitor metrics, and increase exposure gradually. The goal is safe release, not measurement. You're answering "can we ship this without breaking anything?" rather than "did this change make the product better?"

At Spotify, every production change that touches user experience goes through a rollout. The platform monitors guardrail metrics at each stage. If a metric regresses, the team rolls back before the damage reaches most users. Across 10,000+ experiments per year, 42% are rolled back after guardrails detect regressions. That catch rate depends on the rollout pattern giving the platform time to detect problems before they're everywhere.

How does a rollout work?

A rollout in Confidence is technically a flag-controlled experiment with adjustable reach. The flag starts at a low percentage, say 1% or 5%. Users in that percentage see the new feature. Everyone else sees the existing experience.

At each stage, the platform compares guardrail metrics between the exposed and unexposed groups. If crash rates, latency, error rates, or other guardrails stay within bounds, the team increases the percentage. The stages might be 1% to 10% to 50% to 100%, or they might be more granular depending on the risk profile of the change.

The assignment is deterministic: a hash of the user ID ensures the same user consistently sees the same experience, without storing per-user state. When you increase from 10% to 50%, the original 10% stays in the treatment group. New users are added, but no one gets shuffled between groups mid-rollout.

When should you use a rollout vs. an A/B test?

Rollouts and A/B tests are complementary tools that answer different questions.

An A/B test validates whether a change improves a success metric. Traffic splits at a fixed ratio (usually 50/50), the experiment runs until it reaches statistical power, and the result tells you whether users are better off. A/B tests are designed for learning.

A rollout releases a validated change safely. Traffic starts small and grows. The platform watches for regressions. Rollouts are designed for shipping.

The typical workflow at mature experimentation organizations: A/B test first to validate the idea, then roll out to release it. Some low-risk changes (copy updates, configuration adjustments) skip straight to rollout. High-risk changes sometimes get both an A/B test and a slow rollout on top.

Confidence treats both as first-class concepts. An A/B test and a rollout use the same flag infrastructure, the same metric monitoring, and the same guardrail checks. The difference is in intent and traffic allocation, not in the underlying machinery.

What makes a rollout go wrong?

Rolling out too fast. If you jump from 1% to 100% because the first stage looked clean, you skip the stages where subtle regressions would have become visible. A metric that looks stable at 1% might show a clear regression at 50% because the sample size is now large enough to detect it. Patience at each stage is the cheapest insurance.

Not monitoring the right metrics. A rollout that only tracks crash rates will miss a degraded user experience that doesn't cause crashes. Guardrail metrics should cover reliability (crashes, errors, latency), user experience (engagement, retention proxies), and business outcomes (conversion, revenue per user). Confidence lets teams define required guardrail metrics per Surface, so the monitoring is consistent across rollouts.

Skipping the rollback plan. Every rollout should have a clear answer to "what happens if this goes wrong?" In Confidence, rollback means setting the flag back to 0%. It takes seconds and requires no deploy. Teams that plan for rollback before they start the rollout make faster decisions when things go sideways.