What is a Holdout Group?

A holdout group is a segment of users permanently excluded from receiving a feature or set of features, maintained over time to measure cumulative long-term impact. Unlike a control group in a single A/B test (which lasts for the duration of one experiment), a holdout persists across multiple feature releases. It answers a question no individual experiment can: what's the total effect of everything we've shipped?

Holdouts are one of the most valuable and most underused tools in experimentation. At Spotify, holdout groups have revealed cases where individually positive experiments produced negative cumulative effects when stacked together over months. A feature that lifts engagement by 0.5% in isolation can interact with three other recently shipped features in ways that erode the user experience. Without a holdout, those interaction effects are invisible.

How does a holdout group work?

The setup is straightforward. You designate a percentage of users (typically 1-5%) who will not receive any new features in a particular product area. As your team ships feature after feature, the holdout group stays on the old experience. You then compare the holdout group's metrics against the rest of the population at regular intervals.

The comparison measures cumulative impact: the combined effect of every change shipped since the holdout was created. If you shipped ten features over six months and the holdout group's engagement is 3% lower than the treated population, your shipped features collectively produced a +3% lift. If the holdout group's engagement is higher, you've shipped yourself backward.

This is distinct from a per-experiment control group. An A/B test control group exists only for the duration of that experiment. Once the experiment concludes and the feature ships, all users (including former control users) receive the change. A holdout group remains excluded even after individual experiments end.

Why are holdouts necessary if you already run A/B tests?

Individual A/B tests measure the marginal effect of one change in the context of everything else that's already live. They don't measure interaction effects between features shipped at different times, and they don't capture effects that compound slowly.

Consider three changes shipped over a quarter: a redesigned onboarding flow, a new recommendation algorithm, and a reorganized settings page. Each showed a positive result in its own A/B test. But the new onboarding flow set user expectations that the recommendation algorithm now contradicts, and the reorganized settings page made it harder to adjust preferences that the algorithm depends on. The cumulative effect is worse than the sum of the parts.

Spotify's experimentation program uses holdouts specifically for this reason. The Search team's experimentation maturity journey, documented publicly, included adopting holdout evaluation as a later-stage practice: first improve individual experiment quality, then coordinate across experiments, then measure total business impact with holdouts.

What are the limitations of holdouts?

Sample size constraints. A 2% holdout across 10 million monthly users gives you 200,000 holdout users. That's large enough to detect substantial cumulative effects, but not sensitive enough to detect small ones. You're measuring the combined effect of many changes, so the signal is usually large enough, but early in a holdout's life (after only one or two feature ships), the cumulative effect may be too small to measure.

User experience risk. Holdout users don't get improvements. If your team ships a bug fix or a critical safety feature, excluding holdout users may not be acceptable. Most organizations exempt certain categories of changes (bug fixes, compliance requirements, infrastructure changes) from the holdout.

Staleness over time. The longer a holdout runs, the more the holdout experience diverges from the current product. At some point, the holdout group is using a version of the product so different from what everyone else sees that the comparison becomes less useful. Refreshing the holdout (releasing all users and starting a new holdout) resets the measurement window.

Confidence supports holdout evaluation through its experiment coordination surfaces, letting teams configure which user segments are held out and for which feature areas, and tracking cumulative metrics over time.

What is a Holdout Group?

How does a holdout group work?

Why are holdouts necessary if you already run A/B tests?

What are the limitations of holdouts?

Related terms