A holdback is a subset of users intentionally kept on the old experience after a feature has shipped to 100% of everyone else. The purpose is measurement: by maintaining a small control group that never gets the new feature, you can measure the ongoing incremental impact of the change weeks or months after launch.
Most teams stop measuring the moment a feature reaches full rollout. The A/B test is done, the rollout is complete, the feature is "shipped." But product impact isn't static. A feature that showed a 3% lift in a two-week experiment might show a 1% lift after three months as the novelty wears off. Or the lift might grow as users discover the feature's full value. Without a holdback, you're guessing which of those stories is true.
When should you use a holdback?
Holdbacks make sense for features where the long-term impact matters and isn't obvious from the initial experiment.
Major product changes. When a team redesigns a core surface like navigation, search, or the home screen, the short-term experiment captures the immediate reaction. A holdback captures whether that reaction persists, reverses, or compounds.
Revenue-impacting features. For changes that affect conversion, subscription, or monetization, leadership often wants to know the ongoing dollar impact, not just the initial experimental result. A holdback provides a continuously updated estimate.
Features with suspected novelty effects. If the experiment result might be driven by users exploring something new rather than deriving lasting value, a holdback is the honest way to check. The Spotify Search team used holdbacks when evaluating search experience changes where initial engagement spikes might not represent sustained improvement.
Holdbacks don't make sense for every feature. Small UI tweaks, bug fixes, and infrastructure changes rarely justify the cost of maintaining a control group. The decision comes down to whether the long-term causal estimate is worth the engineering and user-experience cost of denying the feature to a small group.
How large should a holdback be?
The holdback group needs to be large enough to detect a meaningful effect but small enough that you're not withholding a valuable feature from too many users.
Common holdback sizes range from 1% to 5% of users. A 1% holdback gives you a control group for directional monitoring but limits statistical power. A 5% holdback provides more precise estimates but means one in twenty users doesn't get the feature. The right size depends on your total user base and the minimum effect size you care about. For a product with millions of daily active users, even 1% can support useful measurement.
In Confidence, a holdback is implemented as a flag that stays at less than 100% allocation after the rollout is "complete." The 95% or 99% in treatment are getting the feature. The remaining group is the holdback. Guardrail metrics and success metrics continue to be computed between the groups for as long as the holdback persists.
What are the risks of holdbacks?
User experience cost. Some fraction of users doesn't get a feature that the experiment already validated as better. If the feature is a clear improvement, every day the holdback runs is a day those users have a worse experience. Teams should set a clear end date for the holdback and stick to it.
Holdback drift. Over time, the holdback group can become less representative. Users churn, new users join and get assigned to the holdback, and the composition of both groups shifts. For holdbacks lasting more than a few months, check that the groups still look comparable on key covariates.
Organizational friction. When a team ships a feature and moves on, maintaining a holdback requires someone to monitor it. If no one owns the holdback, it either runs forever (wasting the control group's experience) or gets forgotten and cleaned up without extracting the measurement. Assign a clear owner and a review date before the holdback starts.