Lesson 4: Cumulative holdback evaluations
A cumulative holdback test measures the combined impact of all changes shipped during a period, such as a quarter. Setting one up requires planning and coordination to ensure all experiments during the period exclude holdback users.
What is a holdback group?
A holdback group is a set of users who are excluded from receiving any product changes during a defined period. This lets you run a test at the end of the period where the holdback group receives all the changes at once, measuring their combined impact.
When you create a holdback, you assign a percentage of users to it.
What is a cumulative holdback test?
A cumulative holdback test measures the total impact of multiple changes shipped during a holdback period. After the period ends, you run an A/B test where the holdback group receives all the changes at once. The result reveals the combined effect of everything that shipped.
This is useful when individual A/B tests cannot capture interaction effects between features, or when you want a single measurement of quarterly progress.
How to set up a holdback
Setting up a holdback requires two decisions before experiments start: how large to make it, and how to create and configure it.
Determine the size of the holdback group
Choosing the holdback group size involves several trade-offs. A larger group gives more statistical power for the cumulative evaluation, but it reduces the population available for experiments during the holdback period. There is also an opportunity cost: if a shipped feature turns out to be very successful, holdback users miss out on it for the entire holdback period. A large holdback group with a successful quarter of shipping can mean meaningful lost value for those users.
If you're unsure how large to make the holdback, work backwards from the evaluation you plan to run at the end of the period. Set up a placeholder A/B test for that evaluation and use the sample size calculator to find how many users you need to detect the effects you care about. That number tells you the minimum holdback size.
In Confidence, create a placeholder A/B test for the holdback evaluation and use the sample size calculator to determine how many users you need. Use that to set the holdback group size before experiments start.
Create the holdback group
Create a holdback group in your experimentation tool, giving it a name and a percentage of users.
Some platforms let you make the holdback required, so that all experiments automatically exclude holdback users and experimenters cannot opt out. This removes the most common source of holdback corruption: experiments that accidentally or intentionally skip the exclusion.
In Confidence, you can make a holdback required on a surface (global or local). When required, every experiment on that surface automatically excludes holdback users. There is no need to configure each experiment individually, and experimenters cannot accidentally or intentionally skip it.
Run A/B tests and rollouts during the holdback period
All A/B tests and rollouts during the holdback period must be configured to not overlap with the holdback. Configure this on each experiment when you set it up. If the holdback is required, this happens automatically when a surface is selected.
For rollouts, the reach percentage is calculated relative to the total user population. Since holdback users are excluded, the maximum reachable population is smaller.
A 10% holdback group exists. A monitored rollout is configured to not overlap with the holdback, so it can only reach the 90% of users outside it. The rollout uses a 90% treatment / 10% control split internally. With 90% of users available, the maximum treatment reach is 81% of total users, since 90% × 90% = 81%. The rollout control takes the remaining 9%, and the 10% holdback group is untouched. The only way to rollout across the holdback's users is to remove the holdback.
What changes to include
For a cumulative holdback to be meaningful, all product changes during the holdback period should respect the holdback. This includes:
- A/B tests followed by rollouts
- Direct rollouts to ship a change without a prior A/B test
Minor bug fixes and backend migrations generally don't need to be included, since they aren't made to improve the product. However, any backend change expected to affect the user experience should be shipped through your experimentation platform so holdback users aren't affected.
Don't select which changes to include based on their expected impact. Excluding changes because their impact seems small compromises the purpose of the holdback.
Measure the impact at the end of the period
After the holdback period is over, run an A/B test targeting only the holdback group. Half of the holdback group receives all the changes that shipped during the period, while the other half acts as control.
In Confidence, select your holdback from the Holdback users sidebar in the Target audience section of the experiment setup.
Run holdbacks continuously
The main challenge with back-to-back quarterly holdbacks is creating the new holdback group in time and managing the transition between periods. Create the next quarter's holdback group before the current quarter ends, so any A/B test that might run into the next quarter can be configured against both holdbacks from the start. The two holdback groups should be non-overlapping—each user belongs to at most one of them—so that the groups remain clean and the evaluations don't interfere with each other.
Before Q1 starts: Create holdback HQ1. Configure all experiments to not overlap with it.
Day 50: A new A/B test is planned that may run into Q2. Create holdback HQ2. Configure this test to not overlap with both HQ1 and HQ2.
Day 91: Q1 ends. Launch the cumulative evaluation A/B test, targeting only users in HQ1.
Day 121: The HQ1 evaluation ends. Remove HQ1, and roll out the Q1 changes to 100%.
Note that the total holdback period includes both the holdback itself and the evaluation test that runs at the end. For example, a quarterly holdback followed by 3 to 4 weeks of evaluation means users in the holdback group are withheld from product changes for a quarter plus those additional weeks.
This video gives a 3 minutes and 57 seconds overview of exclusivity groups and holdbacks.
Cross-surface holdbacks
In Confidence, holdbacks can span multiple surfaces. A holdback is created on one surface, but experiments on other surfaces can also respect it by selecting both their own surface and the surface where the holdback lives. Read more about using surfaces to coordinate experiments. Cross-surface holdbacks require coordination across all teams involved—set up regular syncs to ensure everyone knows which holdbacks are active and when evaluation periods begin and end.
This video gives a 4 minutes and 47 seconds overview of advanced experiment coordination using exclusivity groups and holdbacks across multiple surfaces.
Risks and common questions
Holdback corruption. If an experiment runs without respecting the holdback group, the holdback is contaminated and the cumulative test loses accuracy. Reduce this risk through careful communication, off-platform coordination, and reviewing experiment setups before launch.
Some changes can't be held back. Some backend and infrastructure changes can't be withheld from users. When this happens, the cumulative measurement doesn't capture the full impact of all changes. Document which changes couldn't be held back and communicate the caveat when reporting results.
Which version of a feature to use. When a feature evolves across multiple holdback periods, use the version shipped at the end of the previous holdback as the baseline for holdback users in the next period. Using the previous quarter's final version as the baseline evaluates the change between quarters rather than each incremental update within a quarter.
An A/B test runs into the next quarter. If a test is designed to not overlap with the current quarter's holdback and succeeds, it will be rolled out in the next quarter. Create the next quarter's holdback group early and configure the test to also exclude those users. This ensures that holdback users in both periods are not exposed to the change before the evaluation.