Lesson 4: Cumulative holdback evaluations

What is a holdback group?

A holdback group is a set of users who are excluded from receiving any product changes during a defined period. This lets you run a test at the end of the period where the holdback group receives all the changes at once, measuring their combined impact.

When you create a holdback, you assign a percentage of users to it.

What is a cumulative holdback test?

A cumulative holdback test measures the total impact of multiple changes shipped during a holdback period. After the period ends, you run an A/B test where the holdback group receives all the changes at once. The result reveals the combined effect of everything that shipped.

This is useful when individual A/B tests cannot capture interaction effects between features, or when you want a single measurement of quarterly progress.

How to set up a holdback

Setting up a holdback requires two decisions before experiments start: how large to make it, and how to create and configure it.

Determine the size of the holdback group

Choosing the holdback group size involves several trade-offs. A larger group gives more statistical power for the cumulative evaluation, but it reduces the population available for experiments during the holdback period. There is also an opportunity cost: if a shipped feature turns out to be very successful, holdback users miss out on it for the entire holdback period. A large holdback group with a successful quarter of shipping can mean meaningful lost value for those users.

If you're unsure how large to make the holdback, work backwards from the evaluation you plan to run at the end of the period. Set up a placeholder A/B test for that evaluation and use the sample size calculator to find how many users you need to detect the effects you care about. That number tells you the minimum holdback size.

Create the holdback group

Create a holdback group in your experimentation tool, giving it a name and a percentage of users.

Some platforms let you make the holdback required, so that all experiments automatically exclude holdback users and experimenters cannot opt out. This removes the most common source of holdback corruption: experiments that accidentally or intentionally skip the exclusion.

Run A/B tests and rollouts during the holdback period

All A/B tests and rollouts during the holdback period must be configured to not overlap with the holdback. Configure this on each experiment when you set it up. If the holdback is required, this happens automatically when a surface is selected.

For rollouts, the reach percentage is calculated relative to the total user population. Since holdback users are excluded, the maximum reachable population is smaller.

What changes to include

For a cumulative holdback to be meaningful, all product changes during the holdback period should respect the holdback. This includes:

  • A/B tests followed by rollouts
  • Direct rollouts to ship a change without a prior A/B test

Minor bug fixes and backend migrations generally don't need to be included, since they aren't made to improve the product. However, any backend change expected to affect the user experience should be shipped through your experimentation platform so holdback users aren't affected.

Don't select which changes to include based on their expected impact. Excluding changes because their impact seems small compromises the purpose of the holdback.

Measure the impact at the end of the period

After the holdback period is over, run an A/B test targeting only the holdback group. Half of the holdback group receives all the changes that shipped during the period, while the other half acts as control.

Run holdbacks continuously

The main challenge with back-to-back quarterly holdbacks is creating the new holdback group in time and managing the transition between periods. Create the next quarter's holdback group before the current quarter ends, so any A/B test that might run into the next quarter can be configured against both holdbacks from the start. The two holdback groups should be non-overlapping—each user belongs to at most one of them—so that the groups remain clean and the evaluations don't interfere with each other.

Note that the total holdback period includes both the holdback itself and the evaluation test that runs at the end. For example, a quarterly holdback followed by 3 to 4 weeks of evaluation means users in the holdback group are withheld from product changes for a quarter plus those additional weeks.

This video gives a 3 minutes and 57 seconds overview of exclusivity groups and holdbacks.

Cross-surface holdbacks

This video gives a 4 minutes and 47 seconds overview of advanced experiment coordination using exclusivity groups and holdbacks across multiple surfaces.

Risks and common questions

Holdback corruption. If an experiment runs without respecting the holdback group, the holdback is contaminated and the cumulative test loses accuracy. Reduce this risk through careful communication, off-platform coordination, and reviewing experiment setups before launch.

Some changes can't be held back. Some backend and infrastructure changes can't be withheld from users. When this happens, the cumulative measurement doesn't capture the full impact of all changes. Document which changes couldn't be held back and communicate the caveat when reporting results.

Which version of a feature to use. When a feature evolves across multiple holdback periods, use the version shipped at the end of the previous holdback as the baseline for holdback users in the next period. Using the previous quarter's final version as the baseline evaluates the change between quarters rather than each incremental update within a quarter.

An A/B test runs into the next quarter. If a test is designed to not overlap with the current quarter's holdback and succeeds, it will be rolled out in the next quarter. Create the next quarter's holdback group early and configure the test to also exclude those users. This ensures that holdback users in both periods are not exposed to the change before the evaluation.