Lesson 4: Cumulative holdback evaluations

Summary

A cumulative holdback test measures the combined impact of all changes shipped during a period, such as a quarter. Setting one up requires planning and coordination to ensure all experiments during the period exclude holdback users.

What is a holdback group?

A holdback group is a set of users who are excluded from receiving any product changes during a defined period. This lets you run a test at the end of the period where the holdback group receives all the changes at once, measuring their combined impact.

When you create a holdback, you assign a percentage of users to it.

In Confidence

Create a holdback group on the settings page of any surface: Surfaces tab > your surface > Settings.

What is a cumulative holdback test?

A cumulative holdback test measures the total impact of multiple changes shipped during a holdback period. After the period ends, you run an A/B test where the holdback group receives all the changes at once. The result reveals the combined effect of everything that shipped.

This is useful when individual A/B tests cannot capture interaction effects between features, or when you want a single measurement of quarterly progress.

How to set up a holdback

Setting up a holdback requires two decisions before experiments start: how large to make it, and how to create and configure it.

Determine the size of the holdback group

Choosing the holdback group size involves several trade-offs. A larger group gives more statistical power for the cumulative evaluation, but it reduces the population available for experiments during the holdback period. There is also an opportunity cost: if a shipped feature turns out to be very successful, holdback users miss out on it for the entire holdback period. A large holdback group with a successful quarter of shipping can mean meaningful lost value for those users.

If you're unsure how large to make the holdback, work backwards from the evaluation you plan to run at the end of the period. Set up a placeholder A/B test for that evaluation and use the sample size calculator to find how many users you need to detect the effects you care about. That number tells you the minimum holdback size.

In Confidence

In Confidence, create a placeholder A/B test for the holdback evaluation and use the sample size calculator to determine how many users you need. Use that to set the holdback group size before experiments start.

Create the holdback group

Create a holdback group in your experimentation tool, giving it a name and a percentage of users.

Some platforms let you make the holdback required, so that all experiments automatically exclude holdback users and experimenters cannot opt out. This removes the most common source of holdback corruption: experiments that accidentally or intentionally skip the exclusion.

In Confidence

In Confidence, you can make a holdback required on a surface (global or local). When required, every experiment on that surface automatically excludes holdback users. There is no need to configure each experiment individually, and experimenters cannot accidentally or intentionally skip it.

Run A/B tests and rollouts during the holdback period

All A/B tests and rollouts during the holdback period must be configured to not overlap with the holdback. Configure this on each experiment when you set it up. If the holdback is required, this happens automatically when a surface is selected.

For rollouts, the reach percentage is calculated relative to the total user population. Since holdback users are excluded, the maximum reachable population is smaller.

Example: Rollout reach with a 10% holdback

A 10% holdback group exists. A monitored rollout is configured to not overlap with the holdback, so it can only reach the 90% of users outside it. The rollout uses a 90% treatment / 10% control split internally. With 90% of users available, the maximum treatment reach is 81% of total users, since 90% × 90% = 81%. The rollout control takes the remaining 9%, and the 10% holdback group is untouched. The only way to rollout across the holdback's users is to remove the holdback.

What changes to include

For a cumulative holdback to be meaningful, all product changes during the holdback period should respect the holdback. This includes:

A/B tests followed by rollouts
Direct rollouts to ship a change without a prior A/B test

Minor bug fixes and backend migrations generally don't need to be included, since they aren't made to improve the product. However, any backend change expected to affect the user experience should be shipped through your experimentation platform so holdback users aren't affected.

Don't select which changes to include based on their expected impact. Excluding changes because their impact seems small compromises the purpose of the holdback.

Measure the impact at the end of the period

After the holdback period is over, run an A/B test targeting only the holdback group. Half of the holdback group receives all the changes that shipped during the period, while the other half acts as control.

In Confidence

In Confidence, select your holdback from the Holdback users sidebar in the Target audience section of the experiment setup.

Run holdbacks continuously

The main challenge with back-to-back quarterly holdbacks is creating the new holdback group in time and managing the transition between periods. Create the next quarter's holdback group before the current quarter ends, so any A/B test that might run into the next quarter can be configured against both holdbacks from the start. The two holdback groups should be non-overlapping—each user belongs to at most one of them—so that the groups remain clean and the evaluations don't interfere with each other.

Example: Transitioning from Q1 to Q2 holdbacks

Before Q1 starts: Create holdback HQ1. Configure all experiments to not overlap with it.

Day 50: A new A/B test is planned that may run into Q2. Create holdback HQ2. Configure this test to not overlap with both HQ1 and HQ2.

Day 91: Q1 ends. Launch the cumulative evaluation A/B test, targeting only users in HQ1.

Day 121: The HQ1 evaluation ends. Remove HQ1, and roll out the Q1 changes to 100%.

Note that the total holdback period includes both the holdback itself and the evaluation test that runs at the end. For example, a quarterly holdback followed by 3 to 4 weeks of evaluation means users in the holdback group are withheld from product changes for a quarter plus those additional weeks.

This video gives a 3 minutes and 57 seconds overview of exclusivity groups and holdbacks.

Cross-surface holdbacks

In Confidence

In Confidence, holdbacks can span multiple surfaces. A holdback is created on one surface, but experiments on other surfaces can also respect it by selecting both their own surface and the surface where the holdback lives. Read more about using surfaces to coordinate experiments. Cross-surface holdbacks require coordination across all teams involved—set up regular syncs to ensure everyone knows which holdbacks are active and when evaluation periods begin and end.

This video gives a 4 minutes and 47 seconds overview of advanced experiment coordination using exclusivity groups and holdbacks across multiple surfaces.

Risks and common questions

Holdback corruption. If an experiment runs without respecting the holdback group, the holdback is contaminated and the cumulative test loses accuracy. Reduce this risk through careful communication, off-platform coordination, and reviewing experiment setups before launch.

Some changes can't be held back. Some backend and infrastructure changes can't be withheld from users. When this happens, the cumulative measurement doesn't capture the full impact of all changes. Document which changes couldn't be held back and communicate the caveat when reporting results.

Which version of a feature to use. When a feature evolves across multiple holdback periods, use the version shipped at the end of the previous holdback as the baseline for holdback users in the next period. Using the previous quarter's final version as the baseline evaluates the change between quarters rather than each incremental update within a quarter.

An A/B test runs into the next quarter. If a test is designed to not overlap with the current quarter's holdback and succeeds, it will be rolled out in the next quarter. Create the next quarter's holdback group early and configure the test to also exclude those users. This ensures that holdback users in both periods are not exposed to the change before the evaluation.

Reader exercise

Which product changes during a holdback period should be configured to respect the holdback?

Only changes that are directly related to the metrics you plan to evaluate.

All intended product improvements, regardless of their expected impact.

Only changes that are expected to have a large impact on users.

Reader exercise

How should you decide on the size of a holdback group?

Always use 10% as the standard holdback size.

Balance the statistical power needed for the evaluation against the opportunity cost of withholding successful features from holdback users and the population available for experiments.

Use the same size as the smallest A/B test planned during the holdback period.