Core Experimentation

What are Mutually Exclusive Experiments?

Mutually exclusive experiments are experiments designed so that no user participates in more than one at the same time.

Mutually exclusive experiments are experiments designed so that no user participates in more than one at the same time. If a user is assigned to Experiment A, they're excluded from Experiment B. This prevents interaction effects: the risk that two concurrent changes influence each other's metrics in ways that make both results unreliable.

Mutual exclusion is the cleanest solution to a problem that grows with experimentation maturity. When one team runs one experiment, interactions aren't a concern. When 300+ teams run 10,000+ experiments per year, as Spotify does, two experiments touching the same product surface at the same time is the default, not the exception. Confidence's coordination layer manages mutual exclusion at the platform level so teams don't have to negotiate traffic allocation in spreadsheets.

When do you need mutually exclusive experiments?

Not all concurrent experiments interact. Two experiments running on different parts of the product (one on search, another on checkout) affect different user behaviors and can safely overlap. The problem arises when two experiments affect the same user experience or the same metrics.

Consider two concurrent experiments on a product's home feed: one changes the ranking algorithm, another changes the card layout. A user in both experiments simultaneously sees a different ranking and a different layout. If engagement goes up, which change caused it? The ranking? The layout? The combination? Without mutual exclusion, you can't separate the effects.

The general rule: experiments that modify the same product surface or measure the same primary metrics should be mutually exclusive. Experiments on unrelated surfaces can overlap safely.

Spotify's experimentation coordination strategy, published in 2021, describes the bucket-reuse hashing system that makes this work at scale. A salt-machine algorithm maps users into buckets. Each experiment claims a set of buckets, and the system ensures no two mutually exclusive experiments claim overlapping buckets. The hashing is deterministic: a given user always maps to the same bucket for a given salt, so assignment is consistent without storing per-user state.

How does mutual exclusion affect experiment bandwidth?

Mutual exclusion has a direct cost: it reduces the traffic available for each experiment. If your product has 1 million daily active users and you run two mutually exclusive experiments at 50/50 split, each experiment gets 500,000 users. Run four mutually exclusive experiments and each gets 250,000.

This is the core tension. Mutual exclusion improves result quality by eliminating interactions. It reduces bandwidth by splitting traffic. Organizations that don't manage this tension either run too few experiments (wasting traffic on unnecessary exclusion) or too many overlapping experiments (getting unreliable results).

Confidence manages this through the Surface concept. A Surface groups experiments that operate on the same product area. Within a Surface, experiments that could interact are made mutually exclusive. Across different Surfaces, experiments can overlap freely. This maximizes the traffic available to each experiment while protecting against the interactions that actually threaten validity.

Variance reduction techniques like CUPED also help. By reducing the sample size each experiment needs to reach adequate power, CUPED effectively increases the number of mutually exclusive experiments you can run concurrently on the same traffic.

What's the alternative to mutual exclusion?

Some organizations allow overlapping experiments and handle interactions statistically. Interaction testing checks whether the effect of Experiment A differs depending on whether a user is also in Experiment B. If no significant interaction is detected, the experiments can be analyzed independently.

The problem: interaction tests require large sample sizes and are notoriously underpowered. In practice, they're better at detecting large interactions than subtle ones. A subtle interaction that biases your treatment effect estimate by 10% might go undetected.

Another approach is post-hoc adjustment, where you model the joint effects of overlapping experiments. This works in theory but adds analytical complexity and introduces modeling assumptions that can themselves introduce bias.

For most product experimentation programs, mutual exclusion within a product surface remains the safest and simplest approach. It trades traffic for certainty, and that trade-off is usually worth it.