Metric interaction effects occur when the effect of one experiment on a metric depends on whether another experiment is also running. If experiment A lifts conversion by 3% when run alone but only by 1% when experiment B is also active, the two experiments interact. The 2-percentage-point gap is the interaction effect.
Interaction effects are a coordination problem. When dozens of teams experiment concurrently on the same product, the assumption that each experiment's results are independent becomes questionable. If interactions are large and frequent, experiment results become unreliable and teams lose the ability to attribute metric movements to specific changes.
How common are metric interaction effects?
In practice, most concurrent experiments don't interact meaningfully. Research on large-scale experimentation programs, including Spotify's, shows that statistically detectable interaction effects are rare when experiments modify different parts of the product. Two experiments are unlikely to interact if they change different screens, different algorithms, or different user flows.
Interactions become more likely when experiments operate on the same feature surface. Two experiments that both modify the home feed ranking algorithm, or two experiments that both change the checkout flow, have overlapping causal pathways. The effect of changing the ranking weights depends on what content is being ranked, and if another experiment is changing the content pool, the two interact.
At Spotify, where 300+ teams run 10,000+ experiments per year, the coordination challenge is structural. The platform uses a concept called Surfaces to group experiments that share a product area. Surfaces standardize required metrics and provide visibility into what's running concurrently, so teams can identify potential interactions before they become a problem.
How do you detect interaction effects?
The gold standard is a factorial design: run all combinations of the experiments and measure the interaction directly. If experiments A and B each have two variants (control and treatment), a full factorial has four cells: neither treatment, A only, B only, and both. The interaction effect is the difference between the combined effect and the sum of the individual effects.
Factorial designs work but are expensive. Each additional experiment doubles the number of cells, and the power to detect interaction effects is lower than the power to detect main effects. For most product experimentation, running factorial designs across all pairs of concurrent experiments isn't feasible.
The more practical approach is to detect interactions post hoc. If experiment A shows different results when experiment B is running vs. not running, that's evidence of an interaction. Confidence enables this by tracking which experiments overlap in time and population, so teams can condition their analysis on the presence or absence of concurrent experiments when results look unexpected.
What should teams do about interaction effects?
Three strategies, in order of practicality.
Isolate experiments that are likely to interact. Use exclusive experiment assignment (mutual exclusion) for experiments on the same surface. Spotify's bucket-reuse hashing system assigns users to non-overlapping experiment populations for experiments that modify the same product area, eliminating the interaction by ensuring no user sees both treatments simultaneously.
Monitor for unexpected interactions. When experiment results are surprising, check whether a concurrent experiment on the same surface could explain the discrepancy. A treatment that lifts a metric during weeks one and two but flatlines during week three may be interacting with a new experiment that launched in week three.
Accept that small interactions exist and build them into your uncertainty estimates. If the interaction effect is small relative to the main effects, it's noise, not signal. The variance introduced by concurrent experiments is already captured in the confidence intervals of a well-designed test. Only interactions that are large relative to the treatment effect threaten the validity of the result.
Why does experiment coordination matter more at scale?
A team running 5 experiments per quarter rarely faces interaction effects. A team running 50 experiments concurrently on the same product almost certainly does. The probability of at least one meaningful interaction grows with the number of concurrent experiments on the same surface.
Confidence addresses this through Surfaces, which provide shared metric definitions, visibility into concurrent experiments, and the option for mutual exclusion. This doesn't eliminate interaction effects, but it makes them visible and manageable rather than a hidden source of unreliable results.