Causal inference is the set of statistical methods used to determine whether a change actually caused an observed effect, rather than merely being correlated with it. In product experimentation, causal inference is what separates "users who saw the new feature had higher engagement" (correlation) from "the new feature caused engagement to increase" (causation).
A/B testing is the gold standard of causal inference because randomization eliminates confounding. When users are randomly assigned to control or treatment, the only systematic difference between groups is the change being tested. Any difference in outcomes can be attributed to the treatment with known error rates. This is why experimentation programs exist: they're the most reliable way to establish that a product change makes things better, not just that it arrived at the same time as an improvement. At Spotify, this principle underlies every one of the 10,000+ experiments run annually on Confidence.
Why is causation harder to establish than correlation?
Correlation is easy to observe. You ship a feature, engagement goes up, and the chart looks great. The problem is that dozens of other things also changed: seasonality, a marketing campaign, a competitor's outage, a viral moment, organic user growth. Any of these could explain the engagement increase.
Causal inference requires ruling out these alternative explanations. Randomization does this mechanically: because users are assigned to treatment and control at random, all confounding factors (observed and unobserved) are balanced between groups in expectation. The math behind this is the potential outcomes framework developed by Rubin, which defines the causal effect as the difference between what would happen to the same user under treatment versus control. Since you can't observe both outcomes for the same user, you estimate the average causal effect by comparing averages across groups.
When randomization isn't possible, causal inference gets harder. Observational methods like difference-in-differences, instrumental variables, and regression discontinuity designs can establish causation under specific assumptions, but those assumptions are strong and often unverifiable. This is why A/B tests remain the default in product development: the assumptions are minimal and the conclusions are straightforward.
What role does causal inference play in A/B testing?
Every A/B test is a causal inference exercise. The test statistic (a z-test or t-test comparing group means) estimates the average treatment effect: the causal impact of the change on the metric of interest.
Several design choices affect the quality of that causal estimate.
Random assignment must be truly random. If the assignment mechanism correlates with user characteristics (for example, assigning power users to treatment), the causal interpretation breaks down. Confidence uses deterministic hashing of user IDs and experiment salts to ensure random, reproducible assignment without storing state.
Sample ratio mismatch (SRM) checks verify that the ratio of users in each group matches the intended split. If you designed a 50/50 split but observe 52/48, something in the assignment or logging pipeline is broken, and the causal inference is compromised. Confidence runs SRM checks automatically.
Trigger analysis improves the causal estimate by restricting analysis to users who actually encountered the change. If you're testing a new checkout flow, the relevant causal question is "what's the effect on users who reached checkout?", not "what's the effect on all users including those who never saw it?" The overall average is a valid causal estimate of the intent-to-treat effect, but the triggered estimate is what's relevant for the product decision.
When can't you run an A/B test?
Some product changes can't be randomized at the user level. Network effects (changes to social features where one user's experience depends on what their friends see), platform-wide changes (new pricing, infrastructure migrations), and regulatory constraints all make standard A/B testing difficult.
Spotify's engineering blog covers one alternative: encouragement designs with instrumental variables. Instead of directly assigning users to treatment, you randomly assign "encouragement" (like a notification prompting users to try the feature) and use the encouragement as an instrument to estimate the causal effect among users who respond. This provides valid causal estimates for a subpopulation (compliers) even when you can't force treatment assignment.
Other quasi-experimental methods include difference-in-differences (comparing trends before and after a change between affected and unaffected groups) and synthetic control methods (constructing a counterfactual from weighted combinations of unaffected units). These methods require stronger assumptions than A/B tests, but they're often the best available option when randomization is infeasible.