Experiment Analysis

What is a Simpson's Paradox?

Simpson's paradox is a statistical phenomenon where a trend that appears in several subgroups reverses or disappears when the subgroups are combined.

Simpson's paradox is a statistical phenomenon where a trend that appears in several subgroups reverses or disappears when the subgroups are combined. A treatment can improve a metric in every user segment individually, yet appear to hurt it in the aggregate, or vice versa. The paradox arises from unequal group sizes across segments that confound the overall comparison.

This isn't a theoretical curiosity. It shows up in real experiment analysis whenever the treatment changes the composition of who ends up in different subgroups.

How does Simpson's paradox appear in experiments?

Consider a product change that affects both mobile and desktop users. Among mobile users, conversion increases from 3% to 4%. Among desktop users, conversion increases from 8% to 9%. Both segments improve. But if the treatment also causes more users to visit on mobile (where baseline conversion is lower), the aggregate conversion rate can decrease because the overall mix shifted toward the lower-converting segment.

The math: if treatment shifts the mobile share from 50% to 70%, the weighted average conversion moves from 5.5% to 5.5%, or even lower, depending on the exact numbers. The per-segment improvement is real. The aggregate decline is also real. They're answering different questions.

This is why segment analysis matters. The aggregate result answers "what happened overall?" The segment results answer "did the change help or hurt within each group?" When those answers conflict, you're seeing Simpson's paradox, and the right interpretation depends on what you're trying to decide.

When should you worry about Simpson's paradox?

Three conditions make Simpson's paradox likely in experiment analysis:

The treatment affects user behavior in ways that change segment composition. If the change makes the product more appealing to one group, more of that group shows up, shifting the mix.

The segments have meaningfully different baseline metric values. If mobile and desktop users convert at similar rates, composition shifts don't create a paradox.

The analysis doesn't account for the segments. An aggregate-only analysis hides the reversal entirely.

Confidence supports segment breakdowns in its automated analysis, which helps teams spot these reversals. When the overall result and segment results disagree, that's a signal to investigate composition changes before making a shipping decision.

How do you resolve the paradox?

Simpson's paradox isn't something you "fix" statistically. It's something you interpret correctly.

If the treatment genuinely improves the metric within every relevant segment, that's a real improvement. The aggregate decline is a composition effect, not a treatment failure. Whether you should ship depends on whether the composition shift is itself desirable or harmful.

If the composition shift is a problem (for example, if driving users from a high-value segment to a low-value one destroys revenue even though per-segment metrics improve), the aggregate metric is telling you something important that the segment metrics miss.

The resolution is always the same: look at both levels and decide which question matters for your decision. Don't default to the aggregate just because it's simpler.