Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Experiment Analysis

What is a Simpson's Paradox?

Simpson's paradox is a statistical phenomenon where a trend that appears in several subgroups reverses or disappears when the subgroups are combined.

Simpson's paradox is a statistical phenomenon where a trend that appears in several subgroups reverses or disappears when the subgroups are combined. A treatment can improve a metric in every user segment individually, yet appear to hurt it in the aggregate, or vice versa. The paradox arises from unequal group sizes across segments that confound the overall comparison.

This isn't a theoretical curiosity. It shows up in real experiment analysis whenever the treatment changes the composition of who ends up in different subgroups.

How does Simpson's paradox appear in experiments?

Consider a product change that affects both mobile and desktop users. Among mobile users, conversion increases from 3% to 4%. Among desktop users, conversion increases from 8% to 9%. Both segments improve. But if the treatment also causes more users to visit on mobile (where baseline conversion is lower), the aggregate conversion rate can decrease because the overall mix shifted toward the lower-converting segment.

The math: if treatment shifts the mobile share from 50% to 70%, the weighted average conversion moves from 5.5% to 5.5%, or even lower, depending on the exact numbers. The per-segment improvement is real. The aggregate decline is also real. They're answering different questions.

This is why segment analysis matters. The aggregate result answers "what happened overall?" The segment results answer "did the change help or hurt within each group?" When those answers conflict, you're seeing Simpson's paradox, and the right interpretation depends on what you're trying to decide.

When should you worry about Simpson's paradox?

Three conditions make Simpson's paradox likely in experiment analysis:

The treatment affects user behavior in ways that change segment composition. If the change makes the product more appealing to one group, more of that group shows up, shifting the mix.

The segments have meaningfully different baseline metric values. If mobile and desktop users convert at similar rates, composition shifts don't create a paradox.

The analysis doesn't account for the segments. An aggregate-only analysis hides the reversal entirely.

Confidence supports segment breakdowns in its automated analysis, which helps teams spot these reversals. When the overall result and segment results disagree, that's a signal to investigate composition changes before making a shipping decision.

How do you resolve the paradox?

Simpson's paradox isn't something you "fix" statistically. It's something you interpret correctly.

If the treatment genuinely improves the metric within every relevant segment, that's a real improvement. The aggregate decline is a composition effect, not a treatment failure. Whether you should ship depends on whether the composition shift is itself desirable or harmful.

If the composition shift is a problem (for example, if driving users from a high-value segment to a low-value one destroys revenue even though per-segment metrics improve), the aggregate metric is telling you something important that the segment metrics miss.

The resolution is always the same: look at both levels and decide which question matters for your decision. Don't default to the aggregate just because it's simpler.

Related terms

Experiment Analysis
Segment Analysis

Segment analysis breaks down experiment results by user subgroups to detect heterogeneous treatment effects: cases where the change helps some users, hurts others, or has no effect on a particular ...

Experiment Analysis
Confounding Variables

A confounding variable is a factor that influences both the treatment assignment and the outcome being measured, creating a spurious association that can be mistaken for a causal effect.

Core Experimentation
Average Treatment Effect (ATE)

The average treatment effect (ATE) is the mean difference in outcomes between the treatment group and the control group, averaged across the entire experimental population.

Experiment Analysis
Observational Bias

Observational bias is systematic error introduced when the data collection or analysis process produces results that consistently differ from the truth.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.