Segment analysis breaks down experiment results by user subgroups to detect heterogeneous treatment effects: cases where the change helps some users, hurts others, or has no effect on a particular group, even though the overall result looks positive, negative, or flat.
A single average treatment effect can hide important variation. A feature might improve engagement for mobile users and degrade it for desktop users, producing a net-positive overall result that masks a real regression for one platform. Segment analysis surfaces these patterns.
Why do treatment effects vary across segments?
Users interact with products in different contexts: different devices, different usage frequencies, different stages of their lifecycle, different markets. A change that simplifies navigation helps new users who are still learning the product and does nothing for power users who already have muscle memory. A pricing experiment might lift conversion in one country and suppress it in another where the price point hits a cultural threshold.
These aren't edge cases. At Spotify, where experiments reach users across 186 markets on every major platform, heterogeneous effects are common. A change that looks uniformly positive in the aggregate might be carried entirely by one large segment while being neutral or negative elsewhere.
How do you run segment analysis without inflating false positives?
The risk with segment analysis is the garden of forking paths. If you slice results by 20 segments, you'll find "significant" differences by chance alone. At a 5% significance level, you'd expect one false positive for every 20 segments you check.
A few practices keep segment analysis honest:
Pre-register your segments. Decide which segments to examine before looking at results. "Mobile vs. desktop" and "new vs. returning users" are reasonable choices for most product experiments. Adding segments after seeing results is a recipe for false discoveries.
Adjust for multiple comparisons. If you're examining five segments, apply a multiple testing correction (Bonferroni or Benjamini-Hochberg) to the segment-level results. Confidence applies multiple testing corrections automatically when running analysis across multiple metrics, and the same principle applies to segments.
Treat segment results as hypotheses, not conclusions. A strong segment effect in one experiment is a signal worth investigating. It becomes evidence when it replicates in a follow-up experiment designed to test that specific segment.
Use segment analysis to explain, not to discover. The most productive use is checking whether a known overall effect is consistent across important segments, not mining for any segment where something looks interesting.
What segments are most useful to examine?
The segments worth checking depend on the product, but a few are almost always informative:
Platform or device type (mobile, desktop, tablet) captures differences in UI, screen size, and usage patterns. New vs. established users separates learning effects from habit effects. Geographic market matters for pricing, content, and regulatory differences. Subscription tier or user value helps assess whether the change affects your most important users differently.
Confidence supports segment breakdowns as part of its automated analysis, letting teams examine pre-defined segments without manual data pulls.