Lesson 10: Segment-level analysis
When you discover that a metric behaves differently across user groups, it's tempting to solve this by creating filtered metrics for each cohort. Don't. Filtered metrics are statistically inefficient and obscure the aggregate story. Segment analysis gives you the diagnostic power you actually need—without the cost.
The temptation: a metric for every cohort
Imagine you run an experiment and find that new users respond very differently from power users. The aggregate result looks muddled. The obvious instinct is to clean this up at the metric level: define "streams for new users" and "streams for power users" as separate primary metrics so each can be evaluated cleanly.
This feels rigorous. It is actually a mistake.
When you filter a metric to a subpopulation, you reduce your sample size. A smaller sample means higher variance in your estimate. Higher variance means lower statistical power—you need a larger true effect to reliably detect it, and your confidence intervals get wider. You are not gaining precision; you are trading away sensitivity.
Worse, you lose the aggregate story entirely. A change that helps new users and hurts power users will look like a success if you are only watching the new user metric. The trade-off is invisible.
A team notices that a new onboarding flow lifts 7-day streams for new users. They're tempted to make "new user streams" the primary metric for future onboarding experiments. But new users are 15% of the experiment population. Filtering to that 15% means their experiments are now underpowered for the same effect sizes—and they can no longer see whether changes that help new users are hurting everyone else.
What to do instead: aggregate metric, exploratory segments
The right structure separates two distinct jobs:
The primary metric captures the aggregate effect across all users. It has the full sample size behind it, which maximizes statistical power and captures trade-offs that would cancel or hide in a filtered view.
Segment analysis is how you understand that effect. After you have the aggregate result, you break it down by pre-specified user groups to diagnose what is driving it, where the effect is concentrated, and whether any group is being meaningfully harmed.
This is not a demotion for segment-level thinking—it is a clarification of its role. Segments are diagnostic tools, not primary outcomes. They answer why and for whom, after the primary metric answers whether.
Pre-specify your segments
The risk in exploratory segment analysis is obvious: if you scroll through 20 cuts after seeing the data and report the interesting ones, you will always find something. At a 5% significance threshold across 20 segments, you should expect roughly one spurious significant result even when the treatment has no real effect on anyone.
The fix is to decide which segments you will examine before you look at results. Pre-specified comparisons carry full statistical weight. Post-hoc ones are leads to follow up, not conclusions to act on.
The most practical way to do this is a standard segment set—a fixed list your team runs after every experiment. Good candidates are segments where, if you found a strong effect, it would change what you do:
- New versus returning users—onboarding dynamics often differ sharply
- Free versus premium—what works for one tier may not work for the other
- Mobile versus desktop—platform differences can reflect implementation quality as much as concept quality
- Engagement tier—power users and casual users can respond very differently
A streaming platform runs every experiment with a standard breakdown: new users (under 30 days), returning users, free tier, premium tier. A new recommendation algorithm shows a neutral aggregate result. The standard breakdown reveals +8% streams for new users and −3% for premium users—effects that cancel in aggregate. Because these segments were pre-specified, the finding carries full statistical weight and the team knows they have a real trade-off to resolve before shipping.
The exception: guardrail metrics
There is one legitimate use of cohort-specific metrics: guardrails. A guardrail metric is not a success criterion—it is a constraint. "Do not hurt new users" is a reasonable guardrail even if new user streams are not your primary metric.
The distinction matters:
- Primary metric (filtered): bad. Underpowered, loses aggregate signal.
- Guardrail metric (filtered): fine. You are not trying to optimize it, just checking that you haven't broken something for a specific group.
Post-hoc segment findings—ones that emerged from exploring the data rather than pre-specified analysis—should be treated as hypotheses to confirm in a follow-up experiment. The most useful thing a surprising post-hoc finding can do is point you toward the right question for your next experiment, not justify a ship decision on its own.
Why is filtering your primary metric to a subpopulation (e.g., 'streams for new users only') a problem?
You run an experiment and notice post-hoc that users in one city show a surprisingly large positive effect. What should you do?
What is the appropriate role for cohort-specific metrics (e.g., 'new user streams') in an experiment?
Notes for nerds
The statistical cost of filtering is a direct consequence of the variance of a sample mean. If your full experiment population has N users, the variance of your effect estimate scales as 1/N. If you filter to a subpopulation of size n < N, the variance scales as 1/n—larger by a factor of N/n. To recover the same power you had with the full population, you need a true effect size that is √(N/n) times as large.
For a segment that is 20% of your population, you need an effect roughly 2.2× larger to detect it with the same reliability as your aggregate metric. Most real product effects don't come that much stronger in subpopulations than in aggregate—so you are often just running an underpowered test while thinking you've done something more targeted.
The multiple comparisons problem compounds this: the familywise error rate for k
independent tests at significance level α is 1 − (1 − α)^k. At α = 0.05 and
k = 20 segments, that is roughly 0.64—a 64% chance of at least one false
positive even when nothing is real.