Lesson 10: Segment-level analysis

The temptation: a metric for every cohort

Imagine you run an experiment and find that new users respond very differently from power users. The aggregate result looks muddled. The obvious instinct is to clean this up at the metric level: define "streams for new users" and "streams for power users" as separate primary metrics so each can be evaluated cleanly.

This feels rigorous. It is actually a mistake.

When you filter a metric to a subpopulation, you reduce your sample size. A smaller sample means higher variance in your estimate. Higher variance means lower statistical power—you need a larger true effect to reliably detect it, and your confidence intervals get wider. You are not gaining precision; you are trading away sensitivity.

Worse, you lose the aggregate story entirely. A change that helps new users and hurts power users will look like a success if you are only watching the new user metric. The trade-off is invisible.

What to do instead: aggregate metric, exploratory segments

The right structure separates two distinct jobs:

The primary metric captures the aggregate effect across all users. It has the full sample size behind it, which maximizes statistical power and captures trade-offs that would cancel or hide in a filtered view.

Segment analysis is how you understand that effect. After you have the aggregate result, you break it down by pre-specified user groups to diagnose what is driving it, where the effect is concentrated, and whether any group is being meaningfully harmed.

This is not a demotion for segment-level thinking—it is a clarification of its role. Segments are diagnostic tools, not primary outcomes. They answer why and for whom, after the primary metric answers whether.

Pre-specify your segments

The risk in exploratory segment analysis is obvious: if you scroll through 20 cuts after seeing the data and report the interesting ones, you will always find something. At a 5% significance threshold across 20 segments, you should expect roughly one spurious significant result even when the treatment has no real effect on anyone.

The fix is to decide which segments you will examine before you look at results. Pre-specified comparisons carry full statistical weight. Post-hoc ones are leads to follow up, not conclusions to act on.

The most practical way to do this is a standard segment set—a fixed list your team runs after every experiment. Good candidates are segments where, if you found a strong effect, it would change what you do:

  • New versus returning users—onboarding dynamics often differ sharply
  • Free versus premium—what works for one tier may not work for the other
  • Mobile versus desktop—platform differences can reflect implementation quality as much as concept quality
  • Engagement tier—power users and casual users can respond very differently

The exception: guardrail metrics

There is one legitimate use of cohort-specific metrics: guardrails. A guardrail metric is not a success criterion—it is a constraint. "Do not hurt new users" is a reasonable guardrail even if new user streams are not your primary metric.

The distinction matters:

  • Primary metric (filtered): bad. Underpowered, loses aggregate signal.
  • Guardrail metric (filtered): fine. You are not trying to optimize it, just checking that you haven't broken something for a specific group.

Notes for nerds

The statistical cost of filtering is a direct consequence of the variance of a sample mean. If your full experiment population has N users, the variance of your effect estimate scales as 1/N. If you filter to a subpopulation of size n < N, the variance scales as 1/n—larger by a factor of N/n. To recover the same power you had with the full population, you need a true effect size that is √(N/n) times as large.

For a segment that is 20% of your population, you need an effect roughly 2.2× larger to detect it with the same reliability as your aggregate metric. Most real product effects don't come that much stronger in subpopulations than in aggregate—so you are often just running an underpowered test while thinking you've done something more targeted.

The multiple comparisons problem compounds this: the familywise error rate for k independent tests at significance level α is 1 − (1 − α)^k. At α = 0.05 and k = 20 segments, that is roughly 0.64—a 64% chance of at least one false positive even when nothing is real.