Lesson 11: The winner's curse

You already know about two ways an experiment can go wrong: a false positive (Type I error) and a false negative (Type II error). There are two more failure modes that only appear inside statistically significant results. They are subtler, and they tend to go unnoticed precisely because the result looks like a success.

This lesson builds on statistical power and significance. If you need a refresher on those concepts, see the Hypothesis testing course and the Sample size calculation courses before continuing. For the formal statistical definitions, see Gelman & Carlin (2014).

The winner's curse

Imagine running many experiments on a small true effect with insufficient sample size. Most experiments will not reach statistical significance. The signal is too weak relative to the noise. The ones that do reach significance got there by chance: they happened to draw an unusually large estimated effect. Only extreme estimates cross the significance threshold.

This is the winner's curse. Your significance filter does not just select for real effects; it selects for large-looking estimates. When an experiment is underpowered, the results you actually see and act on are not a representative sample of all possible results. They are the lucky outliers.

Two new types of errors

If you took the hypothesis testing course, you might remember the two classic error types: the Type I error (a false positive, controlled by α) and the Type II error (a false negative, controlled by power). The winner's curse produces two more, named by statisticians Andrew Gelman and John Carlin, that only appear inside statistically significant results:

Type M error (Magnitude) is the tendency for significant results to overestimate the true effect size. The exaggeration ratio tells you by how much: it is defined as average significant effect/true effect. On average, a significant result from an underpowered experiment reports an effect much larger than what is actually there.

Type S error (Sign) is the risk that a significant result points in the entirely wrong direction. It is rare except in severely underpowered experiments. In practice, this means either aborting an experiment even though the treatment had a positive effect, or shipping a treatment even though it had a harmful effect.

Both errors are conditional on significance. They describe what happens inside the pool of experiments that find significant results, not the full distribution of experiments.

Explore both risks

The simulator below shows the sampling distribution of effect estimates, for a given true effect and sample size. The green area represents the fraction of all experiments that would find a significant result in the correct direction (your statistical power). The red area is the fraction that would find a significant result pointing the wrong way (Type S error). The stat cards update as you move the sliders.

The simulator assumes a metric that improves when it increases, and a true effect in the direction of improvement. The Type S risk shown is therefore the risk of finding a significant deterioration and aborting an experiment that is actually positive.

Power: 8.3%Type S: 0.56%Type M factor: 4.19×
Significant, correct direction (power)Significant, wrong direction (Type S)True effect (δ)

With a true effect of 0.20 and n = 100, there is a 8.3% chance of a significant result in the correct direction. Conditional on a significant positive result, the average observed effect is 0.84, a 4.19× overestimate of the true effect. The probability of a significant result in the wrong direction (Type S) is 0.56%.

Notice the key patterns:

  • Lower true effect (move the left slider down): Type M overestimation grows sharply. At small true effects, a significant result might report an effect that is 5× or 10× the truth.
  • Smaller sample size (move the right slider toward Small): the distribution widens, the green area shrinks (lower power), the red area grows (higher Type S risk), and the overestimation gets worse.
  • Type S error stays small unless the experiment is extremely underpowered. It only becomes a meaningful concern at very low power with tiny true effects.

Planned power versus actual power

The explorer above treats the true effect as your only input. In practice, you size an experiment around a minimum detectable effect (MDE), the smallest effect worth caring about, and that choice determines your planned power. If the true effect is smaller than the MDE, actual power is lower than planned and Type M inflation is worse than you expected. If the true effect is larger than the MDE, actual power exceeds your target and the overestimation shrinks.

The simulator below separates the two. Set the MDE and the true effect independently to see how actual power diverges from planned, and how the Type M factor changes as the gap grows.

Planned power: 30.3%Actual power: 8.3%Type M factor: 4.19×
SignificantTrue effect (δ)MDE

The MDE is set to 0.50, giving a planned power of 30.3%. The true effect is 0.20, so the actual power is 8.3%. Observed effects need to exceed 0.68 to reach significance, so the average significant result (in the green area) is 0.84, a 4.19× overestimate of the true effect.

When the true effect equals the MDE, actual and planned power match. Slide the true effect below the MDE and actual power drops, and the significant results you do get come from the lucky tail of the distribution and overestimate the truth by more.

The fundamental problem: the true effect is never known

Both simulators above ask you to specify the true effect. In real life, you never know this. You run the experiment precisely because you do not know whether, or by how much, the treatment works.

This single fact changes how to use the concepts above. Type M and Type S are design-time sensitivity tools, not post-hoc diagnostics:

  • Before you run an experiment, you can ask: if the true effect were around X%, what would the exaggeration ratio be? That question is useful for planning sample size and setting expectations.
  • After you see a significant result, you cannot work backwards to say whether your specific estimate is inflated. A reported effect of +3% could be a Type M-inflated estimate of a true +0.6% effect, or a slight underestimate of a true +4% effect. You cannot tell from the data alone.
  • Even at the conventional 80% power, if the true effect exactly equals your MDE, the expected overestimation is around 13%.

The picture is not all bad. Type S errors are rare outside of extremely underpowered settings, so directional decisions from experiments are generally reliable: if your experiment says the treatment helps, it very likely does. The more practical concern is Type M, which affects effect size estimates even in reasonably powered experiments. If you need an accurate magnitude, the best remedies are running a larger experiment, or following up a significant A/B test with a gradual rollout, which both replicates the finding and yields a larger-sample estimate of the true effect size under real traffic conditions.

Notes for nerds

The winner's curse is a risk, not a certainty

A significant result does not only come from underpowered experiments detecting small effects. It can just as well come from a well-powered experiment detecting a large effect, and in that case the estimate is reliable. The winner's curse applies conditional on being underpowered relative to the true effect. If the true effect is large and the experiment was well-powered to detect it, the significant result is not inflated.

The limits of "just be powered"

A widely quoted piece of advice is: don't trust results from underpowered experiments. This is sound, but it conceals an important subtlety.

Power is always defined relative to an assumed effect size. An experiment is "80% powered" to detect a 2% lift, or "30% powered" to detect a 1% lift. The same experiment can be well-powered for a large effect and poorly powered for a small one.

The question is not only is this experiment powered? but powered for what effect, and how plausible is that effect? An experiment designed to detect a 5% effect when the realistic effect is 0.5% is not trustworthy even if it reaches significance: it would require massive overestimation to do so. Conversely, a nominally underpowered experiment that finds a large effect might be reporting something accurate if the true effect really is large.

Type S flips with the direction of the true effect

The simulator fixes the true effect as positive (an improvement). When that is the case, a Type S error means detecting a significant deterioration on a treatment that actually helps, and aborting prematurely.

If the true effect were negative (a harmful treatment), the roles reverse. Now a Type S error means detecting a significant improvement and shipping something that is actually harmful. The same rare-but-severe warning applies, just in the other direction.

Empirical shrinkage

Some tech companies with strong experimentation programs use historical data to work around the unknown-true-effect problem. If you have run thousands of experiments and know the distribution of typical effect sizes for your product area, you can use that historical distribution as a prior and shrink new estimates toward it, pulling inflated estimates back toward more plausible values. It is statistically elegant, but it requires a large archive of past experiments with consistent metrics, and confidence that the historical distribution still describes the current product and user base.