Metrics

What is a Win Rate?

Win rate is the proportion of experiments that produce a statistically significant positive result on their primary success metric.

Win rate is the proportion of experiments that produce a statistically significant positive result on their primary success metric. At Spotify, the win rate is approximately 12%. That number surprises people who assume experimentation is about shipping winners. It shouldn't. The real measure of an experimentation program's value is how much the organization learns, not how often it ships.

Spotify's Experiments with Learning (EwL) framework reframes this: while only ~12% of experiments produce a statistically significant improvement, roughly 64% produce a clear learning. The learning rate captures experiments where the team gained an actionable understanding of user behavior, even when the treatment didn't win. A null result with a clear interpretation is a success. An ambiguous result that teaches nothing is a failure, regardless of whether the metric moved.

Why is a low win rate normal?

A 12% win rate is consistent with what mature experimentation programs report across the industry. The reason is structural: if most of your experiments are producing positive results, you're probably testing ideas that were already obvious. The point of experimentation is to test ideas where the outcome is genuinely uncertain.

Teams that report win rates above 30-40% are often running confirmation tests on changes they've already decided to ship. Those experiments consume bandwidth without generating new information. A healthy experimentation culture tests bold hypotheses, and bold hypotheses fail more often than safe ones. That's the tradeoff.

At Spotify, where 300+ teams run 10,000+ experiments per year, a 12% win rate still means roughly 1,200 validated improvements annually. The volume of learning compounds: each null result narrows the hypothesis space for the next experiment.

How does the learning rate differ from the win rate?

The win rate counts statistical wins. The learning rate counts experiments where the team extracted a usable finding, whether or not the treatment outperformed control. The gap between 12% and 64% represents the majority of experimentation value at Spotify.

A learning happens when the team can answer questions like: which user segment responded differently? Did the metric move in the expected direction but not enough to reach significance? Did the guardrail metrics reveal an unexpected side effect? Was the hypothesis directionally correct but the implementation too timid?

Confidence supports this framing by surfacing metric breakdowns, guardrail results, and segment-level analysis alongside the primary result. The platform treats a well-interpreted null result as a legitimate outcome, not a failure state.

What drives win rate up or down?

Several factors influence win rate, and not all improvements are desirable.

Testing bolder implementations increases win rate. When a team tests the maximum viable version of an idea, the signal is larger and easier to detect. Timid implementations produce ambiguous nulls that don't count as wins or learnings.

Better metric selection matters. If your success metric isn't sensitive to the change you're making, real improvements go undetected. Variance reduction techniques like CUPED can cut the noise in a metric by ~50%, making real effects visible that would otherwise be buried.

Higher statistical power helps mechanically: a test with 80% power detects a real effect 80% of the time, while a test with 30% power misses it more often than not. Underpowered experiments are the single largest source of wasted experiment bandwidth.

But artificially inflating win rate is counterproductive. Running only safe, obvious experiments raises the win rate while lowering the learning rate and the overall value of the program.

How should teams track win rate?

Track win rate as a diagnostic, not a target. A win rate that's trending upward could mean the team is getting better at forming hypotheses. It could also mean the team is getting more conservative about what it tests. The learning rate, tracked alongside win rate, disambiguates the two.

Confidence surfaces both metrics at the organization level, giving experimentation program owners visibility into whether their teams are learning or just confirming.