The Real ROI of Experimentation

Want to experiment like Spotify? Sign up for a 30 day free trial.

At Spotify, roughly 12% of experiments win. The learning rate is 64%.

That gap surprises people. If you're used to evaluating experimentation programs by counting winners, an 88% "failure rate" looks damning. But the Experiments with Learning framework that produced those numbers tells a different story: across a recent six-month analysis of Spotify's R&D experiments, two-thirds generated valid, decision-ready results that changed how teams built their products. The EwL framework defines a "learning" as any experiment that produced results clear enough for the team to make a confident ship, hold, or iterate decision. By that standard, most of Spotify's experimentation value came from experiments that didn't win.

The accounting error at the heart of how most companies measure experimentation ROI: they tally the winners and ignore everything else.

Why does the winners tally get the value wrong?

The case for counting winners is intuitive. A Microsoft engineer once ran an A/B test on Bing's ad headline display that had been shelved for six months as low priority. Within hours, the variation was generating unexpectedly high revenue. The final result: a 12% revenue increase worth over $100M per year in the US alone. Wins like that are real, and they justify investment on their own terms.

The problem starts when that becomes the only story you tell. At Microsoft, roughly one-third of well-designed experiments are positive and statistically significant. One-third are flat. One-third are negative. At companies with more mature, already-optimized products, win rates drop to 10-20%. Spotify's 12% is typical for an organization at that level of product maturity.

If your experimentation report card is "we ran 100 experiments and 12 won," you're presenting a program with an 88% failure rate. Leadership will reasonably ask why you're spending engineering time, analyst time, and opportunity cost on a process that mostly doesn't produce wins. The answer is that the frame is wrong. Experiments produce three kinds of value, and wins are the smallest category.

Where does the value actually come from?

The first source of value is the obvious one: shipped wins. Revenue gains, engagement improvements, conversion lifts: these are real and they matter. The Bing headline test is a good example. But even here, the win happened because someone ran a test on a change that human judgment had dismissed as low priority. The experimentation program surfaced the value. Intuition missed it.

The second source is prevented harm. At Spotify, 42% of experiments result in the team deciding not to ship after guardrail metrics (metrics that monitor for unintended regressions) detect problems. Nearly half. Each of those decisions prevents a product regression from reaching 750 million users. That value never shows up in a winners tally, but it's often worth more than the wins themselves. A single prevented regression on a core metric can be worth months of incremental optimization.

Organizations that don't run experiments ship changes at the same rate, and a similar proportion of those changes will degrade the product. The difference is they have no mechanism to identify which ones. The regressions accumulate silently, showing up as gradual product quality decline rather than a single dramatic failure.

The third source of value is organizational learning velocity. Every experiment that produces a valid result — win or not — teaches the organization something it didn't know before. A well-designed experiment that shows a bold implementation had no effect on the target metric tells the team that the lever they pulled doesn't influence the outcome they care about. That knowledge redirects future effort.

Spotify's 64% learning rate captures exactly this: the fraction of experiments where teams made a confident ship, hold, or iterate decision based on the results. Learning rates across Spotify teams range from 16% to 76%. The teams at the higher end distinguish themselves by investing in adequate statistical power and bold enough implementations to produce unambiguous answers, not by having more wins.

Why does the value compound over time?

Consider a single product surface: Spotify's mobile home screen. In 2023, dozens of teams ran 250+ experiments per year on that surface. By 2025, 58 teams were running 520 experiments, averaging 10 new experiments every week. Part of that growth came from more teams onboarding to the platform. But the growth also meant more learnings feeding into the next round of product decisions: each round of experiments surfaced new understanding of user behavior on that surface, which shaped better hypotheses, which produced more informative experiments the next time around. We can't prove this cycle causally, but the pattern is consistent: teams with longer experimentation histories on a surface generate better hypotheses and waste fewer experiments on dead ends.

Booking.com describes the same dynamic across 25,000 experiments per year. Their own research documents how this velocity creates "cross-pollination between teams and products, by iterating on or revisiting past failures and by disseminating successes."

The teams working on Spotify Home today have a collective understanding of user behavior on that surface that no amount of user research or market analysis could have produced. That understanding came from hundreds of experiments whose results fed back into the next cycle.

Experiment bandwidth is the binding constraint on innovation for exactly this reason. The ceiling on how fast a product gets better is set by how fast the organization can generate trustworthy evidence about what works. Every improvement to that cycle, whether faster analysis, better metric design, or fewer experiments wasted on proxy metrics that don't track real outcomes, compounds.

What should you measure instead?

If you run an experimentation program today and need to articulate its value, here's a starting point that's more honest than counting winners.

Start by classifying each completed experiment into one of three outcomes: shipped, rolled back, or informed a decision. That third category is the one most organizations don't track. The bar for "informed a decision" is simple: did the team make a different decision because of this experiment? A null result on a well-powered, boldly implemented test counts if it redirected the team's next move. A muddled result on an underpowered test doesn't, even if someone glanced at it.

Your rollback rate tells you how much harm the program prevented. If you're rolling back 30-40% of experiments, that means the platform is catching real problems before users feel them.

Your learning rate, even a rough one, tells you what fraction of experiments are producing actionable results. If it's low, look at whether too many tests are underpowered, too timidly implemented, or answering questions the team already knew the answer to.

Your informative experiments per team per quarter (learning rate applied at the team level) tell you whether the organization is getting faster at learning. Raw experiment count can grow by running more trivial tests. What matters is how many experiments per team produce results the team actually acted on. If that number is flat, something structural is constraining it: tooling that makes experiments slow to set up, or a culture that requires management approval before testing.

Tracking these metrics takes discipline, but it doesn't require sophisticated tooling. A shared spreadsheet where teams classify outcomes after each experiment is enough to shift the conversation from "how many winners did we get?" to "how fast are we learning?"

The ROI question, reframed

Not every organization is ready to walk into a leadership review and say "our ROI is learning velocity." If you're still building the case for experimentation investment, the concrete accounting helps: count the shipped wins and their projected revenue impact, then add the regressions your guardrails caught and the cost of shipping them. That's already a better story than win rate alone.

But if you've run your experimentation program long enough to have a track record, the real argument is simpler. Counting shipped wins, prevented harm, and learning velocity gives you an honest picture of what your experimentation program is actually worth.

The metric that matters most is the rate at which your organization learns from its experiments.