An experimentation maturity model is a framework for assessing how advanced an organization's experimentation practice is, from ad-hoc tests run by a few teams to a fully integrated system where experimentation informs every product decision. The model gives organizations a way to understand where they are, what's limiting them, and what to invest in next.
Most maturity models describe three to five levels. The specific labels vary, but the underlying progression is consistent: organizations move from running isolated experiments to coordinating across teams to measuring cumulative impact at the portfolio level.
What are the typical stages?
The Spotify Search team's public account of their maturity arc provides a concrete, three-stage example.
Stage 1: Individual experiment quality. Teams start running A/B tests, but each experiment is treated as an isolated event. The focus is on getting the basics right: writing clear hypotheses, choosing appropriate success and guardrail metrics, reaching adequate statistical power, and interpreting results correctly. Most organizations stall here because the infrastructure makes each experiment expensive to set up and analyze.
Stage 2: Cross-experiment coordination. As experiment volume grows, teams start interfering with each other. Two experiments changing the same surface produce results that are hard to interpret. The organization needs coordination mechanisms: shared surfaces, standardized metrics, and rules for how concurrent experiments interact. At Spotify, this is handled through the Surface concept in Confidence, which groups teams experimenting on the same product area, standardizes required metrics, and prevents overlap.
Stage 3: Portfolio-level measurement. The most mature organizations stop asking "did this experiment win?" and start asking "what is the cumulative impact of everything we've shipped?" This requires holdback groups, long-running control populations, and the analytical discipline to measure aggregate effect across dozens or hundreds of shipped changes.
Why do most organizations get stuck at Stage 1?
Two bottlenecks dominate.
The first is tooling friction. If setting up an experiment requires a data engineer to build a pipeline, an analyst to calculate sample size, and a week of configuration, teams will only test the high-stakes features. Everything else ships without evidence. Confidence addresses this by encoding statistical best practices into defaults: sample size calculation, sequential testing, CUPED variance reduction, and guardrail monitoring are built in, not bolted on.
The second is cultural. Stage 1 maturity produces a lot of null results, because most product ideas don't produce a measurable positive effect. Spotify's win rate is around 12%. Organizations that interpret this as "experimentation doesn't work" stop investing. Organizations that measure learning rate instead (Spotify's is 64%, measured through the EwL framework) understand that the null results are the primary source of value: they prevent bad ideas from shipping and sharpen product intuition over time.
How do you use a maturity model without it becoming a vanity exercise?
The risk with any maturity model is that it becomes a self-assessment scorecard where the goal is to reach "Level 5" rather than to solve real problems. Three things keep it grounded.
First, tie each level to a measurable constraint. Stage 1 maturity is constrained by experiment setup cost. Stage 2 is constrained by coordination overhead. Stage 3 is constrained by the ability to measure cumulative impact. If you can't name the constraint, the level is decorative.
Second, accept that different parts of the organization will be at different stages. A company with 50 product teams might have 5 at Stage 3, 20 at Stage 2, and 25 at Stage 1. The maturity model helps you allocate investment where it produces the most learning, not where it produces the highest score.
Third, recognize that advancing stages requires infrastructure investment, not just process changes. Moving from Stage 1 to Stage 2 at Spotify required building the Surface concept and the coordination strategy that prevents experiment interference. Process documents alone don't solve interference problems.