Ship rate is the proportion of experiments whose results lead to shipping the treatment to all users. It differs from win rate because a statistically significant positive result isn't always sufficient to justify shipping. Guardrail regressions, strategic reprioritization, or a cost-benefit analysis that doesn't pencil out can all turn a "winning" experiment into a no-ship decision.
At Spotify, 42% of experiments are rolled back after guardrail metrics detect regressions. That means nearly half of all tested changes cause measurable harm to something the team cares about protecting. Ship rate reflects the discipline to act on that information, not just the ability to detect positive results.
How does ship rate relate to win rate?
Win rate measures how often an experiment produces a statistically significant positive result on its success metric. Ship rate measures how often the team actually ships the treatment. The gap between the two reveals how much work guardrail metrics and organizational judgment are doing.
Consider a treatment that lifts engagement by 2% but degrades app start time by 150 milliseconds. The experiment has a positive result on the success metric. It's a win. But the guardrail regression makes it a no-ship. A team with a healthy experimentation practice will have a ship rate meaningfully lower than their win rate, because they're catching tradeoffs that a success-metric-only view would miss.
The inverse also happens: teams sometimes ship treatments that didn't reach statistical significance on the success metric but showed strong directional signal and no guardrail harm. These cases are rarer, but they're legitimate when the team has a clear strategic rationale and the risk profile is acceptable.
Why does ship rate matter for experimentation programs?
Ship rate is a signal of program maturity. A very high ship rate (approaching 100% of experiments) suggests one of two problems: either the team isn't using guardrail metrics, or the team is only testing changes that are almost certain to succeed. Both patterns indicate the experimentation program is confirming decisions rather than informing them.
A very low ship rate, on the other hand, might mean the team is testing bold ideas (good) or that the bar for shipping is so high that learning never translates into product improvement (bad). Context matters.
The most useful diagnostic is the relationship between three numbers: win rate, ship rate, and learning rate. At Spotify, the EwL (Experiments with Learning) framework tracks all three. A program where the learning rate is high but the ship rate is low is generating knowledge efficiently. A program where all three are low is burning experiment bandwidth without return.
How does Confidence support ship rate decisions?
Confidence structures experiment results around a decision framework with distinct metric types. Success metrics measure what you're trying to improve. Guardrail metrics monitor what you're trying not to break. Quality metrics track properties the treatment should maintain. This separation prevents the common failure mode where a single headline metric dominates the ship decision.
The platform runs inferiority tests on guardrail metrics by default, flagging when a treatment is statistically worse than control by more than an acceptable margin. This gives teams a structured basis for no-ship decisions rather than relying on gut feel about whether a regression "looks bad."
When a team decides not to ship, Confidence preserves the full experiment record: the hypothesis, the metric results, and the decision rationale. That record is what converts a no-ship into a learning. Without it, the same bad idea gets retested six months later by a different team.