A guardrail metric is a metric monitored during an experiment to ensure the change doesn't cause unintended harm, even when the success metric improves. Guardrail metrics define what you're trying not to break. At Spotify, 42% of experiments are rolled back after guardrail metrics detect regressions. That number reflects a platform that catches real harm before it ships, not a product organization that builds poorly.
Most teams start experimenting by tracking only a success metric. They ask: did the change make the thing we wanted better? Guardrail metrics ask the second question: did the change make anything else worse? Without guardrails, a team can ship a feature that boosts engagement by degrading app performance, increases conversions by annoying users into clicking, or improves one product surface at the expense of another.
Why do guardrail metrics matter for experiment decisions?
The statistical treatment of guardrail metrics differs from success metrics in a way that has practical consequences.
For a success metric, the primary risk is a false positive: concluding the change helped when it didn't, then shipping something ineffective. For a guardrail metric, the primary risk is a false negative: missing a real regression and shipping something harmful. The costs are asymmetric, and the statistical framework should reflect that.
Confidence's decision framework, formalized in the paper "Risk-Aware Product Decisions in A/B Tests with Multiple Metrics," handles this directly. False positive rates are adjusted across success metrics (using Bonferroni correction), but false negative rates are adjusted across guardrail metrics. This means adding more guardrails to an experiment requires increasing sample size to maintain the power to detect regressions on each one. It's a deliberate tradeoff: the more things you protect, the more data you need.
What are common guardrail metrics?
The right guardrails depend on your product, but patterns emerge across organizations.
Performance metrics. App startup time, latency, crash rates, error rates. A feature that improves engagement but slows the app by 200ms has a cost that won't show up in your success metric until users start leaving.
Engagement breadth. If you're optimizing one part of the product, track whether users are still engaging with other parts. A change to search that increases search usage but decreases home feed engagement may be shifting behavior rather than creating value.
Revenue and monetization. For non-revenue experiments, revenue is often a guardrail. You want to know if a UX improvement inadvertently reduces ad impressions or subscription conversions.
User satisfaction signals. Customer support contact rates, uninstall rates, or other signals that capture whether users are having a worse experience in ways your success metric doesn't cover.
At Spotify, teams running experiments on the mobile home screen (58 teams, 520 experiments in 2025 alone) use standardized guardrail metrics across a shared experimentation surface. Confidence's Surface concept enforces this: when multiple teams experiment on the same part of the product, required guardrails are defined at the surface level so individual teams can't accidentally skip them.
How should teams adopt guardrail metrics?
The Confidence blog recommends starting simple. Teams new to guardrail metrics should begin with inferiority tests: statistical tests that check whether the treatment is meaningfully worse than control. An inferiority test flags clear regressions without requiring you to define how much degradation you're willing to accept.
The next step is non-inferiority testing, which introduces a non-inferiority margin (NIM): the threshold of acceptable deterioration. A NIM of -1% on app crash rate means you'll accept the treatment as long as it doesn't increase crash rates by more than 1 percentage point. Setting NIMs requires judgment about tradeoffs, which is why the incremental approach matters. Teams that jump straight to non-inferiority testing without experience interpreting guardrail results tend to set margins that are either too tight (blocking good features) or too loose (missing real harm).
What happens when a guardrail metric regresses?
A guardrail regression doesn't automatically mean the experiment failed. It means the team has a decision to make.
Sometimes the regression is real and the right call is to roll back. Sometimes the regression is expected and acceptable given the improvement in the success metric. The guardrail's job is to surface the tradeoff so the team can make an informed decision rather than shipping blind. Confidence surfaces guardrail results alongside success metric results in the experiment analysis, making the tradeoff visible to everyone involved in the ship decision.
The 42% rollback rate at Spotify includes experiments where the success metric showed no improvement and experiments where guardrail regressions outweighed the gains. Both are cases where the platform prevented a bad change from reaching users.