Multi-metric decision making is the practice of evaluating experiment results across multiple metrics simultaneously rather than basing ship decisions on a single success metric. In real product decisions, a change that improves one metric often degrades another. The decision to ship, iterate, or roll back depends on the full picture, not just one number.
At Spotify, experiments are evaluated against three metric types: success metrics (what you're trying to improve), guardrail metrics (what you're trying not to break), and quality metrics (secondary indicators that provide context). 42% of experiments are rolled back after guardrail regressions, which means nearly half of the changes that improve the target metric also cause harm that would have shipped undetected if the team had only looked at one number.
Why is a single metric not enough?
Product decisions have tradeoffs. A recommendation algorithm change that increases streams per session might decrease the diversity of artists played, reducing long-term user satisfaction. A checkout flow redesign that increases conversion rate might increase support tickets because users feel pushed through the process. A page speed improvement that reduces load time might reduce ad revenue.
None of these tradeoffs are visible if you only look at the success metric. The whole point of guardrail metrics is to make the cost of the change visible alongside the benefit.
The statistical challenge is that evaluating multiple metrics simultaneously increases the risk of false discoveries. If you test 10 metrics at a significance level of 0.05, you have roughly a 40% chance of at least one false positive. Multiple testing corrections (Bonferroni being the most common in practice) adjust for this. Confidence applies multiple testing corrections to success metrics while treating guardrail metrics differently: as Spotify's research shows, the risk that needs controlling for guardrails is false negatives (missing a real regression), not false positives.
How should different metric types be treated?
Spotify's published decision framework formalizes the answer.
Success metrics answer "did the change achieve what we intended?" False positives are the primary risk: you don't want to ship a change that appears to have helped but actually didn't. Multiple testing corrections control this risk.
Guardrail metrics answer "did the change cause harm we didn't intend?" False negatives are the primary risk: you don't want to miss a real regression. The framework recommends controlling the false negative rate across guardrails, which is a different statistical problem from controlling false positives across success metrics.
Quality metrics provide context but don't drive the ship/no-ship decision directly. A quality metric that moves unexpectedly warrants investigation, but it doesn't automatically trigger a rollback.
This distinction matters because applying the same correction procedure to all metrics is wrong in both directions. Over-correcting success metrics means you miss real improvements. Under-correcting guardrail metrics means you miss real harm.
What does multi-metric decision making look like in practice?
A team running an experiment to improve search result relevance might define their metrics as follows:
- Success metric: click-through rate on the first search result
- Guardrail metrics: search latency (p95), user-reported search quality (monthly survey), streams initiated from search
- Quality metrics: number of searches per session, scroll depth on search results page
The experiment shows a 1.2 percentage point increase in click-through rate (statistically significant after multiple testing correction). Search latency p95 is unchanged. But streams initiated from search dropped by 0.8% (significant at the guardrail threshold). The team now faces a real decision: the change makes users more likely to click the first result, but those clicks lead to fewer streams. Users might be clicking out of curiosity rather than genuine interest, or the new ranking is surfacing popular but less personally relevant results.
This is the kind of decision that a single-metric framework can't support. The multi-metric view makes the tradeoff explicit and forces the team to reason about what's happening, not just whether a number went up.
How does Confidence support multi-metric decisions?
Confidence structures experiment analysis around this framework. When you set up an experiment, you designate metrics by type: success, guardrail, or quality. The platform applies the appropriate statistical corrections to each type. Guardrail metrics are monitored with inferiority tests (testing whether the treatment is worse than control) rather than the two-sided tests used for success metrics, because the question is different: you're not asking "did it change?" but "did it get worse?"
The analysis view presents all metric types together so the decision-maker sees the complete picture. A green success metric next to a red guardrail metric is a common and important outcome, and the platform is designed to make that tension visible rather than hidden.