Lesson 2: Metric roles
In this lesson, you learn about the different roles metrics play in experiments: success metrics that define what winning looks like, guardrail metrics that protect against unintended harm, diagnostic metrics that validate your experiment ran correctly, and exploratory metrics that help you understand why results occurred. You learn how to construct a balanced metric suite that supports confident decision-making.
Why metric roles matter
Not all metrics serve the same purpose in an experiment. Some tell you whether to ship your change. Others help you understand what happened and why. Still others act as safety checks, ensuring you haven't broken something critical along the way.
The key to running effective experiments is understanding these different roles and selecting the right mix of metrics before you start. A metric that's perfect for understanding user behavior might be terrible for making a shipping decision. A metric that's essential as a guardrail might distract you if treated as a success measure.
When you define clear roles for each metric upfront, you create a framework for making decisions based on evidence rather than intuition or confirmation bias.
Success metrics
Success metrics define what winning looks like. These are the metrics you use to decide whether to ship your change. Everything else in your experiment supports this core question: did we succeed?
A good success metric directly measures the behavior your hypothesis predicts will improve. If you're testing a faster checkout flow because you believe it will increase purchase completion, your success metric should measure purchase completion. If you believe a new feature will increase user engagement, your success metric should capture that specific type of engagement.
Success metrics need to be sensitive enough to detect changes that matter but stable enough to trust. They should be clearly interpretable—when the metric moves, you should know whether that's good or bad without needing complex analysis. And they should be few in number, typically one to three. More than that and you're not being clear about what success actually means.
An e-commerce company tests a simplified checkout flow:
Hypothesis: Reducing checkout from five steps to three will increase purchase completion by making the process less frustrating and faster to complete.
Success metric: Purchase completion rate per session in the first 7 days
This metric directly measures the intended outcome. When it moves, you know whether your simplification worked. It's specific enough to be meaningful but broad enough to capture the full effect on purchasing behavior.
Select all your metrics during experiment planning, before you collect any data. Pre-defining your success criteria and guardrails helps you avoid confirmation bias and ensures you're measuring what you set out to learn, not just what happened to move.
Guardrail metrics
Guardrail metrics protect against unintended harm. These are metrics you don't expect to improve—you just need to ensure they don't get significantly worse.
Think of guardrails as the price of shipping. You might improve checkout completion, but if you also increase refund rates or hurt customer satisfaction, the change isn't worth shipping. You might boost engagement, but if you also increase server costs beyond acceptable limits or degrade app performance, you've created new problems while solving old ones.
Good guardrails measure outcomes that matter strategically but aren't the focus of your current change. Revenue, retention, and key quality metrics are common guardrails because they're critical to the business even when they're not what you're trying to move. Choose guardrails carefully—too many inflate sample size requirements and delay results, but too few leave you vulnerable to shipping changes that help one metric while hurting another.
Because you only care about whether a guardrail degrades (not whether it improves), guardrail metrics are tested asymmetrically from success metrics. The Advancing Experimentation course covers inferiority and non-inferiority tests for guardrail metrics in detail.
For the checkout simplification experiment, guardrails might include:
Revenue per completed order: You're trying to increase completion, not reduce order values. Make sure removing steps didn't make high-value purchases harder.
Refund rate: Ensure faster checkout didn't lead to more impulse purchases that customers later regret.
Customer support contacts: Check that removing options didn't create confusion requiring support help.
Site performance metrics: Verify that code changes didn't slow down the page or introduce errors.
You don't expect checkout changes to improve these metrics, but you need confidence they didn't get worse before shipping.
Diagnostic metrics
Diagnostic metrics answer a simpler question: did the experiment run correctly? Before you interpret any results, you need to know whether the treatment was actually applied, whether users were randomized properly, and whether the implementation worked as intended.
Most platforms track many diagnostic metrics automatically. Sample ratio checks ensure you got the expected number of users in each variant. Exposure metrics confirm users actually saw your treatment. Implementation-specific metrics validate technical details like whether code deployed correctly or whether UI elements rendered.
You should always check diagnostics before diving into results. An imbalanced sample ratio can indicate a randomization bug or biased exposure logging. The most common cause of exposure bias is logging events conditionally on something affected by the treatment, which skews who appears in each variant. Low exposure means fewer users saw your change, reducing your sample size and statistical power—though the effect estimate for exposed users remains accurate. Technical errors might have affected some users but not others, creating misleading patterns in your data.
Review diagnostic checks before interpreting your success metrics to ensure your experiment ran as intended. If diagnostics show issues, investigate and fix them before drawing conclusions from your results.
In Confidence, diagnostic checks run automatically for every experiment. The randomization is always sound, so a sample ratio imbalance means the exposure logging is biased rather than a randomization issue.
Exploratory metrics
Exploratory metrics help you understand the story behind your results. Of course you will break results down by segments and look for patterns—and of course that will inform how you make product decisions. But it's critical to understand the pitfalls: post-hoc exploration inflates false positive rates, and great product organizations use exploratory findings primarily to inform future experiments and product iterations, not as a direct basis for shipping decisions. This is covered in detail in Lesson 10: Segment-level analysis.
Unlike success metrics, exploratory metrics don't need to be defined before the experiment starts. You can add them during analysis as questions arise. Saw an unexpected drop in your success metric? Add exploratory metrics to investigate possible causes. Saw a surprising win? Add metrics to understand which user segments benefited most or what behaviors drove the improvement.
This flexibility is powerful because you can't anticipate every question you'll want to ask. The data itself often reveals patterns you didn't predict. Exploratory metrics let you dig deeper into those patterns without the rigidity of pre-defined hypotheses.
Exploratory metrics do not affect sample size requirements or the ship/no-ship decision, but they are not statistically free. Running many post-hoc analyses inflates the false discovery rate. Treat exploratory findings as hypotheses to confirm in a follow-up experiment, not as the basis for a ship decision.
Continuing the checkout flow example, you might add exploratory metrics during analysis to understand the results. If purchase completion increased, you might explore:
- Average time spent in checkout (did speed matter?)
- Cart abandonment at each remaining step (where do users still drop off?)
- Share of purchases using saved payment methods (did removing steps make saved methods more prominent?)
If results were neutral, you might explore:
- Mobile versus desktop completion rates (did the change only help one platform?)
- New versus returning customer completion (did experience level matter?)
- Order values (did you lose high-value shoppers who wanted more options?)
These metrics help you understand what drove the outcome and what to test next.
In Confidence, exploratory metrics can be added at any point during analysis. You can also automate exploration using Actions, which can trigger metric calculations, generate reports, or run analyses based on experiment results.
Complete metric suite
A complete experiment includes all four metric roles, balanced to give you both confidence in your decision and insight into your results.
At Spotify, a typical experiment uses one to three success metrics that define winning, three to five guardrail metrics for critical business outcomes, three to five diagnostic metrics to validate implementation, and five to ten exploratory metrics to understand mechanisms and context. These ranges reflect our experience and serve as a useful ballpark—they are not prescriptive rules. The right numbers depend on the scope of your change, your available sample size, how many things could plausibly go wrong, and how much you need to understand about the mechanism behind your results. The Sample Size Calculation II course covers how the number of success metrics affects sample size requirements in detail.
Complete metric suite for an e-commerce checkout simplification:
Success metric:
- Purchase completion rate per session in first 7 days
Guardrail metrics:
- Revenue per completed order
- 30-day refund rate
- Customer support contact rate
- Payment processing errors
Diagnostic metrics:
- Treatment/control sample ratio
- Share of treatment users who saw simplified flow
- Checkout page load times
- JavaScript error rates
Exploratory metrics:
- Average checkout time
- Cart abandonment by step
- Mobile versus desktop completion rates
- Share using saved payment methods
- New versus returning customer completion
- Average items per completed order
In Confidence, you define metric roles during experiment setup. Success and guardrail metrics form your decision framework and drive sample size requirements. Exploratory metrics do not affect sample size calculations. Learn more in the Intro to experimentation course.
What distinguishes success metrics from exploratory metrics?
Why can exploratory metrics be added during analysis in Confidence?
What is the main risk of having too many guardrail metrics?
You are testing a new product recommendation algorithm. Average order value is a critical business metric that you do not expect to improve. Which role is most appropriate?
Notes for nerds
Risk mitigation and metric types. One practical way to think about metric roles is through a risk mitigation lens: success metrics confirm you achieved your goal, guardrails protect against harm, and diagnostics confirm your experiment ran correctly. Spotify has written about this framing in the context of running experiments with smaller samples—where the choice of metric roles directly affects how much risk you're taking on at each stage. The Confidence blog post on experimenting with smaller samples digs into this, including how the risk mitigation ladder connects to metric type selection.