Lesson 2: Metric roles

Why metric roles matter

Not all metrics serve the same purpose in an experiment. Some tell you whether to ship your change. Others help you understand what happened and why. Still others act as safety checks, ensuring you haven't broken something critical along the way.

The key to running effective experiments is understanding these different roles and selecting the right mix of metrics before you start. A metric that's perfect for understanding user behavior might be terrible for making a shipping decision. A metric that's essential as a guardrail might distract you if treated as a success measure.

When you define clear roles for each metric upfront, you create a framework for making decisions based on evidence rather than intuition or confirmation bias.

Success metrics

Success metrics define what winning looks like. These are the metrics you use to decide whether to ship your change. Everything else in your experiment supports this core question: did we succeed?

A good success metric directly measures the behavior your hypothesis predicts will improve. If you're testing a faster checkout flow because you believe it will increase purchase completion, your success metric should measure purchase completion. If you believe a new feature will increase user engagement, your success metric should capture that specific type of engagement.

Success metrics need to be sensitive enough to detect changes that matter but stable enough to trust. They should be clearly interpretable—when the metric moves, you should know whether that's good or bad without needing complex analysis. And they should be few in number, typically one to three. More than that and you're not being clear about what success actually means.

Guardrail metrics

Guardrail metrics protect against unintended harm. These are metrics you don't expect to improve—you just need to ensure they don't get significantly worse.

Think of guardrails as the price of shipping. You might improve checkout completion, but if you also increase refund rates or hurt customer satisfaction, the change isn't worth shipping. You might boost engagement, but if you also increase server costs beyond acceptable limits or degrade app performance, you've created new problems while solving old ones.

Good guardrails measure outcomes that matter strategically but aren't the focus of your current change. Revenue, retention, and key quality metrics are common guardrails because they're critical to the business even when they're not what you're trying to move. Choose guardrails carefully—too many inflate sample size requirements and delay results, but too few leave you vulnerable to shipping changes that help one metric while hurting another.

Because you only care about whether a guardrail degrades (not whether it improves), guardrail metrics are tested asymmetrically from success metrics. The Advancing Experimentation course covers inferiority and non-inferiority tests for guardrail metrics in detail.

Diagnostic metrics

Diagnostic metrics answer a simpler question: did the experiment run correctly? Before you interpret any results, you need to know whether the treatment was actually applied, whether users were randomized properly, and whether the implementation worked as intended.

Most platforms track many diagnostic metrics automatically. Sample ratio checks ensure you got the expected number of users in each variant. Exposure metrics confirm users actually saw your treatment. Implementation-specific metrics validate technical details like whether code deployed correctly or whether UI elements rendered.

You should always check diagnostics before diving into results. An imbalanced sample ratio can indicate a randomization bug or biased exposure logging. The most common cause of exposure bias is logging events conditionally on something affected by the treatment, which skews who appears in each variant. Low exposure means fewer users saw your change, reducing your sample size and statistical power—though the effect estimate for exposed users remains accurate. Technical errors might have affected some users but not others, creating misleading patterns in your data.

Exploratory metrics

Exploratory metrics help you understand the story behind your results. Of course you will break results down by segments and look for patterns—and of course that will inform how you make product decisions. But it's critical to understand the pitfalls: post-hoc exploration inflates false positive rates, and great product organizations use exploratory findings primarily to inform future experiments and product iterations, not as a direct basis for shipping decisions. This is covered in detail in Lesson 10: Segment-level analysis.

Unlike success metrics, exploratory metrics don't need to be defined before the experiment starts. You can add them during analysis as questions arise. Saw an unexpected drop in your success metric? Add exploratory metrics to investigate possible causes. Saw a surprising win? Add metrics to understand which user segments benefited most or what behaviors drove the improvement.

This flexibility is powerful because you can't anticipate every question you'll want to ask. The data itself often reveals patterns you didn't predict. Exploratory metrics let you dig deeper into those patterns without the rigidity of pre-defined hypotheses.

Complete metric suite

A complete experiment includes all four metric roles, balanced to give you both confidence in your decision and insight into your results.

At Spotify, a typical experiment uses one to three success metrics that define winning, three to five guardrail metrics for critical business outcomes, three to five diagnostic metrics to validate implementation, and five to ten exploratory metrics to understand mechanisms and context. These ranges reflect our experience and serve as a useful ballpark—they are not prescriptive rules. The right numbers depend on the scope of your change, your available sample size, how many things could plausibly go wrong, and how much you need to understand about the mechanism behind your results. The Sample Size Calculation II course covers how the number of success metrics affects sample size requirements in detail.

Notes for nerds

Risk mitigation and metric types. One practical way to think about metric roles is through a risk mitigation lens: success metrics confirm you achieved your goal, guardrails protect against harm, and diagnostics confirm your experiment ran correctly. Spotify has written about this framing in the context of running experiments with smaller samples—where the choice of metric roles directly affects how much risk you're taking on at each stage. The Confidence blog post on experimenting with smaller samples digs into this, including how the risk mitigation ladder connects to metric type selection.