Lesson 9: Select metrics

You're ready to design metrics

You now understand what makes metrics effective. Metrics measure aggregated user behavior (lesson 1). They play distinct roles in experiments: success, guardrail, exploratory, diagnostic (lesson 2). Time windows shape when you measure and when you see results (lesson 3). Good metrics capture the right behavior without gaming or unintended consequences (lesson 4). The metric hierarchy connects tactical work to strategic outcomes (lesson 5). Clear names and documentation ensure interpretability (lesson 6). Effective variance and influenceability determine whether you can detect changes (lesson 7). Variance reduction techniques reduce effective variance (lesson 8).

The question isn't whether you understand these concepts—you do. The question is how to apply them systematically when designing an experiment.

The framework: From hypothesis to metric suite

Start with your hypothesis

Every metric choice flows from what you expect to change and why. Write your hypothesis explicitly: "[Change] will [impact behavior] because [mechanism]."

For example: "Adding one-click checkout will increase purchase completion because it reduces friction for returning customers" or "Personalized homepage recommendations will increase discovery streams because they surface relevant content users wouldn't otherwise find."

Your hypothesis identifies what behavior should change (purchase completion, discovery streams) and why (reduced friction, better relevance). This points directly to the metrics you need.

Identify what to measure

Translate your hypothesis into specific, measurable behaviors. Ask: "If my hypothesis is correct, what would users do differently?"

If reduced friction increases completion, measure purchase completion rate. If better relevance increases discovery, measure streams from recommendations. These are your candidate success metrics—they directly capture the behavior your hypothesis predicts will change.

Then identify what could go wrong. Faster checkout might increase regretted purchases—measure return rate. Better recommendations might cannibalize search—measure search engagement. These become guardrail candidates—metrics you don't expect to improve but must protect.

Assign metric roles

Build a balanced suite by assigning clear roles to each candidate metric.

Your success metrics (1-3) define what winning looks like and guide your ship/no-ship decision. Choose metrics where changes are directly attributable to your experiment—typically middle-layer metrics (local or proxy) from the metric hierarchy. They should be movable by your change and sensitive enough to detect realistic effects.

Your guardrail metrics (3-5) protect critical outcomes. Include top-layer KPIs like revenue and retention, plus any specific risks your change introduces. These ensure you don't improve one dimension while breaking another.

Exploratory metrics (5-10) enable understanding. These help you interpret why your success metrics moved and discover unexpected patterns. Unlike success metrics, you can add these during analysis as questions arise.

Diagnostic metrics (automated by most platforms) validate implementation—sample balance, exposure rates, technical health.

Evaluate metric characteristics

For each success and guardrail metric, validate that it meets your needs across four dimensions.

First, check feasibility—can you compute this reliably with available data? Do you have sufficient history to estimate variance reduction potential? If not, you may need to invest in logging infrastructure before proceeding.

Second, assess effective variance and influenceability, which together determine whether you'll detect changes. You need both low effective variance (considering temporal correlation and regression adjustment, not just raw variance) and high influenceability (your change must actually move the metric). Remember that binary metrics aren't automatically more sensitive—they reduce both variance and influenceability. A continuous metric with strong temporal correlation often outperforms a binary metric after applying variance reduction.

Third, ensure interpretability—does the name clearly communicate what's being measured (what behavior, for whom, over what time period)? Will stakeholders understand what it means when the metric moves?

Finally, verify alignment—does this metric connect to what you actually care about? If you're using a proxy metric, have you validated that it correlates with the ultimate outcome?

Configure time windows

Time configuration determines when you measure and when you see results. For each metric, you need to make three choices.

When does measurement start? The exposure offset controls this. Use 0 days for immediate effects, or 7+ days to measure behavior after novelty effects fade. Consider using multiple metrics with different offsets to understand both short-term response and sustained impact.

How long do you measure each user? The aggregation window determines this. Match the window to the natural cycle of your user behavior and how long effects take to manifest.

When do users appear in results? You have two options. Closed windows (measuring at the end of the window) give cleaner interpretation—all users measured over exactly the same period—but results appear slower. Use these for success metrics guiding final decisions. Cumulative windows give faster results by including users as soon as they enter the window, even if incomplete. Use these when you need earlier monitoring, such as for guardrails or when monitoring success metrics before they mature.

Document everything

Write down for each metric: the specific behavior being measured, who it applies to (all users, activated users, exposed subset), the time configuration, expected direction of change, and why you chose this metric. Include calculation logic, filters, and any caveats.

This documentation serves three purposes: it ensures everyone interprets results consistently, it helps you catch issues before launching, and it creates a learning artifact for future teams.

Trade-offs

Real constraints force trade-offs. You can't always have perfect alignment, ideal sensitivity, and fast results simultaneously. Here's how to navigate common tensions.

When sensitivity conflicts with business alignment, remember that business-critical metrics like revenue and retention often have high variance or require slow measurement. Use sensitive proxies as primary success metrics while monitoring business metrics as guardrails. Just validate that improvements in your sensitive metric actually translate to business outcomes.

When feasibility limits ideal measurement, proxy metrics become necessary. Use them when direct metrics aren't available, but validate the correlation with historical data. Document the relationship clearly so others understand what the proxy represents.

When time constraints conflict with clean interpretation, semi-open windows give faster monitoring at the cost of interpretation complexity. Use them for early signals, but plan to validate with closed windows before making irreversible decisions.

The key is making these trade-offs deliberately, documenting them, and understanding their implications.

A complete example

You've got this

You now have a systematic approach for selecting metrics. You understand how to start with your hypothesis, translate it into measurable behaviors, assign appropriate roles, validate characteristics, configure time windows, and make deliberate trade-offs when necessary.

The framework isn't rigid—adapt it to your context. But the core principles remain: be explicit about what you expect to change and why, measure the right behavior even if it's harder, protect critical outcomes, and document your choices so others can learn from them.

Metric design is iterative. Your first attempt won't be perfect. You'll learn from each experiment what worked, what didn't, and how to refine your measurement approach. That's expected and valuable—each iteration builds your intuition and your organization's measurement capabilities.