Lesson 4: Capturing behavior
In this lesson, you learn how to design metrics that accurately capture the user behavior you care about. You explore common pitfalls to avoid—including filters that hide failures—understand the difference between direct and proxy metrics, learn about the danger of optimizing directly for proxies, and discover how to ensure your metrics truly reflect your hypothesis.
Aligning metrics with user behavior
A good metric directly measures the user behavior you want to understand or influence. This sounds simple, but it's surprisingly easy to choose metrics that capture something different from what you intended.
The key question to ask when defining a metric is: "If this metric moves, do I know what user behavior changed?"
Consider a team trying to improve product recommendations on an e-commerce site. They might instinctively reach for "impressions of recommended products" as their metric. It's easy to measure and seems relevant. But impressions only tell you how many people saw the recommendations, not whether they found them useful. A user could scroll past recommendations dozens of times without ever clicking. The metric moves, but you don't know if recommendations improved.
A better choice would measure the actual discovery behavior: did users add recommended products to their cart or wish list? This captures not just exposure or even clicks, but genuine interest—the behavior that matters.
Four common pitfalls
Four patterns repeatedly trip up teams when choosing metrics: confusing activity with value, picking metrics that can be gamed, overlooking unintended consequences, and filtering in ways that hide failures.
Activity that doesn't indicate success
The first pitfall is measuring activity that doesn't indicate success. Take search quality as an example. If you use "number of searches" as your success metric, more searches might seem like success. But the opposite could be true—more searches often means users can't find what they want on the first try. An improved search experience might actually decrease searches because users find what they need faster.
The search quality trap:
You're improving search quality and choose "number of searches" as your success metric. The experiment launches, and searches increase by 15%. Success?
Not necessarily. Analysis reveals users are performing multiple searches to find what they want, whereas before they found it on the first try. The metric improved, but the experience degraded.
Better metric: "Share of searches resulting in a play"—this captures whether searches actually led to the desired outcome.
Metrics that can be improved without better experience
The second pitfall is choosing metrics that can be improved without improving the experience. If your metric is "number of feature impressions," you could increase it by showing the feature more frequently or more prominently, even if users find it intrusive. The metric improves, but the experience degrades. Measure "share of users who engaged with the feature when shown" instead—this captures actual interest, not just exposure.
Unintended consequences
The third pitfall is overlooking unintended consequences. Optimizing purely for "time on site" might lead you to make navigation more difficult or force users through extra steps, keeping them on the site longer while degrading their experience. Time on site matters, but only when it reflects genuine engagement. Pair it with quality indicators like "task completion rate" or "return visit rate" to ensure you're measuring the right thing.
Filters that hide failures
The fourth pitfall is filtering in ways that erase evidence of harm. If you're testing a backend optimization and filter your latency metric to successful requests only, a 4× increase in timeouts simply disappears from your analysis—the metric improves while the experience degrades. The fix is pairing a filtered success metric (latency among successful requests) with an unfiltered guardrail (overall request success rate).
Confidence lets you configure metrics to pad missing values with zero rather than excluding them, which preserves evidence of failures rather than silently dropping them.
A related trap is post-treatment filtering: when the treatment itself changes who enters the filtered population. If a new notification algorithm sends fewer, higher-confidence notifications, click-through rate among recipients may improve—but you're measuring a cherry-picked group, not a real improvement in engagement. When the treatment determines who gets filtered in, the metric is compromised by design.
Direct metrics versus proxy metrics
Sometimes you can measure exactly what you care about. Subscription conversions measure the actual business outcome. Purchase completion measures the transaction you care about. Customer satisfaction surveys measure actual sentiment. These are direct metrics: they capture the outcome itself, not a substitute.
But direct metrics aren't always practical. You might care about long-term retention, but you might only have a few weeks to run an experiment. Or you care about customer satisfaction, but don't have survey infrastructure in place. This is where proxy metrics come in. These measure something related to your outcome, serving as a stand-in.
When to use a proxy:
You want to measure whether a new SaaS product feature creates more long-term active users. But your experiment needs to conclude in 2 weeks to meet a release deadline.
Direct metric (what you really care about): Monthly active users over 6 months
Proxy metric (what you can measure): Weekly active users and feature adoption in first 2 weeks
The proxy works if you've validated that users who adopt the feature in the first two weeks tend to remain active long-term. Document this correlation so others understand what the metric represents.
The danger of proxy optimization
Proxy metrics are useful for measuring outcomes quickly, but they become dangerous when you start optimizing directly for them instead of the real outcome they represent. This is the risk of gamification—when you can improve the proxy without improving what you actually care about, the relationship breaks down.
Consider a streaming platform like Spotify. The real goal is long-term retention—keeping users coming back month after month. But retention is slow to measure, so you look for proxies. You might discover that users who like songs have much higher retention rates. This makes sense: liking a song signals intent to return and listen again. Number of liked songs becomes a useful proxy for retention.
The proxy works well for observational analysis and for experiments that indirectly affect liking behavior. But the moment you start directly optimizing for liked songs—making the like button larger, prompting users to like more often, adding like suggestions—you break the relationship. You can easily increase likes without increasing retention because you've changed what liking means. Users who casually click a prominent like button aren't signaling the same intent as users who seek out the like feature.
The proxy has been gamed. It no longer correlates with retention because you've optimized for the proxy itself rather than the underlying behavior it represented.
A proxy compromised:
Initial observation: Users who like songs have 40% higher 90-day retention than users who don't.
Conclusion: Number of liked songs is a good proxy for long-term retention.
What works: Testing new discovery features and measuring whether they increase likes (along with other engagement metrics). The relationship holds because users are liking songs for the same reasons they always did.
What breaks: Making the like button 3x larger and adding "Did you like this?" prompts after every song. Likes increase 60%, but 90-day retention doesn't budge. The proxy no longer predicts retention because you've changed user behavior—users now like songs casually rather than intentionally. The correlation is broken.
Proxy metrics are often genuinely necessary—no team can run a six-month experiment to observe long-term retention directly. The value of a proxy is that it makes outcomes measurable on an experimentally practical timescale. Use proxies for measuring outcomes, not for direct optimization. When you want to improve the real outcome (like retention), test changes that should affect the outcome directly, and use the proxy to measure results quickly. The critical discipline is validating with historical data that the proxy actually correlates with the outcome you care about—and revisiting that validation as your product and user base evolve.
Hypothesis alignment
Your hypothesis should drive your metric choice, not the other way around. Start with what you expect to happen and why, then choose metrics that would confirm or refute your expectations.
Imagine you're adding a one-click checkout option to your e-commerce site. Your hypothesis might be: "This will increase purchase completion rate because it reduces friction in the checkout process." This hypothesis points you toward specific metrics: purchases using one-click checkout (the direct behavior), overall purchase completion rate (the intended outcome), and some quality check like return rate to ensure the faster checkout doesn't lead to more impulse purchases that customers regret.
If you find yourself changing your hypothesis to match an available metric, stop and reconsider. It's better to invest in measuring the right thing than to optimize for a convenient but misleading metric.
Key questions to ask
When defining a metric, ask yourself these questions:
- Does this metric directly measure the behavior I care about, or is it a proxy?
- If I'm using a proxy, could I improve it without improving the real outcome I care about?
- If this metric improves, am I confident the user experience also improved?
- Does this metric reflect what my hypothesis says should change?
- Can someone artificially inflate this metric without improving the real outcome?
- Could optimizing for this metric have negative side effects?
If the answer to question 2, 5, or 6 is "yes," you're at risk of measuring the wrong thing. Proxies are useful for measurement but dangerous for direct optimization.
Which metric best captures actual user interest in a feature?
What is the main risk of using number of searches as a success metric for search quality?
When should you use a proxy metric instead of a direct metric?
What should you do if an available metric does not match your hypothesis?
Why is it dangerous to optimize directly for a proxy metric like number of liked songs?
Notes for nerds
Proxy metrics in practice. Spotify has done a lot of research and thinking about how to build proxy metrics that remain valid over time, how to validate them empirically, and how to detect when the proxy-outcome relationship has drifted. This blog post covers practical approaches to proxy metric design and validation.
Post-treatment filtering as a causal inference problem. The lesson frames post-treatment filtering as a metric design pitfall and recommends pairing a filtered metric with an unfiltered guardrail. That fix is practical, but it addresses the symptom rather than the cause. When you filter on a variable that is itself affected by the treatment—for example, filtering to "users who received a notification" when the treatment changes notification volume—you are conditioning on a descendant of the treatment. In causal inference terms, this introduces selection bias: the filtered population is no longer comparable between treatment and control, because the treatment changed who entered the filter. The unfiltered guardrail helps you detect that something went wrong; diagnosing what requires understanding why the filter correlates with treatment assignment.
Goodhart's Law and Campbell's Law. The phenomenon described in this lesson—where optimizing for a proxy corrupts its relationship with the underlying outcome—has a name: Goodhart's Law. Originally observed by economist Charles Goodhart (1975) in the context of monetary policy targets, and later generalized by Marilyn Strathern: "When a measure becomes a target, it ceases to be a good measure."
The related Campbell's Law (Donald Campbell, 1976) extends this to social contexts: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."
In machine learning and reinforcement learning, the same failure mode is called reward hacking or specification gaming—an agent finds a way to maximize the reward signal without achieving the intended goal.
These aren't just theoretical concerns. Most "growth hack" failures and metric manipulation scandals in product companies are applied examples of Goodhart's and Campbell's Laws. Designing metrics that are robust to gaming—and separating the metrics you measure from the metrics you optimize—is one of the most important practical skills in experimentation.