Lesson 4: Capturing behavior

Aligning metrics with user behavior

A good metric directly measures the user behavior you want to understand or influence. This sounds simple, but it's surprisingly easy to choose metrics that capture something different from what you intended.

The key question to ask when defining a metric is: "If this metric moves, do I know what user behavior changed?"

Consider a team trying to improve product recommendations on an e-commerce site. They might instinctively reach for "impressions of recommended products" as their metric. It's easy to measure and seems relevant. But impressions only tell you how many people saw the recommendations, not whether they found them useful. A user could scroll past recommendations dozens of times without ever clicking. The metric moves, but you don't know if recommendations improved.

A better choice would measure the actual discovery behavior: did users add recommended products to their cart or wish list? This captures not just exposure or even clicks, but genuine interest—the behavior that matters.

Four common pitfalls

Four patterns repeatedly trip up teams when choosing metrics: confusing activity with value, picking metrics that can be gamed, overlooking unintended consequences, and filtering in ways that hide failures.

Activity that doesn't indicate success

The first pitfall is measuring activity that doesn't indicate success. Take search quality as an example. If you use "number of searches" as your success metric, more searches might seem like success. But the opposite could be true—more searches often means users can't find what they want on the first try. An improved search experience might actually decrease searches because users find what they need faster.

Metrics that can be improved without better experience

The second pitfall is choosing metrics that can be improved without improving the experience. If your metric is "number of feature impressions," you could increase it by showing the feature more frequently or more prominently, even if users find it intrusive. The metric improves, but the experience degrades. Measure "share of users who engaged with the feature when shown" instead—this captures actual interest, not just exposure.

Unintended consequences

The third pitfall is overlooking unintended consequences. Optimizing purely for "time on site" might lead you to make navigation more difficult or force users through extra steps, keeping them on the site longer while degrading their experience. Time on site matters, but only when it reflects genuine engagement. Pair it with quality indicators like "task completion rate" or "return visit rate" to ensure you're measuring the right thing.

Filters that hide failures

The fourth pitfall is filtering in ways that erase evidence of harm. If you're testing a backend optimization and filter your latency metric to successful requests only, a 4× increase in timeouts simply disappears from your analysis—the metric improves while the experience degrades. The fix is pairing a filtered success metric (latency among successful requests) with an unfiltered guardrail (overall request success rate).

A related trap is post-treatment filtering: when the treatment itself changes who enters the filtered population. If a new notification algorithm sends fewer, higher-confidence notifications, click-through rate among recipients may improve—but you're measuring a cherry-picked group, not a real improvement in engagement. When the treatment determines who gets filtered in, the metric is compromised by design.

Direct metrics versus proxy metrics

Sometimes you can measure exactly what you care about. Subscription conversions measure the actual business outcome. Purchase completion measures the transaction you care about. Customer satisfaction surveys measure actual sentiment. These are direct metrics: they capture the outcome itself, not a substitute.

But direct metrics aren't always practical. You might care about long-term retention, but you might only have a few weeks to run an experiment. Or you care about customer satisfaction, but don't have survey infrastructure in place. This is where proxy metrics come in. These measure something related to your outcome, serving as a stand-in.

The danger of proxy optimization

Proxy metrics are useful for measuring outcomes quickly, but they become dangerous when you start optimizing directly for them instead of the real outcome they represent. This is the risk of gamification—when you can improve the proxy without improving what you actually care about, the relationship breaks down.

Consider a streaming platform like Spotify. The real goal is long-term retention—keeping users coming back month after month. But retention is slow to measure, so you look for proxies. You might discover that users who like songs have much higher retention rates. This makes sense: liking a song signals intent to return and listen again. Number of liked songs becomes a useful proxy for retention.

The proxy works well for observational analysis and for experiments that indirectly affect liking behavior. But the moment you start directly optimizing for liked songs—making the like button larger, prompting users to like more often, adding like suggestions—you break the relationship. You can easily increase likes without increasing retention because you've changed what liking means. Users who casually click a prominent like button aren't signaling the same intent as users who seek out the like feature.

The proxy has been gamed. It no longer correlates with retention because you've optimized for the proxy itself rather than the underlying behavior it represented.

Hypothesis alignment

Your hypothesis should drive your metric choice, not the other way around. Start with what you expect to happen and why, then choose metrics that would confirm or refute your expectations.

Imagine you're adding a one-click checkout option to your e-commerce site. Your hypothesis might be: "This will increase purchase completion rate because it reduces friction in the checkout process." This hypothesis points you toward specific metrics: purchases using one-click checkout (the direct behavior), overall purchase completion rate (the intended outcome), and some quality check like return rate to ensure the faster checkout doesn't lead to more impulse purchases that customers regret.

Key questions to ask

When defining a metric, ask yourself these questions:

  1. Does this metric directly measure the behavior I care about, or is it a proxy?
  2. If I'm using a proxy, could I improve it without improving the real outcome I care about?
  3. If this metric improves, am I confident the user experience also improved?
  4. Does this metric reflect what my hypothesis says should change?
  5. Can someone artificially inflate this metric without improving the real outcome?
  6. Could optimizing for this metric have negative side effects?

If the answer to question 2, 5, or 6 is "yes," you're at risk of measuring the wrong thing. Proxies are useful for measurement but dangerous for direct optimization.

Notes for nerds

Proxy metrics in practice. Spotify has done a lot of research and thinking about how to build proxy metrics that remain valid over time, how to validate them empirically, and how to detect when the proxy-outcome relationship has drifted. This blog post covers practical approaches to proxy metric design and validation.

Post-treatment filtering as a causal inference problem. The lesson frames post-treatment filtering as a metric design pitfall and recommends pairing a filtered metric with an unfiltered guardrail. That fix is practical, but it addresses the symptom rather than the cause. When you filter on a variable that is itself affected by the treatment—for example, filtering to "users who received a notification" when the treatment changes notification volume—you are conditioning on a descendant of the treatment. In causal inference terms, this introduces selection bias: the filtered population is no longer comparable between treatment and control, because the treatment changed who entered the filter. The unfiltered guardrail helps you detect that something went wrong; diagnosing what requires understanding why the filter correlates with treatment assignment.

Goodhart's Law and Campbell's Law. The phenomenon described in this lesson—where optimizing for a proxy corrupts its relationship with the underlying outcome—has a name: Goodhart's Law. Originally observed by economist Charles Goodhart (1975) in the context of monetary policy targets, and later generalized by Marilyn Strathern: "When a measure becomes a target, it ceases to be a good measure."

The related Campbell's Law (Donald Campbell, 1976) extends this to social contexts: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."

In machine learning and reinforcement learning, the same failure mode is called reward hacking or specification gaming—an agent finds a way to maximize the reward signal without achieving the intended goal.

These aren't just theoretical concerns. Most "growth hack" failures and metric manipulation scandals in product companies are applied examples of Goodhart's and Campbell's Laws. Designing metrics that are robust to gaming—and separating the metrics you measure from the metrics you optimize—is one of the most important practical skills in experimentation.