When Proxy Metrics Break: How Optimizing for Proxies Can Backfire

When Proxy Metrics Break: How Optimizing for Proxies Can Backfire
Mårten Schultzberg, Staff Data Scientist
Mårten Schultzberg, Staff Data Scientist

Want to experiment like Spotify? Check out Confidence and get a personalized demo.

Contact us

Selecting what to optimize for is one of the recurring challenges of online experimentation. This post discusses how wrong things can go when proxy metrics start to influence product development.

The Role of Metrics in Online Experiments

When we run experiments, we're not looking at a single number to make our decisions. Instead, we rely on a collection of metrics that together paint a complete picture of how a change affects our product and users.

At the heart of every experiment are success metrics that are outcomes we're actively trying to improve. These directly measure whether our changes are moving the business forward. Alongside these, we track guardrail metrics to ensure we're not causing unintended harm. A feature might increase engagement, but if it degrades performance or harms a critical part of the user experience, we need to know before we ship it.

The decision to launch a feature is rarely based on a single metric moving in the right direction. Instead, we look at the complete scorecard: Did our success metrics improve? Did we stay within acceptable bounds on our guardrails? Does the overall picture suggest we've created genuine value for users? Read more about Spotify's framework for making decisions from multi-metric experiments in this blog post.

This holistic approach works well but one big problem remains: the metrics we care about most are often slow to measure. Proxy metrics enter the picture here, and things can start to go wrong.

What Is a Proxy Metric?

At Spotify, long-term retention is often the long-term business goal. It's what we truly care about. But measuring retention is slow by construction. To iterate quickly, we need a feedback loop that works in days or weeks, not months.

Proxy metrics address this problem. We look for short-term behaviors that signal a user's intent to return. For example, when a user likes a song, it's not a big leap to infer that they are signaling the intention to listen to it again later. This in turn directly leads to improved retention. How to select proxy metrics is a topic of its own, but let's, to limit the scope of this post, assume we have found one.

Having a good proxy metric allows us to move fast. We can run a dozen experiments in the time it would take to measure one retention cycle. The proxy metrics is an excellent yardstick for success until it becomes the blueprint for the product.

The Proxy-Metric Trap

The proxy-metric trap is subtle. It happens when you stop building features to improve the long-term goal, like retention, and start building specifically to move the proxy, like Liked Songs.

A proxy metric should work like a compass: it tells you whether you're heading in the right direction toward a destination you can't yet see. But when you start optimizing for the proxy itself, you're no longer following the compass. You're manipulating it with a magnet. You add a bigger like button here, a nudge animation there, and suddenly likes are up 20%. The needle points in a clear direction, but that direction no longer leads to your destination.

If you believe "more likes equals more retention," the logical next step is to make liking songs as easy as possible. You might make the button larger, place it in more prominent locations, or add "nudge" animations. These changes will almost certainly increase likes.

But because these changes were designed to directly move the proxy metric rather than the long-term outcome, the relationship breaks and the proxy metric stops being a proxy metric.

I've seen this play out in real experiments. A team ships changes that boost the proxy by 20%. The experiment looks like a clear win. Later, when the long-term data comes in, the long-term north-star metric has barely moved. The team has spent weeks of engineering time on changes that generated clicks but not value. Worse, they've now trained themselves to think about features in terms of "how do we get more likes" rather than "how do we help users discover music they'll love."

Why Proxies Break Under Optimization

Proxy metrics are effective because they capture organic behavior. When a user discovers a song they love and chooses to like it, that action reveals genuine value. This organic behavior is the signal you want to measure.

But when you optimize for the proxy directly, you can weaken the connection between the action and the long-term outcome. You're no longer measuring organic intent. Instead, you're measuring your ability to extract a specific click. The proxy becomes a target, and as Goodhart's Law warns:

When a measure becomes a target, it ceases to be a good measure.

The causality question. There's an important nuance here worth exploring. In principle, if improvements in the proxy directly cause improvements in the long-term outcome (if the effect of your treatment on retention flows through the proxy) then optimizing directly for it should work.

The problem is that this condition is extremely difficult to verify in practice. First, you need to establish that the proxy actually drives (technically mediates) the effect rather than just being another outcome that happens to move alongside retention. Second, even if it does drive the effect, the relationship might not be linear or might break down at certain ranges. Perhaps liking songs improves retention up to a point, but beyond some threshold (say, when users have liked hundreds of songs) additional likes don't matter anymore. Or worse, forced likes might even harm retention if users return to find their library cluttered with songs they don't actually love.

The core issue concerns the functional form of the relationship, which is hard to know with certainty. The research on surrogate endpoints (see Athey et al. for a more formal treatment) shows just how stringent the conditions need to be for this to work reliably. You need the proxy to capture essentially all the treatment effect, and you need that relationship to hold across different contexts and magnitudes.

In most real-world product development, we simply don't have enough certainty about these conditions to optimize directly for proxies without risking to break the very signal they're meant to capture.

The Cost of Proxy-Metric Manipulation

What makes this particularly insidious is that your experiments will show "positive" results. Likes will go up. You'll ship the changes. Your success metrics will look healthy quarter over quarter. Meanwhile, the thing you actually care about, retention, stays flat or even declines. In the example of "liking songs", it's entirely possible that retention goes down if users are coerced into liking songs that they in fact are not so fond of when they come back to listen to them.

You've spent engineering resources and product surface area on changes that don't move the business forward. Worse, you've trained your organization to optimize for the "click" rather than the "value."

The organizational cost is real too. Teams are often evaluated on quarterly metrics. If your proxy looks good this quarter, you get celebrated. By the time the long-term data reveals the proxy has decoupled from retention, you might be working on something else entirely. The incentive structure makes it tempting to keep optimizing the thing you can measure quickly, even when you suspect it might not translate.

How to Use Proxy Metrics Safely

This doesn't mean you should stop using proxies as success metrics. They remain essential for fast iteration. The key is to resist the temptation to let the proxy influence the design of the solution.

Build for value, measure the signal. If you improve a recommendation algorithm, "Liked Songs" should go up because the music is better. Such an increase represents a healthy signal. If you simply make the "Like" button pulse, the metric goes up, but the signal is empty. Before implementing a change, ask yourself: will this change drive the long-term outcome, or is it directed at the short term outcome. You should always target the long-term outcome with your changes.

Use proxies as diagnostic tools. Use proxies to validate that a change is working as intended. If a feature designed to improve discovery doesn't move "Liked Songs," it's a sign the discovery isn't actually better. The proxy should confirm your hypothesis, not define it.

Use more than one proxy. By using more than one proxy, it's easier to keep yourself honest. If these proxies are sufficiently different from each other while still correlating with the long-term outcome, it's unlikely you can over-optimize for one without harming the others. When multiple independent signals move together, you're probably creating real value.

Validate against the true north. For major shifts, run longer-term experiments or holdout groups to ensure that your proxy wins are actually translating into the retention gains you expect. At Spotify we often achieve this by running a rollout after an A/B test, leaving the reach at 90% (letting 10% holdout on the old version) for several weeks to measure, if not long-term retention, at least retention at, for example, six weeks.

How LTV Models Amplify the Proxy Trap

Lifetime value (LTV) metrics deserve special attention because they sometimes magnify the proxy trap. LTV models try to predict long-term user value using short-term signals. Essentially, they're forecasts of user value built from proxies.

The problem compounds when teams don't understand what goes into the LTV calculation. If your LTV model heavily weights "Liked Songs," a team might ship features that boost likes by 15%, see their LTV projections climb, and declare victory. But they've optimized for a part of the model rather than actual lifetime value.

This creates a false sense of achievement. The team believes they've increased user value when they've really just gamed the formula. Because LTV models are often treated as black boxes, these dynamics can persist for months before anyone notices the predicted gains never materialized.

When building LTV models, teams typically choose inputs based on correlation with long-term outcomes. But they rarely ask: "Which of these proxies could product teams inadvertently optimize for?" A highly correlated signal today becomes a poor predictor tomorrow once teams start building features specifically to move it.

If you use LTV metrics, make the components transparent to product teams. Ensure they understand that improving genuine user experience should move LTV as a side effect, not that moving LTV components is the goal itself. Perhaps even track the LTV components separately to make sure they are moving in the right direction as a collective. If only one part is moving in the right direction, there is reason for suspicion.

The Broader Lesson

Experimentation is about learning what works. Metrics, or more specifically, what to optimize for, is a problem that has a lot more devils in its details than you might first think. The proxy-metric trap is just one example of how well-intentioned measurement can lead teams astray when the incentives aren't aligned with the outcomes.

Proxy metrics aren't inherently problematic. They're essential tools that enable fast iteration and learning. The danger emerges when we start optimizing directly for them without understanding whether they truly drive the effects we care about. In theory, there are conditions under which this could work (when the proxy captures all pathways to the outcome and the relationship holds across all ranges). In practice, we rarely have enough certainty about these conditions to optimize directly for proxies without breaking the very signal they're meant to capture.

The hardest part isn't setting up the metrics. It's maintaining the discipline to remember what you actually care about when the proxy is telling you everything is fine.

Do you wanna experiment like Spotify? Reach out to book a demo.