Metrics

What is a Metric Drift?

Metric drift is a gradual, non-experiment-related change in a metric's baseline value over time.

Metric drift is a gradual, non-experiment-related change in a metric's baseline value over time. Seasonality, product updates outside the experiment, shifting user demographics, market changes, or changes in data collection can all push a metric up or down independently of any treatment effect. Drift becomes a problem when it's large enough to confuse the interpretation of experiment results.

A metric that's drifting upward during your experiment will make a neutral treatment look positive and a positive treatment look even better. A metric drifting downward will mask real improvements. Neither situation produces trustworthy evidence.

How does metric drift affect experiment results?

In a well-randomized A/B test, drift affects both control and treatment equally. The randomization ensures that any external trend applies to both groups, so the difference between groups still reflects the treatment effect. This is one of the core strengths of randomized experiments over before-and-after comparisons.

The problem arises in three scenarios.

First, when drift interacts with exposure timing. If users enter the experiment over an extended period and the baseline metric is shifting during that window, early entrants and late entrants have different baselines. This doesn't bias the average treatment effect estimate, but it increases variance, reducing the experiment's power to detect real effects.

Second, when teams interpret absolute metric values instead of treatment-vs-control differences. A team that sees their success metric decline during an experiment may panic and stop the test, even though the decline is happening equally in both groups and the treatment effect is genuinely positive.

Third, when drift affects guardrail metrics. If a guardrail metric is trending downward for reasons unrelated to the experiment, the inferiority test may flag a regression that the treatment didn't cause. Teams need to distinguish between "the treatment made this metric worse" and "this metric is getting worse everywhere."

How do you detect and account for metric drift?

CUPED (Controlled-experiment Using Pre-Experiment Data) helps directly. By using each user's pre-experiment metric value as a covariate, CUPED adjusts for baseline differences between users, which reduces the impact of drift that correlates with pre-experiment behavior. Confidence applies CUPED by default, which mitigates a substantial portion of drift-related noise.

Monitoring control group trends is the most straightforward detection method. If the control group's metric is moving significantly over the experiment's runtime, drift is present. Confidence surfaces time-series views of metrics for both groups, making drift visible without requiring manual analysis.

For guardrail metrics specifically, longitudinal tracking across experiments provides a broader view. A single experiment might trigger a guardrail alert due to drift, but examining the metric across dozens of experiments over months reveals whether the trend is systemic or experiment-specific.

When is metric drift a sign of a deeper problem?

Persistent drift in a guardrail metric across many experiments is a signal worth investigating. It might indicate that the metric's definition is stale (measuring something that no longer reflects the underlying concept), that the user population is changing in ways the metric set doesn't capture, or that cumulative product changes are having an unintended compounding effect.

At Spotify, where teams run thousands of experiments per year, distinguishing metric drift from cumulative experiment effects requires tracking metrics longitudinally. Individual experiments may each pass their guardrail checks while the aggregate effect of many small regressions accumulates into a meaningful decline. This is the problem longitudinal guardrails are designed to solve.