Metric sensitivity is how responsive a metric is to real product changes. A sensitive metric moves when the product experience genuinely changes. An insensitive metric stays flat even when users are having a meaningfully different experience. Sensitivity determines whether your experiment can detect the effect you're trying to measure.
If your success metric isn't sensitive to the change you're testing, you'll get null results that look like "the change had no effect" when the real conclusion is "our metric couldn't see the effect." This distinction matters enormously. One is a learning about the product. The other is a measurement failure.
How do you assess whether a metric is sensitive enough?
The most direct test is statistical power. Power depends on three things: sample size, the variance of the metric, and the minimum detectable effect (MDE) you care about. A metric with high variance relative to the expected effect size will require a much larger sample to detect the same change.
You can estimate sensitivity before running an experiment. Calculate the MDE your metric can detect at 80% power given your available sample size. If the MDE is larger than the effect you'd consider practically meaningful, the metric is too insensitive for this experiment. You either need a different metric, more traffic, or a variance reduction technique.
At Spotify, teams use CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce metric variance by roughly 50%. That reduction translates directly into sensitivity: a metric that needed 4 weeks of runtime to detect a 1% effect can detect the same effect in about 2 weeks after CUPED is applied. Confidence runs CUPED by default on metrics where pre-experiment data is available.
What makes a metric more or less sensitive?
Several factors determine sensitivity.
Proximity to the change. A metric that measures the direct behavior you're modifying is more sensitive than a downstream metric several causal steps away. If you're changing the search ranking algorithm, search success rate will be more sensitive than overall monthly retention. The signal attenuates as it passes through more intermediate steps.
Metric variance. High-variance metrics require more data to distinguish a real signal from noise. Revenue per user, for example, is often dominated by a small number of high-spending users, making it extremely noisy. Winsorizing (capping extreme values) or using medians instead of means can reduce variance, but each transformation changes what the metric measures.
User-level aggregation. Metrics computed per session and then averaged across sessions behave differently from metrics computed directly per user. The right level of aggregation depends on what you're trying to measure, but the wrong choice can bury the signal.
Exposure dilution. If only 10% of your experiment population actually encounters the changed feature, the effect is diluted by the 90% who don't. Trigger analysis restricts the analysis to exposed users, recovering sensitivity without biasing the estimate.
When should you trade sensitivity for breadth?
Sometimes you deliberately choose a less sensitive metric because it captures something more important. A broad metric like 30-day retention is less sensitive than daily active sessions, but it's closer to what the business actually cares about.
The standard approach is to use sensitive metrics as your primary success metric for the experiment and track broader metrics as secondary or guardrail metrics. This gives you the statistical power to reach a conclusion on the success metric while monitoring whether the change has broader effects.
Confidence supports this by letting teams define metric hierarchies per experiment surface: a required set of guardrail metrics that every experiment on that surface must track, plus team-specific success metrics chosen for sensitivity to the particular hypothesis.