Lesson 7: Feasibility and sensitivity
Learn how to evaluate metrics for experiments by assessing feasibility, understanding that effective variance (after variance reduction) matters more than raw variance, recognizing that binary metrics aren't inherently more sensitive than continuous metrics, and ensuring your product change can actually influence the metric you're measuring.
Feasibility: Can we measure it?
Feasibility means you have the data, infrastructure, and resources to compute the metric reliably. Before committing to a metric, verify: data availability (events are logged at the right granularity with sufficient history), technical capability (you can compute it efficiently within required time frames), and sample size requirements (you can collect enough data at acceptable cost).
For example, "streams per user from playlist recommendations" is feasible with logged events and user-level data. In contrast, "user satisfaction with recommendation quality" requires expensive survey data that may not exist.
Sensitivity: Can we detect changes?
A metric is sensitive if it reliably detects meaningful changes. Sensitivity depends on two independent factors: variance (how much noise exists) and influenceability (whether your change can actually move the metric). Both must be favorable.
Variance: Raw versus effective
Higher variance requires more users or longer runtime to detect a given effect size. However, what matters is not raw variance but effective variance after applying variance reduction techniques. The next lesson covers variance reduction in detail—for now, understand that the right covariate can dramatically change which metric is actually the most sensitive choice.
This distinction is critical. Variance reduction techniques use regression adjustment to control for pre-experiment user behavior. These techniques (like CUPED, CUPAC, and others) leverage temporal correlation: how stable the metric is over time. A metric with high raw variance but strong temporal correlation can end up with lower effective variance than a metric with moderate raw variance but weak correlation.
Consider two metrics for measuring engagement:
Metric A - Total streams per user:
- Raw variance: Very high (some users stream 1 song, others stream 1,000+)
- Temporal correlation: Very high (heavy streamers stay heavy streamers)
- Variance reduction factor: 80%
- Effective variance: Low
Metric B - Did user create a playlist?:
- Raw variance: Moderate (binomial metric)
- Temporal correlation: Low (playlist creation is sporadic)
- Variance reduction factor: 20%
- Effective variance: Moderate to high
After applying variance reduction, Metric A might actually have lower effective variance and detect changes faster, despite its higher raw variance.
Binary versus continuous metrics
Converting continuous metrics to binary (changing "total streams" to "did user stream more than 10 times?") has critical trade-offs:
Binary metrics can have lower variance when users cluster far from the threshold, but they lose information about magnitude. Say you convert "streams per user" into a binary metric: "did the user stream more than 10 times?" Then a change from 15 to 50 streams produces the same binary outcome (1→1) as no change at all—both users were already above the threshold. You only detect crossings of that threshold, making binary metrics less responsive to changes tested in the experiment. Substantial behavior improvements may not register. Additionally, variance is highest when proportions are near 50%, so poor thresholds can increase variance.
Binary metrics aren't inherently more sensitive. They can reduce variance but also reduce influenceability by ignoring non-threshold changes. Consider whether information loss justifies potential variance reduction—and whether regression adjustment techniques would make the continuous metric more sensitive anyway.
The techniques used to reduce effective variance—regression adjustment (CUPED, CUPAC), capping, and related methods—are covered in depth in Lesson 8: Variance reduction.
Influenceability: Can your change move the metric?
Even with perfect variance properties, a metric is useless if your product change cannot move it. Influenceability measures whether your experiment can actually affect the metric—this is completely independent of variance.
Influenceability depends on three factors: Exposure (what proportion of users encounter the change?), Mechanism (does the metric measure behavior the change can affect?), and Effect size (is the change substantial enough to impact the metric?).
Testing a playlist creation flow redesign:
Low influenceability: Using "monthly active users" as your primary metric. Most users never create playlists, so the change can't influence them. Even a perfect redesign might show no effect.
High influenceability: Using "playlist creation rate among users who start the creation flow." All measured users are exposed, the metric directly captures what you're improving, and UI changes can meaningfully affect completion.
Scoping your experiment population and choosing metrics aligned with your intervention are crucial—perfect variance properties mean nothing if your change can't move the metric.
Trade-offs in practice
You've now seen that good metrics require feasibility, low effective variance, high influenceability, interpretability, and business alignment. No single metric optimizes all dimensions—you must make deliberate trade-offs:
Sensitivity versus business alignment: Metrics directly tied to business value (revenue, long-term retention) often have high variance or long measurement windows. Use sensitive proxies (engagement, short-term behavior) as primary metrics while monitoring business metrics secondarily. Ask: do improvements in my sensitive metric translate to business outcomes?
Granularity versus variance: Granular metrics (total revenue per user) capture effect magnitude but have high variance. Before simplifying through capping or binary conversion, check whether regression adjustment can give you both granularity and sensitivity.
Confidence automatically applies variance reduction using regression adjustment. You can see how much variance was reduced for each metric on the detailed results page.
Scope versus influenceability: Broad metrics (platform retention) align with business goals but resist movement from single experiments. Narrow metrics (feature engagement) are movable by experiments but miss broader impacts. Solution: use narrow, movable primary metrics for decisions while monitoring broad metrics for unintended effects.
Testing a new recommendation algorithm:
Success metric: "Streams from recommendations per user"—directly influenced by your change, moderate variance with variance reduction, measures what you're improving.
Guardrail metrics: Total streams (broader impact), premium conversion (business outcome), recommendations shown (diagnostic).
This lets you decide confidently while understanding broader implications.
Why might a metric with high raw variance be better for experiments than a metric with low raw variance?
What is a key disadvantage of converting a continuous metric to binary?
What makes a metric highly influenceable for a specific experiment?
Notes for nerds
Ratio metrics and variance reduction. Ratio metrics require special care not just for variance estimation, but also for variance reduction. When applying regression adjustment to a ratio metric, you can't treat the ratio as a simple scalar—the numerator and denominator each carry independent variance, and the covariance between them matters. This is covered in depth in Lesson 8: Variance reduction.
Heterogeneous treatment effects. Influenceability describes the average effect of your change across the measured population—but that average can mask enormous variation. A feature might be highly influenceable for power users who interact frequently with the affected surface, while being essentially unmovable for casual users who rarely encounter it. Conversely, a change can produce a neutral average because it helps some segments while harming others, with the two effects cancelling out. Whether this matters depends on your product strategy. If your goal is aggregate improvement, the average effect is what counts. If you want to understand who benefits and why, average influenceability is not enough—segment-level analysis is the right tool, and it is covered in Lesson 10: Segment-level analysis.