A binary metric is a metric that takes one of two values for each user in an experiment: 1 (the event happened) or 0 (it didn't). Conversion rate, signup rate, whether a user streamed at least one song, whether a user encountered an error. These are all binary metrics. They answer yes-or-no questions about user behavior.
Binary metrics are among the most common metrics in A/B testing because they map directly to product questions: did the user convert? Did they complete onboarding? Did they return within seven days? At Spotify, binary metrics like "streamed at least one track in the session" or "completed the signup flow" appear in thousands of experiments analyzed annually in Confidence. Their statistical properties are well-understood, which makes them straightforward to analyze, power, and interpret.
How are binary metrics analyzed in an A/B test?
The analysis follows the standard difference-in-means approach. For each group, you compute the proportion of users where the event occurred (the sample proportion, p_hat). The treatment effect estimate is the difference in proportions: p_hat_treatment minus p_hat_control.
The standard error has a particularly clean form for binary metrics. The variance of a binary variable is p(1-p), where p is the true proportion. This means the standard error of the difference in proportions depends only on the proportions and the sample sizes, with no additional parameters to estimate.
A z-test then determines whether the observed difference is statistically significant. For the sample sizes common in online experimentation, this approach is exact in practice: the normal approximation to the binomial is excellent once np and n(1-p) both exceed 5, which is trivially satisfied with thousands of users.
What determines the power of a binary metric experiment?
Statistical power for binary metrics depends on three things: the base rate (the proportion in the control group), the minimum detectable effect (how large a change you want to detect), and the sample size.
The base rate matters because it determines the variance. A metric with a 50% base rate has maximum variance (0.25). A metric with a 1% base rate has much lower variance (0.0099). Lower variance means tighter confidence intervals and higher power, all else equal.
But there's a catch. Metrics with very low base rates (1-2% conversion) produce small absolute effect sizes even when the relative effect is large. A 10% relative improvement on a 1% conversion rate is a 0.1 percentage point absolute change (from 1.0% to 1.1%). Detecting that 0.1 percentage point shift requires a very large sample because the absolute signal is tiny, despite the relative signal being meaningful.
The Spotify Search team's experimentation maturity journey illustrates this in practice. Search conversion rates can vary significantly across markets and user segments, and experiments on low-traffic search verticals face exactly this power challenge. Trigger analysis helps by restricting the analysis to users who actually performed searches, which removes the dilution from users who never engaged with the feature under test.
How does variance reduction apply to binary metrics?
CUPED works on binary metrics the same way it works on continuous metrics: by adjusting for pre-experiment covariates. If a user converted last week, they're more likely to convert this week, and CUPED subtracts that predictable component.
The variance reduction for binary metrics is typically smaller than for continuous metrics, because the variance of a binary variable is already bounded by 0.25 (its maximum at p = 0.5). There's less room for improvement compared to a heavy-tailed continuous metric like revenue, where a few extreme values can dominate the variance. That said, even a 10-15% variance reduction on a binary metric translates directly into shorter experiment runtimes.
Metric capping doesn't apply to binary metrics since the values are already bounded at 0 and 1. There are no outliers to clip.
When should you use a binary metric vs. a continuous metric?
Binary metrics are useful when the product question is fundamentally yes-or-no: did the user convert, retain, or encounter the feature? They're easy to interpret, easy to power, and resistant to outliers.
Continuous metrics (revenue per user, streams per session, time spent) capture magnitude, not just occurrence. A continuous metric can distinguish between "users listened to one more song" and "users listened to ten more songs," while a binary "did the user listen?" metric treats both the same.
The best practice is to use both. Confidence supports defining success metrics and guardrail metrics of any type for the same experiment. A common pattern is a binary success metric (conversion rate) paired with a continuous guardrail metric (revenue per user), ensuring the change drives the desired behavior without harming business outcomes.