Want to experiment like Spotify? Sign up for a 30 day free trial.
Start your free trialWhen a user is exposed to an experiment but generates zero events for a metric during the measurement window, the platform has to make a choice: include that user as a zero, or exclude them from the metric entirely. This is not a technical default. It is a choice about what the experiment measures. Including zeros estimates the average treatment effect across all exposed users, regardless of whether they engaged. This is an intent-to-treat analysis. Excluding zeros estimates the effect only among users who engaged. This is a per-engager analysis. They answer different questions, and mixing them across metrics in the same experiment produces internally inconsistent results.
Most platforms do not document which approach they use. Some apply different rules to different metric types without stating it. The result is that the metric you defined may be answering a different question than you intended, and the effect size you report is not comparable to effect sizes on other metrics in the same experiment. A revenue metric that excludes non-purchasers and a conversion metric that includes all exposed users are not measuring the same population. Treating their effect sizes as if they are creates decisions that look data-driven but are built on inconsistent foundations.
Recent work on covariate adjustment with missing outcomes shows that when outcomes are missing, the choice of adjustment method matters for both what the experiment measures and how precisely it measures it. The results apply directly to the experimentation setting: users with zero events are not "missing data" in the technical sense if they were truly exposed but chose not to engage. Including them as zeros is a valid analysis that answers a different question than excluding them. But if the platform makes this choice for you without documenting it, you cannot know which question your experiment is answering.
Should you add metric zero-handling to your experimentation platform requirements?
Yes, and this is one of the most consequential undocumented choices in the vendor landscape. The zero-handling decision affects what the experiment measures, the variance of the metric, the required sample size, and the interpretation of the result. Getting it wrong does not produce a visibly broken analysis. It produces a subtly wrong one: an effect size that answers a different question than the one you asked, with a confidence interval calibrated for a different population than the one you intended to study. A platform that handles zeros inconsistently across its pipeline puts the experimenter in a position where different metrics in the same scorecard target different populations, and nothing in the interface reveals it. At Spotify, we have found that an undocumented default causes more confusion than a visible limitation ever would. We built Confidence to propagate the zero-handling choice to every downstream calculation because inconsistency here is invisible, cumulative, and corrosive to trust in results.
At Spotify, Confidence lets you configure zero-handling per metric, and the choice propagates to the sample size calculator and variance reduction, so every downstream calculation reflects the same population definition.
The RFP question is not "do you include zeros?" It is whether the platform documents its zero-handling behavior, lets you configure it per metric, and carries the choice through consistently to variance reduction, sample size calculation, and segmented analysis.
What your RFP should ask instead of the "yes/no?"
Six questions decide whether a platform's zero-handling implementation is coherent.
First: does the platform document its zero-handling behavior? The most basic requirement. When a user is exposed to the experiment but generates no events for a given metric, what happens? Are they included as a zero, excluded entirely, or is the behavior metric-type-dependent? Most vendors do not address this in their public documentation. The absence of documentation does not mean the behavior is absent. It means you cannot verify what your experiment is measuring without reverse-engineering the analysis. Ask for explicit documentation of what happens to exposed users with no metric events, and whether the answer differs by metric type.
Second: is zero-handling consistent across metric types? Some platforms include all exposed users for additive metrics (counts, sums, revenue) but exclude non-engagers from participation-based metrics (conversion rates, averages conditioned on engagement). This is not inherently wrong, but it means different metrics in the same experiment are measuring different populations. A revenue-per-user metric that includes non-purchasers as zeros and a purchase-conversion metric that counts only purchasers in the denominator are measuring the same population consistently. A revenue metric that excludes non-purchasers while a conversion metric includes all exposed users produces an experiment where different metrics target different populations. When you report results across metrics, the effect sizes are not directly comparable. Ask whether zero-handling is consistent across metric types, and if not, whether the documentation states which types use which approach.
Third: is zero-handling configurable per metric? Different metrics have legitimately different requirements. A latency metric where zeros likely mean timeouts (the request never completed, so no latency event was logged) needs different handling than a revenue metric where zeros mean no purchase. For the latency metric, including timed-out users as zeros would be misleading: it would make a treatment that causes more timeouts look like it improved latency. Excluding them would also be misleading if the exclusion rate differs between treatment and control, creating a within-metric sample composition difference. The right answer depends on the metric's semantics, and the platform should let the experimenter choose. Ask whether zero-handling is configurable per metric, or whether a single platform-wide default applies to all metrics.
Fourth: does the zero-handling choice propagate to the sample size calculation? The variance of a metric depends on whether zeros are included. A revenue metric with zeros included (many users at zero, some at positive values) has a very different distribution than the same metric with zeros excluded (only positive values). The required sample size depends on this variance. A sample size calculator that uses one definition while the analysis uses another will produce the wrong number. If the calculator assumes intent-to-treat variance but the analysis excludes non-engagers, the experiment will be overpowered for the question actually being answered. If the reverse, it will be underpowered. Ask whether the sample size calculator uses the same zero-handling configuration as the analysis, and whether changing the zero-handling setting updates the sample size estimate. For more on how sample size calculators connect to the rest of the analysis pipeline, see our post on sample size calculation.
Fifth: does variance reduction account for the zero-handling choice? CUPED reduces metric variance by adjusting for pre-experiment behavior. The pre-experiment covariate must reflect the same population as the analysis. If the analysis includes zeros (all exposed users), the pre-experiment covariate should also include zeros for the same users. If the analysis excludes non-engagers, the covariate should be computed over engagers only. A mismatch between the analysis population and the covariate population introduces bias in the variance reduction adjustment. Research on covariate adjustment with missing outcomes establishes that the choice of adjustment method affects both efficiency and consistency when outcomes are missing. A CUPED implementation that ignores the zero-handling configuration may not deliver the efficiency gains it promises. Ask whether variance reduction uses the same zero-handling definition as the primary analysis.
Sixth: does zero-handling apply consistently in segmentation and exploratory analyses? When you slice experiment results by country, device type, or user segment, the zero-handling configuration should carry through. If the topline analysis includes all exposed users as zeros for a revenue metric but a segment breakdown recomputes the metric over engagers only, the segment result answers a different question than the topline. This makes it impossible to decompose the topline effect into segment contributions. The numbers will not add up, and the interpretation will be inconsistent. Ask whether segmented views inherit the same zero-handling configuration as the primary analysis. For more on the statistical treatment of dimensional breakdowns, see our upcoming post on exploratory analysis. This also connects to the percentile metrics post in this series, where zero-handling determines which users enter the percentile distribution, and to the time-in metrics post, where the observation window determines which users have had a chance to generate events at all.
What the answers actually look like across vendors
Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.
"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent. A dash (---) means the platform does not offer the broader feature at all.
| Platform | Zero-handling documented? | Intent-to-treat available? | Per-engager available? | Configurable per metric? | Consistent across metric types? | Reflected in sample size calculation? | Variance reduction accounts for zero-handling? | Segmentation inherits zero-handling? | Other gaps |
|---|---|---|---|---|---|---|---|---|---|
| Confidence | Yes | Yes | Yes | Yes | Yes (configurable) | Yes | Yes | Yes | — |
| GrowthBook | Partial (quantile only) | Yes | Partial (quantile and ratio) | Partial (quantile and ratio) | No (varies by type) | No | Not documented | Not documented | Mean/proportion zero-handling undocumented |
| Eppo | Partial (implicit in SQL) | Yes (via user SQL) | Not documented | Not documented | Not documented | Not documented | Not documented | Not documented | Controlled by user SQL |
| Statsig | Yes | Yes (additive metrics) | Yes (participation metrics) | Not documented | No (varies by type) | Not documented | Not documented | Not documented | No override for per-type defaults |
| LaunchDarkly | Yes | Yes (conversion metrics) | Yes (numeric average) | Partial (numeric only) | No (varies by type) | Not documented | Not documented | Not documented | No cross-type consistency option |
| PostHog | Not documented | Yes | Partial (outlier bounds only) | Partial (outlier bounds) | Not documented | Not documented | Not documented | Not documented | Ignore-zeros for outlier capping only |
| Amplitude | Not documented | Yes | Not documented | Not documented | Not documented | Not documented | Not documented | Not documented | — |
| Optimizely | Not documented | Not documented | Not documented | Not documented | Not documented | Not documented | Not documented | Not documented | — |
| VWO | Not documented | Not documented | Not documented | Not documented | Not documented | Not documented | Not documented | Not documented | — |
Three patterns stand out from this comparison.
The first is that zero-handling is mostly undocumented. Only three vendors (Statsig, LaunchDarkly, and partially GrowthBook) address the topic explicitly in their public documentation. LaunchDarkly is the most transparent: their documentation describes the "units without events" configuration, explains when including zeros is appropriate (conversion metrics, shopping cart totals) and when it is not (latency metrics, satisfaction scores), and lets the experimenter choose for numeric average metrics. Statsig documents the distinction between additive metrics (where zeros are imputed for all exposed users) and participation metrics (where non-participants are excluded), though the configuration options for overriding these defaults are not documented. GrowthBook documents an "ignore zeros" option for quantile metrics and excludes zero-denominator users from ratio metrics, but does not address zero-handling for standard mean or proportion metrics. Eppo delegates zero-handling to the user's metric SQL (for example, wrapping a column in COALESCE to treat missing values as zeros), which gives full control but means the behavior is implicit in SQL rather than surfaced as a platform configuration. The remaining four vendors do not document their zero-handling behavior in a way that lets a buyer verify what the experiment is measuring.
The second pattern is inconsistency across metric types within the same platform. Statsig imputes zeros for additive metrics (counts, sums, revenue) but excludes non-participants from participation metrics (daily active user rates, event participation rates). LaunchDarkly always includes zeros for conversion metrics but makes inclusion optional for numeric average metrics. These are defensible defaults, but the inconsistency means that different metrics in the same experiment target different populations. If the platform does not surface this distinction clearly, the experimenter may not realize that the revenue lift and the engagement rate reported in the same experiment scorecard are not measured over the same set of users. Neither vendor documents how to make the zero-handling consistent across metric types if you want all metrics in an experiment to measure the same population.
The third pattern is the absence of downstream propagation. No external vendor documents that the zero-handling choice propagates to the sample size calculator, the variance reduction implementation, or the segmentation engine. This means that even when the analysis itself handles zeros correctly, the planning and exploration tools may not reflect the same choice. A sample size calculator that estimates variance from the full exposed population while the analysis excludes non-engagers will produce a sample size estimate that does not match the experiment you will run. A variance reduction implementation that computes the pre-experiment covariate over all users while the analysis conditions on engagement introduces a covariate-population mismatch that can reduce or negate the variance reduction benefit. The table's final three columns illustrate this gap: no external vendor documents propagation to sample size calculation, variance reduction, or segmentation.
The overall picture is that zero-handling is one of the least visible configuration choices in the vendor landscape, and one of the most consequential. A revenue metric that includes non-purchasers as zeros will show a smaller absolute effect than the same metric excluding them, because the denominator is larger and includes users who were never going to buy regardless of the treatment. Neither answer is wrong. But a platform that makes this choice for you, without documenting it, without letting you change it, and without carrying the choice through to sample size calculation, variance reduction, and segmented analysis, leaves you with an experiment that answers a question you may not have intended to ask.