We built this·RFP Series

How to Write an Experimentation Platform RFP for Clustered Randomization

Want to experiment like Spotify? Sign up for a 30 day free trial.

Start your free trial

Last updated: May 2026

Not every experiment randomizes at the user level. B2B companies randomize by account so that everyone at the same company sees the same experience. Retail chains randomize by store. Streaming services randomize by household. Education platforms randomize by school district. In each case, the randomization unit is a cluster of individuals, but the outcomes are measured at the individual level: revenue per user, test scores per student, engagement per household member.

This mismatch between randomization unit and analysis unit creates a statistical problem that most experimentation platforms either handle incorrectly or do not address at all. When individuals within the same cluster are assigned to the same treatment, their outcomes are correlated. Standard variance estimates treat every individual as an independent observation, which understates the true uncertainty. Confidence intervals come out too narrow. P-values come out too small. False positive rates exceed the stated significance level, silently. The experiment looks more conclusive than it is.

The effective sample size in a cluster-randomized experiment is driven by the number of clusters, not the number of individuals. An experiment that randomizes 50 stores with 10,000 customers each does not have the statistical power of a 10,000-user experiment. It has the statistical power of a 50-unit experiment, adjusted for within-cluster correlation. A platform that reports confidence intervals based on 10,000 independent users is producing numbers that cannot be trusted.

In Confidence, you can randomize on any key available in the evaluation context, and analysis aggregates at the same level. If you pass in a store ID or account ID, the platform randomizes and analyzes at that level. The cluster keys need to be available in the evaluation context; you cannot upload them separately.

Should you add clustered randomization to your experimentation platform requirements?

If you randomize at anything other than the individual level, yes. Clustered randomization is not an edge case for B2B companies, marketplace platforms, or any organization where the unit of treatment assignment differs from the unit of measurement. And it is one of the most underspecified areas in the vendor landscape.

The core issue is that a cluster-randomized experiment violates the independence assumption that underlies standard variance estimation. Users within the same store, household, or account tend to behave similarly. This within-cluster similarity, measured by the intracluster correlation coefficient (ICC), inflates the true variance relative to what a naive estimator computes. The higher the ICC and the larger the clusters, the worse the problem becomes.

A concrete example: if you randomize 100 stores with an ICC of 0.05 and 200 customers per store, the design effect is approximately 1 + (200 - 1) * 0.05 = 10.95. The effective sample size is not 20,000 customers but roughly 1,825. A platform that reports results as if you had 20,000 independent observations will show confidence intervals that are roughly 3.3 times too narrow.

The question for your RFP is not "do you support cluster-level randomization?" Most platforms let you randomize by any identifier: account, device, household, store. The question is what happens in the analysis after randomization. If the variance estimate does not account for the cluster structure, the feature is a display-level setting without analytical substance: clustered assignment with unclustered inference. The results are unreliable, with confidence intervals roughly 3x too narrow in typical configurations. At Spotify, we treat this as a hard rule: if the analysis layer cannot match the randomization layer, the feature does not ship. The cost of a fragmented implementation, where the assignment says "clustered" but the inference says "independent," compounds silently across every experiment that uses it.

What your RFP should ask instead of the "yes/no?"

Five questions separate a sound clustered randomization implementation from an incomplete one.

First: does the platform adjust variance estimates for the cluster structure? Variance adjustment is the fundamental requirement. When randomization is at the cluster level and outcomes are at the individual level, the variance of the treatment effect estimator must account for the within-cluster correlation. There are two established approaches. The first is cluster-robust standard errors, which directly estimate the variance using cluster-level residuals without imposing a parametric model on the correlation structure. The second is the delta method applied to ratio metrics, which reformulates the individual-level metric as a ratio of cluster-level aggregates (total outcome divided by cluster size) and computes the variance of that ratio. These two approaches are mathematically equal for the standard case. A platform that uses either one correctly will produce valid confidence intervals. A platform that treats individual observations as independent will not. Ask what variance estimation method the platform uses for cluster-randomized experiments, and whether it is documented.

Second: is the effective sample size calculated correctly for planning? The sample size calculator for a cluster-randomized experiment should express power in terms of the number of clusters, not the number of individuals. The required number of clusters depends on the ICC, the average cluster size, the expected effect size, and the desired power. A calculator that accepts the number of users and ignores cluster structure will overstate power. If your experiment has 50 clusters per arm, you need a calculator that knows the power comes from those 50 clusters and the correlation within them. No external vendor reviewed provides a sample size calculator that accounts for cluster-level variance inflation. For every vendor, teams running cluster-randomized experiments have no principled way to plan experiment duration or decide whether the experiment is adequately powered before it starts.

Third: does clustered randomization work with variance reduction? Methods like CUPED reduce metric variance by adjusting for pre-experiment behavior. For standard user-level experiments, this is well understood. For clustered experiments, the interaction is more involved. CUPED needs pre-experiment data at the right level of aggregation, and the variance reduction must be reflected in the cluster-adjusted variance estimate, not the naive individual-level estimate. A platform that applies CUPED to a cluster-randomized experiment but computes the variance reduction as if users were independent will overstate the efficiency gain. Ask whether variance reduction is available for clustered experiments, and whether the reduced variance is computed at the cluster level.

Fourth: are all metric types supported under clustered randomization? The delta method approach to cluster adjustment works naturally for ratio metrics, because the cluster-randomized metric is itself a ratio (total outcome per cluster divided by cluster size). But experiments typically include multiple metric types: simple means, proportions, ratios with different denominators, and potentially percentile metrics. Each type needs its own cluster-adjusted variance formula. A platform that adjusts variance correctly for mean metrics but not for ratio or percentile metrics forces you to choose between using the right metric type and getting a valid confidence interval. Ask which metric types are supported under clustered randomization and whether the variance adjustment applies to all of them.

Fifth: does clustered randomization work in both Bayesian and frequentist modes? The cluster structure affects the likelihood of the data regardless of the inference framework. A Bayesian analysis that uses an individual-level likelihood without accounting for within-cluster correlation will produce posterior intervals that are too narrow, for the same reason that a frequentist analysis with naive standard errors will. If the platform offers both inference modes, ask whether the cluster adjustment applies in both. A platform where switching from frequentist to Bayesian silently drops the cluster correction produces results that are only valid in one mode.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.

"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "—" means the platform does not offer this feature, so the question does not apply. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.

PlatformCluster-level randomization?Upload pre-specified clusters?Variance adjustment methodSample size accounts for clustering?CUPED with clustering?All metric types supported?Works in both freq and Bayes?Other gaps
ConfidenceYesNo (must be in evaluation context)YesYesYesYesYes
GrowthBookYesNot documentedDelta methodNoNot documentedPartialYesSequential not documented for clusters
EppoYesNot documentedDelta method / cluster-robustNoYesNot documentedYesSSC does not account for clustering
StatsigYesNot documentedDelta methodNoNot documentedPartialNot documentedTwo implementations (Cloud vs WHN)
LaunchDarklyPartialNot documentedNot documentedNoNot documentedNot documentedNot documentedNo documented variance correction
PostHogPartialNot documentedNot documentedNoNot documentedNot documentedNo cluster-level variance adjustment
AmplitudePartialNot documentedNot documentedNoNot documentedNot documentedNot documentedNo documented variance adjustment
OptimizelyNot documentedNot documentedNot documentedNot documentedNot documentedNot documentedNot documentedNo documented cluster randomization support
VWONot documentedNot documentedNot documentedNot documentedNot documentedNot documentedNot documentedNo documented cluster randomization support

Three patterns emerge from this comparison.

The first is the gap between randomization and inference. Six of the nine platforms let you randomize at a non-user level: GrowthBook, Eppo, Statsig, LaunchDarkly, PostHog, and Amplitude (via an add-on). But offering a cluster-level randomization toggle is not the same as offering cluster-level inference. LaunchDarkly's documentation explicitly warns that mismatched randomization and analysis units break the independence assumption and recommends matching them whenever possible, but does not document a variance correction for when you cannot. PostHog supports group-targeted experiments but does not document any statistical adjustment for the clustered structure. Amplitude supports account-level bucketing through its Accounts add-on but does not document variance adjustment for the resulting clustered analysis. The result on these platforms is that you can assign treatments at the cluster level, but the confidence intervals and p-values are computed as if every individual were independently randomized. The numbers might be misleading.

GrowthBook, Eppo, and Statsig are the three vendors that document a variance adjustment for clustered experiments. All three use the delta method, treating the individual-level metric as a ratio of cluster-level aggregates and computing the variance of that ratio. The approach is equal to cluster-robust standard errors and produces valid confidence intervals when the cluster structure is properly specified. GrowthBook implements this through its Fact Tables and statistics engine. Eppo provides a dedicated clustered analysis mode with explicit documentation of the ratio-metric reformulation. Statsig offers two implementations: normalized metrics in Warehouse Native and hierarchical IDs in the Cloud product, both using the delta method.

The second pattern is the complete absence of cluster-aware planning. None of the external vendors offer a sample size calculator that accounts for cluster-level variance inflation. GrowthBook's power analysis documentation covers standard experiments but does not mention cluster adjustments. Eppo's sample size calculator operates on individual-level variance estimates. Statsig's power calculator accepts standard inputs without cluster parameters. The consequence is that even on platforms where the analysis is correct, teams have no way to determine before the experiment starts whether 50 clusters is enough or whether they need 200. They discover the experiment was underpowered after it ends. The problem is the same planning-analysis disconnect described in our sample size post, but the stakes are higher for clustered experiments because the gap between naive and correct power estimates can be an order of magnitude.

The third pattern is incomplete feature integration. Even on the three platforms that adjust variance correctly, the adjustment does not extend to every feature. GrowthBook's quantile metrics explicitly do not support the clustered analysis pipeline. Statsig's documentation does not address whether percentile metrics work with hierarchical IDs. Eppo is the most explicit about feature compatibility, documenting that CUPED++ works with clustered experiments and that all major analysis methods (frequentist, Bayesian, sequential hybrid) are available. But even Eppo does not connect cluster structure to the sample size calculator. The partial pipeline pattern is consistent: the analysis adjustment exists, but the surrounding features (planning, variance reduction, metric type coverage) are not uniformly connected.

A complete implementation means the variance adjustment is applied in the analysis, reflected in the sample size calculator, compatible with variance reduction, available across all metric types you use, and consistent across inference modes. Cluster randomization is also where geo-based experiments connect: geo experiments are a special case of cluster randomization where markets or regions are the clusters, and the same variance inflation problem applies. If your platform handles one correctly, it should handle the other. If neither is addressed, any experiment that randomizes above the individual level is reporting results with uncontrolled error rates.

The RFP question that matters is not "can we randomize by account?" It is "when we randomize by account, do the confidence intervals reflect the fact that we have 50 accounts, not 50,000 users?" If the answer is no, the feature exists in the assignment layer but not in the inference layer, and the experiment results cannot be trusted.