We chose not to·RFP Series

How to Write an Experimentation Platform RFP for Geo-Lift and Synthetic Control

Want to experiment like Spotify? Sign up for a 30 day free trial.

Start your free trial

Last updated: May 2026

Geo-lift studies measure the causal effect of an intervention that cannot be randomized at the user level. A new ad campaign, a pricing change, a market launch. The treatment is applied to entire markets, and the question is whether the outcome in those markets changed beyond what would have happened without the intervention. The standard approach constructs a synthetic control: a weighted combination of untreated markets whose pre-intervention behavior closely matches the treated markets, forming a counterfactual against which the treatment effect is estimated.

It is easy to find an experimentation vendor that lists geo-lift or geo-testing as a feature. A few do. The problem is that geo-lift analysis is fundamentally different from a standard A/B test. The number of independent units is small (markets, not users). The counterfactual is constructed, not randomized. The validity of the result depends on whether the synthetic control is a credible stand-in for what would have happened, and that credibility rests on assumptions about whether the comparison is valid, assumptions that can and should be evaluated before you trust the estimate. A yes on "do you support geo experiments?" tells you the feature exists. It does not tell you whether the platform evaluates identification, selects donors rigorously, or produces inference you can trust.

Should you add geo-lift to your experimentation platform requirements?

Probably not. And if you do, you should be cautious about what you expect from it.

Geo-lift and synthetic control methods occupy a different position from the other topics in this series. Sample size calculation, sequential testing, multiple testing corrections: those are features that every experimentation program needs, and the question is whether the vendor implements them correctly. Geo-lift is a method most experimentation programs do not need, and platformizing it removes the friction that protects users from acting on estimates that rest on unexamined assumptions.

At Spotify, we have among the best researchers in the world working on synthetic control methods. Our research team formulated synthetic control models in Pearl's structural causal model framework for the first time, proved that the causal effect is identifiable under weaker assumptions than before, and developed a general framework for sensitivity analysis when those assumptions are violated. A more recent paper introduced a method for detecting spillovers in donor units using proximal causal inference, inverting the usual synthetic control logic to validate donors from pre-intervention data rather than relying on post-intervention estimation. The work is active, published research at the frontier of the field.

We still choose not to platformize it.

Each application of synthetic control requires thinking and local context. Which markets are valid donors? Are there spillovers between treated and untreated markets? Is the pre-intervention fit good enough to trust the counterfactual? Would the treated and untreated markets have followed the same trajectory without the intervention, or does the comparison rest on a stronger structural assumption that needs domain knowledge to evaluate? The questions are not configuration choices you set once in a user interface. They are judgment calls that benefit from friction: the kind of conversation between a researcher and a product team where someone asks "are we sure this comparison is valid?" and the answer requires looking at the data, not clicking a button.

Packaging geo-lift as a self-serve feature in an experimentation platform risks removing exactly that friction. A one-click solution that selects donors, constructs a synthetic control, and returns a p-value gives the user the aesthetics of rigor without the substance. If the donor pool is poor, the pre-treatment fit is weak, or spillovers are present, the estimate is not a causal effect. It is a number. A platform that does not surface these problems, or surfaces them in a way that is easy to dismiss, produces causal claims that are not warranted by the data.

If you do need geo-lift because you run interventions that can only be applied at the market level and you cannot randomize users, then put it on your list. But evaluate it with a different standard than you would apply to a standard A/B testing feature. The right question is not "does the platform support geo experiments?" It is whether the platform forces the user to confront the assumptions that make or break the analysis.

What your RFP should ask instead of the "yes/no?"

Five questions separate a credible geo-lift implementation from a checkbox.

First: does the platform use the correct analysis method? Geo-lift studies compare treated markets against synthetic controls constructed from untreated markets. The synthetic control is a weighted combination of donor markets chosen so that the resulting composite tracks the treated market's behavior during the pre-intervention period. The method is not a t-test on two groups of markets. It is not a simple difference-in-differences with a single control market. The analysis method matters because the number of independent units is small (often fewer than 50 markets), the treated and control groups are not exchangeable by randomization, and the counterfactual is estimated rather than observed. A platform that applies a t-test to geo-level aggregates has used the wrong method. Ask what analysis method the platform uses for geo experiments, whether it constructs a synthetic control, and whether the method accounts for the small number of independent units.

Second: does the platform evaluate identification and warn if the synthetic control is not credible? The validity of a synthetic control estimate depends on how well the synthetic control matches the treated market during the pre-intervention period. A poor fit means the counterfactual is unreliable, and the treatment effect estimate is not trustworthy. The platform should evaluate pre-treatment fit using historical data and surface diagnostics that tell the user whether the synthetic control is good enough. This includes measures of fit quality (how closely the synthetic control tracks the treated market before the intervention), placebo tests (applying the method to periods where no intervention occurred to check whether it falsely detects effects), and leave-one-out analyses (checking whether the result is sensitive to removing individual donor markets). A platform that constructs a synthetic control and returns a result without evaluating fit quality is skipping the step that determines whether the result means anything. Ask whether the platform surfaces pre-treatment fit diagnostics, runs placebo tests, and warns the user when the synthetic control is not credible.

Third: does the platform support principled donor selection? The donor pool is the set of untreated markets from which the synthetic control is constructed. Not every untreated market is a valid donor. Markets that experienced their own shocks during the study period, markets that are economically linked to the treated market (creating spillovers), or markets with fundamentally different dynamics are poor donors whose inclusion can bias the estimate. The platform should support automatic donor selection that evaluates candidate donors based on historical fit, and it should flag or exclude donors that show signs of contamination. The alternative, a platform that uses all available untreated markets by default, risks constructing a synthetic control from a donor pool that includes invalid units. Ask whether the platform selects donors automatically, what criteria it uses, and whether it evaluates donors for spillovers or structural breaks.

Fourth: does the platform produce inference without strong parametric assumptions? Inference for synthetic control is harder than for standard A/B tests because the number of independent units is small and the standard large-sample approximations do not apply. Two approaches are credible. Conformal inference recasts the problem as a prediction and structural breaks test, using permutation-based procedures to construct p-values and confidence intervals that are valid under weak assumptions. Bayesian inference produces posterior credible intervals from a generative model fitted to the pre-intervention data, with the quality of inference depending on the model specification and the prior. Both are legitimate. A platform that computes a p-value from a t-test on the pre-post difference, or from a naive comparison of treated versus untreated market means, is applying inference that does not match the data-generating process. Ask what inference method the platform uses, whether it accounts for the small number of units, and whether it relies on parametric assumptions that may not hold.

Fifth: does the platform support power analysis for the geo design? Before running a geo-lift study, you need to know whether the study has a reasonable chance of detecting the effect you care about, given the number of available markets, their historical variance, and the expected effect size. Power analysis for geo designs is different from power analysis for user-level experiments: the effective sample size is the number of markets, not the number of users, and the variance depends on how well a synthetic control can be constructed from the available donor pool. A platform that offers geo-lift without power analysis leaves you with no principled basis for deciding whether the study is worth running. Ask whether the platform provides power analysis specific to the geo design, whether it uses historical data to estimate achievable power, and whether it accounts for the quality of the synthetic control in the power estimate.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on internal knowledge.

"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent. "—" means the platform does not offer geo-lift as a feature.

VendorGeo-lift supportAnalysis methodPre-treatment fit evaluationDonor selectionInference methodPower analysisOther gaps
ConfidenceNo (deliberate)
EppoYesBayesian synthetic controlPartialYesBayesian credible intervalsYesPlacebo and leave-one-out not documented
StatsigYesSynthetic control (GeoLift)YesYesConformal inferenceYesGeoLift diagnostic surfacing not fully documented
GrowthBookNo
LaunchDarklyNo
PostHogNo
AmplitudeNo
OptimizelyNo
VWONo

The most striking feature of this table is how empty it is. Only two of the nine platforms reviewed offer geo-lift as a product feature. This is appropriate: geo-lift is a specialized method that most experimentation programs do not need. A poor implementation returns causal estimates with the same visual authority as a well-powered A/B test, but without the identification assumptions that would make the estimate credible. Teams act on those estimates because the platform presented them as conclusions.

Eppo and Statsig take different approaches, both in the statistical method and in what they surface to the user. Statsig builds its geo-testing on Meta's open source GeoLift package, which uses augmented synthetic control methods with conformal inference for p-values and confidence intervals. Conformal inference is a strong choice for this setting: it is valid under weak assumptions, does not require large-sample approximations, and has been shown to work with synthetic control, difference-in-differences, and factor models alike. Statsig's experiment designer automates the evaluation of candidate geo splits, ranking them by minimum detectable effect, power, and cost. The platform surfaces model performance details after design generation, giving the user some basis for evaluating whether the synthetic control is credible. How much of GeoLift's diagnostic machinery (placebo tests, leave-one-out checks) is surfaced in the Statsig UI is not fully documented.

Eppo takes a Bayesian approach, using a hierarchical model with explicit time series components, adstock modeling (the delayed and decaying effect of advertising spend), and saturation effects. This is purpose-built for marketing incrementality measurement. The Bayesian framework produces posterior credible intervals rather than frequentist confidence intervals, and the model estimates marketing-specific quantities that a generic synthetic control would not capture. Eppo's documentation describes a structured approach to donor selection: geographic units are grouped based on historical metric behavior and correlations, low-signal regions are filtered out, and regions with unique events (natural disasters, market entries) can be excluded. The platform surfaces data quality diagnostics and supports visual inspection of synthetic control units, but more specific diagnostics like placebo tests and leave-one-out analyses are not documented. Eppo provides simulation-based power analysis that models various treatment effect scenarios, which is appropriate for the Bayesian framework.

Neither platform fully addresses the identification question at the level that would substitute for expert judgment. A platform can automate donor selection, rank candidate designs by power, and return a credible interval. What it cannot easily automate is the domain-specific reasoning about whether the donor pool is valid in the first place: whether treated and untreated markets are truly independent, whether the intervention created spillovers, whether the pre-intervention period is long enough and stable enough to support the synthetic control assumption. These questions decide whether the estimate is causal or coincidental, and they resist being turned into a product feature.

The gap is why we do not build geo-lift into the Confidence platform at Spotify despite having published research that advances the method. The friction of running a geo-lift study as a bespoke analysis, with a researcher involved in the design and interpretation, ensures that assumptions are examined rather than accepted by default. Removing that friction without replacing it with equally rigorous automated diagnostics means users get answers faster, but those answers are less likely to be correct.

If your organization runs geo-lift studies, Eppo and Statsig are the two vendors with documented offerings. If your organization does not currently run geo-lift studies, adding this to your RFP is unlikely to improve your experimentation program. The features that matter for every experiment you run deserve more attention than a specialized method that most teams will never need and few platforms implement well enough to trust without expert oversight.