How to Write an Experimentation Platform RFP for Exploratory Analysis and Dimensions

Want to experiment like Spotify? Sign up for a 30 day free trial.

Every experimentation platform lets you slice results by dimension. Country, device type, user segment, cohort. The feature is universal. What is not universal is whether the platform does anything to control the false positive rate when you do it.

Slicing by ten dimensions and scanning for significant subgroups is a multiple comparisons problem. At the 5% significance level, ten independent tests give you a roughly 40% chance of at least one false positive. Most platforms surface dimensional breakdowns with the same visual weight as the primary analysis and without applying any correction. The experimenter sees a green "significant" indicator for users in Brazil on Android, treats it as a finding, and ships a targeted change. The platform did not lie. It also did not warn the experimenter that scanning ten dimensions without correction gave the experiment a roughly 40% chance of producing at least one false positive.

The deeper issue is that exploratory analysis and confirmatory analysis require different statistical treatment. A pre-specified subgroup analysis with an appropriate correction is valid confirmatory evidence. An unstructured scan through segment results is hypothesis generation, not a decision basis. Platforms that present both with the same confidence indicators and no distinction in the user interface mislead users about the strength of what they are looking at.

At Spotify, dimensional breakdowns in Confidence inherit the full analysis configuration, so every subgroup view uses the same statistical treatment as the topline result.

Should you add exploratory analysis and dimensions to your experimentation platform requirements?

Yes, but only if you ask the right question. "Can you slice results by dimension?" will get a yes from every vendor. The question that matters is whether the platform controls the false positive rate when you do, whether it distinguishes exploratory from confirmatory analysis, whether the analysis settings you configured for the primary experiment carry through to dimensional breakdowns, and whether all your metric types are available in those breakdowns.

Dimensional analysis serves two legitimate purposes. The first is confirmatory: you specified before the experiment that you would check whether the treatment effect differs by country or device, and you included that subgroup in your analysis plan with an appropriate correction. The second is exploratory: the experiment ended, the topline result is clear, and you want to browse subgroups for patterns worth investigating in a follow-up experiment. Both are valuable. But they require different statistical standards, and most platforms apply the same treatment to both.

The risk is not that dimensional analysis exists. It is that dimensional analysis without correction, without documentation of the false positive risk, and without consistent analysis settings produces results that look like confirmatory evidence but carry the reliability of a hypothesis generator. Teams act on these results because the interface presents them with the same authority as the primary analysis. A dimensional breakdown that drops variance reduction, ignores the multiple testing correction, or changes the zero-handling policy is not a lesser version of the feature. It is a different analysis presented as if it were the same one. At Spotify, we invested heavily in making dimensional views inherit the full analysis configuration because trust in experiment results depends on consistency. If a subgroup view silently uses different settings than the topline, teams lose confidence in both. After that trust erodes, the platform stops being useful for decisions.

What your RFP should ask instead of the "yes/no?"

Five questions separate a connected dimensional analysis implementation from a disconnected one.

First: does the platform control the false positive rate when slicing by dimension? False positive control is the most consequential question on the list. Scanning ten dimensions at the 5% significance level, with no correction, gives you a roughly 40% chance of at least one false positive. The math is straightforward: 1 - (1 - 0.05)^10 = 0.40. Every additional dimension you check increases the probability further. A platform that highlights significant subgroups without correcting for the number of dimensions tested is surfacing noise with the visual authority of a real finding.

The correction can take different forms. Some platforms apply the same multiple testing correction used for the topline metrics (Bonferroni, Holm-Bonferroni, Benjamini-Hochberg) across the dimension values. Some treat the dimensional breakdown as a separate family of tests with its own correction. Some flag dimensional results as exploratory and suppress significance indicators entirely. Each approach is defensible. What is not defensible is surfacing dimensional significance with no correction and no warning. Ask whether the platform applies a correction when results are broken down by dimension, and if so, what method it uses and how the family of tests is defined. For more on how multiple testing corrections work across vendors, see our post on multiple testing.

Second: does the platform distinguish pre-specified analyses from ad-hoc exploration? A pre-specified subgroup analysis is part of the experiment design. You declared before the experiment started that you would check a specific dimension, and the analysis plan accounts for it. An ad-hoc exploration is something you do after the experiment ends, scanning multiple dimensions for interesting patterns. The statistical standards for these two activities are fundamentally different. A pre-specified comparison with correction is valid evidence for a shipping decision. An ad-hoc discovery is a hypothesis that needs its own dedicated experiment to confirm.

Most platforms make no distinction between these in the interface. The dimensional breakdown page looks the same whether you defined the subgroup in advance or discovered it ten minutes ago. A platform that distinguishes pre-specified from exploratory analysis in the interface, either through separate analysis modes, different visual treatment, or explicit labeling, helps experimenters calibrate how much weight to put on what they see. A platform that presents all dimensional results with the same confidence indicators and no context about whether the comparison was planned encourages treating exploratory findings as confirmed results. Ask whether the platform provides any mechanism to distinguish pre-specified subgroup analyses from post-hoc exploration, and whether that distinction affects the statistical treatment or the visual presentation of results.

Third: do the analysis settings carry through to dimensional breakdowns? When you configure an experiment, you make choices: variance reduction (CUPED) on or off, an observation window, an aggregation method, zero-handling policy, a sequential testing method, a multiple testing correction. These choices affect what the metric measures and how it is analyzed. The question is whether those same choices apply when you slice by dimension.

Some platforms recompute dimensional breakdowns from raw events, bypassing the user-level aggregation, windowing, or variance reduction configured for the primary analysis. The result is a subgroup number that does not measure the same thing as the topline metric. If the primary analysis uses a seven-day fixed observation window with variance reduction and zero-inclusive handling (counting non-engagers as zeros), but the dimensional breakdown uses cumulative events with no variance reduction and excludes non-engagers, the subgroup result and the topline result answer different questions. Comparing them is meaningless. For more on how observation windows and zero-handling affect what a metric measures, see our posts on percentile metrics and the upcoming posts on time-in metrics and zero-handling.

Ask whether variance reduction, observation window, aggregation choices, zero-handling, sequential testing, and multiple testing corrections all apply consistently in dimensional breakdowns, or whether any of those settings are dropped or recalculated when results are filtered to a subgroup.

Fourth: does the platform support both Bayesian and frequentist modes in exploratory analysis? If a platform offers both inference frameworks for the primary analysis, the dimensional analysis should match. A platform that runs Bayesian analysis at the topline but switches to frequentist for dimensional breakdowns, or vice versa, creates an inconsistency that makes results difficult to compare. More practically, if the platform offers Bayesian analysis without multiple testing correction in the primary view (a common pattern documented in our Bayesian post), that same gap carries into dimensional analysis and compounds the false positive problem. Ten dimensions explored with uncorrected Bayesian posteriors and a ship-if-any rule have the same roughly 40% false positive rate as ten uncorrected frequentist tests.

Ask whether the inference framework configured for the primary analysis carries through to dimensional breakdowns, and whether any corrections applied at the topline also apply in subgroup views. If the platform uses different inference settings in dimensional views, ask what the differences are and why.

Fifth: are all metric types available in dimensional breakdowns? Platforms vary in which metric types they support in the primary analysis and which of those extend to dimensional views. Simple means and proportions are almost always available. Ratio metrics, percentile metrics, and funnel metrics may not be. If your experiment evaluates a P95 latency guardrail and a revenue-per-user ratio metric alongside a conversion rate, and the dimensional breakdown only supports the conversion rate, you cannot check whether the latency regression is concentrated in a specific country or whether the revenue effect is driven by a single device type. Yet exactly this kind of question is what dimensional analysis is supposed to answer.

Ask which metric types are available in dimensional breakdowns, and whether any types that appear in the primary analysis are excluded from dimensional views.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.

"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "Partial" means the capability exists but with documented limitations, explained in the cell. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.

Platform	Dimensional slicing available?	Multiple testing correction applied to dimensions?	Pre-specified vs. exploratory distinction?	Variance reduction in dimensional views?	Analysis settings carry through?	Inference framework consistent in dimensions?	All metric types in dimensions?	Other gaps
Confidence	Yes	Yes	Yes	Yes	Yes	Yes (frequentist only)	Yes	None documented
GrowthBook	Yes	Yes (per-dimension family)	No	Not documented	Not documented	Partial (frequentist only)	Partial	Top 20 dimension values shown
Eppo	Yes	Not documented	No	No	Partial	Not documented	Not documented	Dimension values capped at 50
Statsig	Yes	Partial (Bonferroni)	Partial	Yes	Yes	Not documented	Not documented	Top 10 dimension levels shown
Optimizely	Yes	No	No	Not documented	Not documented	Not documented	Not documented	Warns segments are exploratory
LaunchDarkly	Yes	Not documented	No	Not documented	Not documented	Not documented	Not documented	Warehouse-native is Snowflake-only
PostHog	Partial	No	No	No	Not documented	Not documented	Not documented	No correction in any mode
Amplitude	Yes	No	No	Not documented	Not documented	Not documented	Not documented	—
VWO	Yes	Not documented	No	Not documented	Not documented	Not documented	Not documented	Bonferroni for variations only

Last updated: May 2026

Four patterns emerge from this landscape.

The first is the gap between dimensional analysis as a feature and dimensional analysis as a statistically valid procedure. Every platform offers some form of dimensional breakdown. Most present dimensional results with the same confidence indicators as the topline analysis. But the statistical treatment behind those indicators varies dramatically. GrowthBook and Statsig apply corrections in dimensional views, though through different mechanisms: GrowthBook defines a combined family across dimension values, metrics, and variations, while Statsig applies a separate Bonferroni correction using the number of dimension levels as the family size. Optimizely explicitly does not correct for dimensional analysis and warns users in its documentation that segments should be used for exploration, not decisions. Most other vendors do not document whether correction extends to dimensions at all. The result is that "slicing by dimension" on one platform produces statistically corrected results, on another produces uncorrected results with a warning, and on most produces uncorrected results with no warning.

The second pattern is the drop in analysis rigor between the topline and the dimensional view. Even platforms that apply variance reduction, sequential testing, and multiple testing correction at the topline may not carry those settings through to dimensional breakdowns. Eppo's documentation explicitly states that CUPED is not computed for segments or ad-hoc filters due to computational cost. This means that if variance reduction cuts the topline confidence interval width by 30%, the dimensional view operates without that benefit, producing wider intervals and potentially different significance conclusions. A team comparing the topline result to a dimensional breakdown is comparing two analyses computed under different assumptions. When one says "significant" and the other says "not significant," the difference may reflect the dropped settings rather than a real subgroup effect. Statsig is the most explicit about consistency: its documentation states that explore queries use the same statistical procedures as the main results tab, including CUPED and sequential testing. GrowthBook applies CUPED at the topline, but its documentation does not confirm that CUPED carries through to dimensional views.

The third pattern is the universal absence of a pre-specified versus exploratory distinction. No vendor reviewed provides a mechanism in the interface to mark a dimensional analysis as pre-specified and include it in the formal analysis plan with appropriate correction. Statsig comes closest with its Differential Impact Detection, which analyzes pre-configured "Segments of Interest" for heterogeneous treatment effects using a Bonferroni correction. The segments must be configured in advance, which makes this closer to pre-specified analysis than ad-hoc exploration. But it is an automated detection layer, not a mechanism for the experimenter to declare a subgroup hypothesis as part of the formal analysis plan. GrowthBook's documentation explicitly advises treating dimensions as an exploratory tool and not something to directly draw conclusions from. Optimizely's documentation similarly warns that segments should be used for data exploration, not making decisions. These warnings are appropriate but insufficient: they appear in documentation that most experimenters never read, while the interface presents dimensional results with the same authority as confirmatory analysis. A warning in the docs does not prevent a product manager from acting on a significant subgroup result they found by scanning countries.

The fourth pattern is incomplete metric type coverage in dimensional views. Most platforms do not document which metric types are available in dimensional breakdowns. The implicit assumption is that if a metric type works at the topline, it works in dimensions. But this assumption breaks for metric types that require specialized statistical machinery. Percentile metrics, for example, require different variance estimation than means. If the platform supports percentile metrics in the primary analysis but recomputes dimensional breakdowns using a different aggregation path, the subgroup percentile may not match the topline percentile. GrowthBook explicitly excludes quantile metrics from sequential testing, which raises the question of whether they are available in dimensional views with the same analysis settings. Most other vendors do not address metric type availability in dimensional breakdowns at all.

The overall picture is that dimensional analysis is one of the most widely available and least statistically rigorous features across the vendor landscape. The feature exists everywhere. The statistical safeguards that would make it reliable for decision-making exist on very few platforms. The RFP that asks "can you slice by dimension?" will get a universal yes. The RFP that asks what happens to the false positive rate, the analysis settings, and the metric type coverage when you do will get very different answers.

A complete dimensional analysis implementation means that the multiple testing correction extends to dimensional views, that variance reduction and all other analysis settings carry through consistently, that the platform distinguishes between pre-specified and exploratory analyses (or at minimum documents the false positive risk of exploratory scanning), and that all metric types in the primary analysis are available in dimensional breakdowns. Without those connections, dimensional analysis will surface false positives at an uncontrolled rate while presenting them with the same visual authority as your primary analysis.

How to Write an Experimentation Platform RFP for Exploratory Analysis and Dimensions

Should you add exploratory analysis and dimensions to your experimentation platform requirements?

What your RFP should ask instead of the "yes/no?"

What the answers actually look like across vendors

More in this series