We built this·RFP Series

How to Write an Experimentation Platform RFP for Experiment Coordination

Want to experiment like Spotify? Sign up for a 30 day free trial.

Start your free trial

Last updated: July 2026

Most experiments can run overlapping. Research from Microsoft's experimentation platform, analyzing hundreds of concurrent tests across four products with millions of users, found that experiment interactions are empirically rare: three of four products showed no detected interactions at all, and the fourth found them in only 0.002% of test-pair metric combinations. A change to the search ranking algorithm does not interfere with a checkout flow redesign, and forcing them into separate traffic pools wastes capacity that both could use. But some experiments cannot overlap: two changes to the same recommendation model, two redesigns of the same card layout, two pricing tests on the same surface. For those, control with precision is essential.

Experiment coordination is the problem of making overlap the default and non-overlap precise. A good platform solves two things at once. First, the technical problem: flexible mechanisms for deciding which experiments share users and which do not, without forcing everything into isolated buckets. Second, the organizational problem: making it easy to find and reason about which experiments yours should not overlap with. At a company running hundreds of experiments, no one person knows what every team is testing. The platform needs to make the small number of relevant experiments visible without burying you in the full list.

The gaps show up when only one of these problems is solved. A platform that offers mutual exclusion groups but no way to browse what is running on your surface leaves you coordinating by Slack message. A platform that shows all running experiments but makes every exclusion decision manual leaves the coordination to discipline rather than structure.

Should you add experiment coordination to your experimentation platform requirements?

Yes. To enable real throughput and parallelism, a platform needs to make it easy to reason about the few experiments that yours must not overlap with, and automatic to overlap with the rest. Most experiments at a company do not need coordination with most other experiments. They only need coordination with things affecting the same or related aspects of the user experience. Within that subset, however, most experiments do need coordination.

A platform that treats coordination as "put experiments in isolated buckets" solves the wrong problem. It prevents overlap at the cost of throughput. A platform that defaults to overlap and gives you precise tools for the exceptions lets you run more experiments without sacrificing validity where it matters.

At Spotify, Confidence solves both sides of this problem. Surfaces make it easy to see all experiments affecting a particular part of the user experience, so you can identify which ones yours needs to not overlap with. Exclusivity groups make the actual coordination straightforward: add the relevant experiments to a group, and the platform enforces non-overlap at the assignment level. Everything else overlaps by default. Holdback groups reserve clean control traffic at the surface level for measuring cumulative impact across a quarter.

The RFP question is not "do you support mutual exclusion?" Every platform reviewed in this page does, in some form. The question is whether the platform makes it easy to identify which experiments need coordination and then provides precise tools to enforce it, without forcing everything into isolated traffic pools.

What your RFP should ask instead of the "yes/no?"

Six questions separate a connected coordination implementation from a set of isolated features.

First: does the platform support mutual exclusion groups, and how are they implemented? Mutual exclusion ensures that a user who is in one experiment cannot simultaneously be in another experiment within the same group. The implementation details matter. Some platforms use hash-based assignment with named ranges, where each user gets a deterministic value and each experiment claims a range within that space. Others use layer-based systems where experiments within a layer share a parameter space and users are assigned to at most one experiment per layer. The key questions are whether exclusion is enforced at the SDK level (guaranteeing no overlap) or is advisory, whether you can create multiple independent exclusion groups for different product areas, whether finishing one experiment automatically frees its traffic for the next, and whether adding or removing experiments from a group mid-flight reshuffles existing assignments. Reshuffling breaks the stability that valid causal inference requires. Ask whether adding a new experiment to an existing group changes the assignments of users already in other experiments within that group.

Second: does the platform support holdout groups for measuring cumulative impact? Holdouts reserve a percentage of traffic that sees no experiments, creating a clean control for measuring the combined effect of all shipped features over a period. The design of the holdout determines what you can measure. A simple holdout withholds users from all experiments and compares them to the general population at the end of the period. A more sophisticated design splits the held-out traffic into a status-quo group (users who always see the control experience) and a winning-variants group (users who see the winning treatment of each concluded experiment), enabling a richer comparison. Ask how holdout groups are configured, whether they apply globally or can be scoped to specific experiments or surfaces, what percentage of traffic they reserve, and whether the platform provides built-in analysis for comparing the holdout group against the general population.

Third: does the platform help you organize and manage experiments at scale? When multiple teams experiment on the same product, the number of concurrent experiments quickly exceeds what anyone can keep track of manually. Organizing experiments by surface or product area (home screen, search, checkout, player) provides a natural structure: experiments within a surface are candidates for mutual exclusion, while experiments on different surfaces can overlap safely. Without this kind of organization, coordination becomes a manual process of checking a shared spreadsheet or asking in a channel whether anyone else is running something on the same page. Ask whether the platform provides a way to group experiments by product area, whether exclusion can be scoped to a surface, and whether there is a timeline or calendar view showing all active and planned experiments per surface.

Fourth: does the platform show traffic availability across the organization? At scale, the most common coordination failure is not technical. It is that teams make allocation decisions without knowing what traffic is already committed. If five experiments in the same exclusion group each take 20%, the group is fully allocated. A sixth experiment has nowhere to go, and nobody finds out until launch. The platform should surface remaining available traffic before the experiment launches, not after. Ask whether the platform shows remaining available traffic within an exclusion group or surface, whether it prevents or warns about over-allocation, and whether the view includes planned experiments that have not yet started so teams can coordinate proactively.

Fifth: does the sample size calculator account for coordination constraints? This is the planning-analysis link described in the sample size post in this series. A holdout group that reserves 5% of traffic means your experiment can reach at most 95% of the eligible population. Mutual exclusion with three other experiments in a four-way split means your experiment gets roughly 25%. If the sample size calculator estimates runtime based on full traffic while coordination constraints leave you with a fraction, the runtime estimate will be wrong, sometimes by weeks. Ask whether the calculator surfaces the reachable population after holdouts and exclusion groups are applied, and whether the runtime estimate reflects the traffic the experiment will actually receive.

Sixth: does the platform provide a coordination overview across the organization? A coordination overview is a single view that shows all active and planned experiments, their traffic allocations, their exclusion groups, and their surfaces. Without it, teams make local decisions (this experiment needs 50% of home screen traffic) without global context (the home screen is already 80% allocated). Ask whether the platform provides a view of all running experiments with their traffic allocations, whether experiments can be filtered by surface or exclusion group, and whether the view distinguishes between active and planned experiments.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.

Cell value legend: "Yes" and "No" are based on explicit evidence in the vendor's public documentation. "—" means the platform does not offer the broader feature at all, so the column does not apply. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.

PlatformMutual exclusion?Holdout groups?Experiment organization?Traffic availability overview?Sample size accounts for coordination?Coordination overview?Other gaps
ConfidenceYesYesYes (domains and surfaces)YesYesYes (timeline per domain)
GrowthBookYes (namespaces)YesPartial (programs)PartialNoPartialNamespace ranges are manual
EppoYes (layers)Yes (status-quo + winners split)Not documentedPartialNoPartial (layer view)
StatsigYes (layers)YesNot documentedYesNoYes
OptimizelyYes (exclusion groups)YesNot documentedPartialNoNot documentedCannot add experiments after start
LaunchDarklyYes (layers)YesNot documentedPartialNoPartial
AmplitudeYes (exclusion groups)YesNot documentedPartialNoPartialCannot change allocation after creation
VWOYesPartial (enterprise only)Not documentedPartialNoNot documentedHoldouts limited to enterprise
PostHogNoYesNoPartial (per-experiment)NoNoNo mutual exclusion

Four patterns emerge from this comparison.

The first pattern is the gap between having coordination features and connecting them to planning. Every vendor except PostHog offers some form of mutual exclusion. Every vendor offers holdout groups in some form. But only Confidence feeds coordination constraints back into the sample size calculator. As described in the sample size post in this series, the planning-analysis disconnect is the most common structural gap across the vendor landscape. Coordination makes this gap wider: the more experiments that share a traffic pool, the more the reachable population diverges from the total population, and the less accurate a runtime estimate based on total traffic becomes.

The second pattern is the maturity of holdout implementations. The simplest holdout reserves a percentage of traffic from all experiments. GrowthBook, Statsig, LaunchDarkly, and Optimizely offer this basic model. Eppo goes further with a three-group design that splits held-out traffic into a status-quo group and a winning-variants group, enabling measurement of both the combined impact and the incremental value of shipping winners. Confidence implements holdbacks at the domain level, scoped to product surfaces, with quarterly rotation built into the workflow. VWO offers holdouts only at the enterprise tier. PostHog supports holdouts but without the mutual exclusion that typically accompanies them in a mature coordination system.

The third pattern is the absence of organizational structure for experiments. Confidence is the only platform reviewed that offers explicit domain and surface concepts for organizing experiments by product area. GrowthBook's experimentation programs provide a partial equivalent, grouping experiments into organizational units with shared configuration. The remaining vendors rely on layers or exclusion groups as the primary organizational unit, which works for preventing overlap but does not help teams manage the broader coordination problem: seeing all experiments on a given surface, understanding how much traffic is available there, and planning the next quarter's work.

The fourth pattern is the rigidity of exclusion group management. Several platforms impose constraints that affect operational flexibility. Optimizely requires that all experiments be added to an exclusion group before any of them starts; adding an experiment after the group is active can reshuffle assignments. Amplitude does not allow traffic allocation percentages to be changed after a mutual exclusion group is created. These constraints exist for valid statistical reasons (maintaining stable assignment) but they make it harder to respond to changing priorities during a quarter. Platforms that handle reallocation gracefully, by assigning only new users to the new experiment while maintaining existing assignments, offer more operational flexibility without compromising validity.

Mutual exclusion and holdouts are now table stakes. What separates the platforms is whether those features connect to planning (sample size and runtime estimation), to visibility (traffic availability across the organization), and to organization (surface-level experiment management). An RFP that asks "do you support mutual exclusion?" will get a yes from almost every vendor. An RFP that asks how coordination connects to the rest of the experimentation pipeline will get very different answers.