We built this·RFP Series

How to Write an Experimentation Platform RFP for Experiment Design Review

Want to experiment like Spotify? Sign up for a 30 day free trial.

Start your free trial

Last updated: July 2026

Every experimentation platform lets you set up an experiment. You pick metrics, define variants, choose an audience, and click launch. What happens between "design is ready" and "experiment is live" varies enormously. In some platforms, the person who created the experiment is also the person who launches it, with no structured checkpoint between the two. In others, the design goes through a review workflow that resembles a pull request: teammates inspect the setup, leave comments on specific sections, and approve or request changes. The experiment cannot launch until the required reviewers sign off.

The difference matters most when experimentation scales beyond a single team. When dozens of teams share the same traffic, a wrongly configured audience filter or a missing guardrail metric does not just waste one experiment's time. It can pollute the results of every overlapping experiment and erode trust in the platform. A structured design review catches these problems before they reach production, the same way code review catches bugs before they reach users.

The RFP checkbox "does the platform support experiment review?" will get a yes from every vendor. But most of those answers describe flag-level approval workflows, not experiment design reviews. The difference between platforms lies in whether the review process is structured around the experiment design itself, whether it can block launch, and whether it scales governance across product areas without slowing teams down.

Should you add experiment design review to your experimentation platform requirements?

Yes, and it becomes more important as the number of experimenters grows. When a single data scientist runs every experiment, that person is both the designer and the reviewer. The quality of the design depends on their expertise, and the feedback loop is internal. When product managers, engineers, and data scientists all create experiments, the expertise is distributed unevenly. A product manager might choose the right metrics but set the wrong audience. An engineer might set up the flag correctly but forget to add a guardrail. A structured review workflow lets the right person catch the right mistake at the right time, before the experiment starts collecting data.

The failure modes are specific and preventable. An experiment launches without a guardrail metric, and the team ships a change that improves engagement but degrades performance. An experiment targets 100% of users on a high-traffic surface when a 10% ramp would have been enough for the desired power. An experiment launches with the wrong metric definition, and the team discovers the error three weeks in, after the data is useless. Each of these is a design problem, not an analysis problem. No amount of statistical rigor in the results can fix a flawed setup.

At Spotify, experiment design review in Confidence follows the same pattern as code review. The experimenter requests reviews from teammates. Surfaces can require specific reviewers, so high-traffic product areas always get a second pair of eyes. Reviewers comment on specific sections of the experiment design and approve or reject the overall setup. Changes to the flag, variants, audience, or allocation after approval reset the approval status so reviewers always see the final design. The workflow keeps the bar high without adding process for experiments where the risk is low, because surfaces that do not require reviewers can still use the workflow optionally.

Confidence also supports AI reviewer bots that can be added to the review flow like any other reviewer. Teams configure a bot with a prompt that defines what it should check, and the bot reviews the experiment design accordingly. This helps in two ways: it standardizes scrutiny so that common mistakes (missing guardrails, wrong audiences, underpowered designs) are caught consistently regardless of who set up the experiment, and it scales review capacity so that human reviewers can focus on judgment calls rather than checklists.

Without a structured design review, governance becomes informal. Teams rely on Slack threads, shared documents, or calendar invites to get feedback on experiment designs. That works until someone forgets to ask, or asks but does not wait for the answer. A platform-native review workflow makes the checkpoint visible, trackable, and enforceable.

What your RFP should ask instead of the "yes/no?"

Seven questions separate a structured experiment design review from a platform that offers no checkpoint between configuration and launch.

First: does the platform have a structured review workflow for experiment designs before launch? The most basic question is whether the platform offers any formal review step between completing the experiment setup and launching it. A structured workflow means that the experimenter can request feedback on the design, reviewers can inspect the configuration, and the review status is tracked in the platform. The distinction from a flag-level approval matters here. Flag approvals protect against unintended production changes. Experiment design reviews protect against launching a poorly designed experiment. Both are valuable, but they solve different problems. Ask whether the platform has a review workflow that is specific to experiment design, or whether it relies on general-purpose flag approval mechanisms.

Second: can reviews be required, blocking launch until approved? An advisory review that anyone can skip is useful but not enough for governance. When an experiment runs on a surface that serves millions of users, the organization may need a guarantee that at least one qualified reviewer approved the design before launch. This means the platform must support required reviewers whose approval is a prerequisite for starting the experiment. Ask whether reviews can be configured as blocking (the experiment cannot launch without approval) or whether they are purely advisory. If blocking reviews are supported, ask whether the platform enforces the block at the system level or relies on social norms.

Third: can review requirements be configured per surface or product area? Not every experiment carries the same risk. A low-traffic experiment on a secondary feature may need no formal review at all. A high-traffic experiment on a core product surface may need sign-off from a data scientist and a product lead. The platform should let surface owners or administrators decide whether reviews are optional, suggested, or required for experiments on their surface, and who the designated reviewers are. Without per-surface configuration, the organization faces an all-or-nothing choice: require reviews everywhere (which slows low-risk experiments) or require them nowhere (which exposes high-risk experiments). Ask whether review policies can vary by surface, product area, or team, and whether surface owners can manage their own reviewer lists.

Fourth: can reviewers comment on specific parts of the experiment setup? A general comment box ("looks good") is less useful than the ability to leave feedback on a specific section of the experiment design. If the reviewer has a concern about the audience definition, that comment should attach to the audience section, not float in a general thread where it might be missed. The same applies to metrics, variants, allocation, and hypothesis. Thread-based discussions with the ability to resolve individual threads (the way pull request reviews work in GitHub) make it clear what has been addressed and what is still open. Ask whether the platform supports section-specific comments, threaded discussions, and thread resolution on the experiment design page.

Fifth: does the platform track review status and notify reviewers? A review workflow is only useful if reviewers know they have been asked. The platform should notify reviewers when a review is requested, surface pending reviews in a central location (a to-do list or inbox), and integrate with the team's communication tools. Slack notifications, email alerts, or in-app badges are all reasonable mechanisms. Without notifications, review requests sit unnoticed and experiments stall in the queue. Ask whether the platform notifies reviewers on request, whether it provides a centralized view of pending reviews, and whether it integrates with Slack, email, or other communication tools.

Sixth: do changes to an approved experiment require re-approval? An experiment that was approved on Monday may be modified on Tuesday. If the experimenter changes the audience, adds a variant, or modifies the allocation after approval, the reviewer's original sign-off no longer applies to the current design. The platform should reset approval status when material changes are made, so that required reviewers must re-approve the updated design. Without this, the review workflow has a loophole: get approval first, make changes later. Ask whether the platform tracks which version of the design was approved, which changes trigger a re-approval requirement, and whether re-approval is automatic or manual.

Seventh: does the platform support automated or AI-powered reviewers? As experiment volume grows, human reviewers become a bottleneck. A platform that supports automated reviewer bots lets teams define review criteria in a prompt or configuration, and the bot reviews each experiment design against those criteria like any other reviewer. This standardizes scrutiny so that common mistakes (missing guardrails, wrong audiences, underpowered designs) are caught consistently regardless of who set up the experiment. It also scales review capacity so that human reviewers can focus on judgment calls rather than checklists. Ask whether the platform supports adding automated reviewers to the review workflow, whether the review criteria are configurable, and whether the automated review integrates with the same approval flow as human reviews.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.

"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.

PlatformDesign review workflow?Blocking reviews?Per-surface configuration?Section-specific comments?Review notifications?Re-approval on changes?AI/automated reviewers?Other gaps
ConfidenceYesYesYes (per-surface)Yes (comment zones, threads)Yes (Slack, to-do list)Yes (resets on changes)Yes (configurable bots)
GrowthBookPartial (flag-level)YesPartial (per-environment)Partial (comments, diff view)Not documentedNot documentedNoFlag approval, not experiment design
EppoPartial (flag-level)Partial (admins bypass)NoNoYes (email)Not documentedNoAdmins bypass approvals
StatsigYesYesYes (per-team, per-entity)PartialYes (Slack, email)Not documentedNoDiscussion panel for results, not design
LaunchDarklyYes (flag-level)Yes (enterprise only)Partial (per-environment)PartialYes (Slack, email, Teams)Not documentedNoFlag-level, not experiment-specific
PostHogPartial (flag-level)Yes (quorum-based)Partial (per-project)NoYes (email)PartialNoFlag actions, not experiment design
AmplitudePartialYesPartial (per-project)NoYes (in-app)PartialNoNo section-specific feedback
OptimizelyYes (flag-level)YesPartial (per-environment)PartialYes (email)Not documentedNoFlag-level with action scoping
VWOPartialYesNoNoYes (email)Not documentedNoAuto-approval timeout can bypass

Four patterns emerge from this comparison.

The first pattern is the conflation of flag approval with experiment design review. Almost every vendor offers some form of approval workflow, but in most cases the workflow gates flag changes rather than experiment designs. GrowthBook's approval flows trigger when you change a feature flag and want to publish the change. Eppo's approvals activate when a non-admin modifies a production flag. LaunchDarkly's approval system operates at the flag level, with environment-scoped requirements and tag-based filtering. PostHog's approval policies gate specific flag actions like enabling a flag or changing a rollout percentage. These workflows protect production stability, which is valuable. But they do not address the experiment-specific questions that a design review should cover: Are the right metrics attached? Is the audience correctly scoped? Is the hypothesis clearly stated? Is the sample size enough? Flag approval asks "should this configuration go live?" Experiment design review asks "is this a well-designed experiment?" Only Statsig and Confidence offer review workflows that are framed around the experiment itself rather than the underlying flag.

The second pattern is the gap in section-specific feedback. In a code review, you comment on a specific line. In most experimentation platforms, you get a general comment box attached to an approval request. GrowthBook shows a diff of flag changes with comments alongside. Statsig has a discussion panel where teammates can leave comments inline on experiment results, though this is documented for the results phase rather than the design phase. LaunchDarkly and Optimizely allow comments on approval requests, but these are general comments on the entire change, not targeted feedback on the audience definition or the metric selection. VWO, Eppo, PostHog, and Amplitude do not document section-specific commenting on experiment designs. Confidence's comment zones on specific sections of the design page, with threaded discussions and resolution, are the closest parallel to a pull request review. The practical consequence is that in most platforms, feedback on a specific design choice gets buried in a general comment thread or happens outside the platform entirely.

The third pattern is uneven configuration granularity. Statsig stands out for offering review requirements at multiple levels: project-wide, per-team, per-entity, and per-environment, with designated reviewer teams for each. LaunchDarkly allows per-environment configuration with optional tag-based filtering, though this is an enterprise-only feature. Optimizely lets administrators select specific flags for approval requirements and scope by environment. PostHog offers per-project policies with scope rules for different flag actions. GrowthBook configures approvals per environment. Eppo and VWO configure approvals globally, with no per-surface or per-area granularity. For organizations where different product areas have different risk profiles, global configuration forces a one-size-fits-all policy. Confidence's per-surface configuration, where surface owners decide whether reviews are required or suggested and who the designated reviewers are, maps directly to how product organizations are structured.

The fourth pattern is the re-approval gap. When an experimenter changes the design after receiving approval, the approval should no longer apply to the modified version. PostHog addresses this through version tracking: if anyone modifies the feature flag while a change request is pending, the flag's version increments and the pending request becomes inapplicable. Amplitude requires re-approval for changes to critical fields in live experiments. But most vendors do not document what happens to an existing approval when the underlying configuration changes. Confidence explicitly resets approval status when the flag, variants, audience, or allocation change, requiring re-approval from required reviewers. Without this safeguard, the review workflow has a gap: approval is granted for one version of the design, but a different version is what actually launches.

The RFP question "does the platform support experiment review?" will get a yes from every vendor, because every platform has some form of approval workflow. The question that distinguishes them is whether the review is structured around the experiment design itself: whether reviewers can inspect the full setup and leave targeted feedback, whether approval can block launch on high-risk surfaces without slowing low-risk experiments, whether surface owners control their own review policies, and whether changes after approval require re-approval. Without that structure, experiment design review is an informal process that depends on the diligence of individual experimenters, and at scale, informal processes have gaps.