How to Write an Experimentation Platform RFP for Multi-Metric Decision Making

Want to experiment like Spotify? Sign up for a 30 day free trial.

Last updated: May 2026

Every experimentation platform lets you add multiple metrics to an experiment. Most will display results for each one, neatly arranged in a scorecard. The scorecard tells you which metrics moved up, which moved down, and which were inconclusive. What it rarely tells you is what to do next. Should you ship?

The decision to launch a feature is almost never based on a single metric moving in the right direction. Teams look at the complete picture: Did success metrics improve? Did guardrail metrics stay within acceptable bounds? When the answers point in different directions, a table of individual metric results does not resolve the question. It surfaces the question and leaves it to you.

At Spotify, the median experiment has two user-defined success metrics and four guardrail metrics. With six metrics, the number of possible outcome combinations is large enough that the informal approach of scanning the scorecard and applying judgment does not scale. The platform needs to do more than display results. It needs to distinguish what each metric means for the shipping decision, enforce that distinction in the statistical analysis, and connect the metric roles to the planning stage so the experiment is designed to answer the question it actually needs to answer.

Should you add multi-metric decision making to your experimentation platform requirements?

Carefully. This is the topic where the gap between what the platform provides and what the practitioner needs is widest. The gap is equally wide in Bayesian and frequentist frameworks. Neither a p-value nor a posterior probability tells you whether a 3% lift on retention is worth a 1% drop in engagement. That is a decision problem, not a statistics problem, and no inference framework solves it automatically.

Put this on your list only if you define what you actually want. "Multi-metric support" on an RFP will get a yes from every vendor. What separates a useful implementation from a dashboard of numbers is whether the platform enforces the distinction between metric roles and carries that distinction through to the correction procedure, the sample size calculation, the monitoring logic, and the shipping recommendation. A guardrail metric that appears in a separate section of the results page but receives the same statistical treatment as a success metric is a display label, not a decision tool.

At Spotify, Confidence distinguishes success metrics from guardrail metrics at the analysis level, not just the display level. The label changes the statistical treatment and the shipping recommendation.

A multi-metric framework that only relabels metrics in the UI without changing the underlying analysis creates the appearance of guardrail protection where none exists. Our experience building Confidence taught us that the added benefit of a display-only distinction cannot outweigh the confusion it introduces. We would rather show teams a simple scorecard that requires explicit judgment than ship an automated framework that enforces the wrong distinctions behind a reassuring interface.

What your RFP should ask instead of the "yes/no?"

Six questions separate an implementation where metric roles affect analysis from one where they affect only the display.

First: does the platform distinguish between success metrics and guardrail metrics, and is the distinction enforced? Every vendor lets you label metrics. The question is whether the label changes anything. A guardrail metric that is tested for improvement the same way as a success metric, displayed in a different section of the scorecard, and included in the same correction family has no functional distinction from a success metric with a different name. The distinction matters only if it changes the statistical treatment: the direction of the test (looking for deterioration rather than improvement), the correction family (excluded from the correction applied to success metrics), and the shipping logic (a breached guardrail blocks the ship recommendation regardless of success metric results). Ask whether the guardrail label changes the analysis, or only the display.

Second: is the metric role reflected in the multiple testing correction? This is where the distinction between metric roles has its largest statistical impact. If you have two success metrics and four guardrail metrics, a correction applied across all six tests each metric at a significance threshold six times stricter than the nominal alpha. Restricting the correction to the two success metrics tests each at alpha divided by two. The difference is substantial: at Spotify, restricting the correction family to success metrics alone cuts the power cost roughly in half. The statistical logic is straightforward. Success metrics must be corrected because a false positive on any one of them leads to shipping something that does not work. Guardrail metrics do not need the same correction because a false positive on a guardrail leads to not shipping, which is a conservative error rather than a costly one. A platform that includes guardrails in the success metric correction family pays a power penalty for a protection that does not match the decision logic. Ask whether guardrail metrics are excluded from the correction applied to success metrics, and whether the platform documents the rationale.

Third: is the metric role reflected in the sample size calculation? Adding a second success metric should update the required sample size. With Bonferroni correction, two success metrics means testing each at alpha divided by two, which increases the required sample size by roughly 20%. Adding a guardrail metric evaluated with an inferiority test should not increase the sample size in the same way, because the guardrail is tested in the opposite direction and does not enter the success metric correction family. However, guardrail metrics do affect power through a different mechanism: the requirement that all guardrails pass simultaneously means the probability of incorrectly failing at least one guardrail compounds with each additional metric, so the power level for each individual guardrail needs to be adjusted. A sample size calculator that treats all metrics identically, or one that ignores metric roles entirely, will produce either an overpowered or underpowered experiment depending on the ratio of success to guardrail metrics. Ask whether the calculator distinguishes between metric types when computing the required sample size.

Fourth: is the shipping decision framework documented? When success metrics show improvement and guardrails show deterioration, what does the platform recommend? Most vendors surface all metrics and leave the decision to the experimenter. That is a reasonable default for simple experiments, but it does not scale to organizations running hundreds of experiments across dozens of teams with varying statistical sophistication. A documented decision framework specifies the rules before the experiment starts: what combination of metric outcomes leads to "ship," "do not ship," or "discuss with stakeholders." The framework does not need to be rigid. It can recommend rather than enforce, and it can escalate ambiguous cases to human judgment. But the rules should be specified before the data arrives, not improvised after. Ask whether the platform provides a configurable decision framework, whether it allows you to define the rules before the experiment starts, and what happens when success and guardrail metrics disagree.

Fifth: is multi-metric decision making handled explicitly in both Bayesian and frequentist frameworks? The multiple comparisons problem is not a frequentist problem. Ten metrics tested with weak Bayesian priors and a ship-if-any rule will produce false positives at roughly the same rate as ten uncorrected frequentist tests. Any platform that claims Bayesian inference inherently avoids the multi-metric problem without specifying a correction procedure is making a claim the math does not support. The question for your RFP is whether the platform applies metric role distinctions, correction procedures, and decision frameworks consistently across both inference modes, or whether switching to Bayesian quietly drops the multi-metric safeguards. As discussed in our multiple testing post, most vendors that offer both modes apply correction only in the frequentist engine.

Sixth: does the platform connect metric roles to monitoring and alerting? Guardrail metrics exist to catch regressions. If the platform distinguishes guardrails from success metrics but does not monitor them differently during the experiment, the distinction is incomplete. A guardrail breach that surfaces only in the final results, weeks after the regression started, has failed its purpose. The monitoring system should treat guardrail metrics as candidates for early alerting or automatic rollback, using sequential testing methods that control the false alarm rate. Ask whether guardrail metrics trigger alerts during the experiment, whether those alerts use a valid sequential method, and whether the platform can take automated action (notification, pause, or rollback) when a guardrail is breached.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.

"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.

Platform	Metric roles distinguished?	Guardrails excluded from success metric correction?	Metric roles reflected in sample size calculation?	Shipping decision framework?	Multi-metric handling in Bayesian mode?	Guardrail monitoring with alerting or auto-action?
Confidence	Yes (success, guardrail)	Yes	Yes	Yes	Yes	Yes (configurable)
GrowthBook	Yes (goal, guardrail, secondary)	Yes	No	Yes	No (frequentist only)	Yes (Safe Rollouts)
Eppo	Yes (primary, guardrail)	Partial	Not documented	Yes	No	No (warnings only)
Statsig	Yes (primary, secondary)	Partial	Partial (per-variant only)	Yes	Not documented	No documented auto-action
LaunchDarkly	Partial (no guardrail role)	Not documented	No	No	No	Yes (Guarded Rollouts)
Optimizely	Yes (primary, secondary, monitoring)	Yes	No	Partial	Not documented	Partial (no autorollback)
VWO	Yes (success, guardrail, diagnostic)	Not documented	Not documented	No documented framework	Partial	Yes (autopause)
PostHog	Partial (no guardrail role)	N/A	N/A	No	N/A (no correction)	No
Amplitude	Yes (primary, secondary, guardrail)	Not documented	No	No documented framework	Not documented	No documented guardrail monitoring

Four patterns stand out across this landscape.

The first pattern is the gap between metric role labels and metric role enforcement. Every vendor except PostHog lets you assign different labels to metrics. But in most implementations, the label changes the display without changing the statistical treatment. LaunchDarkly's documentation is explicit about this: it advises basing the shipping decision on the primary metric only, treating additional metrics as informational context. PostHog distinguishes primary from secondary metrics but applies no multiple testing correction in either case, leaving the multi-metric false positive rate entirely in the experimenter's hands. The most common pattern is a platform that displays guardrails in a separate section of the scorecard while testing them with the same statistical machinery as success metrics. The separation looks meaningful in the UI, but it does not change the underlying analysis.

The second pattern is the disconnect between metric roles and sample size calculation. The correction family directly affects the power calculation. If success metrics are corrected with Bonferroni across two metrics, each is tested at alpha divided by two, and the sample size increases so. If guardrail metrics are excluded from that correction, they do not contribute to the increase. But no external vendor reflects this distinction in the sample size calculator. GrowthBook excludes guardrails from the correction family in the analysis but does not connect the correction to the sample size calculator at all. Statsig applies a preferential alpha split between primary and secondary metrics but reflects only per-variant Bonferroni in the calculator, not the per-metric role distinction. The result is the same planning-analysis gap described in our sample size post: the experiment is designed without accounting for how the metric roles will affect the analysis.

The third pattern is the emergence of documented decision frameworks. Three vendors now offer configurable decision frameworks that go beyond displaying results: GrowthBook's Decision Criteria, Eppo's Experiment Protocols, and Statsig's Decision Framework. All three allow you to specify, before the experiment starts, what combination of metric outcomes should lead to shipping, discussion, or rejection. GrowthBook's "Clear Signals" preset requires all goal metrics to be significant and positive and no guardrail metrics to be significant and negative. Eppo's Protocols allow configuring decision matrices that trigger recommendations when the experiment ends or when metrics reach significance. Statsig's framework maps primary and guardrail metric outcomes to one of three actions: roll out, discuss, or do not roll out. All three are advisory rather than enforced. None blocks a ship decision automatically when the framework recommends against it. The frameworks represent genuine progress over the scorecard-and-judgment approach, but the gap between a recommendation and an enforced gate is worth understanding for organizations that need governance at scale.

The fourth pattern is the split between experiments and rollouts for guardrail enforcement. The strongest guardrail enforcement across external vendors exists not in the experiment workflow but in the rollout workflow. GrowthBook's Safe Rollouts and LaunchDarkly's Guarded Rollouts both use one-sided sequential tests to monitor guardrail metrics during feature releases, with automatic rollback when a regression is detected. VWO's guardrails can autopause variations on breach. But these capabilities live in the release management layer, not the experiment analysis layer. In the experiment workflow, the same vendors surface guardrail results without enforcement. The implication is that guardrail protection is treated as a deployment concern rather than an analysis concern. For teams that want guardrail enforcement integrated into the experiment decision, the rollout-level protection is necessary but not sufficient.

Metric roles are widely recognized as a concept but inconsistently carried through the analysis pipeline. Every vendor displays metric roles. A smaller set adjusts the correction family. A smaller set still connects the correction to sample size. Almost none enforces the distinction in the shipping decision. The RFP question that separates them is not "do you support guardrail metrics?" It is "where does the guardrail label start and where does it stop?"

How to Write an Experimentation Platform RFP for Multi-Metric Decision Making

Should you add multi-metric decision making to your experimentation platform requirements?

What your RFP should ask instead of the "yes/no?"

What the answers actually look like across vendors

More in this series