Want to experiment like Spotify? Sign up for a 30 day free trial.
Start your free trialLast updated: May 2026
Every experimentation platform lets you look at results. The question is whether the platform looks for you. Monitoring is what happens between the moment you start an experiment and the moment you make a decision: detecting broken randomization, catching guardrail regressions, and surfacing data quality issues before they invalidate weeks of data. Alerting is how you find out.
A results dashboard that waits for someone to check it is not monitoring. Alerts that fire on raw metric movements without sequential correction produce false alarms. And restricting monitoring to sequential experiments, leaving fixed-sample experiments blind until they end, misses the point entirely. Monitoring is not a mode of analysis. It is an operational safety layer that should run regardless of how the experiment is designed.
At Spotify, Confidence monitors every running experiment continuously, with sequential guardrail checks and automatic sample ratio mismatch detection regardless of the primary analysis method. The goal is that no experiment runs for days with broken randomization or a serious guardrail regression without the team knowing about it.
Should you add monitoring and alerting to your experimentation platform requirements?
Yes, and this is a case where the standard RFP framing undersells the requirement. "Does the platform have monitoring?" is too vague. Every platform has a results page you can refresh. The question that matters is what the platform monitors automatically, what statistical method backs the monitoring, and how it tells you when something is wrong.
The need is straightforward. Experiments run for days or weeks. During that time, three things can go wrong that should not wait for a human to discover them. First, the randomization can break: a sample ratio mismatch means the treatment and control groups are no longer comparable, and every metric result is suspect. Second, a guardrail metric can regress: the experiment is hurting something you committed to protecting. Third, the data pipeline can fail or produce anomalies: missing data, delayed ingestion, or instrumentation bugs that silently corrupt the analysis. Each requires a different detection method, a different urgency, and a different response. A platform that lumps them together or handles none of them leaves the experimenter flying blind.
The harder question is whether the monitoring is statistically valid. Checking a guardrail metric repeatedly during an experiment is a multiple testing problem across time. Without sequential correction, a 5% significance threshold checked daily over a two-week experiment inflates the false positive rate well beyond 5%. An alert that fires on every transient dip trains the team to ignore it. An alert that waits until the experiment ends to report a regression defeats the purpose. We have seen this pattern play out repeatedly across Spotify's experimentation program: uncorrected monitoring erodes trust in alerts faster than it catches real problems. Once teams start ignoring alerts, the monitoring system is effectively gone, and restoring that trust is far harder than building it correctly in the first place. The right answer is sequential monitoring: guardrail checks that use valid sequential boundaries so each check maintains error rate control.
This connects directly to the sequential testing post in this series. The statistical validity of monitoring depends on the sequential method backing it. It also connects to the multi-metric decision making post: when monitoring checks multiple guardrail metrics simultaneously, the correction procedure determines the false alarm rate across metrics, not just across time.
What your RFP should ask instead of the "yes/no?"
Seven questions separate a connected monitoring and alerting implementation from a dashboard you have to remember to check.
First: does the platform detect sample ratio mismatch automatically? A sample ratio mismatch (SRM) means the observed traffic split between treatment and control does not match the intended allocation. SRM detection is the most basic quality gate in experimentation. If the randomization is broken, every metric result is unreliable. SRM can be caused by bugs in the assignment logic, filtering differences between groups, or bot traffic that affects variants unequally. A platform that surfaces experiment results without checking whether the randomization worked is skipping the foundational integrity check. Ask whether SRM detection runs automatically on every experiment, what statistical test is used (a chi-squared test at a stringent alpha like 0.001 is standard), whether the check runs continuously or only at the end, and whether a detected SRM triggers an alert or only appears as a warning in the results page. Continuous detection is necessary because a randomization bug introduced mid-experiment will not be visible in an end-of-experiment check if enough valid data accumulated before the break.
Second: do guardrail alerts respect sequential testing boundaries? Guardrail metrics are the metrics you committed to protecting: latency, error rate, crash rate, revenue. If the experiment degrades one of these, you want to know early. But "early" and "reliably" are in tension. An alert that fires on a raw confidence interval without sequential correction will produce false alarms every time a metric fluctuates during the natural course of the experiment. An alert that applies a sequential boundary, whether group sequential or always-valid, maintains the stated error rate while still allowing early detection. The difference is operational: teams trust alerts that are rarely wrong and stop trusting alerts that cry wolf. Ask whether guardrail monitoring uses a sequential correction, what method it uses, and whether the stated false positive rate accounts for the number of checks performed during the experiment. For more on what distinguishes valid sequential methods from invalid ones, see the sequential testing post in this series.
Third: is monitoring available for fixed-sample experiments? Many teams choose a fixed-sample design for their primary analysis because it maximizes statistical power and simplifies interpretation. But choosing a fixed-sample primary analysis does not mean you want to be blind to problems for the duration of the experiment. Guardrail regressions, SRM, and data quality issues are just as urgent in a fixed-sample experiment as in a sequential one. The key insight, developed in Spotify's work on fixed-power designs, is that monitoring certain quantities during the experiment does not compromise the primary inference. You can monitor guardrails sequentially and check for SRM continuously without affecting the fixed-sample analysis of your primary metric. A platform that ties monitoring to the sequential testing toggle forces a false choice: either use sequential testing for everything, or give up early detection of problems. Ask whether monitoring and alerting are available regardless of the primary analysis method.
Fourth: does monitoring work across all metric types? Experiments typically include a mix of metric types: means, proportions, ratios, and sometimes percentiles. If monitoring applies only to simple means, the metrics that often matter most for guardrails (ratio metrics like revenue per session, or latency percentiles) are left unmonitored. The sequential correction must be valid for each metric type, not just applied uniformly. As covered in the percentile metrics post and the sample size post in this series, metric type coverage is a recurring gap across the vendor landscape. If a platform supports a metric type in its analysis but not in its monitoring, that metric type has no early warning system. Ask which metric types are covered by the monitoring system, and whether the sequential correction is validated for each.
Fifth: does monitoring work in both Bayesian and frequentist modes? If the platform offers both inference frameworks, monitoring should work in both. A Bayesian experiment still needs SRM detection, guardrail checks, and data quality monitoring. A platform that provides monitoring only in its frequentist mode creates an asymmetry: switching to Bayesian analysis means losing the safety layer. The sequential correction backing the monitoring may differ between modes, but the operational need is the same. Ask whether the full monitoring suite (SRM, guardrails, data quality) is available in both inference modes, or whether one mode is treated as a second-class citizen.
Sixth: what notification channels are supported? Detection without notification is a dashboard, not an alert. The value of monitoring depends entirely on whether the right person finds out at the right time. At minimum, the platform should support email and Slack (or a comparable team messaging integration). Webhook support is important for organizations that route alerts through a centralized incident management system like PagerDuty or Opsgenie. Dashboard-only monitoring means the alert exists only if someone is looking. Ask which notification channels are supported, whether alerts can be routed to a team channel rather than just the experiment creator, and whether webhook support allows integration with your existing alerting infrastructure.
Seventh: what happens when a monitoring boundary is crossed? Detection and notification are necessary but not enough. When a guardrail breach is detected, does the platform take any automated action, or does it only inform? Automated responses range from pausing the experiment to rolling back to the control variant. Manual responses require the team to see the alert, assess the situation, and act. Neither approach is universally better, but the platform should offer a choice. A platform that detects a regression but takes no action and sends no alert is monitoring without statistical validity or operational value. A platform that automatically rolls back without notification takes action the team may not understand. The best implementations offer configurable responses: alert only, alert and pause, or alert and autorollback. Ask what automated response options exist, whether they are configurable per experiment or per metric, and whether the response is logged so the team can review what happened.
What the answers actually look like across vendors
Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.
"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.
| Platform | Automatic SRM detection? | SRM detection method | Guardrail monitoring? | Guardrail alerts sequentially corrected? | Monitoring for fixed-sample experiments? | Notification channels | Automated response on breach? |
|---|---|---|---|---|---|---|---|
| Confidence | Yes (continuous) | Chi-squared, sequential | Yes | Yes | Yes | Email, Slack, configurable | Yes (configurable) |
| GrowthBook | Yes | Chi-squared | Yes (Safe Rollouts) | Yes | Partial (rollouts only) | Slack, webhook | Yes (autorollback) |
| Eppo | Yes (continuous) | Pearson chi-squared | Yes | Yes | Not documented | Email, Slack | No (warnings only) |
| Statsig | Yes (continuous) | Sequential SRM | Yes | Partial (after 24 hours) | Not documented | Email, Slack, in-app | No documented response |
| Optimizely | Yes (continuous) | Sequential SRM | Yes | Not documented | Not documented | Email, Slack | No documented autorollback |
| LaunchDarkly | Yes | SRM with autorollback | Yes (Guarded Rollouts) | Yes | No (rollouts only) | Email, Slack, MS Teams, PagerDuty, in-app | Yes (autorollback) |
| VWO | Yes | SRM | Yes (autopause) | Not documented | Not documented | Email, Slack, in-app | Yes (autopause) |
| PostHog | Partial | Not documented | No | No | Not documented | Email, Slack, webhook | No |
| Amplitude | Yes | Sequential chi-squared | Partial | No | Not documented | Email, Slack, MS Teams, webhook | No |
Five patterns emerge from this comparison.
The first pattern is the gap between SRM detection and SRM response. Nearly every vendor now detects sample ratio mismatch automatically, which represents real progress from a few years ago when SRM detection was rare outside of internal platforms at large tech companies. But detection without a connected response still leaves the problem in the experimenter's hands. Optimizely uses a sequential SRM algorithm that checks continuously, but the response is limited to a warning in the experiment health indicator. Amplitude detects SRM and displays a warning banner, but does not alert the team through an external channel by default. The strongest implementations connect detection to notification and, sometimes, to automated action. GrowthBook's Safe Rollouts and LaunchDarkly's Guarded Rollouts both offer automatic rollback when SRM is detected during a feature rollout. Confidence detects SRM continuously and alerts the owning team immediately. The question for your RFP is not just whether SRM is detected, but whether the detection triggers a notification through a channel the team actually monitors.
The second pattern is the conflation of monitoring with sequential testing. Several vendors tie their monitoring capabilities to their sequential testing mode. LaunchDarkly's monitoring is available through Guarded Rollouts, which are inherently sequential. GrowthBook's Safe Rollouts are a feature-flag rollout mechanism, not a general experiment monitoring system. If you are running a standard A/B test with a fixed-sample design, these monitoring features may not apply. Confidence separates the two concerns: you can run a fixed-horizon primary analysis while guardrails are monitored sequentially in the background. Many teams choose fixed-sample designs for power reasons but still need early detection of problems. A platform that forces you to choose between statistical power on your primary metric and safety monitoring on your guardrails is imposing a tradeoff that does not need to exist.
The third pattern is the question of whether guardrail alerts are statistically valid. GrowthBook's Safe Rollouts use one-sided sequential testing for guardrail checks, giving each check a controlled false positive rate. Eppo applies sequential confidence intervals to guardrail cutoffs, which provides the same property. Statsig applies its sequential testing method to metric alert calculations, but only after the first 24 hours of an experiment. Optimizely's Stats Engine is sequential, but whether that sequential correction extends to guardrail alerts specifically is not documented. VWO's SmartStats is sequential, but the same ambiguity applies. Platforms where guardrail alerts fire on raw, uncorrected results will produce false alarms at a rate that scales with how often they check. If the platform checks daily for 14 days, the false alarm rate is substantially higher than the nominal significance level. The RFP question is not whether guardrail alerts exist, but whether they are backed by a method that controls the false alarm rate across the full duration of the experiment.
The fourth pattern is notification channel breadth. Most vendors support email and Slack, which covers the majority of team workflows. LaunchDarkly stands out with PagerDuty and Microsoft Teams integration in addition to email, Slack, and in-app notifications, making it the most flexible for organizations with established incident management pipelines. PostHog supports email, Slack, and webhooks through its general alerting system, but these are not experiment-specific alerts tied to guardrail breaches or SRM detection. Amplitude supports webhooks through its general monitoring infrastructure. The gap is not whether any notification channel exists, but whether experiment-specific alerts (SRM, guardrail breach, data quality) flow through those channels automatically, or whether the integration is limited to general product analytics alerts that require manual configuration for each experiment.
The fifth pattern is the automated response gap. Three vendors offer automated responses to monitoring signals: Confidence (configurable alert, pause, or autorollback), GrowthBook (autorollback in Safe Rollouts), and LaunchDarkly (autorollback in Guarded Rollouts). VWO offers autopause on guardrail breach, though the documentation does not specify whether this applies universally or only in specific experiment types. The remaining vendors detect and notify but leave the response entirely to the experimenter. At small scale, this is workable. At scale, with hundreds of concurrent experiments, depending on every team to notice an alert and act on it within a reasonable window is fragile. Automated responses are not always the right choice. A false positive that triggers an autorollback stops a valid experiment. But the option should exist, configurable per experiment, so that teams running high-stakes rollouts can choose automated protection while teams running exploratory experiments can choose notification only.
Nearly every vendor now detects SRM. Most offer some form of guardrail monitoring. Where they diverge is in the statistical rigor of that monitoring, the breadth of notification channels, the availability of monitoring for fixed-sample experiments, and the range of automated responses. The RFP question "does the platform have monitoring?" will get a yes from every vendor. The question that distinguishes them is whether the monitoring is statistically valid, operationally connected, and available regardless of how the experiment is designed.