Lesson 7: Health checks and sample ratio mismatch
In this lesson, you learn what the health checks on the results page are testing, why they matter, and what to do when one of them fails. The sample ratio mismatch check is especially important: if it fails, you cannot trust any of the metric results.
The health checks section sits between the Spotlight and the Metrics section on the results page. Its job is to answer a question that comes before any metric interpretation: did this experiment actually run correctly?
Metric results are only meaningful if the experiment was set up correctly and ran as intended. The health checks verify this, and they flag problems before you invest time reading individual results.
Incoming traffic
This check verifies that your experiment is actually receiving traffic. Specifically, it confirms that the flag rule the experiment controls is being evaluated by clients, that those evaluations are being applied, and that exposure is being calculated for all groups.
If this check fails, the experiment has not been collecting data as expected. There is nothing to interpret yet.
Balanced traffic (the SRM check)
This is the most important health check, and the one that most frequently requires action.
When users are assigned to experiment groups, they should be distributed according to the allocation you configured. If you set up a 50/50 split, then roughly half the users should be in control and half in the treatment group. If you set up a 33/33/33 split across three groups, each group should have approximately a third of the users.
The sample ratio mismatch (SRM) check tests whether the observed distribution of users across groups matches the expected allocation. If there is a meaningful imbalance (more users in one group than there should be), this is a strong signal that something went wrong in the implementation or exposure logic.
If the SRM check fails, stop the experiment and investigate before drawing any conclusions from the results. A sample ratio mismatch means the groups are likely not comparable, which invalidates all metric results. Results from an experiment with an SRM cannot be trusted, even if the metric numbers look good.
Why does a traffic imbalance invalidate the results? The entire logic of a randomized experiment depends on the groups being statistically equivalent before the treatment variant is applied. If the exposure logic has a bug or was implemented incorrectly (for example, if code sitting between the experiment and the SDK affects users in one group more than another), the groups may differ systematically in ways unrelated to the treatment variant. Any observed difference in metrics could then be due to that pre-existing difference, not the treatment variant.
Common causes of SRM include bugs in the assignment logic, caching issues that cause some users to miss the exposure event, and SDK integration issues where custom code between the experiment and the SDK causes uneven exposure. The SRM check will not tell you which of these is the cause: it will only tell you that something is wrong. Investigating requires looking at the exposure data and the experiment setup in detail.
No metric deterioration
This check verifies that none of the metrics you are tracking have moved significantly in the wrong direction. This is a continuous check: it runs throughout the experiment using sequential tests regardless of which evaluation strategy you have chosen (more on that in Lesson 9).
If this check fails, at least one metric has deteriorated. The specific metric showing deterioration is listed in the health check section. This triggers an Abort recommendation in the Spotlight.
The metric deterioration health check and the deterioration status labels on individual guardrail metrics are related but not identical. The health check is a broad, always-on, early-warning scan that applies to all tracked metrics. The guardrail metric status label is the formal statistical result for that metric based on the test configuration you set up. If either signals deterioration, pay attention.
What the Abort recommendation means
When you see Abort in the Spotlight, Confidence is telling you that something has gone wrong that makes continuing the experiment harmful or pointless. This is either because a health check failed (typically SRM) or because one or more metrics have deteriorated significantly.
Aborting an experiment is not a failure. It is the system working as intended. Catching a problem early, before a damaging feature reaches all users, is one of the core reasons to run experiments in the first place. At Spotify, 42% of experiments are aborted because harm was detected, not because the experiment was poorly designed, but because the experimentation system did exactly what it should. That is an enormous amount of value protected. The Spotify experiments with learning framework post has more on how Spotify thinks about this.