Experiment Analysis

What is a Sample Ratio Mismatch?

A sample ratio mismatch (SRM) occurs when the observed number of users in each experiment group differs from the intended allocation ratio by more than chance alone would explain.

A sample ratio mismatch (SRM) occurs when the observed number of users in each experiment group differs from the intended allocation ratio by more than chance alone would explain. If you designed a 50/50 split and observe 51.2% in treatment vs. 48.8% in control, something in the experiment pipeline is systematically sending users to the wrong group or dropping them from one side. The experiment results cannot be trusted until the cause is found and fixed.

SRM is one of the most reliable diagnostic signals in experimentation. When the split is wrong, the groups are no longer comparable, and any observed treatment effect could be an artifact of the imbalance rather than a real product impact.

What causes sample ratio mismatches?

SRM rarely comes from the randomization itself. Modern hashing algorithms distribute users evenly. The problems almost always occur downstream of assignment.

Common causes include:

Bot filtering that behaves differently across variants. If the treatment changes page structure and bot detection relies on page structure, the filter may remove more users from one group.

Redirects or page loads that fail at different rates. If the treatment variant triggers a slower code path, users on weak connections are more likely to drop off before their data is logged. Treatment loses users; control doesn't.

SDK initialization issues where the flag evaluation happens after some users have already exited the flow. If the treatment code path takes longer to initialize, a fraction of fast-exiting users never get logged in treatment.

Interaction with other experiments. If two experiments share the same user pool and one of them affects whether users reach the second experiment's trigger point, the resulting sample can be skewed.

At Spotify, with 10,000+ experiments running per year, SRM checks are essential infrastructure. Even a 0.5% imbalance on a large experiment can indicate a systematic problem.

How does Confidence detect SRM?

Confidence runs an SRM check automatically on every experiment. The check uses a chi-squared test comparing the observed user counts per variant against the expected counts from the configured allocation ratio. If the p-value falls below a threshold (typically 0.001, which is deliberately strict to avoid false alarms), the experiment is flagged.

When Confidence detects an SRM, it surfaces the warning directly in the experiment results. The flag tells the team: don't interpret these results until you've investigated and resolved the imbalance. This matters because SRM doesn't just add noise. It introduces systematic bias. The groups are no longer exchangeable, so the fundamental assumption of the A/B test is violated.

The strict threshold (0.001 rather than the usual 0.05) reflects the asymmetry of the decision. A false negative on SRM detection, missing a real problem, leads to acting on biased results. A false positive, investigating an imbalance that turns out to be chance, costs time but causes no harm.

What should you do when SRM is detected?

First, don't ignore it. SRM doesn't go away if you wait longer or add more users. If the cause is systematic, more data makes the mismatch more statistically significant, not less.

Investigate the pipeline. Check bot filters, check logging completion rates by variant, check whether other experiments are interacting with yours. The cause is almost always in the instrumentation, not the randomization.

If you can identify and fix the root cause, you may be able to re-run the experiment cleanly. If the cause is unfixable (for example, inherent to the treatment implementation), you need to account for the bias or redesign the experiment.

Don't adjust the results post hoc by reweighting or trimming to restore the ratio. These corrections assume you know the mechanism behind the imbalance, and that assumption is usually wrong.