An A/A test is a randomized experiment where both groups receive the identical experience. There's no treatment. The control sees the current product, and the "treatment" sees the exact same thing. The purpose is to validate that the experimentation system itself is working correctly: that the randomization is unbiased, the metrics pipeline is accurate, and the statistical tests produce false positives at the expected rate.
A/A tests are the calibration check for your entire experimentation stack. If you split users into two groups that see the same thing and your platform reports a statistically significant difference, something is broken. Either the randomization introduced a bias, the metric computation has a bug, or the statistical method is miscalibrated. Spotify runs A/A tests as part of validating its experimentation infrastructure, and Confidence's built-in sample ratio mismatch (SRM) detection flags one of the most common failure modes automatically.
How does an A/A test work?
The setup is identical to a standard A/B test. Users are randomly assigned to two groups using deterministic hashing. Both groups see the same product experience. After the experiment has accumulated enough data, you analyze the metrics the same way you would in any experiment.
The expected result: no statistically significant difference on any metric. More precisely, if you run 100 A/A tests at a 5% significance level, you should see roughly 5 false positives. That's the definition of a well-calibrated system. If you see significantly more than 5 out of 100, the system has a problem.
The three most common problems an A/A test catches:
- Randomization bias. The assignment mechanism isn't truly random. One group ends up with systematically different users, such as heavier users, newer accounts, or users from specific regions.
- Metric pipeline bugs. The data collection or aggregation introduces asymmetry. One group's events get processed differently, dropped, or double-counted.
- Statistical test miscalibration. The test itself is producing too many false positives, often because of incorrect variance estimation or violations of the method's assumptions.
When should you run an A/A test?
Run A/A tests in three situations.
When you first set up your experimentation platform. Before you trust any A/B test results, verify that the system produces correct null results. This is especially important when connecting Confidence to a new data warehouse or metric pipeline for the first time.
After major infrastructure changes. If you migrate your event logging, change your assignment mechanism, update your metrics computation, or switch statistical methods, an A/A test confirms nothing broke. At Spotify, infrastructure changes that touch the experimentation pipeline include A/A validation as part of the rollout process.
As periodic health checks. Some teams run a standing A/A test alongside their regular experiments. Any unexpected significance in the A/A test is an early warning signal that something in the pipeline has drifted.
What can go wrong with A/A tests?
Running too few. A single A/A test that shows no significance doesn't prove the system is correct. You need multiple A/A tests (or one A/A test evaluated across many metrics) to assess whether the false positive rate matches the expected level. One test is a single coin flip. You need dozens to know if the coin is fair.
Confusing "no significance" with "correct." An A/A test with low statistical power might not detect a real bias. If the groups are small or the metric is noisy, even a genuine problem could hide behind the noise. Make sure your A/A test has enough traffic and enough metrics to be a meaningful check.
Ignoring sample ratio mismatch. Before looking at metric results, check whether the two groups have the expected number of users. A 50/50 split that produces a 52/48 ratio is a red flag: the randomization or the logging has a problem. Confidence runs SRM checks automatically and flags mismatches before you even look at the metric-level results.