An A/B/n test is a randomized experiment that compares more than two variants simultaneously: one control and two or more treatments. The "n" represents the number of variants beyond the original A/B pair. Instead of asking "is B better than A?", an A/B/n test asks "which of B, C, D is best, and are any of them better than A?"
A/B/n tests let teams evaluate multiple design options, copy alternatives, or feature configurations in a single experiment rather than running them back to back. This matters for velocity. At Spotify, where 300+ teams run over 10,000 experiments per year, running three sequential A/B tests to compare three ideas would consume three times the experiment bandwidth that a single A/B/n test requires. Confidence supports A/B/n experiments natively, including the multiple testing corrections needed to keep the results trustworthy.
How does an A/B/n test differ from a standard A/B test?
The mechanics are identical to an A/B test up to the point of assignment. A feature flag hashes each user into one of n+1 groups (one control, n treatments) using deterministic assignment. Each group sees exactly one variant. The difference shows up in the analysis.
With two groups, you make one comparison: treatment vs. control. With four groups, you make three comparisons. Each additional comparison increases the chance of a false positive. If you run three tests at a 5% significance level without adjustment, the probability of at least one false positive rises to roughly 14%. Multiple testing corrections, such as Bonferroni or Holm, adjust the significance threshold so the overall false positive rate stays at 5%.
Confidence applies these corrections automatically when an experiment has more than two variants.
When should you use an A/B/n test instead of sequential A/B tests?
Use an A/B/n test when you have several concrete alternatives and enough traffic to power the comparison. The main advantage is speed: you test all variants in parallel under identical conditions, eliminating the time and seasonal variation that come with running experiments one after another.
The trade-off is traffic. Each additional variant needs its own share of users. An A/B test with a 50/50 split gives each group half the total traffic. An A/B/n test with four variants gives each group roughly 25%. That means you need roughly twice the sample size to reach the same statistical power per comparison, or you need to accept a larger minimum detectable effect (MDE).
A practical rule: if you have three or more ideas and your traffic can support the split without pushing experiment duration past four weeks, run the A/B/n test. If traffic is tight, prioritize the two most promising variants and run a standard A/B test.
What are common mistakes with A/B/n tests?
Skipping multiple testing corrections. The most frequent error. Teams compare each variant to control independently, each at the 5% level, and report whichever one is significant. This inflates the family-wise error rate. Confidence handles this by default, but teams building custom analyses need to apply corrections manually.
Adding too many variants. Each variant dilutes traffic. A test with ten variants and 100,000 total users gives each variant only 10,000 users. Unless each variant produces a large effect, the test won't have the power to detect meaningful differences. Two to four variants is the practical sweet spot for most product experiments.
Comparing variants to each other instead of to control. An A/B/n test is designed to answer "which variants beat the control?", not "is variant C better than variant B?" Pairwise comparisons between all treatments multiply the number of tests and weaken the statistical guarantees. If you need to rank treatments against each other, plan for that in the experiment design and adjust the correction family accordingly.