What is an A/B Testing?

An A/B test is a randomized controlled experiment that splits users into two groups: one sees the current experience (control), the other sees a changed version (treatment). By comparing metrics between the groups, you measure the causal impact of the change, not just correlation. It's the most reliable way to answer the question: did this change actually make the product better?

At Spotify, teams run over 10,000 A/B tests per year across 300+ teams and 750 million users. 42% of those experiments are rolled back after guardrail metrics detect regressions. That number isn't a sign of bad product development. It's evidence that the platform catches real harm before it ships.

How does an A/B test work?

The mechanics are straightforward:

A feature flag randomly assigns each user to either control or treatment. The assignment is deterministic: a hash of the user ID and a salt guarantees the same user always sees the same variant without storing state.
Both groups use the product normally. The only difference between them is the change being tested.
After enough data accumulates, a statistical test compares the metric values between groups. If the difference is large enough to be unlikely under chance alone, the result is statistically significant.

The critical property is randomization. Because users are assigned randomly, any observed difference in metrics can be attributed to the change itself, not to pre-existing differences between the groups. This is what separates an A/B test from looking at before/after data or comparing user segments.

What makes a good A/B test?

Four things need to be right before you start.

A clear hypothesis. "We think changing X will improve metric Y because Z." The hypothesis isn't a formality. It forces you to commit to what you expect and why, which determines what you measure and how you interpret the result.

The right metrics. A success metric measures what you're trying to improve. Guardrail metrics monitor what you're trying not to break. Proxy metrics stand in for outcomes that take too long to observe directly. Getting the metric set wrong is the most common source of misleading results. If you optimize a proxy metric directly, you can destroy the relationship between the proxy and the outcome it was supposed to predict.

Adequate statistical power. Power is the probability of detecting a real effect when one exists. An underpowered test produces ambiguous null results that teach nothing: you can't tell whether the change had no effect or whether your test was simply too small to see it. Power depends on three things: sample size, the variance of your metric, and the minimum detectable effect (MDE) you care about.

A bold enough implementation. The change being tested should be provocative enough to produce a measurable signal if the hypothesis is correct. Mentimeter calls this the "Maximum Viable Change": the loudest possible version of the idea that still functions as a user experience. Test whether the lever exists before you optimize the implementation.

When should you use an A/B test vs. a rollout?

A/B tests and rollouts answer different questions.

An A/B test validates an idea. Traffic is split at a fixed ratio (typically 50/50), the experiment runs until it reaches statistical power, and the result tells you whether the change improved your success metric and whether it harmed your guardrails.

A rollout releases a change safely. Traffic starts small (1%, 5%) and increases gradually. The platform monitors guardrail metrics at each stage. If something breaks, you roll back before most users are affected. A rollout doesn't tell you whether the change is good. It tells you whether the change is safe to ship.

Most mature experimentation programs use both: A/B test to validate, then roll out to release.

What are common A/B testing mistakes?

Peeking at results before the test is done. Checking a fixed-horizon test early and stopping when you see significance inflates false positive rates. Sequential testing methods (group sequential tests, always-valid inference) solve this by designing the analysis to handle repeated looks.

Ignoring multiple testing. Running an experiment with ten metrics and declaring victory on whichever one is significant gives you a ~40% chance of a false positive. Multiple testing corrections (Bonferroni, Benjamini-Hochberg) adjust for this.

Running underpowered tests. If your test has 30% power, it produces a clear answer less than a third of the time. The rest of the time you get ambiguous null results that consume experiment bandwidth without generating learning.

Including users who never saw the change. If you change the checkout flow but include all users in your analysis, the effect gets diluted by everyone who never reached checkout. Trigger analysis restricts the analysis to users who actually encountered the change, improving sensitivity without biasing the result.

Treating every result as a ship/don't-ship decision. The most valuable outcome of an A/B test is often the learning, not the shipping decision. At Spotify, the win rate is around 12%, but the learning rate is 64%. Most experiments don't produce a positive result. They produce an understanding of what doesn't work, which sharpens product intuition over time.

How is an A/B test different from multivariate testing?

An A/B test (or A/B/n test with multiple variants) changes one thing and measures its effect. A multivariate test (MVT) changes multiple elements simultaneously and measures both individual and interaction effects. MVTs require substantially more traffic because the number of variant combinations grows multiplicatively.

For most product experimentation, A/B tests are the right default. Use MVTs when you have high traffic and genuinely need to understand how multiple changes interact with each other.