Core Experimentation

What is an A/B Testing?

An A/B test is a randomized controlled experiment that splits users into two groups: one sees the current experience (control), the other sees a changed version (treatment).

An A/B test is a randomized controlled experiment that splits users into two groups: one sees the current experience (control), the other sees a changed version (treatment). By comparing metrics between the groups, you measure the causal impact of the change, not just correlation. It's the most reliable way to answer the question: did this change actually make the product better?

At Spotify, teams run over 10,000 A/B tests per year across 300+ teams and 750 million users. 42% of those experiments are rolled back after guardrail metrics detect regressions. That number isn't a sign of bad product development. It's evidence that the platform catches real harm before it ships.

How does an A/B test work?

The mechanics are straightforward:

  1. A feature flag randomly assigns each user to either control or treatment. The assignment is deterministic: a hash of the user ID and a salt guarantees the same user always sees the same variant without storing state.
  2. Both groups use the product normally. The only difference between them is the change being tested.
  3. After enough data accumulates, a statistical test compares the metric values between groups. If the difference is large enough to be unlikely under chance alone, the result is statistically significant.

The critical property is randomization. Because users are assigned randomly, any observed difference in metrics can be attributed to the change itself, not to pre-existing differences between the groups. This is what separates an A/B test from looking at before/after data or comparing user segments.

What makes a good A/B test?

Four things need to be right before you start.

A clear hypothesis. "We think changing X will improve metric Y because Z." The hypothesis isn't a formality. It forces you to commit to what you expect and why, which determines what you measure and how you interpret the result.

The right metrics. A success metric measures what you're trying to improve. Guardrail metrics monitor what you're trying not to break. Proxy metrics stand in for outcomes that take too long to observe directly. Getting the metric set wrong is the most common source of misleading results. If you optimize a proxy metric directly, you can destroy the relationship between the proxy and the outcome it was supposed to predict.

Adequate statistical power. Power is the probability of detecting a real effect when one exists. An underpowered test produces ambiguous null results that teach nothing: you can't tell whether the change had no effect or whether your test was simply too small to see it. Power depends on three things: sample size, the variance of your metric, and the minimum detectable effect (MDE) you care about.

A bold enough implementation. The change being tested should be provocative enough to produce a measurable signal if the hypothesis is correct. Mentimeter calls this the "Maximum Viable Change": the loudest possible version of the idea that still functions as a user experience. Test whether the lever exists before you optimize the implementation.

When should you use an A/B test vs. a rollout?

A/B tests and rollouts answer different questions.

An A/B test validates an idea. Traffic is split at a fixed ratio (typically 50/50), the experiment runs until it reaches statistical power, and the result tells you whether the change improved your success metric and whether it harmed your guardrails.

A rollout releases a change safely. Traffic starts small (1%, 5%) and increases gradually. The platform monitors guardrail metrics at each stage. If something breaks, you roll back before most users are affected. A rollout doesn't tell you whether the change is good. It tells you whether the change is safe to ship.

Most mature experimentation programs use both: A/B test to validate, then roll out to release.

What are common A/B testing mistakes?

Peeking at results before the test is done. Checking a fixed-horizon test early and stopping when you see significance inflates false positive rates. Sequential testing methods (group sequential tests, always-valid inference) solve this by designing the analysis to handle repeated looks.

Ignoring multiple testing. Running an experiment with ten metrics and declaring victory on whichever one is significant gives you a ~40% chance of a false positive. Multiple testing corrections (Bonferroni, Benjamini-Hochberg) adjust for this.

Running underpowered tests. If your test has 30% power, it produces a clear answer less than a third of the time. The rest of the time you get ambiguous null results that consume experiment bandwidth without generating learning.

Including users who never saw the change. If you change the checkout flow but include all users in your analysis, the effect gets diluted by everyone who never reached checkout. Trigger analysis restricts the analysis to users who actually encountered the change, improving sensitivity without biasing the result.

Treating every result as a ship/don't-ship decision. The most valuable outcome of an A/B test is often the learning, not the shipping decision. At Spotify, the win rate is around 12%, but the learning rate is 64%. Most experiments don't produce a positive result. They produce an understanding of what doesn't work, which sharpens product intuition over time.

How is an A/B test different from multivariate testing?

An A/B test (or A/B/n test with multiple variants) changes one thing and measures its effect. A multivariate test (MVT) changes multiple elements simultaneously and measures both individual and interaction effects. MVTs require substantially more traffic because the number of variant combinations grows multiplicatively.

For most product experimentation, A/B tests are the right default. Use MVTs when you have high traffic and genuinely need to understand how multiple changes interact with each other.

Related terms

Core Experimentation
Control Group

The control group is the set of users in an experiment who see the unchanged, current experience.

Core Experimentation
Treatment Effect

A treatment effect is the measured difference in a metric between the treatment group (users who see a change) and the control group (users who see the current experience).

Statistical Methods
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Statistical Methods
Sample Size

Sample size is the number of experimental units (typically users) needed in an A/B test to detect a given effect with a specified level of confidence and power.

Feature Flags
Feature Flag

A feature flag is a runtime switch that controls whether a feature is active for a given user, without deploying new code.

Metrics
Guardrail Metric

A guardrail metric is a metric monitored during an experiment to ensure the change doesn't cause unintended harm, even when the success metric improves.

Sequential Testing
Sequential Testing

Sequential testing is a statistical framework that allows experimenters to make valid decisions at multiple analysis points during an experiment, rather than waiting for a single final evaluation.

Statistical Methods
Variance Reduction

Variance reduction is a set of statistical techniques that tighten the confidence intervals of an A/B test without requiring more traffic.

Feature Flags
Rollout

A rollout is the process of releasing a feature to users in controlled stages using feature flags.