What is a Multi-Armed Bandit?

A multi-armed bandit is an adaptive experiment design that shifts traffic allocation toward better-performing variants during the experiment, rather than splitting traffic evenly for the entire duration. Named after the problem of choosing between multiple slot machines (each with unknown payout rates), bandit algorithms balance exploration (learning which variant is better) with exploitation (sending more users to the variant that's currently winning).

Bandits are genuinely useful in specific contexts: optimizing a single metric in real time when the decision is purely about which variant performs best on that metric. Ad selection, headline testing, and landing page optimization are classic cases. The appeal is straightforward: if variant B is outperforming variant A, why keep sending half the traffic to A?

When do bandits work well?

Bandits shine when three conditions hold:

The optimization target is a single metric with fast feedback. Click-through rate on an ad, conversion on a landing page, open rate on an email subject line. The metric is observed quickly, the cost of showing a worse variant is real and measurable, and you want to minimize regret (total lost value from showing suboptimal variants).

The decision is reversible and low-stakes per user. Showing someone a slightly worse headline for a few hours doesn't cause lasting harm. The cost is statistical (less total engagement) rather than experiential.

You don't need a precise treatment effect estimate. Bandits converge toward the best variant but don't produce clean confidence intervals the way fixed-allocation experiments do. If your goal is "pick the winner" rather than "measure the effect size," bandits are efficient.

Thompson sampling, upper confidence bound (UCB), and epsilon-greedy are the most common algorithms. Thompson sampling, which draws from the posterior distribution of each variant's performance and selects the variant with the highest draw, tends to perform best in practice for web optimization.

Why doesn't Confidence use bandits for experimentation?

Confidence deliberately chose not to implement multi-armed bandits for product experimentation, and the reasoning goes to the core of what product experiments are for.

Product decisions require balancing multiple metrics. A feature change that improves engagement might harm revenue. A checkout optimization that lifts conversion might degrade user satisfaction. Bandits optimize a single reward signal. When you feed one metric into a bandit algorithm, it maximizes that metric while remaining blind to everything else. In fifteen years of running experiments at Spotify, the pattern is clear: optimizing a single metric in real time finds a local maximum and calls it a win, while something unmeasured quietly declines.

Guardrail metrics exist because product decisions have side effects. Confidence's analysis framework evaluates success metrics, guardrail metrics, and quality metrics together. An experiment that improves the success metric but trips a guardrail gets flagged for review, not auto-promoted. Bandits have no concept of guardrails. Adaptive allocation toward a "winning" variant can route more users to a treatment that's actively harming a metric you haven't wired into the reward function.

Effect size estimation matters. Product teams don't just need to know which variant is better. They need to know by how much, with what uncertainty, and for which user segments. These estimates inform roadmap decisions, resource allocation, and future experiment design. Bandit algorithms trade estimation precision for allocation efficiency, which is the right trade-off for ad optimization but the wrong one for product development.

The replication problem. Bandit-allocated experiments are harder to analyze after the fact. Because the allocation changes over time, the data from early and late phases come from different allocation ratios, which complicates standard statistical inference. Producing valid confidence intervals from bandit data requires specialized methods that are less mature and less well understood than fixed-allocation analysis.

Are bandits ever appropriate alongside A/B tests?

Yes. Bandits and A/B tests answer different questions, and mature experimentation programs can use both.

Use bandits for runtime optimization: selecting the best creative, headline, or configuration from a set of candidates when the goal is to maximize a single metric with fast feedback. Use A/B tests for product decisions: evaluating whether a change improves the metrics that matter, doesn't harm the ones that shouldn't move, and produces learning that informs future development.

The distinction maps to the difference between optimization and experimentation. Optimization minimizes regret on a known objective. Experimentation generates evidence for decisions that involve multiple objectives and long-term consequences.

What is a Multi-Armed Bandit?

When do bandits work well?

Why doesn't Confidence use bandits for experimentation?

Are bandits ever appropriate alongside A/B tests?

Related terms