What is Interleaving?

Interleaving is an experiment technique where results from treatment and control are mixed within a single user session, rather than assigning each user entirely to one group. It's most commonly used in search and ranking systems, where a user sees a blended result list drawn from two competing algorithms, and their interactions with individual results reveal which algorithm is better.

Interleaving is dramatically more sensitive than traditional A/B testing for ranking problems. Published research from major tech companies consistently shows that interleaving experiments need 10 to 100 times fewer users to detect the same effect. At Spotify, where search and recommendation rankings affect 750 million users, that sensitivity advantage matters: it means ranking teams can evaluate algorithm changes in hours or days rather than weeks.

How does interleaving work?

In a standard A/B test, User A sees results entirely from Algorithm 1, and User B sees results entirely from Algorithm 2. You compare aggregate metrics (click-through rate, session length, satisfaction) between the two groups.

In an interleaving experiment, each user sees a single result list assembled from both algorithms. The simplest variant, team-draft interleaving, works like picking teams for a schoolyard game: Algorithm 1 and Algorithm 2 alternate picking results for each position in the list. A coin flip decides who picks first. The user sees one unified list and interacts with it naturally. Behind the scenes, the system tracks which algorithm "contributed" each result. If users consistently click on results from Algorithm 2 more often, Algorithm 2 is performing better.

More sophisticated methods exist. Probabilistic interleaving assigns each result a probability of being contributed by each algorithm rather than a binary assignment, reducing variance. Optimized interleaving selects the combined list to maximize the statistical information gained from each impression.

Why is interleaving so much more sensitive?

The sensitivity advantage comes from within-user comparison. In a traditional A/B test, the unit of analysis is the user. You compare averages across thousands of users, and all the natural variation between users (different tastes, different usage patterns, different moods) adds noise. In interleaving, each user serves as their own control. The comparison happens within the same session, the same query, the same moment of intent. User-level variation cancels out.

This is the same statistical principle behind paired designs and variance reduction techniques like CUPED. Interleaving takes it further by making the pairing happen at the individual interaction level.

The practical consequence: a ranking team can test more algorithm variants in less time. That's a direct increase in experiment bandwidth for ranking-specific changes.

When should you use interleaving vs. a standard A/B test?

Interleaving works well for a specific class of problems: evaluating ranked lists where users interact with individual items. Search results, content recommendations, playlist ordering, ad ranking. In these settings, the question is "which algorithm produces better items in better positions?" and interleaving answers it efficiently.

Interleaving doesn't work for changes where the entire user experience differs between variants. A redesigned homepage, a new onboarding flow, a different pricing page: these require traditional A/B tests because you can't meaningfully blend two fundamentally different experiences within a single session.

Most mature experimentation programs use both. Interleaving evaluates ranking and recommendation changes quickly. A/B tests validate broader product changes. When a ranking change passes interleaving evaluation, teams often run a follow-up A/B test to measure the impact on broader metrics (session length, retention, revenue) that interleaving can't capture.

What are the limitations?

Interleaving measures preference between two algorithms, but it doesn't directly measure the effect on business metrics. Users might prefer Algorithm B's results in an interleaving test, but that preference might not translate into longer sessions or higher retention. The interleaving result tells you which algorithm users engage with more within a session. It doesn't tell you about downstream outcomes.

Position bias is another concern. Users click on higher-ranked items more frequently regardless of quality. Interleaving methods account for this through their assignment procedures (team-draft alternates positions between algorithms), but residual bias can still affect results if not handled carefully.

What is Interleaving?

How does interleaving work?

Why is interleaving so much more sensitive?

When should you use interleaving vs. a standard A/B test?

What are the limitations?

Related terms