Counterfactual logging is the practice of recording what a system would have shown a user under an alternative policy or variant, alongside what was actually shown. In a recommendation system, this means logging both the recommendations the user saw and the recommendations a different algorithm would have produced for the same context. The counterfactual data is never shown to the user. It exists purely for analysis.
This technique is most valuable when evaluating machine learning models and ranking algorithms, where online experiments are expensive and offline evaluation methods need realistic data to be trustworthy.
Why is counterfactual logging useful for experimentation?
Running an A/B test is the gold standard for causal evaluation, but it's not always practical. Testing a new recommendation model requires deploying it to live users, which means engineering effort, risk to user experience, and the opportunity cost of allocating traffic.
Counterfactual logging enables a form of offline evaluation that's closer to a real A/B test than traditional backtesting. If the production system logs what the alternative model would have recommended at each decision point, you can estimate how the alternative would have performed without ever serving it to users.
The key requirement: the logging must capture enough context to reconstruct the decision. This includes the user state, the candidate set, the features available to the model, and the action the production system took. Without full context, the counterfactual estimates are unreliable.
At Spotify, where personalization drives a large fraction of the user experience, counterfactual logging enables teams to evaluate model changes before committing to a full A/B test. This reduces the number of experiments that need to be run while maintaining rigor. When a counterfactual evaluation looks promising, the team runs a confirmatory A/B test. When it doesn't, they've saved weeks of experiment time.
What are the limitations of counterfactual evaluation?
Counterfactual logging doesn't replace A/B tests. It supplements them. Several limitations apply:
The counterfactual policy was never experienced by users. You can estimate what would have been shown, but you can't observe how users would have responded to it. User behavior depends on context, mood, and interaction effects that don't appear in logged features.
Distributional shift. If the alternative policy would have shown very different content, the logged data from the production policy may not cover the relevant states well. The further the counterfactual policy diverges from the production policy, the less reliable the estimates become.
Missing feedback loops. If the alternative model would have recommended an item the user has never seen, you have no data on whether the user would have engaged with it. This is the fundamental exploration-exploitation challenge.
These limitations are why Confidence focuses on A/B testing as the primary evaluation method. Counterfactual logging is a valuable pre-screening step, but the causal claim comes from the randomized experiment.