TL;DR
At Spotify, we build personalization systems using our ML stack and evaluate them through our experimentation stack. Each tech stack does what it's good at.
Personalization systems have strict infrastructure requirements: access to diverse model types (neural networks, boosting, bandits), rich feature sets, low-latency inference, and real-time data collection. These don't fit naturally inside an experimentation tool. And even if you use a contextual bandit, you still need to evaluate that bandit as a system through A/B tests on different bandit versions. When A/B tests and multi-armed bandits live in the same tool, you get confusing dependencies between instances of the same tool.
Keeping a clean separation of concerns helps us scale with less friction for product teams.
Read the full post on Spotify Engineering: Why We Use Separate Tech Stacks for Personalization and Experimentation

