Experiment like Spotify: Feature Flags

Experiment like Spotify: Feature Flags
Johan Rydberg, General Manager
Johan Rydberg, General Manager
Sebastian Ankargren, Senior Data Scientist
Sebastian Ankargren, Senior Data Scientist

If you want to experiment like Spotify, check out our experimentation platform Confidence. It's currently available in a set of selected markets, and we're gradually adding more markets as we go.

Start trial

This post is part of a series that showcases how you can use Confidence. Make sure to check out our earlier posts Experiment like Spotify: With Confidence and Experiment like Spotify: A/B Tests and Rollouts.

At Spotify, experimentation is fundamental to the way we develop our products, and feature flagging is an essential part of this.

Our experimentation platform, Confidence, is built for teams that, like Spotify, want to leverage experimentation and data-informed decision making in their product development process. The experimentation workflow we use at Spotify is naturally part of the DNA of Confidence.

With Confidence, you can easily put something behind a flag, turn it on for your closest team, and evaluate the change in a scientific and reliable way in an A/B test. If you find a winning variant, you can convert the test into a rollout and continue ship the experience to your entire user base. Users in the winning treatment continue to receive the winning experience — there's no flickering back to a default or control.

Use flags to remotely control the experience

Before jumping into more details, let's start with the basics. A feature flag is a way to remotely control what experience a user, visitor, or any other type of identifier receives. Instead of immediately making changes for everyone when your updated code is ready, you release the changes but limit the number of people who can see it. Involve randomization in the splitting of traffic and you're on track to run a full-blown A/B test. This approach puts you in full control of the amount of traffic your new change receives. At any time, you can increase or decrease usage.

Traditional feature flags are boolean values that indicate whether to enable the new change. A flag in Confidence is more than this simple on-off switch: it's a structure with named properties. Think of it as a JSON object. This makes it possible to control multiple aspects of the behavior of a client with a single flag. Flags have a schema that describes the structure of the value, including available properties and their data types. Variants give a name to a value of the flag, which defines a possible behavior of the thing the flag is controlling.

Control the Spotify home screen with flags

As an example, imagine a flag that controls the various aspects of the Spotify home screen. The flag has a name, home-screen, and the value has properties that govern:

  • The size of the title (title-font-size).
  • Whether to show shortcuts (show-shortcuts).
  • The number of shortcuts to show (shortcut-count).
PropertyTypeDescription
title-font-sizeStringSize of the title font.
show-shortcutsBooleanWhether to show shortcuts.
shortcuts-countIntegerHow many shortcuts to show.

The flag has two variants called default and large-style, whose values are:

Propertydefaultlarge-style
title-font-size"small""large"
show-shortcutstruetrue
shortcuts-count48

With the flag and its variants in place, a client (such as an app, a website, or a backend service) can resolve the flag to a value. For example, when a user visits the Spotify home screen, the app resolves the flag using the ID of the user and receives the large-title variant. It then uses the values of the flag's properties directly to render the design of the home screen, showing a large title and eight shortcuts.

Test new variants without changing your code

A major advantage of enriching flags with entire configurations is that you don't need to tamper with the code as soon as you want to test a new variant. Implement the feature flag so that the values of the properties are simply passed on in the right places, and then you just need to create new variants on the flag itself without changing the code at all. For example, imagine that you also want to test a no-shortcuts variant in which you disable the shortcuts altogether. To do that, leave the code unchanged and create another variant with show-shortcuts set to false — and you're immediately good to go.

In addition to enabling fast testing of new variations, the flag configurations are particularly helpful for teams that want to use feature flagging but have a limited amount of engineering resources. With on-off feature switches, every new variant and idea requires a code change and the help of engineers. With flag configurations, the engineers can set up the mapping from flag properties to the implementation of the features once, and then marketers, product managers, and others can create whatever variants they want without requiring any further engineering support.

OpenFeature — An open standard for feature flagging

Flags in Confidence are based on OpenFeature, the feature-flagging standardization initiative that the Cloud Native Computing Foundation (CNCF) is incubating. Spotify has a long tradition of working with the CNCF — in 2021 we donated Backstage to the community. Most recently, Spotify was awarded the Top End User award for our contributions across the cloud-native ecosystem. We believe that standardizing feature flagging benefits the customer greatly since it reduces vendor lock-in. As part of our commitment to OpenFeature, we donated SDKs for Swift and Kotlin to the foundation in 2023.

Control the Spotify home screen through the OpenFeature client

To use the home-screen flag and dynamically set what the user sees on their home screen, you leverage the OpenFeature libraries together with the Confidence provider. In JavaScript, setting that up looks like this:

import { OpenFeature } from '@openfeature/js-sdk';
import { createConfidenceServerProvider } from '@spotify-confidence/openfeature-server-provider';

OpenFeature.setProvider(
  createConfidenceServerProvider({
    clientSecret: '<API_KEY>',
    fetchImplementation: fetch,
    timeout: 1000,
  }),
);

After instructing OpenFeature to use the Confidence provider, use the OpenFeature library to resolve the flag.

const client = OpenFeature.getClient();

const value = client.getObjectValue(
  'home-screen',
  // default values if no variant is returned
 {
    "title-font-size": "small",
    "show-shortcuts": true,
    "shortcuts-count": 4
  },
  { targetingKey: '<USER_ID>' }
);

The returned value includes all the properties and the specific values to use for the specific user_id used in the request. For example, use value["show-shortcuts"] to control the visibility of the shortcuts.

How Confidence Flags helps you create and release great products

Achieve exceptional performance

System latency greatly impacts the user experience. We know from experience at Spotify that increases in startup and page load times lead to lower user engagement and increased churn.

When using feature flags, the application needs to know the value of a flag to make a decision on what experience to serve ‌a user. In Confidence, we say the application "resolves the flag". You resolve a flag value by passing information about the user and the environment (for example, browser version) to a flag resolve engine that looks at the configuration and makes a decision on what value the flag should have. Where the resolve engine runs has an impact on the latency of the resolve operation, but also what functionality is available when resolving. If the SDKs locally evaluate rules, you get almost zero latency. On the other hand, you have to give up features that require cross-device state, like sticky assignments that keep on serving a variant even after the responsible rule changes, or more sophisticated targeting criteria.

In Confidence, you have several different options for achieving the performance you need.

For all applications, you can use our central resolvers. They're deployed across the world to reduce latency. Using them is a good option for most mobile apps, since they are less latency sensitive. Our mobile SDKs resolve all flags at once and cache the result throughout the session. The incurred latency for using a flag then becomes zero. You can also use resolvers that run at the edge to reduce latency even further, where the distance between the user's application and the resolver is further reduced. Our edge resolvers are deployed in over 100 places around the world.

Using these centrally managed resolvers, you don't have to give up any functionality.

For backend and server-side rendered web workloads, there's a few more options for you. You can run the central resolvers to leverage all functionality, but with some increased latency. You can also deploy our resolvers locally in a container. If you run it as a sidecar in your pod, your latency is virtually zero. If you opt to deploy it as a service, then you get latency like any other intra-cloud network call. You can hook up the resolver to a local database to store state. The cost of running the resolvers yourself is naturally more maintenance, since it's no longer a service managed by us.

We know from first-hand experience how essential performance is for any business. It's an issue we take seriously, and we're constantly looking for ways to reduce latency while keeping the functionality needed for more experimentation-heavy organizations.

Get consistent behavior on all platforms

The engine powering Confidence Flags helps you achieve consistent experiences wherever you use your flags. As explained in our paper, the allocation engine hashes incoming IDs into buckets multiple times, and the resulting buckets are randomly assigned to experiments and variants. However, given the salts of the hashing, the bucket a given ID belongs to is deterministic. This means that if the same user ID resolves a flag from within an iPhone app or from a browser on a desktop computer, they receive the same variant (as long as the flag is enabled for both clients). SDKs are available for all major languages, including JavaScript, Swift, Go, Kotlin, Python, Java, Rust, Ruby, and PHP.

Write all logs to your data warehouse

The data that your flags generate is yours, and it belongs to you. That's why Confidence is 100% data warehouse native, with no storing of the logs the flag resolutions generate. It's all written into your data warehouse, so that you have full ownership of and insight into your data.

Coordinate experiments that shouldn't overlap

The Spotify experience is heavily personalized and powered by ML models that rank and suggest content to users. For example, ML models rank the recommendations on the home screen, including the content shown in the shortcuts. To run multiple experiments on a single model at the same time, we need the capability to run mutually exclusive experiments or else a single experiment could potentially receive all incoming traffic. Through coordination, you can make sure that an experiment gets a fixed percentage of the population and that it isn't eclipsed by other experiments that have a higher priority. Imagine that you launch an experiment comparing model A to model B, and that a week later you want to start another test comparing model A to model C. If you coordinate the two tests, you can assign a 50% allocation to both of them and they will then together use 100% of the population. If you don't coordinate the two tests, they are set to randomly overlap — but they're using the same flag, and a given user can only get their variant from one of the two experiments.

It's these types of pragmatic reasons that should drive your coordination practice, and not a fear of interactions (unless experiences are truly incompatible). A recent analysis by Microsoft showed that being overly cautious about interaction effects simply isn't warranted. At the larger scale that Spotify operates at, coordination rather serves the purpose of a planning tool. With more than 300 experimenting teams, the reality is vastly more complex than the earlier two-experiment example. Coordination helps teams avoid stepping on each other's toes.

In Confidence, you can coordinate experiences through "exclusivity tags" on a flag rule (which is what A/B tests use to control the experience). The tags let the allocation engine that Confidence uses know what traffic shouldn't overlap. For example, set the tag home-screen-shortcuts-ranker on all ranker tests for the home screen shortcuts and Confidence makes sure they're exclusive to each other. To verify that this practice is statistically sound, we investigated the implications of coordination on the analysis in a research paper. In brief, our findings support the use of this practice.

Exclusive experiments or not, you can quickly run out of users to test on whenever your population is finite. Confidence's allocation engine takes targeting conditions into consideration when allocating users for a test. This frees up users for testing when targeting already creates exclusive groups. For example, say that test A targets users in Sweden and test B targets users in Brazil. Since these tests are already exclusive to each other, Confidence can optimize its allocation and free up space for further testing.

Confidence's exclusivity allocation also supports holdbacks. Holdbacks enable a practice where some users are put aside from an experimentation program and aren't given the latest features. Later on, you can use these users to run an experiment that estimates the lift of all features enabled at once. Many teams at Spotify use holdbacks to get a reading on what impact they had during a quarter.

Use flags all the way from prototype to fully released

The virtue of feature flagging is that it's with you all the way from early prototype to fully released. Returning to the example with the home-screen flag and its default and large-style variants, here's a breakdown of what the full lifecycle of that flag can look like:

  • Overrides. Override just yourself to the new variant to do quality assurance and test out your new change. The flag rule targets only users whose user IDs match any of a fixed set of IDs.
  • Employee test. Run an A/B test on your employees. The flag rule targets only users for whom the requests include "is_employee": true in the context.
  • A/B test. Run an A/B test on a subset of the population. The flag rule targets only a fixed percentage of users — for example, 20% — and randomly returns default or large-style. The experiment coordinates with other experiments using the same tag, the home-screen-shortcuts-ranker tag.
  • Rollout. After you conclude the experiment and find the large-style variant to be superior, convert the A/B test into a rollout. The flag rule keeps giving all users that got large-style in the A/B test the same variant, but in the rollout you're now able to ramp up the percentage that receives the large-style variant.
  • Static flag rule. When you reach 100% and the large-style variant is fully rolled out, end the rollout but keep the flag rule. This lets you continue to serve the large-style variant to everyone.

At the end of this cycle, what you do depends on the nature of your change and your product. You can choose to keep the flag and start on your next iteration, or remove the dependency on the flag from your code altogether if it's a one-time change you won't revisit. But don't be too quick to remove the flag — as long as you're serving the new variant through the flag, you have the possibility to roll back to default instantly.

Leverage the Confidence APIs to integrate with your other tools

Confidence is more than an experimentation tool — it's a true platform. This means that you can do all the above and all the flag management via the APIs. You can integrate the entire management needed throughout the lifecycle of a flag, including overrides, coordination, and varying types of allocation, into your other tools — machine learning platforms, content and marketing platforms, and more. The possibilities of using Confidence this way are endless, and truly give you an opportunity to power up your work close to where it happens.

What's next

This post is part of a series that showcases how you can use Confidence. Make sure to check out our earlier posts Experiment like Spotify: With Confidence and Experiment like Spotify: A/B Tests and Rollouts. Coming up in the series are posts on analysis of experiments, metrics, workflows, and more.

Confidence is currently available in private beta. If you haven't signed up already, sign up today.