Reduce Dilution and Improve Sensitivity with Trigger Analysis

Reduce Dilution and Improve Sensitivity with Trigger Analysis
Mårten Schultzberg, Staff Data Scientist
Mårten Schultzberg, Staff Data Scientist
Sebastian Ankargren, Senior Data Scientist
Sebastian Ankargren, Senior Data Scientist

If you want to experiment like Spotify, check out our experimentation platform Confidence. It's currently available in a set of selected markets, and we're gradually adding more markets as we go.

Start trial

Introduction

Experimentation has become a cornerstone of how digital companies operate. After getting started with experimentation, many companies shift their focus to learning more and maximizing the value of each experiment. Ideally, an experiment maximizes value, measured by the information it produces, while minimizing associated costs—specifically, the number of users exposed. Reducing the number of exposed users decreases the risk of exposing users to poor experiences, and enables a higher velocity through more parallel experiments. More experiments lead to more frequent iterations, accelerating the transformation of good ideas into great products.

The most common method to increase the efficiency of experiments is through variance reduction techniques, such as CUPED, which have become standard practice. This approach uses data collected before users are exposed to reduce variance, which directly results in fewer users being required for experiments.

There are also specialized methods that are more efficient for certain types of experiments. For example, experiments on ranking algorithms can use interleaving, where comparisons within users enable more efficient treatment comparisons. For an introduction to this method, see for example this Netflix blog post. In this blog post, we discuss a technique for increasing efficiency that goes by several names, including exposure filters, trigger analysis, and exposure based on counterfactual logging (Stone, 2022). In this post, we refer to it collectively as trigger analysis. To see what others have written about trigger analysis and related topics, see the blog posts by Booking.com and DoorDash, and the book chapter by Kohavi, Tang and Xu (2020). In this post, we focus on the relation between trigger analysis and exposure logging, how implementation of feature flags affects the need for trigger analysis, trade-offs and implications of using trigger analysis, and discuss the most common pitfall.

What is exposure logging?

In online experiments, an exposure event determines when users are considered exposed to an experiment. In a nutshell, users interact with a website or an app. At some point during this interaction, the website or app queries the feature flagging service to decide which variant to show to the user. This query is commonly referred to as resolving. It could involve decisions about which text to present, which recommendation algorithm to use, or some other change to the user experience. The feature flagging service emits an event every time it's resolved. The event records which user received what experience at what time. These events are the foundation for analyzing experiments as they identify which users belong to each treatment group and when each user first encountered the new experiment. To evaluate the change introduced in an experiment, pipelines combine ‌exposure events with business or user behavior data—such as streams, purchases, or transactions—to produce the experiment's metrics.

The fact that a user has emitted an exposure event doesn't necessarily mean they've actually experienced the change.

While the procedure sounds simple enough, the fact that a user has emitted an exposure event doesn't necessarily mean they've actually experienced the change. If the user doesn't experience the change, there's nothing that can cause a change in user behavior. In more technical jargon, there's no way for the treatment to have an effect. The importance of this issue varies depending on the type of client that resolves the feature flag. This distinction isn't just important for the topic of trigger analysis—the difference has much broader implications for generalizability more widely.

Mobile and web applications

Mobile and web applications often resolve feature flags in batch on app start to avoid flickering. As an example, consider a feature flags that determines whether to show a You might also like section at the bottom of playlists in the Spotify mobile app. Not all users will visit a playlist, and not all will scroll to the bottom. This means that all users who start the app emit an exposure event for the feature flag, regardless of if they see the new section.

The image above shows a schematic illustration of the user journey and when the exposure event occurs. For a mobile or web application, the exposure event is typically emitted on app startup. Later actions, like visiting a playlist or scrolling to the bottom of it, aren't taken into account.

Backend services

Feature flags in backend services are typically resolved when a request is made to the backend. This implies that the exposure event is generally more specific. Only users who interact with the feature powered by the backend resolve the flag and emit exposure events. For example, consider a feature flag that controls what recommendation algorithm to use for recommending tracks to add to playlists. The Recommended section is displayed at the bottom of user-created playlists. The feature flag is resolved when a user visits a playlist. This means that only users who visit a playlist emit an exposure event for the feature flag.

Contrasting the above image to the previous one, the exposure event for a backend application occurs at a later point in the user journey. The exposure event now happens when a user visits a playlist, as that's the point where the Recommendation section at the bottom of the playlist needs to be populated. That still doesn't mean that the user scrolls to the bottom of the playlist.

What is trigger analysis?

Trigger analysis narrows the broad definition of exposure to enable more focused analyses of populations of interest. For example, you could redefine exposure as the first time a user visits the page where the change occurs after the feature flag has been resolved. That is, the exposure event is only triggered when the user interacts with part of the experience that the experiment modifies. This will also change the time of exposure to be the first time a user had a chance to experience the change.

The image above shows how the exposure and triggering events occur in sequence to define the two types of exposure. In the previous section, we described how the resolving behavior differs between different types of clients. For a mobile or web application, the resolving event often occurs on app start. The broad exposure definition that only relies on the exposure events can be further narrowed down by using trigger analysis. Two possible triggers for the playlist example are the events of visiting a playlist or scrolling to the bottom of it.

More generally speaking, two main flavors of trigger analysis exist: user-level inclusion and event-level inclusion.

User-level inclusion

User-level inclusion refers to the exposure filter described earlier. This filter focuses the analysis only on ‌users who interacted with the part of the experience that the experiment modified. If a user is logged as exposed, metric results include all their measurements.

Event-level inclusion

Event-level inclusion is more granular than user-level inclusion. It only includes measurements of a user's behavior that happened when they interacted with the part of the experience that the experiment changed.

Event-level inclusion is particularly useful for experiments on recommendation algorithms. Counterfactual logging tracks whether a user's recommendation would've been different if they had been in the other treatment group. Consider a personalized search model as an example. We send the user's search query to both the control group and the treatment group models and compare their responses. If the responses differ, we include the measurement for whether the user succeeded in finding what they were looking for in the metric results. If the search results are the same, we discard it. This way, only the behavior after encountering results that were affected by the treatment are included in the results. In highly optimized personalized systems, the experiments often aim to make minor adjustments to certain types of queries. If the systems perform as expected, most query responses should be the same between the treatment and control groups for a given query and user. By removing queries where the treatment effect is‌ zero, we refine the comparison and increase the efficiency of the experiment. For more details on experimentation and methods for evaluating recommendation systems, see Schultzberg and Ottens (2024).

The pros and cons of trigger analysis

Removing unaffected users reduces dilution

The primary purpose of trigger analysis is to reduce dilution. Filtering out users (or events) that didn't experience the change can lead to large efficiency gains. The treatment effect for the users who had no opportunity experience the treatment is zero, and including them only introduces noise and dilutes the effect. Consider an experiment where only 10% of the users who resolved the experiment feature flag actually had the opportunity to experience the change. Now imagine that among users who experienced the change, the average treatment effect was 2 metric units. Because the users who didn't experience the change dilute the effect, the observed effect is 10% of 2 units—which is only 0.2 units.

Filtering out users changes the interpretation

The downside of trigger analysis is that it changes the type of treatment effect that the experiment estimates. Consider an item at the bottom of a page, such as the footer of a webpage. If we only trigger the analysis for users that scroll all the way down, the experiment then estimates a treatment effect for the population of users who scroll to the bottom. This group likely differs in important user characteristics, such as age and activity level, from users who don't scroll to the bottom. This implies that the effect estimated from this subset of users isn't an unbiased estimator of the average treatment effect if this change were to be rolled out to everyone, because not everyone would interact with it. In other words, an estimated effect of 10% on trigger-exposed users doesn't translate to a 10% increase in the entire population. The filtering of users results in a change of estimands.

An estimated effect of 10% on trigger-exposed users doesn't translate to a 10% increase in the entire population.

In the previous section, we shared an example on how an effect of 2 units for 10% of the population translates to an effect of 0.2 units for the entire population. In such a scenario, a change of 0.2 units is the effect we should expect if we roll out the change to everyone. With trigger analysis, we'd estimate the treatment effect to be 2 units. This effect is the average treatment effect for the population of users who behave in such a way that they experience the change. The image above shows hypothetical distributions of the metric values in a case like this. With broad exposure, the large population of users that don't experience the change pull the sample mean towards zero. With trigger-based exposure, those users are excluded, and the sample mean is centered on the sample mean in the group of users that experience the change.

The change in population leads to a more complex interpretation when making decisions based on experiment results that involve trigger analysis. While trigger analysis improves sensitivity and makes it easier to find effects that truly exist, ‌a cost-benefit analysis can't take the estimated change at face value and naively scale it to the size of the entire user population. Deng and Hu (2015) give a more detailed explanation on how to translate effects estimated with trigger analysis to a larger population. For an introduction to different types of estimators and their estimands, an introductory book in econometrics, like Wooldridge (2019), is a great place to start.

Impacting the probability of triggering causes imbalanced groups

A common problem with trigger analysis is that the treatment experience impacts the inclusion criterion, making it more or less likely that a user triggers the inclusion event. If this happens, the observed traffic split between control and treatment won't follow the desired split, leading to a sample ratio mismatch. For example, consider again an experiment that changes an item at the bottom of a webpage. Imagine that users in the treatment group see a banner at the top of the page prompting them to scroll all the way down. In this scenario, users in the treatment group are more likely to scroll to the bottom, and the probability of being exposed isn't independent of treatment status. The problem is that users who'd normally not have scrolled down will now do so—but only in the treatment group. Because of this, the control and treatment groups are systematically different, violating the whole randomization-based foundation of the experiment.

The example with a banner that prompts users to scroll all the way down is clearly problematic. In practice, problems are much more subtle. In fact, at Spotify, the most common source of sample ratio mismatches in experiments is poorly designed exposure filters. A more subtle variation of the problem is that the new change increases the load time of the page. A slight increase in load time can cause impatient users to abandon the page and not scroll down, which they would've if they hadn't experienced the increased load time. In this case, instead of increasing the probability of being logged as exposed—scrolling down—in the treatment group, the change decreases the probability. To learn from experiments efficiently, it's crucial to think carefully about the definitions of exposure filters and trigger analyses. Instead of boosting your experiment's efficiency, a poorly designed trigger analysis can instead lead to a loss in what you can learn.

Summary

From our experience, trigger analysis is an important feature for efficient experimentation. To summarize the main takeaways from this post:

  • Trigger analysis is especially valuable in already highly optimized products or in experiments that aim to improve parts of a product that only a small proportion of the total user base interacts with.
  • Translating treatment effects from trigger-based analyses to average treatment effects, which describe the expected change when everyone receives the new feature, is notoriously difficult for some types of metrics.
  • We recommend using results from trigger analyses primarily as strong indications of the direction of change rather than as precise estimates of the size of the improvement.
  • Trigger analysis helps find systematic improvements in already highly optimized systems, but at the cost of making it more difficult to make cost-benefit trade-offs of the business value of the improvement as compared to the added cost of the new variant.

If you want to try Confidence yourself, sign up today. Now available in selected markets.