Lesson 3: Time considerations
In this lesson, you learn how time windows shape metric behavior and interpretation in experiments. You start with the simplest cumulative approach, then explore how to configure exposure offsets and aggregation windows, understand the differences between closed and cumulative windows, and learn how to choose the right time configuration for your use case.
Why time windows matter in metrics
Every metric aggregates measurements over some time window. The way you configure this time window fundamentally affects what you're measuring and when you can interpret your results. Consider a simple metric like "streams per user" in the first week after signup. This seemingly straightforward metric raises several questions: Does "first week" start immediately at signup or after some delay? Should we wait until all users complete their full week before including them in results, or include them as they progress through the week?
The answers to these questions determine not just what you measure, but how quickly you get results and how easy those results are to interpret. A metric that waits for all users to complete their measurement window gives you the cleanest interpretation but delays your results. A metric that includes users as soon as they're exposed gives you faster feedback but requires more careful interpretation.
Watch this video for an overview of the different ways to handle time in metrics in 4 minutes and 34 seconds.
The simplest approach: cumulative metrics
The most straightforward way to configure a metric is to include all measurements from all exposed entities as soon as they're exposed, with no upper limit on the time period. This cumulative approach with no window starts counting immediately at exposure and continues indefinitely.

With this approach, a user exposed on day 1 has 10 days of measurements on day 10, while a user exposed on day 9 has only 1 day of measurements. This means different users are always measured over different time periods—you're comparing users who have had very different amounts of time to exhibit the behavior you're measuring.
In cumulative metrics without a window, some entities are always measured over longer periods than others because not all entities are exposed at the same time. This makes these metrics hard to interpret. Only use these metrics when entities are short-lived.
This approach is primarily useful when you're working with short-lived entities like cookie-based users who may not return for multiple sessions. In these cases, trying to follow the same entity over a multi-day window doesn't make sense because the entity itself may not exist that long.
An e-commerce site runs experiments on anonymous visitors identified by cookies. Many visitors never return after their first session. A metric measuring "average order value" configured with no window makes sense here because:
- Following individual cookies over multiple days often means measuring nothing (they don't return)
- The business cares about the total order value from all visitors, regardless of when they were exposed
- The short-lived nature of the entity makes time windows less meaningful
For persistent entities like logged-in users, this approach creates problems. Users exposed early in the experiment always have more time to contribute measurements than users exposed later, making fair comparisons difficult. This is where time windows become essential.
How time windows work
Time windows solve the problem of unequal measurement periods by defining a fixed time period for each user relative to their exposure. Window-based metrics use two key parameters: the exposure offset and the aggregation window.
In Confidence, you configure the exposure offset and aggregation window when defining a metric.

The exposure offset defines how long to wait after a user's exposure before starting to collect their measurements. Think of it as a waiting period. If you set an offset of 7 days, the first week of behavior is ignored and measurements only start from day 8 onward.
The aggregation window defines the length of time over which you collect measurements for each user. A 7-day aggregation window means you're collecting data for exactly 7 days (relative to each user's exposure time plus any offset). The window is always relative to when each individual user was exposed, not to calendar time.

The figure illustrates how time windows work to aggregate values within entities. The horizontal bar shows when the user was first exposed to the experiment. The dashed part of the box is the exposure offset, and the solid box is the aggregation window. The metric calculation includes the measurements that fall within the aggregation window.
When to use exposure offset: novelty and change-aversion effects
Sometimes the effect you want to measure isn't the immediate response to a change but how users behave after they've had time to adapt to it. This is where exposure offset becomes valuable.
Two opposing time-dependent effects can distort early measurements. The first is novelty effects: users encountering something new temporarily change their behavior—a redesigned interface might see increased engagement simply because users are exploring the change, even if the long-term effect is neutral or negative.
The second is change aversion (also called primacy effects in the experimentation literature): users accustomed to the old experience may initially resist or underuse the new one, suppressing the apparent effect until they adapt. Novelty and change aversion are among the most common sources of short-term metric distortion in online experiments.
These two forces work in opposite directions. Novelty inflates early results; change aversion suppresses them. Either can mislead you about the true long-term impact of a change.
A music streaming service redesigns their playlist creation flow. They configure two metrics:
- Immediate response: 0-day offset, 7-day window—measures first week behavior
- Sustained impact: 7-day offset, 7-day window—measures second week behavior after the initial adjustment period
The immediate response metric shows a 15% increase in playlist creation. The sustained impact metric shows only a 3% increase. This suggests most of the initial lift was novelty. Had the redesign created friction for habitual users, the immediate metric might instead have shown a decrease that recovered as users adapted—an example of change aversion.
Using an offset lets you deliberately measure behavior after the initial adjustment period. This is particularly useful when your hypothesis is about sustained behavior change rather than immediate response.
When you suspect novelty or change-aversion effects, configure multiple metrics with different offsets: one capturing immediate response and one capturing sustained behavior after the initial period. Comparing the two gives you a more complete picture of your change's true impact.
Two ways to use windows
Now that you understand what a window is—a fixed time period for each user defined by offset and duration—the question becomes: when should users appear in your metric results? Two approaches exist for window-based metrics, each representing a different trade-off between interpretability and speed of results.
Closed windows: at the end of a window
Metrics that include entities at the end of a window wait until each user completes their full measurement window before including them in the results. This is the most rigorous approach because every user included in the metric results has been measured over exactly the same time period relative to their exposure.

The illustration shows how this works over time. On any given day of your experiment, the metric results include only users who have completed their full aggregation window. Users exposed on day 1 appear in the results after exposure offset + aggregation window days. Users exposed on day 2 appear one day later, and so on.
This approach has an important implication for when you see results. If you create a metric measuring behavior during the second week after exposure (offset of 7 days, window of 7 days), you won't see any results until 14 days after launch. Before that, no user has completed their second week yet. It's also worth noting that closed windows only include users exposed early enough to complete the full window—users exposed near the end of the experiment are excluded. For long windows (30+ days), this means your results increasingly reflect earlier-exposed users, which can introduce bias if user composition shifts over the course of the experiment.
A streaming platform tests a new recommendation algorithm. They create a metric measuring "streams in the second week after exposure" configured as:
- Exposure offset: 7 days
- Aggregation window: 7 days
- Include users: At the end of a window
This metric shows no data for the first 13 days. On day 14, the first exposed users complete their second week and appear in the results. Each subsequent day adds more users who have completed their full second week.
Cumulative windows: include users during a window
Metrics that include entities cumulatively during a window include users as soon as they enter their measurement window, even if they haven't completed it yet. This gives you earlier results but requires more careful interpretation because different users in your metric results have been measured over different amounts of time.

With this approach, you start seeing results as soon as the first users reach the start of the window (after the exposure offset). A user exposed on day 1 appears in the results on day 8 (after a 7-day offset), even though they've only been measured for one day of their 7-day window. The next day, that same user has two days of data in the window, and so on.
This means the metric results change their meaning over time. Early in the experiment, the metric represents an average of users measured over different fractions of the full window. Only after exposure offset + aggregation window days do all users in the results have their full window measured.
Using the same recommendation algorithm example with cumulative inclusion:
- Exposure offset: 7 days
- Aggregation window: 7 days
- Include users: Cumulatively during a window
This metric shows first results on day 8, when the earliest exposed users enter their second week. On day 8, these users have only 1 day of second-week data. On day 9, they have 2 days of data. By day 14, the earliest users have their full 7 days of second-week data, but newer users still have partial windows.
The right time configuration
The decision of which time configuration to use follows a logical progression: first consider your entity type, then decide whether to use windows, and finally choose how to include users in results.

Step 1: consider your entity type
For short-lived entities like cookies or anonymous sessions that rarely return, use cumulative metrics with no window. Time windows are impractical when entities don't persist long enough to complete them. Accept that different measurement periods are unavoidable in this case.
For persistent entities like logged-in users, accounts, or devices, use window-based metrics for fair comparisons. Windows ensure each user is measured over the same time period, regardless of when they were exposed. Continue to step 2 to choose your window approach.
Step 2: choose your window approach (for persistent entities)
| Approach | When to Use | Trade-off |
|---|---|---|
| Closed windows (at the end of window) | Interpretability is paramount and you can wait for complete data. Best for success metrics when making final decisions. | Slower results, but cleanest interpretation—every user measured over exactly the same period. |
| Cumulative windows (during window) | You need earlier results for monitoring. Common for guardrail metrics or when you want to monitor success metrics before they reach full maturity. | Faster results, but more complex interpretation—users at different stages of their windows. |
Step 3: configure offset and window duration
Exposure offset controls when measurement starts. Use 0 days to measure immediate response, or 7+ days to skip novelty or change-aversion effects and measure sustained behavior. Consider using multiple metrics with different offsets to understand both short-term and long-term impact.
Window duration should match the natural cycle of your user behavior and how long it takes for the effect to manifest. Make sure your window is compatible with your fact data frequency—don't use hourly windows if your facts are generated daily.
What does the exposure offset parameter control in a metric?
A metric configured to include users 'at the end of a window' with a 7-day offset and 7-day window will show first results on which day after experiment launch?
When should you use metrics with no window (cumulative without window)?
What is the main trade-off between 'at the end of a window' and 'cumulatively during a window'?
Notes for nerds
Cumulative metrics are tricky both from an interpretation perspective and an inference perspective. Spotify's engineering team has dug deep into both topics:
- Bringing sequential testing to experiments with longitudinal data part 1: the peeking problem
- Bringing sequential testing to experiments with longitudinal data part 2: the peeking problem
- It's About Time: What A/B Test Metrics Estimate
Survivorship bias in cookie-based experiments. The recommendation to use cumulative no-window metrics for short-lived cookie entities has a hidden cost: users who don't return after their first session are excluded from your analysis entirely. These non-returning users are precisely those most likely to have had a poor experience. By measuring only returning users, your metric reflects a survivor population—the happy path—rather than the full distribution of user outcomes. Keep this in mind when interpreting results from anonymous-visitor experiments.
Variance non-stationarity. Metric variance is not constant over the life of an experiment. In the early days, variance is typically elevated—novelty effects, an evolving mix of new versus returning users, and exploratory user behavior all contribute to higher noise. Variance tends to stabilize as the experiment matures. This matters in practice: sample size estimates based on historical variance may be optimistic for early experiment phases, and sequential testing methods that assume fixed variance need to account for this non-stationarity when applied to longitudinal data.