Alpha spending is the method of distributing a fixed significance budget (alpha, typically 5%) across multiple interim analyses in a group sequential test. Instead of using the full 5% at each look and inflating false positive rates, an alpha spending function specifies how much of the total alpha to "spend" at each interim analysis, so the cumulative false positive probability never exceeds the planned level.
The concept solves a concrete problem. Teams want to look at experimental results before the experiment is done. The peeking problem says this inflates false positives when using standard tests. Alpha spending makes interim looks valid by reducing the significance threshold at each look in a principled way, so the total type I error rate across all looks stays at 5%.
How does an alpha spending function work?
An alpha spending function is a non-decreasing function that maps the information fraction (the proportion of planned data observed so far) to the cumulative alpha spent up to that point. At information fraction 0, no alpha is spent. At information fraction 1 (the final analysis), all alpha has been spent.
Between those endpoints, the shape of the function determines how aggressively the test can reject the null hypothesis at early looks versus later ones.
Two well-known spending functions illustrate the range.
O'Brien-Fleming spending allocates very little alpha early and concentrates most of it at the final analysis. At 50% of the planned sample, an O'Brien-Fleming boundary might require a z-statistic of ~2.8 to reject (compared to ~1.96 for a single-look test). This means early stopping requires very strong evidence, which rarely happens for small or moderate effects. The benefit: if the experiment runs to completion, the final analysis boundary is close to the fixed-horizon threshold, so you lose almost no power.
Pocock spending distributes alpha more evenly across looks. The boundaries are roughly equal at each interim analysis, making early stopping more likely. The cost is a stricter final boundary, which reduces power if the experiment runs to full sample size.
Most product experimentation scenarios favor O'Brien-Fleming-style spending. At Spotify, where the majority of the 10,000+ annual experiments run close to their planned duration, preserving power at the final analysis is more valuable than maximizing the probability of early stopping.
Why not just divide alpha equally across looks?
Dividing alpha equally (e.g., testing at 1% significance at each of five looks for a total of 5%) is a valid but crude approach. It's the Bonferroni correction applied to sequential looks. It controls the false positive rate, but it's overly conservative because it doesn't account for the correlation between test statistics at successive looks (each analysis includes all previously observed data).
Alpha spending functions exploit this correlation. Because the test statistic at look 3 contains all the information from looks 1 and 2, the probability of a false positive at look 3, given that you didn't reject at looks 1 and 2, is smaller than if the looks were independent. Spending functions use this structure to allocate alpha more efficiently than Bonferroni, resulting in higher power.
How does alpha spending interact with the information fraction?
Alpha spending functions are defined in terms of information fractions, not calendar time or raw sample counts. This design choice has a practical benefit: if the actual interim analyses happen at slightly different information fractions than originally planned (because enrollment was faster or slower than expected), the spending function still produces correct boundaries.
For example, if you planned analyses at information fractions 0.25, 0.5, 0.75, and 1.0, but your first look actually happens at 0.3 because enrollment was faster than expected, the spending function evaluates at 0.3 and returns the appropriate boundary. No adjustment is needed.
This flexibility is important in practice. Experiments rarely accumulate data at exactly the predicted rate. Confidence computes the actual information fraction at each analysis point, accounting for variance reduction and the realized sample sizes, then applies the spending function to determine the correct boundary.
What are spending function boundaries for guardrail metrics?
In Confidence's decision framework, success metrics and guardrail metrics serve different roles. Success metrics test whether the treatment improved the target outcome. Guardrail metrics test whether the treatment harmed something you want to protect.
The sequential testing approach, including the alpha spending, applies to both. But the direction of the test differs: success metrics use a superiority test (is the treatment better?), while guardrail metrics use an inferiority test (is the treatment worse?). The spending function and boundaries are computed separately for each metric type, reflecting the risk-aware decision framework that distinguishes the error rates that need controlling for each metric role.