Lesson 2: The Spotlight
In this lesson, you learn what each Spotlight recommendation means, what drives each one, and how the Spotlight synthesizes many metric results into a single actionable signal.
The Spotlight is the first thing you see on the results page. It gives a recommendation for each treatment variant: Ship, Continue, End, or Abort. Each recommendation has a precise meaning, and understanding what drives each one lets you immediately understand the state of your experiment when you open it.
The four recommendations
Ship
Ship means Confidence has found sufficient evidence that the treatment variant is worth rolling out. Three conditions must all be true for a Ship recommendation:
- At least one success metric has improved significantly in the intended direction.
- All guardrail metrics meet their tolerance levels: either significantly non-inferior (if you use non-inferiority margins), or showing no evidence of deterioration (if you do not).
- No health check has failed. There is no SRM and no evidence of metric deterioration.
Ship is a positive signal, but it is a recommendation, not a mandate. You should still exercise judgment. Consider whether the effect size is practically meaningful, whether the experiment ran long enough to rule out novelty effects, and whether the results make sense given your understanding of the product.
Continue
Continue means there is not yet enough evidence to ship, and no reason to stop. This is the most common recommendation for an experiment that is on track. It means:
- No success metric has significantly improved yet.
- No health checks have failed.
- No metrics have deteriorated.
The right response to Continue is to let the experiment run until it reaches the required sample size. Stopping an experiment early because results are not yet significant is a common mistake: it produces biased, unreliable results.
End
End appears when an experiment has collected enough data to be considered powered for all success metrics, but none of those metrics has shown a significant improvement. This is different from Continue.
Continue means "we do not know yet." End means "we have collected enough data, and there is no signal."
An End recommendation is a genuine null result. The treatment variant does not appear to improve the metrics you care about, and you have enough data to be fairly confident in that conclusion. The right response is to stop the experiment and treat this as a real finding: the treatment variant did not work as hypothesized.
A null result is a valuable result. Knowing that a change did not improve metrics saves engineering and design resources that would otherwise go into shipping and maintaining a change that does not help users. Do not dismiss null results.
Abort
Abort means something has gone wrong that makes continuing the experiment harmful or pointless. This recommendation appears when:
- A health check has failed (most commonly a sample ratio mismatch), which means the results cannot be trusted.
- One or more metrics have deteriorated significantly, meaning there is statistical evidence the treatment variant is harming something you care about.
When you see Abort, stop the experiment. If the cause is an SRM, investigate the exposure logic before relaunching. If the cause is metric deterioration, the treatment variant may be harmful and should not be shipped.
After the experiment ends
After you end an experiment, the Continue and End recommendations merge into a single Don't ship label. The Abort recommendation may also appear if there was a health check failure. The Ship recommendation remains if the evidence for shipping was already established before the experiment ended.
The full picture
The Spotlight is a synthesis. It takes the outcomes of all success metrics, all guardrail metrics, and all health checks, and compresses them into one recommendation per treatment variant. Understanding the individual components (significance, CI width, health checks, evaluation strategy) gives you the ability to look at a Spotlight recommendation and trace it back to its causes.
An experiment shows a Continue recommendation in the Spotlight, with one success metric showing "Not significant" (+1.4%) and all guardrail metrics showing "Has not deteriorated." The experiment is halfway through its planned duration.
The correct interpretation: the experiment is healthy and on track. The success metric has not yet moved enough to be statistically significant, but "not significant" at the halfway point is expected. Continue running until the required sample size is reached before drawing conclusions.
The Spotlight shows 'End' for a running experiment. What is the most accurate interpretation?
Which of the following must be true for Confidence to recommend 'Ship'?
Notes for nerds
The Spotlight's synthesis of many metric results into a single recommendation is a non-trivial statistical problem. When you have many metrics, each tested at some significance level, the probability that at least one shows a spurious significant result grows quickly. Spotify has published research on how to make principled, risk-aware product decisions under exactly this kind of multi-metric setting. The underlying ideas are described in a paper by Schultzberg, Ankargren, and Frånberg (2024). You can read the engineering post at engineering.atspotify.com.