Lesson 2: The Spotlight

The Spotlight is the first thing you see on the results page. It gives a recommendation for each treatment variant: Ship, Continue, End, or Abort. Each recommendation has a precise meaning, and understanding what drives each one lets you immediately understand the state of your experiment when you open it.

The four recommendations

Ship

Ship means Confidence has found sufficient evidence that the treatment variant is worth rolling out. Three conditions must all be true for a Ship recommendation:

  1. At least one success metric has improved significantly in the intended direction.
  2. All guardrail metrics meet their tolerance levels: either significantly non-inferior (if you use non-inferiority margins), or showing no evidence of deterioration (if you do not).
  3. No health check has failed. There is no SRM and no evidence of metric deterioration.

Ship is a positive signal, but it is a recommendation, not a mandate. You should still exercise judgment. Consider whether the effect size is practically meaningful, whether the experiment ran long enough to rule out novelty effects, and whether the results make sense given your understanding of the product.

Continue

Continue means there is not yet enough evidence to ship, and no reason to stop. This is the most common recommendation for an experiment that is on track. It means:

  • No success metric has significantly improved yet.
  • No health checks have failed.
  • No metrics have deteriorated.

The right response to Continue is to let the experiment run until it reaches the required sample size. Stopping an experiment early because results are not yet significant is a common mistake: it produces biased, unreliable results.

End

End appears when an experiment has collected enough data to be considered powered for all success metrics, but none of those metrics has shown a significant improvement. This is different from Continue.

Continue means "we do not know yet." End means "we have collected enough data, and there is no signal."

An End recommendation is a genuine null result. The treatment variant does not appear to improve the metrics you care about, and you have enough data to be fairly confident in that conclusion. The right response is to stop the experiment and treat this as a real finding: the treatment variant did not work as hypothesized.

Abort

Abort means something has gone wrong that makes continuing the experiment harmful or pointless. This recommendation appears when:

  • A health check has failed (most commonly a sample ratio mismatch), which means the results cannot be trusted.
  • One or more metrics have deteriorated significantly, meaning there is statistical evidence the treatment variant is harming something you care about.

When you see Abort, stop the experiment. If the cause is an SRM, investigate the exposure logic before relaunching. If the cause is metric deterioration, the treatment variant may be harmful and should not be shipped.

After the experiment ends

After you end an experiment, the Continue and End recommendations merge into a single Don't ship label. The Abort recommendation may also appear if there was a health check failure. The Ship recommendation remains if the evidence for shipping was already established before the experiment ended.

The full picture

The Spotlight is a synthesis. It takes the outcomes of all success metrics, all guardrail metrics, and all health checks, and compresses them into one recommendation per treatment variant. Understanding the individual components (significance, CI width, health checks, evaluation strategy) gives you the ability to look at a Spotlight recommendation and trace it back to its causes.

Notes for nerds

The Spotlight's synthesis of many metric results into a single recommendation is a non-trivial statistical problem. When you have many metrics, each tested at some significance level, the probability that at least one shows a spurious significant result grows quickly. Spotify has published research on how to make principled, risk-aware product decisions under exactly this kind of multi-metric setting. The underlying ideas are described in a paper by Schultzberg, Ankargren, and Frånberg (2024). You can read the engineering post at engineering.atspotify.com.