Experiment Analysis

What are Garden of Forking Paths?

The garden of forking paths refers to the many implicit analytical choices a researcher or analyst makes during an experiment's lifecycle, each of which could have gone differently and each of whic...

The garden of forking paths refers to the many implicit analytical choices a researcher or analyst makes during an experiment's lifecycle, each of which could have gone differently and each of which affects the final result. Which metric to focus on, how to define the metric, which users to include, when to look at results, which segments to examine, whether to remove outliers, how to handle missing data. Every choice is a fork, and the path you take through these forks determines what you find. The problem: if you make these choices after seeing the data, the false positive rate is far higher than the nominal 5%.

This concept, named by statistician Andrew Gelman, explains why two honest analysts can reach opposite conclusions from the same experiment. Neither is fabricating results. They're walking different paths through the same garden.

Why do forking paths inflate false positives?

A standard significance test assumes the analysis was specified before the data was collected. The 5% false positive rate applies to that one, pre-specified analysis. Every additional choice you make after seeing the data is an implicit additional test.

If you run your analysis, see a non-significant result, then try a different metric definition, exclude a few outliers, restrict to a specific segment, and find significance, you haven't discovered a real effect. You've searched across multiple analyses and found one that happened to cross the threshold.

The scale of the problem is larger than most teams realize. A researcher with five reasonable metric definitions, three plausible inclusion criteria, and two outlier-handling approaches has 30 distinct analyses to choose from. Even if every analysis is "defensible," choosing the one that gives the best result inflates the effective false positive rate well beyond 5%.

How do you stay on one path?

Pre-registration is the primary defense. Decide your success metric, your analysis population, your outlier rules, and your stopping criteria before the experiment starts. Write them down. Confidence encourages this by requiring teams to configure metrics and analysis settings before an experiment launches, not after results come in.

Automate the analysis. When the analysis pipeline is standardized and runs the same way for every experiment, there's no room for ad hoc choices. Confidence's automated experiment analysis is designed with this in mind: every experiment gets the same statistical tests, the same SRM check, the same trigger analysis, the same multiple testing corrections. The analyst doesn't choose which tests to run.

Separate exploration from confirmation. Looking at segment breakdowns, alternative metrics, and different time windows is useful for generating hypotheses. It becomes dangerous when those explorations are presented as confirmatory results. The garden of forking paths doesn't mean you can't explore. It means exploration findings need to be validated in a subsequent, pre-specified experiment.

Where do the forks hide in practice?

Some forks are obvious: choosing which metric to report, or deciding to exclude a user segment that "doesn't make sense." Others are subtle:

Defining the metric window. "Did we count conversions within 7 days or 14 days?" Both are reasonable. Choosing after seeing which window gives significance is a fork.

Handling the ramp period. "We excluded the first 3 days because users were still adapting." If that decision was made after seeing that early data looked different, it's a fork.

Choosing the denominator. "We focused on active users" vs. "all assigned users." Both are legitimate analysis populations. The choice should be made before results are available.

At Spotify, the standardization of analysis across 10,000+ experiments per year is partly motivated by the forking paths problem. When every experiment runs through the same pipeline, the garden shrinks to a single well-lit path.