A hypothesis is a testable prediction about the effect of a specific product change on a specific metric. It takes the form: "We believe that changing X will cause metric Y to move by at least Z, because of reason R." The hypothesis is the bridge between a product idea and an experiment. Without one, you're running a test with no criteria for success.
A hypothesis does three things. It forces the team to commit to what they expect before seeing the data, which prevents post-hoc rationalization. It defines what "success" means for the experiment, which determines the metrics and the analysis plan. And it captures the reasoning behind the change, which means the team learns something whether the experiment succeeds or fails. At Spotify, the learning rate across experiments is roughly 64%, far higher than the 12% win rate. Most of that learning comes from experiments where the hypothesis was clear enough that a negative result still sharpened the team's understanding of user behavior.
What makes a good hypothesis?
A good hypothesis is specific, measurable, and falsifiable.
Specific means it names the change, the metric, and the mechanism. "We think the new checkout flow will improve conversion" is too vague. "We think simplifying the checkout form from five fields to three will increase purchase completion rate by at least 2 percentage points, because user research shows that form length is the primary drop-off point" is specific enough to test and learn from.
Measurable means the success metric exists, can be computed from the data you have, and has a known baseline. If you can't measure the outcome, the hypothesis is untestable. Confidence connects experiment design to warehouse metrics, so you can verify that the metric you're committing to is actually available before the experiment starts.
Falsifiable means the experiment can produce a result that would reject the hypothesis. If no realistic outcome would cause the team to abandon the idea, the experiment is confirmation theater. The hypothesis should include an implicit "and if this doesn't happen, we'll do something different."
How does a hypothesis differ from a guess?
A guess is "I think users will like this." A hypothesis is "I think this specific change will move this specific metric by this much, and here's why." The difference is accountability.
When a guess fails, you shrug. When a hypothesis fails, you learn. The "because" clause in the hypothesis is where the learning happens. If you hypothesized that simplifying the form would increase completions because form length is the drop-off point, and the experiment shows no effect, you've learned that form length probably isn't the bottleneck. That's a finding that informs the next hypothesis.
Spotify's Experiments with Learning framework tracks this distinction explicitly. The framework measures not just whether experiments produce positive results (the win rate) but whether they produce learning (the learning rate). A well-formed hypothesis is what makes a negative result informative instead of just disappointing.
How does the hypothesis connect to the experiment design?
The hypothesis determines every downstream decision in the experiment design.
The change being tested (the treatment) comes from the hypothesis. The success metric comes from the "metric Y" in the hypothesis. The minimum detectable effect (MDE) comes from the "at least Z" part: how large an effect does the team care about? The guardrail metrics come from asking "what could go wrong if the hypothesis is correct but has side effects?"
In Confidence, the experiment setup captures the hypothesis alongside the metric selection and power calculation. This creates a record that links the original reasoning to the final result. When a team reviews past experiments, they can trace not just what happened, but what they expected and why.