How to Set Up a Hypothesis
A well-formulated hypothesis is a specific assumption that can be conclusively tested through an experiment. Not all hypotheses are equally effective. An effective hypothesis should be:- a statement, not a question
- clear about what experiment outcomes would support or weaken it
- clear about the key variables
- grounded in past research/learnings
- written with as few assumptions as possible
Doing this/building this feature/creating this experience for these people/personas should result in a change in their behavior, as measured by success metrics. The data supports the hypothesis if the success metrics change by the minimum detectable effect.You can read more about minimum detectable effects (MDE) on the effect sizes page.
Example Hypothesis
Imagine your team is building an autoplay feature for the Spotify mobile app. Your team’s goals are:- Make it easier for people to continue listening when their content ends.
- Lead users to listen to more content curated by Spotify.
Continuing to play music or podcasts when a play context ends for all users should result in users listening to more of Spotify’s curated content rather than searching for something else to play themselves, as measured by percent programmed content. The data supports the hypothesis if the change in percent programmed content increases by 2.5pp.
Composite Hypothesis
Many experiments use one or two success metrics and a few guardrail metrics. In this scenario you should write a hypothesis statement for each success metric, while for the guardrails it’s generally enough to just state the hypothesis that the treatment does not deteriorate the guardrail metrics more than the acceptable margins (known as non-inferiority margins). Consider the earlier example of the Autoplay experiment. The guardrail metrics are the skip rate of programmed content and the app crash rate. To also include the guardrail metrics, change the hypothesis statement as follows:Continuing to play music or podcasts when a play context ends for all users should result in users listening to more of Spotify’s curated content rather than searching for something else to play themselves, as measured by percent programmed content. The data supports the hypothesis if the percent programmed content increases by 2.5pp, while the app crash rate and the programmed content skip rate don’t increase by more than the acceptable margins.The hypothesis statements for the success metrics intend to capture a change in user behavior that is measurable by some metric. For guardrail metrics, expect no change, or only a small one. In settings like this, you need to define the decision rule for a successful experiment upfront. For example, if the treatment significantly improves one success metric, but there is no evidence of non-inferiority on the guardrail metrics, should you ship this variant or not? Have you found enough evidence that this variant is better than the current default version? Read more about the decision rules.

