Two Questions Every Experiment Should Answer

Two Questions Every Experiment Should Answer
Mårten Schultzberg, Staff Data Scientist
Mårten Schultzberg, Staff Data Scientist

If you want to experiment like Spotify, check out our experimentation platform Confidence and get a personalized demo.

Contact us

For many experimenting teams, there is nothing more frustrating in experimentation than a flat line. You had a hypothesis. You built the feature. You ran the test. And… nothing happened. The result is null. Neutral results can be informative, but only if you know why it happened, and what product decision it should lead you to. The problem is that many teams are designing experiments that set them up for failure-to-learn before they even launch.

If you didn't think carefully before the experiment, a neutral result can mean you're stuck with a new annoying problem: Ambiguity. Was the experiment neutral because "the underlying idea was not strong enough to provoke a reaction"? Or did it fail because "your implementation was too timid to provoke a reaction"?

The Trap of Conservative Implementation

When we have an idea, like improving the onboarding, changing the tone of messaging, or improving a personalization algorithm, we usually implement a version that feels reasonable, professional, and "safe." We run the experiment, and the results are undetectable.

This happens because the two phases of product development are often confused:

  • Existence (Building the Right Thing): Is this a lever that actually moves user behavior when pulled? Do users notice or care when you touch this part of the product?

  • Optimization (Building the Thing Right): What is the optimal setting for this lever?

It's a big mistake trying to answer the second question before you have answered the first. You are trying to perfect a change without confirming that the change as a whole actually matters to users.

The Two Dimensions of Rigorous Tests

To run experiments that yield clear learnings every time, you need to align two critical dimensions with your intent:

  1. Implementation Boldness: Is your variant provocative enough to detect an effect if one exists?
  2. Statistical Power: Is your sample size large enough to reliably detect a plausible effect from your change?

Both matter, but they matter differently depending on what you're trying to learn. When identifying whether a lever exists, you need a provocative implementation AND high enough power. When optimizing a proven lever, you naturally test more conservative variations, but you still need adequate power to detect meaningful differences.

The flow chart illustrates the flow from idea to experiment.

flow-chart-from-idea-to-experiment

The Case for Provocative Variants

One of our Confidence customers, the Swedish company Mentimeter, has developed a simple but brilliant step in their pre-experiment review to systematically solve this.

In the words of Sandra Knutsson, growth designer at Mentimeter:

A key difference between designing for an experiment and designing for traditional product development is that you're not only trying to solve a problem. You're designing a solution that helps you accept or reject a hypothesis. Sometimes that means you need to amplify the changes in the experience more than you would in a final implementation, so the change is detectable enough for a user behavior to change.

Before launching a test in Confidence, they explicitly ask: "Is this change bold enough to change human behavior?" Often, they realize that the first version of the proposed change is too polite. Even if the hypothesis is correct, the plausible difference this could cause in human behavior between control and treatment is too small to detect without a massive sample size.

The solution: The Provocative Variant. Implement a version of the idea specifically intended to be loud. This doesn't mean being reckless or breaking the UX. It means finding the Maximum Viable Change: the loudest possible version of the idea that still functions as a user experience. Naturally, judgment is essential here. Push too far, and you'll only confirm that users dislike obviously flawed experiences. In our experience, there's often large room between a conservative first version and a credible provocative variant.

The maximum viable change gives you a clear answer. If they produce no movement, you have strong evidence this isn't a lever worth pulling. If they do show an effect—even negative—you've confirmed this dimension influences user behavior. Now you can invest in finding the right expression.

Combine the Right Implementation with the Right Sample Size

Of course, even a maximum viable change requires a sample size large enough to detect a plausible effect. If the experiment is underpowered, you will always have ambiguity—regardless of the implementation.

  • If a bold change creates no movement: You have strong evidence that this lever doesn't matter. You can safely abandon the idea and move on.
  • If a bold change tanks metrics: Great! You have proven that this dimension influences behavior. Now you can dial it back and focus on optimization.

This can be summarized in the following table:

IntentOutcomeImplementation
ProvocativeConservative
PoweredUnderpoweredPoweredUnderpowered
Identifying LeverNull Result High-value: Lever doesn't exist. Move on. Ambiguous: Can't distinguish no effect from can't detect Ambiguous: No lever or weak implementation? No learning: Complete ambiguity
Effect Found High-value: Lever exists. Now optimize. Useful: But estimates may be noisy. High-value: Lever is more sensitive than expected. Useful: Lever is more sensitive than expected, estimates may be noisy.
Optimizing LeverNull Result Useful: Even bold versions have no effect Ambiguous: Can't distinguish no effect from can't detect High-value: This variation isn't better than existing Ambiguous: Can't distinguish no effect from can't detect
Effect Found High-value: Found an improvement Useful: But estimates may be noisy. High-value: Found an improvement Useful: Lever is more sensitive than expected, estimates may be noisy.

The pattern is clear: when identifying levers, you need both provocative implementations and adequate power. When optimizing levers, you can test conservative variations, but power remains essential for reliable conclusions.

This table might suggest that you will win most of the time, but not all cells are equally likely of course. Most companies have quite low rates of experiments that moves the needle in any way. See this earlier blog post for details on Spotify's learning rate.

Teams that excel at experimentation know that you cannot optimize a lever that doesn't exist. They design experiments to establish existence first. Both by coming up with a version of the change that is provocative enough, and by ensuring that the experiment is powered.

What This Means for Your Next Experiment

Before you implement your next test variant:

  1. Establish what you're testing. Are you checking if this type of change matters, or optimizing how to implement it? Be explicit.

  2. Make it bold enough to learn from. If you haven't established that this lever matters, your variant needs to be different enough that users would react if they care at all. Side-by-side, control and treatment should feel noticeably different.

  3. Ensure adequate statistical power. Run a power analysis based on the minimum effect size that this change should have. If you can't achieve adequate power with available traffic, consider testing something else or increase the sample size by changing the targeting or duration of the experiment.

  4. Know what you'll learn from each outcome. In a well designed experiment, you should be learning in all outcomes. If a neutral result would be useless, you should not run the experiment.

  5. Maximize your chance or learning by considering testing multiple variants: your best guess at the optimal implementation, and a deliberately provocative version designed to test if anyone cares.

The Real Cost of Timid Tests

The most expensive outcome in experimentation isn't a failed test. It's a neutral result that you can't learn from. You've spent resources on implementing and running a test, but you haven't learned whether to invest more in this direction or move on entirely. You've bought yourself ambiguity when you needed clarity.

In our experience, experiment design is often phrased as a statistical problem, when often it is a product problem first. Designing an experiment is about two things: Coming up with an implementation of an idea that can have a reasonably sized effect, and then ensuring that the sample size is large enough to detect that effect.