Experiment design is the plan for how an experiment will be structured, sized, and analyzed before it begins. It covers the hypothesis, the metric set, the traffic allocation, the statistical power calculation, and the decision criteria for shipping or rolling back. A good experiment design answers: what are we testing, how will we know if it worked, and how long will it take to get a trustworthy answer?
Getting the design right before the experiment starts is more important than any amount of post-hoc analysis. A poorly designed experiment wastes experiment bandwidth, one of the scarcest resources a product organization has. At Spotify, where 58 teams ran 520 experiments on the mobile home screen alone in a single year, a badly sized or poorly scoped experiment doesn't just fail to produce learning. It blocks the next experiment in the queue.
What goes into an experiment design?
Five decisions need to be made before you start.
The hypothesis. A specific, testable prediction: "Changing X will improve metric Y because Z." The hypothesis forces the team to commit to what they expect and why. Without it, the experiment becomes a fishing expedition where any positive metric gets declared a win.
The metric set. A success metric measures what you're trying to improve. Guardrail metrics monitor what you're trying not to break. Quality metrics provide context for interpreting the result. Confidence's decision framework formally distinguishes these metric types and applies different statistical treatments to each. Success metrics are corrected for multiple testing; guardrail metrics are tested for inferiority without correction because missing a regression is more costly than a false alarm.
The traffic allocation. How much of your user base participates, and how it's split between control and treatment. A 50/50 split maximizes statistical power. Smaller treatment allocations (90/10, 95/5) reduce risk exposure but require more total traffic. The right ratio depends on how confident you are that the change won't cause harm.
The power calculation. Given your metric's variance, your traffic, and the minimum effect size you care about (MDE), how long will the experiment need to run? An underpowered experiment produces ambiguous null results that teach nothing: you can't tell whether the change had no effect or whether the test was too small to detect it. Confidence calculates required sample sizes using the actual metric variance from your warehouse, accounting for variance reduction from CUPED if enabled.
The decision criteria. What will you do with the result? Ship if the success metric is significant and no guardrails are violated? Iterate if the result is inconclusive? This should be agreed upon before the data arrives.
Why does pre-experiment design matter more than post-experiment analysis?
The most common source of misleading experiment results isn't bad statistics. It's decisions made after seeing the data. If you choose your metric after the experiment, you'll gravitate toward whichever one looks best. If you pick the analysis window after seeing the time series, you'll unconsciously anchor on the period that shows significance. If you decide which user segments to examine after seeing the top-level result, you'll find a subgroup that confirms your preference.
This is the garden of forking paths: every analytical choice you make after seeing the data inflates your false positive rate, even if each individual choice seems reasonable.
Pre-registering the experiment design, as Confidence encourages, locks in these decisions before the data exists. The hypothesis, the metrics, the analysis plan, and the decision criteria are recorded in the experiment setup. When results arrive, the analysis runs against the pre-specified plan.
How does Confidence support experiment design?
Confidence structures experiment creation around the design decisions that matter. When you set up an experiment in Confidence, you define the hypothesis, select success and guardrail metrics, specify the traffic allocation, and run a power calculation that tells you how long the experiment will need to run.
The platform then automates the parts of the analysis that should be consistent across every experiment: sequential testing for safe interim looks, multiple testing corrections for the success metrics, SRM checks for randomization integrity, and variance reduction via CUPED to shorten runtime. These aren't optional add-ons. They're the default analysis, built from fifteen years of learning at Spotify about what goes wrong when they're missing.
The design decisions that require judgment, like which metric to use as the success metric or how aggressive the implementation should be, stay with the product team. Mentimeter calls this the "Maximum Viable Change": testing the boldest credible version of the idea first, so you know whether the lever exists before you optimize the implementation.