Core Experimentation

What is an Experiment Design?

Experiment design is the plan for how an experiment will be structured, sized, and analyzed before it begins.

Experiment design is the plan for how an experiment will be structured, sized, and analyzed before it begins. It covers the hypothesis, the metric set, the traffic allocation, the statistical power calculation, and the decision criteria for shipping or rolling back. A good experiment design answers: what are we testing, how will we know if it worked, and how long will it take to get a trustworthy answer?

Getting the design right before the experiment starts is more important than any amount of post-hoc analysis. A poorly designed experiment wastes experiment bandwidth, one of the scarcest resources a product organization has. At Spotify, where 58 teams ran 520 experiments on the mobile home screen alone in a single year, a badly sized or poorly scoped experiment doesn't just fail to produce learning. It blocks the next experiment in the queue.

What goes into an experiment design?

Five decisions need to be made before you start.

The hypothesis. A specific, testable prediction: "Changing X will improve metric Y because Z." The hypothesis forces the team to commit to what they expect and why. Without it, the experiment becomes a fishing expedition where any positive metric gets declared a win.

The metric set. A success metric measures what you're trying to improve. Guardrail metrics monitor what you're trying not to break. Quality metrics provide context for interpreting the result. Confidence's decision framework formally distinguishes these metric types and applies different statistical treatments to each. Success metrics are corrected for multiple testing; guardrail metrics are tested for inferiority without correction because missing a regression is more costly than a false alarm.

The traffic allocation. How much of your user base participates, and how it's split between control and treatment. A 50/50 split maximizes statistical power. Smaller treatment allocations (90/10, 95/5) reduce risk exposure but require more total traffic. The right ratio depends on how confident you are that the change won't cause harm.

The power calculation. Given your metric's variance, your traffic, and the minimum effect size you care about (MDE), how long will the experiment need to run? An underpowered experiment produces ambiguous null results that teach nothing: you can't tell whether the change had no effect or whether the test was too small to detect it. Confidence calculates required sample sizes using the actual metric variance from your warehouse, accounting for variance reduction from CUPED if enabled.

The decision criteria. What will you do with the result? Ship if the success metric is significant and no guardrails are violated? Iterate if the result is inconclusive? This should be agreed upon before the data arrives.

Why does pre-experiment design matter more than post-experiment analysis?

The most common source of misleading experiment results isn't bad statistics. It's decisions made after seeing the data. If you choose your metric after the experiment, you'll gravitate toward whichever one looks best. If you pick the analysis window after seeing the time series, you'll unconsciously anchor on the period that shows significance. If you decide which user segments to examine after seeing the top-level result, you'll find a subgroup that confirms your preference.

This is the garden of forking paths: every analytical choice you make after seeing the data inflates your false positive rate, even if each individual choice seems reasonable.

Pre-registering the experiment design, as Confidence encourages, locks in these decisions before the data exists. The hypothesis, the metrics, the analysis plan, and the decision criteria are recorded in the experiment setup. When results arrive, the analysis runs against the pre-specified plan.

How does Confidence support experiment design?

Confidence structures experiment creation around the design decisions that matter. When you set up an experiment in Confidence, you define the hypothesis, select success and guardrail metrics, specify the traffic allocation, and run a power calculation that tells you how long the experiment will need to run.

The platform then automates the parts of the analysis that should be consistent across every experiment: sequential testing for safe interim looks, multiple testing corrections for the success metrics, SRM checks for randomization integrity, and variance reduction via CUPED to shorten runtime. These aren't optional add-ons. They're the default analysis, built from fifteen years of learning at Spotify about what goes wrong when they're missing.

The design decisions that require judgment, like which metric to use as the success metric or how aggressive the implementation should be, stay with the product team. Mentimeter calls this the "Maximum Viable Change": testing the boldest credible version of the idea first, so you know whether the lever exists before you optimize the implementation.

Related terms

Core Experimentation
Hypothesis

A hypothesis is a testable prediction about the effect of a specific product change on a specific metric.

Core Experimentation
Null Hypothesis

The null hypothesis is the default assumption in a statistical test that there is no difference between the treatment and control groups.

Statistical Methods
Statistical Power

Statistical power is the probability that an experiment will detect a real effect when one exists.

Statistical Methods
Minimum Detectable Effect (MDE)

The minimum detectable effect (MDE) is the smallest treatment effect an experiment is designed to reliably detect at a given significance level and power.

Statistical Methods
Sample Size

Sample size is the number of experimental units (typically users) needed in an A/B test to detect a given effect with a specified level of confidence and power.

Metrics
Success Metric

A success metric is the primary metric an experiment is designed to move.

Metrics
Guardrail Metric

A guardrail metric is a metric monitored during an experiment to ensure the change doesn't cause unintended harm, even when the success metric improves.

Core Experimentation
A/B Testing

An A/B test is a randomized controlled experiment that splits users into two groups: one sees the current experience (control), the other sees a changed version (treatment).

Core Experimentation
Experiment Bandwidth

Experiment bandwidth is an organization's capacity to run concurrent experiments, constrained by available traffic, metric infrastructure, statistical rigor, and team coordination.