A randomized controlled trial (RCT) is an experimental design where participants are randomly assigned to a treatment group or a control group, and outcomes are compared between the groups to measure the causal effect of the treatment. RCTs are considered the gold standard for establishing causality because randomization eliminates systematic differences between groups, ensuring that any observed effect can be attributed to the treatment itself.
A/B tests in product development are digital RCTs. The logic is identical to a clinical trial in medicine: define a population, randomly assign them to conditions, measure the outcome, compare. The main differences are practical. Medical RCTs test drugs on thousands of patients over months or years. Product RCTs test feature changes on millions of users over days or weeks. Confidence runs these digital RCTs at Spotify's scale, where 300+ teams use the same randomization and analysis infrastructure to validate over 10,000 product decisions per year.
What makes randomization the critical ingredient?
Randomization is what separates an RCT from observational analysis. When you assign users randomly, the treatment and control groups are statistically equivalent on every characteristic: age, usage patterns, device type, geography, and every other factor you haven't thought to measure. This means that any difference in outcomes between the groups is caused by the treatment, not by pre-existing differences.
Without randomization, you're stuck with observational data. If you launch a feature to users who opted in and compare them to users who didn't, the groups differ in ways that have nothing to do with the feature. Opt-in users are typically more engaged, more technically savvy, and more forgiving. Any improvement you observe could be a selection effect rather than a treatment effect.
This is why the experiment design in Confidence uses deterministic hashing for assignment. A hash of the user ID and an experiment-specific salt places each user into a group. The assignment is random (in the sense that user IDs are unrelated to the experiment), repeatable (the same user always gets the same group), and doesn't require storing state.
How does the RCT tradition influence product experimentation?
The methodological foundations of A/B testing come directly from the RCT tradition in statistics and medicine, dating back to Ronald Fisher's agricultural experiments in the 1920s and the first clinical trials in the 1940s. Modern product experimentation inherits several principles from this lineage.
Pre-registration of hypotheses. In clinical trials, researchers register their hypothesis and analysis plan before the trial begins to prevent post-hoc rationalization. In product experimentation, Confidence encourages teams to define their hypothesis, success metrics, and guardrail metrics before the experiment starts. The experiment design captures these decisions so the analysis plan is locked in before results arrive.
Control of error rates. Clinical trials carefully manage false positive rates (concluding a drug works when it doesn't) and false negative rates (missing a drug that works). Product experiments face the same trade-offs. Confidence's sequential testing, multiple testing corrections, and power calculations all descend from methods developed for clinical RCTs.
Intention-to-treat analysis. Medical RCTs analyze patients based on their assigned group, not on whether they actually took the medication. Product experiments face the same issue: a user assigned to the treatment group who never encountered the changed feature dilutes the measured effect. Confidence supports both intention-to-treat analysis and trigger analysis (restricting to exposed users), with clear documentation of what each approach estimates.
Where do product RCTs differ from medical RCTs?
Product experiments have advantages that medical trials don't. Sample sizes are often millions instead of thousands. Experiments can run for days instead of years. Iteration is fast: if the first test is inconclusive, you can redesign and rerun within a week.
But product RCTs also face unique challenges. User behavior is noisier than clinical endpoints. Metrics like session duration or click-through rate have high variance compared to "patient survived / didn't survive." That noise requires techniques like CUPED variance reduction, which uses pre-experiment data to tighten confidence intervals, and trigger analysis, which restricts the analysis to users who actually encountered the change.
At Spotify, 42% of experiments are rolled back after guardrail metrics detect regressions. That high rollback rate is only possible because the RCT design gives teams enough confidence in the causal claim to act on it. Without randomization, you'd never be sure whether the regression was caused by the change or by something else happening at the same time.