Definitions of A/B testing, experimentation, and statistical terms. From the team that has run experimentation at Spotify for over a decade.
The fundamental ideas behind controlled experiments and A/B testing.
An A/A test is a randomized experiment where both groups receive the identical experience.
Read moreAn A/B test is a randomized controlled experiment that splits users into two groups: one sees the current experience (control), the other sees a changed version (treatment).
Read moreAn A/B/n test is a randomized experiment that compares more than two variants simultaneously: one control and two or more treatments.
Read moreThe average treatment effect (ATE) is the mean difference in outcomes between the treatment group and the control group, averaged across the entire experimental population.
Read moreThe control group is the set of users in an experiment who see the unchanged, current experience.
Read moreExperiment bandwidth is an organization's capacity to run concurrent experiments, constrained by available traffic, metric infrastructure, statistical rigor, and team coordination.
Read moreExperiment design is the plan for how an experiment will be structured, sized, and analyzed before it begins.
Read moreA holdout group is a segment of users permanently excluded from receiving a feature or set of features, maintained over time to measure cumulative long-term impact.
Read moreA hypothesis is a testable prediction about the effect of a specific product change on a specific metric.
Read moreInterleaving is an experiment technique where results from treatment and control are mixed within a single user session, rather than assigning each user entirely to one group.
Read moreA multivariate test (MVT) is a randomized experiment that changes multiple variables simultaneously and measures both the individual effect of each variable and the interaction effects between them.
Read moreMutually exclusive experiments are experiments designed so that no user participates in more than one at the same time.
Read moreThe null hypothesis is the default assumption in a statistical test that there is no difference between the treatment and control groups.
Read moreAn online controlled experiment is the formal term for an A/B test conducted in a live digital product.
Read moreA product builder is anyone who builds and ships product: engineers, product managers, designers, data scientists, and the increasingly blended roles between them.
Read moreProduct experimentation is the practice of using controlled experiments to validate product changes before full rollout.
Read moreA product platform is the shared infrastructure and tooling that enables product teams to build, test, and ship features systematically.
Read moreA randomized controlled trial (RCT) is an experimental design where participants are randomly assigned to a treatment group or a control group, and outcomes are compared between the groups to measu...
Read moreScientific product development is a product development approach that treats every product change as a hypothesis to be tested with experimental evidence before it ships.
Read moreA target audience is the subset of users eligible for an experiment, defined by targeting rules that filter on user attributes like country, platform, subscription tier, account age, or behavioral ...
Read moreA treatment effect is the measured difference in a metric between the treatment group (users who see a change) and the control group (users who see the current experience).
Read moreThe treatment group is the set of users in an experiment who see the changed experience.
Read moreThe math that makes experiments trustworthy.
Bayesian A/B testing is an approach to experiment analysis that starts with a prior belief about the treatment effect and updates that belief using observed data, producing a posterior distribution...
Read moreA binary metric is a metric that takes one of two values for each user in an experiment: 1 (the event happened) or 0 (it didn't).
Read moreCausal inference is the set of statistical methods used to determine whether a change actually caused an observed effect, rather than merely being correlated with it.
Read moreA confidence interval is a range of values that, at a given confidence level, is expected to contain the true treatment effect.
Read moreCUPED (Controlled-experiment Using Pre-Existing Data) is a variance reduction method that uses data from before an experiment started to remove predictable noise from metric estimates, producing ti...
Read moreThe difference-in-means estimator is the simplest and most common estimator of the treatment effect in an A/B test.
Read moreEffect size is the magnitude of the difference in a metric between treatment and control groups.
Read moreThe false negative rate, also called the Type II error rate or beta, is the probability of failing to detect a real treatment effect.
Read moreThe false positive rate, also called the Type I error rate, is the probability of concluding that a treatment had an effect when it actually didn't.
Read moreFrequentist A/B testing is the classical approach to experiment analysis that evaluates results using p-values and confidence intervals, asking: how likely is the observed data (or something more e...
Read moreMetric capping (also called winsorization) is a variance reduction technique that clips extreme metric values at a chosen threshold, reducing the outsized influence of outliers on experiment results.
Read moreThe minimum detectable effect (MDE) is the smallest treatment effect an experiment is designed to reliably detect at a given significance level and power.
Read moreA p-value is the probability of observing a result at least as extreme as the one measured, assuming the null hypothesis is true (that is, assuming the change had no real effect).
Read moreSample size is the number of experimental units (typically users) needed in an A/B test to detect a given effect with a specified level of confidence and power.
Read moreThe sampling distribution is the probability distribution of a statistic (like a sample mean or a difference in means) computed across all possible random samples of a given size from a population.
Read moreThe signal-to-noise ratio (SNR) in A/B testing is the ratio of the treatment effect (the signal) to the variability of the metric being measured (the noise).
Read moreThe significance level, commonly called alpha, is the maximum false positive rate you're willing to accept in an experiment.
Read moreStatistical power is the probability that an experiment will detect a real effect when one exists.
Read moreStatistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.
Read moreVariance reduction is a set of statistical techniques that tighten the confidence intervals of an A/B test without requiring more traffic.
Read moreMethods for monitoring experiments before the planned end date.
Alpha spending is the method of distributing a fixed significance budget (alpha, typically 5%) across multiple interim analyses in a group sequential test.
Read moreAlways-valid inference (AVI) is a class of sequential testing methods that construct confidence intervals remaining valid at any stopping time, without requiring the experimenter to pre-plan when o...
Read moreA fixed-power design is a sequential experiment plan where the stopping rule is based on achieving a pre-specified level of statistical power rather than on observing a statistically significant re...
Read moreA group sequential test (GST) is a sequential testing method that pre-plans a fixed number of interim analyses at specific points during an experiment, using an alpha spending function to distribut...
Read moreThe information fraction is the proportion of the total planned statistical information that has been observed so far in a sequential experiment.
Read moreOptional stopping is the practice of ending an experiment based on observed results rather than a pre-determined stopping rule.
Read moreThe peeking problem is the inflation of false positive rates that occurs when experimenters check statistical results before the planned sample size has been reached and stop the experiment early b...
Read moreSequential testing is a statistical framework that allows experimenters to make valid decisions at multiple analysis points during an experiment, rather than waiting for a single final evaluation.
Read moreControlling error rates when evaluating many metrics or comparisons.
The Benjamini-Hochberg (BH) correction is a multiple testing procedure that controls the false discovery rate (FDR): the expected proportion of false positives among all results declared significant.
Read moreThe Bonferroni correction adjusts significance thresholds for multiple testing by dividing the target alpha by the number of tests.
Read moreA correction family is the set of hypothesis tests grouped together for a multiple testing adjustment.
Read moreFalse discovery rate (FDR) is the expected proportion of false positives among all results declared statistically significant.
Read moreFamily-wise error rate (FWER) is the probability of making at least one false positive across a set of hypothesis tests.
Read moreThe Holm correction (also called Holm-Bonferroni) is a step-down multiple testing procedure that controls the family-wise error rate (FWER) while being uniformly more powerful than the Bonferroni c...
Read moreThe Hommel correction is a multiple testing procedure that controls the family-wise error rate (FWER) while being more powerful than both the Bonferroni correction and the Holm correction.
Read moreA multiple testing correction is an adjustment to significance thresholds that accounts for evaluating more than one hypothesis in the same experiment.
Read moreHow to define and measure what matters.
Conversion rate is the fraction of users who complete a desired action out of those who had the opportunity to complete it.
Read moreDaily active users (DAU) is the count of unique users who engage with a product within a single calendar day.
Read moreA guardrail metric is a metric monitored during an experiment to ensure the change doesn't cause unintended harm, even when the success metric improves.
Read moreAn inferiority test checks whether a treatment is worse than control by more than a specified margin on a guardrail metric.
Read moreLongitudinal guardrails track guardrail metrics across many experiments over time to detect slow, cumulative harm that no individual experiment would flag.
Read moreMetric drift is a gradual, non-experiment-related change in a metric's baseline value over time.
Read moreMetric interaction effects occur when the effect of one experiment on a metric depends on whether another experiment is also running.
Read moreMetric sensitivity is how responsive a metric is to real product changes.
Read moreMonthly active users (MAU) is the count of unique users who engage with a product within a 30-day (or calendar month) window.
Read moreA non-inferiority margin (NIM) is the maximum amount of deterioration in a guardrail metric that a team is willing to accept in exchange for a gain on their success metric.
Read moreA non-inferiority test confirms that a treatment is not meaningfully worse than control on a guardrail metric.
Read moreA north star metric is the single metric that best captures the value a product delivers to its users.
Read moreA primary metric is the main metric used to decide whether to ship an experiment's treatment to all users.
Read moreA secondary metric is a supporting metric that provides additional context about an experiment's results but doesn't drive the ship decision.
Read moreShip rate is the proportion of experiments whose results lead to shipping the treatment to all users.
Read moreWeekly active users (WAU) is the count of unique users who engage with a product within a rolling or fixed 7-day window.
Read moreThe delivery mechanism that connects experiments to code.
Auto-rollback is an automatic rollback triggered when a guardrail metric violates a predefined threshold during a rollout.
Read moreBucket hashing is the mechanism that maps a user into a numbered bucket, which then determines their variant in a feature flag or experiment.
Read moreA canary release exposes a change to a tiny fraction of traffic first, typically 1% or less, to detect severe problems before wider rollout.
Read moreDeterministic assignment is a method of assigning users to experiment variants by hashing a stable identifier (typically the user ID) combined with a salt, so that the same user always maps to the ...
Read moreA dynamic config is a feature flag that carries structured data (JSON objects, typed fields) instead of a simple boolean.
Read moreEdge resolvers evaluate feature flags at the CDN or edge compute layer, before a request reaches the origin server.
Read moreAn experimentation platform is the end-to-end system that powers controlled experiments at scale: feature flags for assignment, metric pipelines for measurement, a statistical engine for analysis, ...
Read moreA feature flag is a runtime switch that controls whether a feature is active for a given user, without deploying new code.
Read moreA feature toggle is a runtime switch that controls whether a feature is active for a given user, without deploying new code.
Read moreA flag rule is a conditional expression that determines which users see which variant of a feature flag.
Read moreA holdback is a subset of users intentionally kept on the old experience after a feature has shipped to 100% of everyone else.
Read moreA kill switch is a feature flag designed specifically for instant emergency deactivation of a feature.
Read moreLocal evaluation resolves feature flags on the client or server without making a network call to a central service at evaluation time.
Read moreOpenFeature is a CNCF (Cloud Native Computing Foundation) open standard that defines a vendor-agnostic API for feature flag management.
Read moreOverrides manually force a specific user or group into a particular feature flag variant, bypassing the normal assignment logic.
Read moreA phased rollout is a progressive rollout organized into discrete stages, each with a predefined percentage and explicit criteria that must be met before advancing to the next stage.
Read moreA progressive rollout is the practice of gradually increasing the percentage of users exposed to a feature over time, rather than releasing it to everyone at once.
Read moreA rollback reverts a feature to its previous state when problems are detected during a release.
Read moreA rollout is the process of releasing a feature to users in controlled stages using feature flags.
Read moreSticky assignments ensure that a user who has been assigned to a variant in a feature flag or experiment continues to see that same variant across sessions, devices, and app restarts.
Read moreTargeting conditions are the criteria a feature flag evaluates to decide which variant a user receives.
Read moreWarehouse-native experimentation is an architecture where experiment data (assignments, exposures, metric events, and analysis results) lives in the customer's own data warehouse rather than being ...
Read moreTechniques for analyzing experiment results and avoiding pitfalls.
A confounding variable is a factor that influences both the treatment assignment and the outcome being measured, creating a spurious association that can be mistaken for a causal effect.
Read moreCounterfactual logging is the practice of recording what a system would have shown a user under an alternative policy or variant, alongside what was actually shown.
Read moreDilution is the weakening of an observed treatment effect that occurs when users who were never exposed to the changed feature are included in the experiment analysis.
Read moreExposure filters are criteria applied during experiment analysis to include or exclude users based on their exposure to the treatment.
Read moreExposure logging is the practice of recording exactly when and whether each user was actually exposed to a specific experiment variant.
Read moreThe garden of forking paths refers to the many implicit analytical choices a researcher or analyst makes during an experiment's lifecycle, each of which could have gone differently and each of whic...
Read moreA multi-armed bandit is an adaptive experiment design that shifts traffic allocation toward better-performing variants during the experiment, rather than splitting traffic evenly for the entire dur...
Read moreObservational bias is systematic error introduced when the data collection or analysis process produces results that consistently differ from the truth.
Read moreA sample ratio mismatch (SRM) occurs when the observed number of users in each experiment group differs from the intended allocation ratio by more than chance alone would explain.
Read moreSegment analysis breaks down experiment results by user subgroups to detect heterogeneous treatment effects: cases where the change helps some users, hurts others, or has no effect on a particular ...
Read moreSimpson's paradox is a statistical phenomenon where a trend that appears in several subgroups reverses or disappears when the subgroups are combined.
Read moreTrigger analysis is an experiment analysis technique that restricts the evaluation to users who actually encountered the changed feature, rather than analyzing every user assigned to the experiment.
Read moreAn unbiased estimator is a statistical estimator whose expected value equals the true parameter it's estimating.
Read moreOrganizational patterns that make experimentation work at scale.
Build-measure-learn is the iterative product development loop introduced by Eric Ries in The Lean Startup.
Read moreContinuous discovery is an ongoing cycle of learning from users and experiments, where research, hypothesis formation, and validation happen continuously alongside product development rather than i...
Read moreCumulative holdback evaluation is a method for measuring the aggregate impact of many shipped features by maintaining a long-running holdout group that doesn't receive any of them.
Read moreDogfooding is the practice of using your own product internally before releasing it to customers.
Read moreExperiment coordination is the practice of managing interactions, priorities, and resource allocation across concurrent experiments.
Read moreExperimentation culture is the organizational norm of testing product ideas with data before committing to them.
Read moreAn experimentation maturity model is a framework for assessing how advanced an organization's experimentation practice is, from ad-hoc tests run by a few teams to a fully integrated system where ex...
Read moreExperimentation theatre is the practice of running experiments without the rigor or organizational commitment to act on results.
Read moreHypothesis-driven development is a product development approach where each change is framed as a testable hypothesis before it's built.
Read moreMulti-metric decision making is the practice of evaluating experiment results across multiple metrics simultaneously rather than basing ship decisions on a single success metric.
Read moreThe product loop is the recurring cycle of insight, hypothesis, experiment, decision, and insight that drives how a product improves over time.
Read moreImproving conversion rates through systematic testing.
Client-side testing is the practice of running experiments in the browser using injected JavaScript that modifies the page after it loads.
Read moreConversion rate optimization (CRO) is the practice of systematically improving the percentage of users who complete a desired action: signing up, purchasing, upgrading, completing onboarding, or an...
Read moreServer-side testing is the practice of running experiments in backend code rather than in the browser.
Read moreWebsite optimization is the practice of improving a website's performance, user experience, and conversion outcomes through testing and iteration.
Read more