Experimentation Glossary

Definitions of A/B testing, experimentation, and statistical terms. From the team that has run experimentation at Spotify for over a decade.

Core Experimentation Concepts

The fundamental ideas behind controlled experiments and A/B testing.

A/A Testing

An A/A test is a randomized experiment where both groups receive the identical experience.

A/B Testing

An A/B test is a randomized controlled experiment that splits users into two groups: one sees the current experience (control), the other sees a changed version (treatment).

A/B/n Testing

An A/B/n test is a randomized experiment that compares more than two variants simultaneously: one control and two or more treatments.

Average Treatment Effect (ATE)

The average treatment effect (ATE) is the mean difference in outcomes between the treatment group and the control group, averaged across the entire experimental population.

Control Group

The control group is the set of users in an experiment who see the unchanged, current experience.

Experiment Bandwidth

Experiment bandwidth is an organization's capacity to run concurrent experiments, constrained by available traffic, metric infrastructure, statistical rigor, and team coordination.

Experiment Design

Experiment design is the plan for how an experiment will be structured, sized, and analyzed before it begins.

Holdout Group

A holdout group is a segment of users permanently excluded from receiving a feature or set of features, maintained over time to measure cumulative long-term impact.

Hypothesis

A hypothesis is a testable prediction about the effect of a specific product change on a specific metric.

Interleaving

Interleaving is an experiment technique where results from treatment and control are mixed within a single user session, rather than assigning each user entirely to one group.

Multivariate Testing

A multivariate test (MVT) is a randomized experiment that changes multiple variables simultaneously and measures both the individual effect of each variable and the interaction effects between them.

Mutually Exclusive Experiments

Mutually exclusive experiments are experiments designed so that no user participates in more than one at the same time.

Null Hypothesis

The null hypothesis is the default assumption in a statistical test that there is no difference between the treatment and control groups.

Online Controlled Experiment

An online controlled experiment is the formal term for an A/B test conducted in a live digital product.

Product Builder

A product builder is anyone who builds and ships product: engineers, product managers, designers, data scientists, and the increasingly blended roles between them.

Product Experimentation

Product experimentation is the practice of using controlled experiments to validate product changes before full rollout.

Product Platform

A product platform is the shared infrastructure and tooling that enables product teams to build, test, and ship features systematically.

Randomized Controlled Trial

A randomized controlled trial (RCT) is an experimental design where participants are randomly assigned to a treatment group or a control group, and outcomes are compared between the groups to measu...

Scientific Product Development

Scientific product development is a product development approach that treats every product change as a hypothesis to be tested with experimental evidence before it ships.

Split Testing

A split test is another name for an A/B test.

Target Audience

A target audience is the subset of users eligible for an experiment, defined by targeting rules that filter on user attributes like country, platform, subscription tier, account age, or behavioral ...

Treatment Effect

A treatment effect is the measured difference in a metric between the treatment group (users who see a change) and the control group (users who see the current experience).

Treatment Group

The treatment group is the set of users in an experiment who see the changed experience.

Statistical Methods & Inference

The math that makes experiments trustworthy.

Bayesian A/B Testing

Bayesian A/B testing is an approach to experiment analysis that starts with a prior belief about the treatment effect and updates that belief using observed data, producing a posterior distribution...

Binary Metric

A binary metric is a metric that takes one of two values for each user in an experiment: 1 (the event happened) or 0 (it didn't).

Causal Inference

Causal inference is the set of statistical methods used to determine whether a change actually caused an observed effect, rather than merely being correlated with it.

Confidence Interval

A confidence interval is a range of values that, at a given confidence level, is expected to contain the true treatment effect.

CUPED

CUPED (Controlled-experiment Using Pre-Existing Data) is a variance reduction method that uses data from before an experiment started to remove predictable noise from metric estimates, producing ti...

Difference-in-Means Estimator

The difference-in-means estimator is the simplest and most common estimator of the treatment effect in an A/B test.

Effect Size

Effect size is the magnitude of the difference in a metric between treatment and control groups.

False Negative Rate (Type II Error)

The false negative rate, also called the Type II error rate or beta, is the probability of failing to detect a real treatment effect.

False Positive Rate (Type I Error)

The false positive rate, also called the Type I error rate, is the probability of concluding that a treatment had an effect when it actually didn't.

Frequentist A/B Testing

Frequentist A/B testing is the classical approach to experiment analysis that evaluates results using p-values and confidence intervals, asking: how likely is the observed data (or something more e...

Metric Capping

Metric capping (also called winsorization) is a variance reduction technique that clips extreme metric values at a chosen threshold, reducing the outsized influence of outliers on experiment results.

Minimum Detectable Effect (MDE)

The minimum detectable effect (MDE) is the smallest treatment effect an experiment is designed to reliably detect at a given significance level and power.

P-value

A p-value is the probability of observing a result at least as extreme as the one measured, assuming the null hypothesis is true (that is, assuming the change had no real effect).

Sample Size

Sample size is the number of experimental units (typically users) needed in an A/B test to detect a given effect with a specified level of confidence and power.

Sampling Distribution

The sampling distribution is the probability distribution of a statistic (like a sample mean or a difference in means) computed across all possible random samples of a given size from a population.

Signal-to-Noise Ratio

The signal-to-noise ratio (SNR) in A/B testing is the ratio of the treatment effect (the signal) to the variability of the metric being measured (the noise).

Significance Level (Alpha)

The significance level, commonly called alpha, is the maximum false positive rate you're willing to accept in an experiment.

Statistical Power

Statistical power is the probability that an experiment will detect a real effect when one exists.

Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Variance

Variance is a measure of how much a metric's values spread out across users.

Variance Reduction

Variance reduction is a set of statistical techniques that tighten the confidence intervals of an A/B test without requiring more traffic.

Z-Test

A z-test is a hypothesis test that uses the standard normal distribution to determine whether the observed difference between two groups is statistically significant.

Sequential Testing & the Peeking Problem

Methods for monitoring experiments before the planned end date.

Alpha Spending

Alpha spending is the method of distributing a fixed significance budget (alpha, typically 5%) across multiple interim analyses in a group sequential test.

Always-Valid Inference

Always-valid inference (AVI) is a class of sequential testing methods that construct confidence intervals remaining valid at any stopping time, without requiring the experimenter to pre-plan when o...

Fixed-Power Design

A fixed-power design is a sequential experiment plan where the stopping rule is based on achieving a pre-specified level of statistical power rather than on observing a statistically significant re...

Group Sequential Test

A group sequential test (GST) is a sequential testing method that pre-plans a fixed number of interim analyses at specific points during an experiment, using an alpha spending function to distribut...

Information Fraction

The information fraction is the proportion of the total planned statistical information that has been observed so far in a sequential experiment.

Optional Stopping

Optional stopping is the practice of ending an experiment based on observed results rather than a pre-determined stopping rule.

Peeking Problem

The peeking problem is the inflation of false positive rates that occurs when experimenters check statistical results before the planned sample size has been reached and stop the experiment early b...

Sequential Testing

Sequential testing is a statistical framework that allows experimenters to make valid decisions at multiple analysis points during an experiment, rather than waiting for a single final evaluation.

Multiple Testing & Error Control

Controlling error rates when evaluating many metrics or comparisons.

Benjamini-Hochberg Correction

The Benjamini-Hochberg (BH) correction is a multiple testing procedure that controls the false discovery rate (FDR): the expected proportion of false positives among all results declared significant.

Bonferroni Correction

The Bonferroni correction adjusts significance thresholds for multiple testing by dividing the target alpha by the number of tests.

Correction Family

A correction family is the set of hypothesis tests grouped together for a multiple testing adjustment.

False Discovery Rate

False discovery rate (FDR) is the expected proportion of false positives among all results declared statistically significant.

Family-Wise Error Rate

Family-wise error rate (FWER) is the probability of making at least one false positive across a set of hypothesis tests.

Holm Correction

The Holm correction (also called Holm-Bonferroni) is a step-down multiple testing procedure that controls the family-wise error rate (FWER) while being uniformly more powerful than the Bonferroni c...

Hommel Correction

The Hommel correction is a multiple testing procedure that controls the family-wise error rate (FWER) while being more powerful than both the Bonferroni correction and the Holm correction.

Multiple Testing Correction

A multiple testing correction is an adjustment to significance thresholds that accounts for evaluating more than one hypothesis in the same experiment.

Metrics

How to define and measure what matters.

Conversion Rate

Conversion rate is the fraction of users who complete a desired action out of those who had the opportunity to complete it.

Daily Active Users (DAU)

Daily active users (DAU) is the count of unique users who engage with a product within a single calendar day.

Guardrail Metric

A guardrail metric is a metric monitored during an experiment to ensure the change doesn't cause unintended harm, even when the success metric improves.

Inferiority Test

An inferiority test checks whether a treatment is worse than control by more than a specified margin on a guardrail metric.

Longitudinal Guardrails

Longitudinal guardrails track guardrail metrics across many experiments over time to detect slow, cumulative harm that no individual experiment would flag.

Metric Drift

Metric drift is a gradual, non-experiment-related change in a metric's baseline value over time.

Metric Interaction Effects

Metric interaction effects occur when the effect of one experiment on a metric depends on whether another experiment is also running.

Metric Sensitivity

Metric sensitivity is how responsive a metric is to real product changes.

Monthly Active Users (MAU)

Monthly active users (MAU) is the count of unique users who engage with a product within a 30-day (or calendar month) window.

Non-inferiority Margin

A non-inferiority margin (NIM) is the maximum amount of deterioration in a guardrail metric that a team is willing to accept in exchange for a gain on their success metric.

Non-inferiority Test

A non-inferiority test confirms that a treatment is not meaningfully worse than control on a guardrail metric.

North Star Metric

A north star metric is the single metric that best captures the value a product delivers to its users.

Primary Metric

A primary metric is the main metric used to decide whether to ship an experiment's treatment to all users.

Proxy Metric

A proxy metric is a measurable stand-in for a harder-to-measure outcome.

Secondary Metric

A secondary metric is a supporting metric that provides additional context about an experiment's results but doesn't drive the ship decision.

Ship Rate

Ship rate is the proportion of experiments whose results lead to shipping the treatment to all users.

Success Metric

A success metric is the primary metric an experiment is designed to move.

Weekly Active Users (WAU)

Weekly active users (WAU) is the count of unique users who engage with a product within a rolling or fixed 7-day window.

Win Rate

Win rate is the proportion of experiments that produce a statistically significant positive result on their primary success metric.

Feature Flags & Rollouts

The delivery mechanism that connects experiments to code.

Auto-Rollback

Auto-rollback is an automatic rollback triggered when a guardrail metric violates a predefined threshold during a rollout.

Bucket Hashing

Bucket hashing is the mechanism that maps a user into a numbered bucket, which then determines their variant in a feature flag or experiment.

Canary Release

A canary release exposes a change to a tiny fraction of traffic first, typically 1% or less, to detect severe problems before wider rollout.

Deterministic Assignment

Deterministic assignment is a method of assigning users to experiment variants by hashing a stable identifier (typically the user ID) combined with a salt, so that the same user always maps to the ...

Dynamic Config

A dynamic config is a feature flag that carries structured data (JSON objects, typed fields) instead of a simple boolean.

Edge Resolvers

Edge resolvers evaluate feature flags at the CDN or edge compute layer, before a request reaches the origin server.

Experimentation Platform

An experimentation platform is the end-to-end system that powers controlled experiments at scale: feature flags for assignment, metric pipelines for measurement, a statistical engine for analysis, ...

Feature Flag

A feature flag is a runtime switch that controls whether a feature is active for a given user, without deploying new code.

Feature Toggle

A feature toggle is a runtime switch that controls whether a feature is active for a given user, without deploying new code.

Flag Rule

A flag rule is a conditional expression that determines which users see which variant of a feature flag.

Holdback

A holdback is a subset of users intentionally kept on the old experience after a feature has shipped to 100% of everyone else.

Kill Switch

A kill switch is a feature flag designed specifically for instant emergency deactivation of a feature.

Local Evaluation

Local evaluation resolves feature flags on the client or server without making a network call to a central service at evaluation time.

OpenFeature

OpenFeature is a CNCF (Cloud Native Computing Foundation) open standard that defines a vendor-agnostic API for feature flag management.

Overrides

Overrides manually force a specific user or group into a particular feature flag variant, bypassing the normal assignment logic.

Phased Rollout

A phased rollout is a progressive rollout organized into discrete stages, each with a predefined percentage and explicit criteria that must be met before advancing to the next stage.

Progressive Rollout

A progressive rollout is the practice of gradually increasing the percentage of users exposed to a feature over time, rather than releasing it to everyone at once.

Rollback

A rollback reverts a feature to its previous state when problems are detected during a release.

Rollout

A rollout is the process of releasing a feature to users in controlled stages using feature flags.

Sticky Assignments

Sticky assignments ensure that a user who has been assigned to a variant in a feature flag or experiment continues to see that same variant across sessions, devices, and app restarts.

Targeting Conditions

Targeting conditions are the criteria a feature flag evaluates to decide which variant a user receives.

Warehouse-Native Experimentation

Warehouse-native experimentation is an architecture where experiment data (assignments, exposures, metric events, and analysis results) lives in the customer's own data warehouse rather than being ...

Experiment Analysis & Methodology

Techniques for analyzing experiment results and avoiding pitfalls.

Confounding Variables

A confounding variable is a factor that influences both the treatment assignment and the outcome being measured, creating a spurious association that can be mistaken for a causal effect.

Counterfactual Logging

Counterfactual logging is the practice of recording what a system would have shown a user under an alternative policy or variant, alongside what was actually shown.

Dilution

Dilution is the weakening of an observed treatment effect that occurs when users who were never exposed to the changed feature are included in the experiment analysis.

Estimand

An estimand is the precise quantity an experiment is designed to estimate.

Exposure Filters

Exposure filters are criteria applied during experiment analysis to include or exclude users based on their exposure to the treatment.

Exposure Logging

Exposure logging is the practice of recording exactly when and whether each user was actually exposed to a specific experiment variant.

Garden of Forking Paths

The garden of forking paths refers to the many implicit analytical choices a researcher or analyst makes during an experiment's lifecycle, each of which could have gone differently and each of whic...

Multi-Armed Bandit

A multi-armed bandit is an adaptive experiment design that shifts traffic allocation toward better-performing variants during the experiment, rather than splitting traffic evenly for the entire dur...

Observational Bias

Observational bias is systematic error introduced when the data collection or analysis process produces results that consistently differ from the truth.

Sample Ratio Mismatch

A sample ratio mismatch (SRM) occurs when the observed number of users in each experiment group differs from the intended allocation ratio by more than chance alone would explain.

Segment Analysis

Segment analysis breaks down experiment results by user subgroups to detect heterogeneous treatment effects: cases where the change helps some users, hurts others, or has no effect on a particular ...

Simpson's Paradox

Simpson's paradox is a statistical phenomenon where a trend that appears in several subgroups reverses or disappears when the subgroups are combined.

Trigger Analysis

Trigger analysis is an experiment analysis technique that restricts the evaluation to users who actually encountered the changed feature, rather than analyzing every user assigned to the experiment.

Unbiased Estimator

An unbiased estimator is a statistical estimator whose expected value equals the true parameter it's estimating.

Organizational & Cultural Concepts

Organizational patterns that make experimentation work at scale.

Build-Measure-Learn

Build-measure-learn is the iterative product development loop introduced by Eric Ries in The Lean Startup.

Continuous Discovery

Continuous discovery is an ongoing cycle of learning from users and experiments, where research, hypothesis formation, and validation happen continuously alongside product development rather than i...

Cumulative Holdback Evaluation

Cumulative holdback evaluation is a method for measuring the aggregate impact of many shipped features by maintaining a long-running holdout group that doesn't receive any of them.

Dogfooding

Dogfooding is the practice of using your own product internally before releasing it to customers.

Experiment Coordination

Experiment coordination is the practice of managing interactions, priorities, and resource allocation across concurrent experiments.

Experimentation Culture

Experimentation culture is the organizational norm of testing product ideas with data before committing to them.

Experimentation Maturity Model

An experimentation maturity model is a framework for assessing how advanced an organization's experimentation practice is, from ad-hoc tests run by a few teams to a fully integrated system where ex...

Experimentation Theatre

Experimentation theatre is the practice of running experiments without the rigor or organizational commitment to act on results.

Hypothesis-Driven Development

Hypothesis-driven development is a product development approach where each change is framed as a testable hypothesis before it's built.

Multi-Metric Decision Making

Multi-metric decision making is the practice of evaluating experiment results across multiple metrics simultaneously rather than basing ship decisions on a single success metric.

Product Loop

The product loop is the recurring cycle of insight, hypothesis, experiment, decision, and insight that drives how a product improves over time.

Conversion & Optimization

Improving conversion rates through systematic testing.

Client-Side Testing

Client-side testing is the practice of running experiments in the browser using injected JavaScript that modifies the page after it loads.

Conversion Rate Optimization

Conversion rate optimization (CRO) is the practice of systematically improving the percentage of users who complete a desired action: signing up, purchasing, upgrading, completing onboarding, or an...

Server-Side Testing

Server-side testing is the practice of running experiments in backend code rather than in the browser.

Website Optimization

Website optimization is the practice of improving a website's performance, user experience, and conversion outcomes through testing and iteration.