Experimentation Glossary

Definitions of A/B testing, experimentation, and statistical terms. From the team that has run experimentation at Spotify for over a decade.

Core Experimentation Concepts

The fundamental ideas behind controlled experiments and A/B testing.

A/A Testing

An A/A test is a randomized experiment where both groups receive the identical experience.

Read more
A/B Testing

An A/B test is a randomized controlled experiment that splits users into two groups: one sees the current experience (control), the other sees a changed version (treatment).

Read more
A/B/n Testing

An A/B/n test is a randomized experiment that compares more than two variants simultaneously: one control and two or more treatments.

Read more
Average Treatment Effect (ATE)

The average treatment effect (ATE) is the mean difference in outcomes between the treatment group and the control group, averaged across the entire experimental population.

Read more
Control Group

The control group is the set of users in an experiment who see the unchanged, current experience.

Read more
Experiment Bandwidth

Experiment bandwidth is an organization's capacity to run concurrent experiments, constrained by available traffic, metric infrastructure, statistical rigor, and team coordination.

Read more
Experiment Design

Experiment design is the plan for how an experiment will be structured, sized, and analyzed before it begins.

Read more
Holdout Group

A holdout group is a segment of users permanently excluded from receiving a feature or set of features, maintained over time to measure cumulative long-term impact.

Read more
Hypothesis

A hypothesis is a testable prediction about the effect of a specific product change on a specific metric.

Read more
Interleaving

Interleaving is an experiment technique where results from treatment and control are mixed within a single user session, rather than assigning each user entirely to one group.

Read more
Multivariate Testing

A multivariate test (MVT) is a randomized experiment that changes multiple variables simultaneously and measures both the individual effect of each variable and the interaction effects between them.

Read more
Mutually Exclusive Experiments

Mutually exclusive experiments are experiments designed so that no user participates in more than one at the same time.

Read more
Null Hypothesis

The null hypothesis is the default assumption in a statistical test that there is no difference between the treatment and control groups.

Read more
Online Controlled Experiment

An online controlled experiment is the formal term for an A/B test conducted in a live digital product.

Read more
Product Builder

A product builder is anyone who builds and ships product: engineers, product managers, designers, data scientists, and the increasingly blended roles between them.

Read more
Product Experimentation

Product experimentation is the practice of using controlled experiments to validate product changes before full rollout.

Read more
Product Platform

A product platform is the shared infrastructure and tooling that enables product teams to build, test, and ship features systematically.

Read more
Randomized Controlled Trial

A randomized controlled trial (RCT) is an experimental design where participants are randomly assigned to a treatment group or a control group, and outcomes are compared between the groups to measu...

Read more
Scientific Product Development

Scientific product development is a product development approach that treats every product change as a hypothesis to be tested with experimental evidence before it ships.

Read more
Split Testing

A split test is another name for an A/B test.

Read more
Target Audience

A target audience is the subset of users eligible for an experiment, defined by targeting rules that filter on user attributes like country, platform, subscription tier, account age, or behavioral ...

Read more
Treatment Effect

A treatment effect is the measured difference in a metric between the treatment group (users who see a change) and the control group (users who see the current experience).

Read more
Treatment Group

The treatment group is the set of users in an experiment who see the changed experience.

Read more

Statistical Methods & Inference

The math that makes experiments trustworthy.

Bayesian A/B Testing

Bayesian A/B testing is an approach to experiment analysis that starts with a prior belief about the treatment effect and updates that belief using observed data, producing a posterior distribution...

Read more
Binary Metric

A binary metric is a metric that takes one of two values for each user in an experiment: 1 (the event happened) or 0 (it didn't).

Read more
Causal Inference

Causal inference is the set of statistical methods used to determine whether a change actually caused an observed effect, rather than merely being correlated with it.

Read more
Confidence Interval

A confidence interval is a range of values that, at a given confidence level, is expected to contain the true treatment effect.

Read more
CUPED

CUPED (Controlled-experiment Using Pre-Existing Data) is a variance reduction method that uses data from before an experiment started to remove predictable noise from metric estimates, producing ti...

Read more
Difference-in-Means Estimator

The difference-in-means estimator is the simplest and most common estimator of the treatment effect in an A/B test.

Read more
Effect Size

Effect size is the magnitude of the difference in a metric between treatment and control groups.

Read more
False Negative Rate (Type II Error)

The false negative rate, also called the Type II error rate or beta, is the probability of failing to detect a real treatment effect.

Read more
False Positive Rate (Type I Error)

The false positive rate, also called the Type I error rate, is the probability of concluding that a treatment had an effect when it actually didn't.

Read more
Frequentist A/B Testing

Frequentist A/B testing is the classical approach to experiment analysis that evaluates results using p-values and confidence intervals, asking: how likely is the observed data (or something more e...

Read more
Metric Capping

Metric capping (also called winsorization) is a variance reduction technique that clips extreme metric values at a chosen threshold, reducing the outsized influence of outliers on experiment results.

Read more
Minimum Detectable Effect (MDE)

The minimum detectable effect (MDE) is the smallest treatment effect an experiment is designed to reliably detect at a given significance level and power.

Read more
P-value

A p-value is the probability of observing a result at least as extreme as the one measured, assuming the null hypothesis is true (that is, assuming the change had no real effect).

Read more
Sample Size

Sample size is the number of experimental units (typically users) needed in an A/B test to detect a given effect with a specified level of confidence and power.

Read more
Sampling Distribution

The sampling distribution is the probability distribution of a statistic (like a sample mean or a difference in means) computed across all possible random samples of a given size from a population.

Read more
Signal-to-Noise Ratio

The signal-to-noise ratio (SNR) in A/B testing is the ratio of the treatment effect (the signal) to the variability of the metric being measured (the noise).

Read more
Significance Level (Alpha)

The significance level, commonly called alpha, is the maximum false positive rate you're willing to accept in an experiment.

Read more
Statistical Power

Statistical power is the probability that an experiment will detect a real effect when one exists.

Read more
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Read more
Variance

Variance is a measure of how much a metric's values spread out across users.

Read more
Variance Reduction

Variance reduction is a set of statistical techniques that tighten the confidence intervals of an A/B test without requiring more traffic.

Read more
Z-Test

A z-test is a hypothesis test that uses the standard normal distribution to determine whether the observed difference between two groups is statistically significant.

Read more

Sequential Testing & the Peeking Problem

Methods for monitoring experiments before the planned end date.

Alpha Spending

Alpha spending is the method of distributing a fixed significance budget (alpha, typically 5%) across multiple interim analyses in a group sequential test.

Read more
Always-Valid Inference

Always-valid inference (AVI) is a class of sequential testing methods that construct confidence intervals remaining valid at any stopping time, without requiring the experimenter to pre-plan when o...

Read more
Fixed-Power Design

A fixed-power design is a sequential experiment plan where the stopping rule is based on achieving a pre-specified level of statistical power rather than on observing a statistically significant re...

Read more
Group Sequential Test

A group sequential test (GST) is a sequential testing method that pre-plans a fixed number of interim analyses at specific points during an experiment, using an alpha spending function to distribut...

Read more
Information Fraction

The information fraction is the proportion of the total planned statistical information that has been observed so far in a sequential experiment.

Read more
Optional Stopping

Optional stopping is the practice of ending an experiment based on observed results rather than a pre-determined stopping rule.

Read more
Peeking Problem

The peeking problem is the inflation of false positive rates that occurs when experimenters check statistical results before the planned sample size has been reached and stop the experiment early b...

Read more
Sequential Testing

Sequential testing is a statistical framework that allows experimenters to make valid decisions at multiple analysis points during an experiment, rather than waiting for a single final evaluation.

Read more

Multiple Testing & Error Control

Controlling error rates when evaluating many metrics or comparisons.

Benjamini-Hochberg Correction

The Benjamini-Hochberg (BH) correction is a multiple testing procedure that controls the false discovery rate (FDR): the expected proportion of false positives among all results declared significant.

Read more
Bonferroni Correction

The Bonferroni correction adjusts significance thresholds for multiple testing by dividing the target alpha by the number of tests.

Read more
Correction Family

A correction family is the set of hypothesis tests grouped together for a multiple testing adjustment.

Read more
False Discovery Rate

False discovery rate (FDR) is the expected proportion of false positives among all results declared statistically significant.

Read more
Family-Wise Error Rate

Family-wise error rate (FWER) is the probability of making at least one false positive across a set of hypothesis tests.

Read more
Holm Correction

The Holm correction (also called Holm-Bonferroni) is a step-down multiple testing procedure that controls the family-wise error rate (FWER) while being uniformly more powerful than the Bonferroni c...

Read more
Hommel Correction

The Hommel correction is a multiple testing procedure that controls the family-wise error rate (FWER) while being more powerful than both the Bonferroni correction and the Holm correction.

Read more
Multiple Testing Correction

A multiple testing correction is an adjustment to significance thresholds that accounts for evaluating more than one hypothesis in the same experiment.

Read more

Metrics

How to define and measure what matters.

Conversion Rate

Conversion rate is the fraction of users who complete a desired action out of those who had the opportunity to complete it.

Read more
Daily Active Users (DAU)

Daily active users (DAU) is the count of unique users who engage with a product within a single calendar day.

Read more
Guardrail Metric

A guardrail metric is a metric monitored during an experiment to ensure the change doesn't cause unintended harm, even when the success metric improves.

Read more
Inferiority Test

An inferiority test checks whether a treatment is worse than control by more than a specified margin on a guardrail metric.

Read more
Longitudinal Guardrails

Longitudinal guardrails track guardrail metrics across many experiments over time to detect slow, cumulative harm that no individual experiment would flag.

Read more
Metric Drift

Metric drift is a gradual, non-experiment-related change in a metric's baseline value over time.

Read more
Metric Interaction Effects

Metric interaction effects occur when the effect of one experiment on a metric depends on whether another experiment is also running.

Read more
Metric Sensitivity

Metric sensitivity is how responsive a metric is to real product changes.

Read more
Monthly Active Users (MAU)

Monthly active users (MAU) is the count of unique users who engage with a product within a 30-day (or calendar month) window.

Read more
Non-inferiority Margin

A non-inferiority margin (NIM) is the maximum amount of deterioration in a guardrail metric that a team is willing to accept in exchange for a gain on their success metric.

Read more
Non-inferiority Test

A non-inferiority test confirms that a treatment is not meaningfully worse than control on a guardrail metric.

Read more
North Star Metric

A north star metric is the single metric that best captures the value a product delivers to its users.

Read more
Primary Metric

A primary metric is the main metric used to decide whether to ship an experiment's treatment to all users.

Read more
Proxy Metric

A proxy metric is a measurable stand-in for a harder-to-measure outcome.

Read more
Secondary Metric

A secondary metric is a supporting metric that provides additional context about an experiment's results but doesn't drive the ship decision.

Read more
Ship Rate

Ship rate is the proportion of experiments whose results lead to shipping the treatment to all users.

Read more
Success Metric

A success metric is the primary metric an experiment is designed to move.

Read more
Weekly Active Users (WAU)

Weekly active users (WAU) is the count of unique users who engage with a product within a rolling or fixed 7-day window.

Read more
Win Rate

Win rate is the proportion of experiments that produce a statistically significant positive result on their primary success metric.

Read more

Feature Flags & Rollouts

The delivery mechanism that connects experiments to code.

Auto-Rollback

Auto-rollback is an automatic rollback triggered when a guardrail metric violates a predefined threshold during a rollout.

Read more
Bucket Hashing

Bucket hashing is the mechanism that maps a user into a numbered bucket, which then determines their variant in a feature flag or experiment.

Read more
Canary Release

A canary release exposes a change to a tiny fraction of traffic first, typically 1% or less, to detect severe problems before wider rollout.

Read more
Deterministic Assignment

Deterministic assignment is a method of assigning users to experiment variants by hashing a stable identifier (typically the user ID) combined with a salt, so that the same user always maps to the ...

Read more
Dynamic Config

A dynamic config is a feature flag that carries structured data (JSON objects, typed fields) instead of a simple boolean.

Read more
Edge Resolvers

Edge resolvers evaluate feature flags at the CDN or edge compute layer, before a request reaches the origin server.

Read more
Experimentation Platform

An experimentation platform is the end-to-end system that powers controlled experiments at scale: feature flags for assignment, metric pipelines for measurement, a statistical engine for analysis, ...

Read more
Feature Flag

A feature flag is a runtime switch that controls whether a feature is active for a given user, without deploying new code.

Read more
Feature Toggle

A feature toggle is a runtime switch that controls whether a feature is active for a given user, without deploying new code.

Read more
Flag Rule

A flag rule is a conditional expression that determines which users see which variant of a feature flag.

Read more
Holdback

A holdback is a subset of users intentionally kept on the old experience after a feature has shipped to 100% of everyone else.

Read more
Kill Switch

A kill switch is a feature flag designed specifically for instant emergency deactivation of a feature.

Read more
Local Evaluation

Local evaluation resolves feature flags on the client or server without making a network call to a central service at evaluation time.

Read more
OpenFeature

OpenFeature is a CNCF (Cloud Native Computing Foundation) open standard that defines a vendor-agnostic API for feature flag management.

Read more
Overrides

Overrides manually force a specific user or group into a particular feature flag variant, bypassing the normal assignment logic.

Read more
Phased Rollout

A phased rollout is a progressive rollout organized into discrete stages, each with a predefined percentage and explicit criteria that must be met before advancing to the next stage.

Read more
Progressive Rollout

A progressive rollout is the practice of gradually increasing the percentage of users exposed to a feature over time, rather than releasing it to everyone at once.

Read more
Rollback

A rollback reverts a feature to its previous state when problems are detected during a release.

Read more
Rollout

A rollout is the process of releasing a feature to users in controlled stages using feature flags.

Read more
Sticky Assignments

Sticky assignments ensure that a user who has been assigned to a variant in a feature flag or experiment continues to see that same variant across sessions, devices, and app restarts.

Read more
Targeting Conditions

Targeting conditions are the criteria a feature flag evaluates to decide which variant a user receives.

Read more
Warehouse-Native Experimentation

Warehouse-native experimentation is an architecture where experiment data (assignments, exposures, metric events, and analysis results) lives in the customer's own data warehouse rather than being ...

Read more

Experiment Analysis & Methodology

Techniques for analyzing experiment results and avoiding pitfalls.

Confounding Variables

A confounding variable is a factor that influences both the treatment assignment and the outcome being measured, creating a spurious association that can be mistaken for a causal effect.

Read more
Counterfactual Logging

Counterfactual logging is the practice of recording what a system would have shown a user under an alternative policy or variant, alongside what was actually shown.

Read more
Dilution

Dilution is the weakening of an observed treatment effect that occurs when users who were never exposed to the changed feature are included in the experiment analysis.

Read more
Estimand

An estimand is the precise quantity an experiment is designed to estimate.

Read more
Exposure Filters

Exposure filters are criteria applied during experiment analysis to include or exclude users based on their exposure to the treatment.

Read more
Exposure Logging

Exposure logging is the practice of recording exactly when and whether each user was actually exposed to a specific experiment variant.

Read more
Garden of Forking Paths

The garden of forking paths refers to the many implicit analytical choices a researcher or analyst makes during an experiment's lifecycle, each of which could have gone differently and each of whic...

Read more
Multi-Armed Bandit

A multi-armed bandit is an adaptive experiment design that shifts traffic allocation toward better-performing variants during the experiment, rather than splitting traffic evenly for the entire dur...

Read more
Observational Bias

Observational bias is systematic error introduced when the data collection or analysis process produces results that consistently differ from the truth.

Read more
Sample Ratio Mismatch

A sample ratio mismatch (SRM) occurs when the observed number of users in each experiment group differs from the intended allocation ratio by more than chance alone would explain.

Read more
Segment Analysis

Segment analysis breaks down experiment results by user subgroups to detect heterogeneous treatment effects: cases where the change helps some users, hurts others, or has no effect on a particular ...

Read more
Simpson's Paradox

Simpson's paradox is a statistical phenomenon where a trend that appears in several subgroups reverses or disappears when the subgroups are combined.

Read more
Trigger Analysis

Trigger analysis is an experiment analysis technique that restricts the evaluation to users who actually encountered the changed feature, rather than analyzing every user assigned to the experiment.

Read more
Unbiased Estimator

An unbiased estimator is a statistical estimator whose expected value equals the true parameter it's estimating.

Read more

Organizational & Cultural Concepts

Organizational patterns that make experimentation work at scale.

Build-Measure-Learn

Build-measure-learn is the iterative product development loop introduced by Eric Ries in The Lean Startup.

Read more
Continuous Discovery

Continuous discovery is an ongoing cycle of learning from users and experiments, where research, hypothesis formation, and validation happen continuously alongside product development rather than i...

Read more
Cumulative Holdback Evaluation

Cumulative holdback evaluation is a method for measuring the aggregate impact of many shipped features by maintaining a long-running holdout group that doesn't receive any of them.

Read more
Dogfooding

Dogfooding is the practice of using your own product internally before releasing it to customers.

Read more
Experiment Coordination

Experiment coordination is the practice of managing interactions, priorities, and resource allocation across concurrent experiments.

Read more
Experimentation Culture

Experimentation culture is the organizational norm of testing product ideas with data before committing to them.

Read more
Experimentation Maturity Model

An experimentation maturity model is a framework for assessing how advanced an organization's experimentation practice is, from ad-hoc tests run by a few teams to a fully integrated system where ex...

Read more
Experimentation Theatre

Experimentation theatre is the practice of running experiments without the rigor or organizational commitment to act on results.

Read more
Hypothesis-Driven Development

Hypothesis-driven development is a product development approach where each change is framed as a testable hypothesis before it's built.

Read more
Multi-Metric Decision Making

Multi-metric decision making is the practice of evaluating experiment results across multiple metrics simultaneously rather than basing ship decisions on a single success metric.

Read more
Product Loop

The product loop is the recurring cycle of insight, hypothesis, experiment, decision, and insight that drives how a product improves over time.

Read more

Conversion & Optimization

Improving conversion rates through systematic testing.

Client-Side Testing

Client-side testing is the practice of running experiments in the browser using injected JavaScript that modifies the page after it loads.

Read more
Conversion Rate Optimization

Conversion rate optimization (CRO) is the practice of systematically improving the percentage of users who complete a desired action: signing up, purchasing, upgrading, completing onboarding, or an...

Read more
Server-Side Testing

Server-side testing is the practice of running experiments in backend code rather than in the browser.

Read more
Website Optimization

Website optimization is the practice of improving a website's performance, user experience, and conversion outcomes through testing and iteration.

Read more