A/B Tests - Confidence Documentation

Compared to other types of experiments, the distinguishing features of A/B tests are that they:

can have multiple treatments, which is sometimes referred to as an A/B/n test
use both success and guardrail metrics to identify experiences that improve some metrics without negatively impacting others
let you learn and find promising ideas
have a fixed allocation that doesn’t change
can use either a fixed or sequential design, where you view results upon conclusion or continuously during the experiment

The goal of an A/B test is to decide if the change has a positive or negative effect on the experience as measured by the test’s metrics. If the change has a positive effect, distribute the variant to everyone using a rollout. A rollout lets you gradually increase how widely to distribute the variant.

Most A/B tests aim to test product changes with the goal of understanding whether you should roll out the changes, or if they need further development. A learning experiment is another type of A/B test that aims to learn about user behavior or to measure a strategic baseline for the product. This learning is typically achieved by removing a product or feature from the experience or degrading the experience in some other way. Such a test helps inform future product prioritization by breaking down which parts of the existing product have the most impact on user behavior or the business. Learning experiments can also be exploratory and only aim to find if a certain variant has a causal relation to an outcome regardless of direction.

The Anatomy of an Experiment

An A/B test has different parts. This section gives a high-level overview of these concepts.

The Hypothesis is the Product Foundation of the Test

A hypothesis is a specific assumption that can be conclusively tested when subjected to an experiment, and is the basis for a good experiment. It guides the experiment from a product perspective, and makes the anticipated impact and value of the experiment clear.

A/B Tests Distribute Different Experiences Through Variants

An A/B test evaluates how users react after exposure to a new experience. Variants describe the different user experiences you test. For example, there could be different variants of a button color. One variant sets the button color to red, another to blue. A variant in an experiment is often referred to as a treatment. These variants often introduce new features, innovations, or changes that should improve the experience for the user. Typically, an experiment has one variant representing the current default (in production) experience, usually called control or the control treatment.

Randomization Makes Differences Causal

Users in an experiment are randomly assigned a variant. The variant is the only difference in the experience between the control and treatment groups. Because of randomization, the different treatments explain any observed change in behavior. If the treatment group outperforms the control group on the target metric, the treatment variant improves the user experience. Randomization ensures that the groups are similar. External factors, such as seasonality, other feature launches, and competitor moves, affect control and treatment evenly and have no impact on the results of the experiment.

The treatment effect estimated in an A/B test is only valid for the time of the test. The estimated effect doesn’t necessarily generalize to other future points in time. The same treatment can have a widely different impact depending on when you run the test. For example, recommending Christmas songs in July might not have the same effect as in December. The randomization only ensures that the groups are similar during the experiment.

Metrics Measure the Effect of the Treatments

Every A/B test needs at least one metric. These metrics help prove or disprove the hypothesis and to make a business decision based on the outcome of the test. In other words, your metrics help answer whether the change is good enough to release widely. Confidence supports two types of metrics:

Success metrics are metrics that should improve with the treatment
Guardrail metrics are metrics that don’t need to improve, but shouldn’t deteriorate

It’s common and strongly recommended to use both success and guardrail metrics. The reason is to guard against, for example, cannibalization. An experiment may want to increase engagement in a new feature, but not by cannibalizing the engagement in another feature. In this case, the engagement in the new feature would be the success metric, while the engagement in the related feature is the guardrail metric.

Statistical Analysis Tells the Answer

Experimentation uses statistical analysis to reach a conclusion. A statistical test is a formal procedure used to assess whether the observed difference between two groups is sufficiently large to say that there is an effect. The goal of the statistical test is to distinguish the actual effect of treatment from that due to noise from random sampling. The statistical tests analyze each metric, and ultimately summarize the results using a recommendation for the product decision.

Roll Out a Successful Experiment

Convert the A/B test to a rollout when you complete the A/B test and you have a winning variant. The rollout targets the exact same users, with all the metrics and configuration from the A/B test. You can scale up to more users if the A/B test used less than 100% of the allocation. To avoid reassigning users, the control and treatment groups must remain at the same proportions. For example, suppose an A/B test was running at 10% of the population but had a 50/50 split of control and treatment. When you increase the rollout percentage to 50%, all users are in either control or treatment. You can’t continue to track metrics when you increase the rollout percentage beyond this point. Read more about rollouts

Experiment Lifecycle

An A/B test moves through different states during its lifecycle. Each state has specific actions available.

State	Available actions
Draft	Launch, Archive, Delete, Clone
Live	Roll out, End, Clone
Ended	Archive, Clone

Clone an Experiment

You can clone any A/B test to create a new draft with the same configuration. Clone an experiment to run a similar test without configuring it from scratch. To clone an experiment, open the A/B test detail page and select Clone from the top of the page. The cloned experiment starts as a new draft that you can change before launch.

Planned and Actual Runtime

The Planning section on the A/B test detail page shows information about the experiment’s runtime:

Planned runtime: The expected duration of the experiment. Click Edit to set or update the planned runtime. For draft experiments, planned runtime shows “Not set” until you configure it.
Actual runtime: The time the experiment has been running, calculated automatically from the launch and end dates.

Comparing planned and actual runtime helps you track whether experiments run longer than expected.

Delete a Draft Experiment

You can permanently delete an A/B test while it’s in draft state. To delete a draft, open the experiment and click Delete in the actions section. This action is permanent and can’t be undone. For experiments that you already launched, use Archive instead. Archiving preserves the experiment’s data and removes it from the active list.

Launch an A/B Test

Step-by-step A/B test tutorial

Statistical Settings

Configure experiment parameters

Metrics in Experiments

Configure success and guardrail metrics

​The Anatomy of an Experiment

​The Hypothesis is the Product Foundation of the Test

​A/B Tests Distribute Different Experiences Through Variants

​Randomization Makes Differences Causal

​Metrics Measure the Effect of the Treatments

​Statistical Analysis Tells the Answer

​Roll Out a Successful Experiment

​Experiment Lifecycle

​Clone an Experiment

​Planned and Actual Runtime

​Delete a Draft Experiment

​Related Resources

Launch an A/B Test

Statistical Settings

Metrics in Experiments

The Anatomy of an Experiment

The Hypothesis is the Product Foundation of the Test

A/B Tests Distribute Different Experiences Through Variants

Randomization Makes Differences Causal

Metrics Measure the Effect of the Treatments

Statistical Analysis Tells the Answer

Roll Out a Successful Experiment

Experiment Lifecycle

Clone an Experiment

Planned and Actual Runtime

Delete a Draft Experiment

Related Resources