> ## Documentation Index
> Fetch the complete documentation index at: https://confidence.spotify.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Analyze Results

> Understand what your results mean and what to do next.

Confidence provides metric results by comparing the treatments using
hypothesis tests to see if the differences are statistically
significant. The exact nature of the tests vary depending
on the type and role of the metrics. Ultimately, Confidence gives an
overall shipping recommendation that summarizes the multidimensional
results to one single recommendation.

## Spotlight

For both running and ended experiments, you need to make a decision—whether a
tested feature was good enough to reach the full market or if an ongoing
experiment should continue or if you should stop it. When deciding what to do,
you should always take a step back and consider all the pros and cons of the
decisions you are making. Involve people with different roles in these
decisions.

To help you in deciding what to do, Confidence provides a recommendation in the **Spotlight**
section on the **Result** tab. The recommendation summarizes the outcomes for the multiple metrics
used in the experiment. For an experiment that is live and configured to display results
continuously, the possible recommendations are:

* **Ship**. Confidence recommends to ship the change if:
  * at least one success metric has evidence of improvement
  * all guardrail metrics meet their tolerance levels if you use a non-inferiority margin; if not,
    they should show no evidence of deterioration
  * no evidence of a deterioration in any metric or of a sample ratio mismatch In this case, there
    is conclusive evidence that the change you are testing improves at least one metric without
    doing so while guardrails are acceptable.
* **Continue**. If there's no evidence that you should ship, Confidence recommends to continue the
  experiment as long as there are no signs of deterioration or sample ratio mismatch.
* **End**. Confidence recommends to end the experiment if you use success metrics with minimum
  detectable effects and all success metrics have reached powered but none is significant.
* **Abort**. Confidence recommends to stop the experiment if there is evidence of
  deterioration or a sample ratio mismatch.

When you end an experiment, the recommendations focus on what to do next. The **Continue**, **End**
and **Abort** recommendations change into a **Don't ship** recommendation, as there is no evidence
of an improvement that would suggest shipping.

For ended experiments, the Spotlight section includes an **Explore** option. Click it to create an
[Exploration](./exploration) directly from the Spotlight recommendation. This lets you dig deeper
into the results that informed the recommendation.

## Health Checks

Confidence provides health checks to help you understand the quality of your experiment.

### Incoming Traffic

The incoming traffic health check verifies that your experiment receives traffic. The check confirms
that the flag rule your experiment controls receives resolves from clients, that these resolves are
also applied, and that all groups in the experiment have exposure calculated.

### Balanced Traffic

The balanced traffic health check verifies that the proportion of exposure attributed to each group
follows the allocation that you set up for the experiment. If there is an imbalance, the results are
not reliable. This check uses what is commonly referred to as a sample ratio mismatch test.

### No Metric Deterioration

The no metric deterioration health check verifies that the metrics you track,
including both success and guardrail metrics, do not show any evidence of
deterioration. If a metric deteriorates, you have a clear sign that the
treatment isn't working as intended.

## Metrics

Confidence presents the results for individual metrics in various ways to help you
learn as much as possible from your tests.

### Significance

Significant means that if there is no effect, then it's unlikely that the
observed result is accidental due to the natural variation in the data. Alpha
specifies the threshold for significance, and thus the expected rate of false
positives. Default alpha for Confidence is 5%, which is further adjusted to
account for [multiple comparisons](/docs/experiments/statistical-settings).

You should interpret significance differently for success and guardrail metrics:

|                                | Significant                                                                                                                                | Not significant                                                                                        |
| :----------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------- |
| Success metric                 | You have statistical evidence for a change/increase/decrease due to the treatment.                                                         | You lack statistical evidence for a change/increase/decrease due to the treatment.                     |
| Guardrail metric (with NIM)    | You have statistical evidence that the metric has not increased more than/not decreased more than the specified non-inferior margin (NIM). | You lack statistical evidence that the metric has not increased more than/not decreased more than NIM. |
| Guardrail metric (without NIM) | You have statistical evidence that the metric deteriorates.                                                                                | You lack statistical evidence that the metric deteriorates.                                            |

### Results Estimates and Confidence Intervals

The results for the comparisons between treatment and control give a point
estimate and a confidence interval. The point estimate always lies in the
middle of the confidence interval. The estimated effect of the treatment is
uncertain, and the confidence interval describes the degree of uncertainty. If
you would repeat the experiment 100 times, the confidence interval should
cover the true effect of the treatment in 95% of the experiments if
alpha is 5%. With the same alpha, a given experiment's
confidence interval covers the true effect of the treatment with 95%
confidence.

Confidence displays the point estimate and confidence interval on the *relative* scale. The effects
are always reported as a % change relative to the control group. This makes it easier to
compare and visualize effect sizes across metrics.

### Result Visualization

You can view the difference between treatments and control on the **Result** tab
of the experiment.

For each comparison between a treatment and control, you see the results visualized using a
confidence interval. If you're analyzing the results sequentially, you can click the expand icon to
see a timeline graph.

### Detailed Results

If you click **Detailed results**, you can see more details about the analysis of the metrics
in your experiment. Here you can find the following information:

* **Powered effect** shows the effect size that you have the power to detect with the
  current sample size. For example, if you set the power for the experiment to 80%, then a powered
  effect of 10% means that based on the users that have been exposed to the experiment so far,
  you have a 80% power to detect a 10% effect size. Note here that
  the 10% is *relative* to the control group.
* **Sample size** the number of exposed users in each group.
* **Time** the last time point Confidence analyzes the metric.
* **Adjusted alpha** the multiple testing corrected alpha for the metric.
* **Adjusted power** the decision rule corrected power for the metric.
* **Variance reduction** the percentage of variance that pre-exposure data was able to reduce
  (Confidence displays `N/A` if you disable variance reduction).

### Learn More

To learn more about your results, use an
[Exploration](./exploration). Here you can add more metrics and split the results
by various dimensions.

You can also click **Detailed results** to get more details about the metrics in
your experiment. Choose between different types of visualizations, and add more
columns to see more details.

## Record Your Decision

After an experiment ends, you should record the outcome and reasoning. This creates an
institutional record of experiment learnings that your team can reference later.

On the **Result** tab of an ended experiment, the **Decision** section appears at the top of the
page. It contains:

* **Outcome**—a dropdown list where you select what you decided to do based on the results (for
  example, ship or don't ship). By default, this shows "Not selected".
* **Conclusion**—a text field where you write a brief summary of the decision and the reasoning
  behind it.

## Roll Out a Successful Variant

When your results show a clear winner, you can roll out the winning variant directly from the experiment. This option is available for A/B tests with two treatments (control and one treatment variant). Click **Roll out** in the actions section of a live A/B test to convert it to a [rollout](./workflows/rollouts). This distributes the winning variant to all users without manually configuring the flag.

The rollout preserves the experiment's metrics and configuration so you can continue to monitor the impact as you scale up.

## Related Resources

<CardGroup cols={2}>
  <Card title="Exploration" href="/docs/experiments/exploration">
    Deep dive into results analysis
  </Card>

  <Card title="Statistical Tests" href="/docs/experiments/stats/stat-tests">
    Understand the statistics
  </Card>

  <Card title="Explore Results" href="/docs/how-to-guides/explore-results">
    Step-by-step exploration guide
  </Card>
</CardGroup>
