Skip to main content
Confidence provides metric results by comparing the treatments using hypothesis tests to see if the differences are statistically significant. The exact nature of the tests vary depending on the type and role of the metrics. Ultimately, Confidence gives an overall shipping recommendation that summarizes the multidimensional results to one single recommendation.

Spotlight

For both running and ended experiments, you need to make a decision—whether a tested feature was good enough to reach the full market or if an ongoing experiment should continue or if you should stop it. When deciding what to do, you should always take a step back and consider all the pros and cons of the decisions you are making. Involve people with different roles in these decisions. To help you in deciding what to do, Confidence provides a recommendation in the Spotlight section on the Result tab. The recommendation summarizes the outcomes for the multiple metrics used in the experiment. For an experiment that is live and configured to display results continuously, the possible recommendations are:
  • Ship. Confidence recommends to ship the change if:
    • at least one success metric has evidence of improvement
    • all guardrail metrics meet their tolerance levels if you use a non-inferiority margin; if not, they should show no evidence of deterioration
    • no evidence of a deterioration in any metric or of a sample ratio mismatch In this case, there is conclusive evidence that the change you are testing improves at least one metric without doing so while guardrails are acceptable.
  • Continue. If there’s no evidence that you should ship, Confidence recommends to continue the experiment as long as there are no signs of deterioration or sample ratio mismatch.
  • End. Confidence recommends to end the experiment if you use success metrics with minimum detectable effects and all success metrics have reached powered but none is significant.
  • Abort. Confidence recommends to stop the experiment if there is evidence of deterioration or a sample ratio mismatch.
When you end an experiment, the recommendations focus on what to do next. The Continue, End and Abort recommendations change into a Don’t ship recommendation, as there is no evidence of an improvement that would suggest shipping.

Health Checks

Confidence provides health checks to help you understand the quality of your experiment.

Incoming Traffic

The incoming traffic health check verifies that your experiment receives traffic. The check confirms that the flag rule your experiment controls receives resolves from clients, that these resolves are also applied, and that all groups in the experiment have exposure calculated.

Balanced Traffic

The balanced traffic health check verifies that the proportion of exposure attributed to each group follows the allocation that you set up for the experiment. If there is an imbalance, the results are not reliable. This check uses what is commonly referred to as a sample ratio mismatch test.

No Metric Deterioration

The no metric deterioration health check verifies that the metrics you track, including both success and guardrail metrics, do not show any evidence of deterioration. If a metric deteriorates, you have a clear sign that the treatment isn’t working as intended.

Metrics

Confidence presents the results for individual metrics in various ways to help you learn as much as possible from your tests.

Significance

Significant means that if there is no effect, then it’s unlikely that the observed result is accidental due to the natural variation in the data. Alpha specifies the threshold for significance, and thus the expected rate of false positives. Default alpha for Confidence is 5%, which is further adjusted to account for multiple comparisons. You should interpret significance differently for success and guardrail metrics:
SignificantNot significant
Success metricYou have statistical evidence for a change/increase/decrease due to the treatment.You lack statistical evidence for a change/increase/decrease due to the treatment.
Guardrail metric (with NIM)You have statistical evidence that the metric has not increased more than/not decreased more than the specified non-inferior margin (NIM).You lack statistical evidence that the metric has not increased more than/not decreased more than NIM.
Guardrail metric (without NIM)You have statistical evidence that the metric deteriorates.You lack statistical evidence that the metric deteriorates.

Results Estimates and Confidence Intervals

The results for the comparisons between treatment and control give a point estimate and a confidence interval. The point estimate always lies in the middle of the confidence interval. The estimated effect of the treatment is uncertain, and the confidence interval describes the degree of uncertainty. If you would repeat the experiment 100 times, the confidence interval should cover the true effect of the treatment in 95% of the experiments if alpha is 5%. With the same alpha, a given experiment’s confidence interval covers the true effect of the treatment with 95% confidence. Confidence displays the point estimate and confidence interval on the relative scale. The effects are always reported as a % change relative to the control group. This makes it easier to compare and visualize effect sizes across metrics.

Result Visualization

You can view the difference between treatments and control on the Result tab of the experiment. For each comparison between a treatment and control, you see the results visualized using a confidence interval. If you’re analyzing the results sequentially, you can click the expand icon to see a timeline graph.

Detailed Results

If you click Detailed results, you can see more details about the analysis of the metrics in your experiment. Here you can find the following information:
  • Powered effect shows the effect size that you have the power to detect with the current sample size. For example, if you set the power for the experiment to 80%, then a powered effect of 10% means that based on the users that have been exposed to the experiment so far, you have a 80% power to detect a 10% effect size. Note here that the 10% is relative to the control group.
  • Sample size the number of exposed users in each group.
  • Time the last time point Confidence analyzes the metric.
  • Adjusted alpha the multiple testing corrected alpha for the metric.
  • Adjusted power the decision rule corrected power for the metric.
  • Variance reduction the percentage of variance that pre-exposure data was able to reduce (Confidence displays N/A if you disable variance reduction).

Learn More

To learn more about your results, use an Exploration. Here you can add more metrics and split the results by various dimensions. You can also click Detailed results to get more details about the metrics in your experiment. Choose between different types of visualizations, and add more columns to see more details.