Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Course wrap up

Congratulations! You have finished Interpreting experiment results!

You can now open any experiment results page in Confidence and know exactly what you are looking at. To recap what you have covered:

  • The results page has three sections: Spotlight, Health checks, and Metrics, each answering a different question.
  • The control variant and treatment variant means are averages of what actually happened for real users, and effects are always shown as relative % changes to make them comparable across metrics.
  • A confidence interval tells you both where the effect likely is and how precisely you have measured it. A wide CI means you need more data, not that there is no effect.
  • Status labels differ between success and guardrail metrics because they answer different questions: "did it improve?" versus "did it break anything?"
  • The SRM check is the most critical health check. If it fails, no metric result can be trusted.
  • Variance reduction makes estimates more precise by using pre-experiment behavior to remove noise. The numbers look slightly adjusted, but you interpret them the same way.
  • Your choice of evaluation strategy determines when results are valid to act on. Deterioration checks always run sequentially regardless of that choice.
  • The Spotlight synthesizes everything (health, success metrics, and guardrail metrics) into one recommendation per treatment variant.
  • Explorations are for learning and hypothesis generation, not for deciding whether an experiment succeeded.

What to explore next

If you want to go deeper on the statistical foundations behind what you learned here, the A primer on hypothesis testing course covers the mechanics of how hypothesis tests work and where p-values and significance thresholds come from.

To learn about more advanced experiment configurations, including guardrail metrics with non-inferiority margins and how to choose between sequential and non-sequential tests, check out Advance your experimentation.

Go back to my learning page to keep learning!

Was this page helpful?

PreviousLesson 11: The winner's curse
NextIntroduction

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. What to explore next