> ## Documentation Index
> Fetch the complete documentation index at: https://confidence.spotify.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Statistical Settings

> Tweak the statistical settings to how you want to control the risk of false positives and false negatives.

For a successful and well-planned experiment, you should commit beforehand to a
strategy for when and how to evaluate the results to avoid the infamous pitfalls
of peeking. Similarly, knowing how much traffic you require to be able to
identify the effects of interest with a high probability is essential for
results to be trustworthy.

## Test Evaluation Frequency

When setting up an experiment, you need to select how often to evaluate the results of the
test. You have two options:

* View results continuously
* View results upon conclusion

Viewing results continuously means that you get the results updated
and presented hourly or daily using sequential tests.
Viewing results only upon conclusion separates the data
collection and analysis phases of the experiment.
With this choice, you can view the results after you end the experiment using fixed horizon tests.
Read more about the details on the [sequential tests](./stats/sequential-tests) page.

Selecting a strategy before launching the experiment is crucial, as it makes it possible to control
the risk of finding false positives regardless of the choice made.
Failure to handle this issue and looking at the results when you shouldn't is commonly referred to
as the "peeking problem."

<Note>
  If you choose to view results only at the end of the experiment, Confidence still uses sequential
  tests to run daily checks on all your metrics to ensure they have not deteriorated. Read more about
  that in the [monitoring](./monitoring) section.
</Note>

The benefit of viewing results at the end of the experiment is that it has
higher precision compared to a test with results that update daily. This
means that sticking to that approach leads to less uncertainty in the final
estimates of the effects of the treatments.

## Alpha and Power

The false positive rate is also known as *alpha* and is by default 5%. The
power level, also known as the true positive rate, has a default value of 80%.
A lower alpha means that false positives are less likely to
happen, but at the same time the chance of finding an effect if there is one is
also less likely to happen. The power level sets the desired probability for
being able to find an effect if there is one. See the next section on [power analysis](./sample-size-calculator#power-analysis-and-the-required-sample-size) for how the required
sample size can help inform you how much traffic you need to achieve the
desired level of power.

<Note>
  Confidence adjusts the selected alpha and power levels for multiple comparisons using a Bonferroni
  correction that handles success and guardrail metrics differently.
  The corrections ensure that your error rates for the decision to ship the feature is at most the
  errors rates implied by the false positive rate, determined by alpha, and the power level you give.
  Read more about [adjustment for multiple comparisons](./stats/adjustment-multiple-comparisons).
</Note>

<BehindFlag disableFlag="sample-size-calculator">
  ## Power Analysis and the Required Sample Size

  The sample size calculator is a tool that assists in planning the length and
  size of an experiment. The tool calculates what sample size you need to
  achieve the requested level of power given the set-up of the experiment. The
  required sample size differs across metrics. The tool also displays the largest
  required sample size across all metrics. Having a large enough sample size is
  important to ensure that the experiment has enough sensitivity to detect
  meaningful effects. For more about what affects the required sample size,
  see the [power](./design/power) page.

  <Note>
    The sample size calculator doesn't take audience targeting into account. If
    you are targeting a subset of the population, then the variance of the metrics
    might be different for different subsets of the population. Some subsets might
    have larger variance, which increases the required number of users to power a
    certain [MDE/NIM](./design/effect-sizes), while others might have
    smaller variances which could then decrease the required sample size for a
    certain MDE/NIM.
  </Note>

  ### Calculate the Required Sample Size

  To calculate the required sample size for an experiment:

  1. Configure the experiment as described in the earlier sections.
  2. In the **Required sample size** section on the right sidebar, click **Calculate**.

  It takes some time for it to calculate the required sample size.
  When it finishes, you see the required sample size for each metric on the right sidebar.

  <img src="https://mintcdn.com/confidence-7c0fec1b/KTPKB6kyq9KGua3d/images/required-sample-size.png?fit=max&auto=format&n=KTPKB6kyq9KGua3d&q=85&s=419ac1b203162a5c3f32e43dd77ce930" alt="Required sample size" width="720" height="242" data-path="images/required-sample-size.png" />

  In the preceding example, the results show that the first metric requires 77,000 users
  according to the set up of the experiment. The second metric requires
  407,000 users, and so to power all metrics the experiment requires at least 407,000 users.

  ### Adjust the Required Sample Size

  If the required sample size is too large compared to the available population,
  you can either try to expand the population or reduce the required sample size.

  To reduce the required sample size, you can do one or more of the following:

  * **Increase Alpha setting**. Alpha is the probability of a false positive. A higher
    alpha requires a smaller sample size, but means the risk of finding significance
    when there really is no effect increases.

  * **Lower Power setting**. Power is the probability of a true positive. The
    higher the power, the lower the probability of a false negative. A lower power
    requires a smaller sample size, but lowers the chance of finding a true
    effect. Lower power also increases the risk of sign and magnitude errors (type
    S and type M errors). In general, a too low power makes it hard to reproduce
    the results of an experiment.

  * **Increase metric MDEs and NIMs**. The MDE and NIM are the
    effect sizes that you and your stakeholders care about. The larger the MDE and
    NIM, the smaller the required sample size.

  ### Sample Size for New Metrics

  When calculating the required sample size for an experiment, Confidence
  looks at historical data for the metrics in the experiment.

  There needs to be at least 14 days (plus the aggregation window and exposure
  offsets) of historical data for the metric in order for Confidence to be able
  to calculate the required sample size. For example, if you have an experiment
  with a metric that has a 7-day aggregation window and a 7-day exposure offset,
  you need at least 28 days of historical data. If there is not enough
  historical data, Confidence can't calculate the required sample
  size.
</BehindFlag>

## Related Resources

<CardGroup cols={2}>
  <Card title="Sample Size Calculator" href="/docs/experiments/sample-size-calculator">
    Calculate required sample sizes
  </Card>

  <Card title="Effect Sizes" href="/docs/experiments/design/effect-sizes">
    Configure MDE and NIM settings
  </Card>

  <Card title="Power Analysis" href="/docs/experiments/design/power">
    Understand statistical power
  </Card>

  <Card title="Statistical Tests" href="/docs/experiments/stats/stat-tests">
    Learn about the statistics
  </Card>
</CardGroup>
