The platform provides tests for differences between means of the treatment
groups and the control group. The success metrics and guardrail metrics tests
are slightly different in their interpretations.
Superiority Tests
Confidence uses superiority tests for success metrics and for deterioration tests.
A success metric test can be significant or non-significant. Significant means that it’s unlikely to
find the observed difference of means between the groups if there were no effect. All success
metric tests are against the null hypothesis of zero. Three types of tests are available for
success metrics.
Any change
Increase
Decrease
- Significant result: The data shows evidence that the treatment caused a change in the metric.
- Insignificant result: The data shows no evidence that the treatment caused a change in the metric.
The statistical hypotheses used in the test are:
- H0:δ=0
- H1:δ=0
where δ is the treatment effect.
- Significant result: The data shows evidence that the treatment caused an increase in the metric.
- Insignificant result: The data shows no evidence that the treatment caused an increase in the metric.
The statistical hypotheses used in the test are:
- H0:δ=0
- H1:δ>0
where δ is the treatment effect.
- Significant result: The data shows evidence that the treatment caused a decrease in the metric.
- Insignificant result: The data shows no evidence that the treatment caused a decrease in the metric.
The statistical hypotheses used in the test are:
- H0:δ=0
- H1:δ<0
where δ is the treatment effect.
Non-Inferiority Tests
Confidence uses non-inferiority tests for guardrail metrics.
For non-inferiority tests, the test is against the null hypothesis of NIM (non-inferiority margin).
You must select a direction for a non-inferiority test.
- Significant result: The data shows evidence that the metric hasn’t decreased by more than NIM in the treatment group.
- Insignificant result: The data shows no evidence that the metric hasn’t decreased by more than NIM in the treatment group.
The statistical hypotheses used in the test are:
- H0:δ<−NIM
- H1:δ>−NIM
where δ is the treatment effect.
- Significant result: The data shows evidence that the metric hasn’t increased by more than NIM in the treatment group.
- Insignificant result: The data shows no evidence that the metric hasn’t increased by more than NIM in the treatment group.
The statistical hypotheses used in the test are:
- H0:δ>NIM
- H1:δ<NIM
where δ is the treatment effect.
Inferiority Tests
Confidence uses inferiority tests for unintended negative effects in success and guardrail metrics. The inferiority test is testing for a move in the opposite direction than the intended one.
For inferiority tests, the test is against the null hypothesis of zero.
You must select a direction for an inferiority test.
- Significant result: The data shows evidence that the treatment caused a decrease in the metric.
- Insignificant result: The data shows no evidence that the treatment caused a decrease in the metric.
The statistical hypotheses used in the test are:
- H0:δ=0
- H1:δ<0
where δ is the treatment effect.
- Significant result: The data shows evidence that the treatment caused an increase in the metric.
- Insignificant result: The data shows no evidence that the treatment caused an increase in the metric.
The statistical hypotheses used in the test are:
- H0:δ=0
- H1:δ>0
where δ is the treatment effect.
Relative Values
Confidence performs tests on the absolute values, but lets you give NIMs on a relative scale.
The mean of the baseline group, typically the control group, transforms the relative values into absolute values.
Tests for Success Metrics
Success metrics always use a superiority test. The test is against the null hypothesis of zero mean difference between the groups.
Tests for Guardrail Metrics
You can test guardrail metrics in two different ways:
- Use an inferiority test. This test evaluates whether there is evidence that the guardrail
metric does worse in the treatment group compared to the control group.
- Use a non-inferioriy test. This test instead evaluates whether there is evidence that the
guardrail metric does better than a pre-defined threshold in the treatment group compared to the
control group.
Tests for Deterioration
Confidence tests all success and guardrail metrics for deterioration. For
success metrics, this means testing for inferiority and superiority separately.
For guardrail metrics, this means testing for inferiority and non-inferiority if
the guardrail metric uses a non-inferiority test.