Feature Flags

What is an Auto-Rollback?

Auto-rollback is an automatic rollback triggered when a guardrail metric violates a predefined threshold during a rollout.

Auto-rollback is an automatic rollback triggered when a guardrail metric violates a predefined threshold during a rollout. Instead of waiting for a human to notice a regression and manually flip a flag, the platform detects the violation and reverts the feature to 0% exposure on its own. The time between "something went wrong" and "users are protected" shrinks from minutes or hours to seconds.

Manual rollback depends on someone watching the dashboard, recognizing the problem, and taking action. That works during business hours when the team is paying attention. It works poorly at 3 AM on a Saturday, when the on-call engineer is asleep and the regression is quietly accumulating harm. Auto-rollback covers the gaps that human attention can't.

How does auto-rollback work?

The mechanism has three parts: metric computation, threshold evaluation, and the rollback action.

During a rollout, the platform continuously computes guardrail metrics by comparing the exposed group to the unexposed group. These are the same statistical comparisons used in A/B test analysis: the platform isn't checking raw metric values against an absolute number, it's testing whether the treatment group is significantly worse than the control group.

Each guardrail metric has a threshold. This might be an inferiority test (is the treatment statistically worse than control?) or a non-inferiority margin (is the treatment more than X% worse?). When a metric crosses its threshold, the system triggers the rollback.

The rollback action is the same as a manual rollback: set the feature flag to 0%. In Confidence, flag evaluation is in-process at 10 to 50 microseconds, so the change propagates quickly. Users start seeing the old experience on their next request.

Which metrics should trigger auto-rollback?

Not all guardrail metrics should trigger automatic rollback. The right candidates are metrics where regression is clearly harmful and where waiting for human review adds risk without adding judgment.

Good candidates for auto-rollback:

  • Crash rates. A statistically significant increase in crashes is never acceptable, and there's no interpretation ambiguity.
  • Error rates (server-side 5xx, client exceptions). Same logic: more errors is always bad.
  • Severe latency regressions. A P99 latency doubling during a rollout is a clear signal.

Better left to human judgment:

  • Engagement metrics. A dip in sessions or clicks might reflect a real regression or might reflect a healthy change in user behavior (users finding what they need faster, for example). Auto-rolling back on engagement drops can punish genuinely good features.
  • Revenue metrics. Revenue signals are noisy, especially at low rollout percentages. A human should review the context before deciding.

The principle: auto-rollback for metrics where "worse" has an unambiguous meaning. Human review for metrics where interpretation matters.

What are the risks of auto-rollback?

False positives. Statistical monitoring generates some false alarms by design. If your guardrail uses a 5% significance level and you're monitoring ten metrics, you'll see roughly one false positive every other rollout. A false positive auto-rollback wastes time (the team investigates a non-problem) but doesn't harm users.

To manage false positive rates, teams can set stricter thresholds for auto-rollback triggers than for manual review. A metric might flag for human review at p < 0.05 but trigger auto-rollback only at p < 0.01. Confidence's multiple testing correction framework helps keep overall error rates controlled.

Masking systemic issues. If a feature keeps getting auto-rolled back and the team keeps fixing and re-rolling out, the auto-rollback is treating symptoms rather than causes. The fix might be to rethink the feature, not to keep iterating on it.

Over-reliance. Auto-rollback is a safety net, not a substitute for careful rollout practices. Teams that trust auto-rollback to catch everything might skip canary phases or advance rollout stages too quickly. The system catches metric regressions. It doesn't catch regressions in metrics you forgot to monitor.