Observational bias is systematic error introduced when the data collection or analysis process produces results that consistently differ from the truth. In experimentation, it most commonly appears when users aren't randomly assigned to groups, when measurement differs between treatment and control, or when the analysis selects a non-representative subset of the data.
Randomized experiments exist specifically to eliminate observational bias. When bias enters an experiment through broken instrumentation, differential measurement, or post-hoc analysis choices, it undermines the entire purpose of running the test.
What forms does observational bias take in experiments?
Selection bias occurs when the process that determines which users enter the analysis is correlated with the outcome. Comparing users who opted into a feature against those who didn't tells you about the users, not the feature. Self-selected groups differ systematically: early adopters are more engaged, more technical, and more tolerant of rough edges. Any comparison between these groups confounds the treatment effect with the selection effect.
Measurement bias occurs when the instrumentation captures data differently across variants. If the treatment variant logs events at a different point in the code path than control, completion rates can differ for reasons that have nothing to do with user behavior. A team at Spotify discovered that a checkout experiment showed a 2% lift that disappeared after fixing a logging discrepancy where the treatment fired its conversion event slightly earlier in the flow.
Survivorship bias appears when your analysis only includes users who completed the full experiment period, excluding those who dropped off. If the treatment causes more dropoffs (or fewer), the surviving population is no longer representative, and the comparison is biased.
Attrition bias is the specific case where users leave the experiment at different rates across variants. This often manifests as a sample ratio mismatch, which is why Confidence checks for SRM automatically on every experiment.
How do randomized experiments prevent observational bias?
Randomization ensures that, in expectation, every user characteristic is balanced across groups. There's no selection process that could favor one group over another. The only systematic difference between treatment and control is the treatment itself.
But randomization only prevents bias at the point of assignment. Bias can still enter through:
- Differential attrition after assignment
- Asymmetric instrumentation between variants
- Analysis choices made after seeing the data (the garden of forking paths)
- Post-treatment filtering that conditions on affected outcomes
This is why a complete experimentation platform doesn't just randomize. It also checks for sample ratio mismatches, logs exposure symmetrically for both variants, and runs pre-specified analyses. Confidence automates these checks because each one closes a specific channel through which bias could enter.
How is observational bias different from random error?
Random error (noise) averages out with larger samples. Observational bias does not. If your logging is broken in a way that overcounts conversions in treatment by 0.3%, that bias persists no matter how many users you add. More data makes the biased estimate more precise, not more accurate.
This distinction matters because underpowered experiments and biased experiments look similar on the surface (both produce unreliable results), but the fix is completely different. More sample size solves noise. It does not solve bias.