Culture & Organization

What is an Experimentation Theatre?

Experimentation theatre is the practice of running experiments without the rigor or organizational commitment to act on results.

Experimentation theatre is the practice of running experiments without the rigor or organizational commitment to act on results. The experiments happen. The dashboards exist. But the outcomes don't change what ships. It's the organizational equivalent of going through the motions: the form of experimentation without the substance.

This matters because experimentation theatre is expensive. It consumes engineering time to instrument, analyst time to review, and experiment bandwidth that could go to tests the organization would actually learn from. Worse, it creates a false sense of confidence. Teams believe they're making evidence-based decisions because they ran an experiment, even when the experiment was designed, analyzed, or interpreted in ways that couldn't produce trustworthy evidence.

What does experimentation theatre look like in practice?

It has several recognizable patterns.

Running experiments after the decision is already made. The roadmap is set, the feature is built, and the experiment exists to provide post-hoc justification. The hypothesis is reverse-engineered from the implementation. If the results come back negative, the team ships anyway because the commitment was made months ago. The experiment becomes a rubber stamp.

Underpowered tests that can't detect realistic effects. A team runs an experiment with 5% of traffic for two weeks, checks the dashboard, sees no statistical significance, and concludes "no effect." In reality, the test had 15% power. It couldn't have detected a real effect even if one existed. The team learns nothing but believes it learned that the feature doesn't matter.

Cherry-picking metrics after the fact. The experiment has ten metrics. Nine show no effect. One shows p < 0.05. The team declares victory on that one metric without applying multiple testing corrections. The probability of at least one false positive across ten independent metrics at alpha = 0.05 is roughly 40%.

Ignoring guardrail regressions. The success metric improved, but a guardrail metric (latency, error rate, retention) degraded. The team ships because the success metric is what was promised to leadership. The guardrail regression gets filed as "something to monitor." At Spotify, 42% of experiments are rolled back based on guardrail regressions. In organizations practicing experimentation theatre, the rollback rate is close to zero, not because the features are better, but because the guardrails aren't enforced.

Testing only what's safe. The organization runs experiments on button colors and copy changes but ships major architectural changes, pricing decisions, and algorithm overhauls without testing. The experiment program exists but is limited to low-stakes decisions where the results don't threaten anyone's plans.

Why does experimentation theatre persist?

The incentives are misaligned. Product managers are rewarded for shipping features, not for learning that features don't work. Engineering teams are evaluated on velocity, not on decision quality. Leadership wants the narrative of being "data-driven" without the organizational discomfort of having data contradict their strategy.

There's also a skills gap. Running a trustworthy experiment requires understanding statistical power, multiple testing, metric selection, and the difference between correlation and causation. When these skills are concentrated in a small analytics team that reviews experiments after the fact, the structural decisions (sample size, duration, metric choice) have already been made by people who may not understand the implications.

How do you move from theatre to real experimentation?

Three structural changes help.

First, require hypotheses before implementation starts, not after. A team that writes "we believe changing X will improve Y because Z" before building anything is forced to think about what they're testing and why. Hypothesis-driven development makes it harder to reverse-engineer justifications.

Second, make the platform enforce statistical rigor by default. If sample size calculations, guardrail metrics, and multiple testing corrections are built into the workflow rather than optional add-ons, teams can't accidentally skip them. Confidence encodes fifteen years of Spotify's statistical methodology into the defaults for this reason.

Third, celebrate learning, not just wins. Spotify's Experiments with Learning framework measures learning rate (64%) separately from win rate (12%). When the organization treats "we tested this and learned it doesn't work" as a valuable outcome, the incentive to run rubber-stamp experiments disappears. The experiment's job is to produce evidence, not to produce a green light.