Lesson 3: Why you need randomized controlled experiments

Example: The effectiveness of a weight-loss program

Let's imagine that you own a gym, and you want to offer a weight-loss program to your customers. You want to know how effective your program is, so you design an experiment.

Experiment design 1: Uncontrolled trial

You stand at the entrance of your gym and look for volunteers to participate in your program so you can measure its effectiveness. You weight the people that want to join your program before they enroll. After a 6-week program, consisting of exercise schedules and dietary advice, you weigh them again. You calculate the average difference in weight before and after the program, and find that people lost 3 kg on average. You celebrate a great success and start advertising your program!

The problem with an uncontrolled trial

You don't know what would have happened if people didn't enroll in your program. All participants wanted to lose weight, and maybe they would have done so without your program. People's weight fluctuates over time. People who had just gained some weight (for example after holidays) may be more motivated to sign-up. They may also just lose weight again just because they returned to their normal lifestyle. If you want to know the effectiveness of your program, then you need to compare your program with a situation without it.

Experiment design 2: You need a control group

You stand at the entrance of your gym and look for volunteers to participate in your program so you can measure its effectiveness. You enroll the people that want to join. You ask the people that don't want to participate to be part of a control group. You weigh both groups before and after the program. After the program, you calculate the change in weight before and after for each group, and then calculate the difference between both groups. You find that our treatment group lost more weight than the control group! You celebrate a great success and start advertising our program!

The problem with observational control groups

Our treatment and control groups are not comparable. The treatment group wanted to lose weight, and the control group didn't. The control group may contain people who joined the gym to become stronger, and may have even gained weight from growing muscles! This mechanism is called selection bias, and happens when groups are selected in a way that biases the result. Selection bias causes incomparable groups and invalidates any result. To avoid selection bias, you need a method of assigning people to the treatment and control groups, that can't have any correlation with the outcome that you plan to measure.

Experiment design 3: Randomized controlled trial

You want to get a precise estimate of the effectiveness of your weight loss program. For this, you need to compare people who took the program to a control group that is comparable in all other relevant aspects, except for the fact that they have taken the program. For this example, you could do the following instead. You again stand at the entrance of your gym and ask people if they are interested in participating in your weight-loss program. If they say "no", then they don't participate in the trial. If they say "yes", you weigh them and then flip a coin. Based on the coin flip you either enroll them right away, or you tell them "The program starts in 6 weeks, come back then!"". This way you remove any selection bias. The random assignment makes sure that on average, the groups are similar across all other characteristics except for the treatment that you give them. Of course, you still need to make sure that you can collect the data from everybody in the treatment and the control group after 6 weeks!

Randomized controlled trials

Experiments, like A/B tests and rollouts, split users into two (or more) groups by random assignment. The random assignment makes sure that the groups are, on average, similar in all aspects except for the change you want to test. For example, if you randomly split all Spotify users into two groups, the two groups should be very similar in terms of dimensions like demographics, connection speed, and music taste. One group gets the status of a "treatment" group and receives the new feature. The other group receives the default feature. you can then observe the users over time while they receive two different experiences, and measure some outcome of interest, for example churn, daily activity, or the number of minutes played. At the end of the experiment, you run a statistical test to calculate whether the differences between the groups are larger than what you expect to see if there's no difference.

Randomized controlled trial