A Collection of Data Science Take-Home Challenges
We are launching a new driver app with a better UI. The goal is increasing driver earnings by increasing their number of trips. Outline a testing strategy to see if the new app is better than the old one
The naive answer here would be: I pick a few markets that are representative of the entire population and, on each market, I randomly split drivers in test and control. Then I do a statistical test on the target metric and check if the new app is winning. The reason this would fail is that test and control wouldn't be independent.
Let's say I take all drivers in San Francisco and give 50% of them the new app and 50% keep using the old app. If the new app is effectively making drivers take more trips, that will result in higher competition for the drivers using the old app, and, therefore, will affect their earnings. The opposite is also true. If the new app sucks and those drivers drive less, this will also affect old app driver earnings given that there is less competition. It is extremely hard to design an A/B test in marketplaces or social networks since users are all connected.
If you get a question like that, to quickly check if you can simply randomly split users, mentally take extreme cases. Let's say the new product has a bug and it is unusable. Or the new product is amazing and test users will use it 24/7. Will these options have any effect on the control group? If the answer is yes (like it is obvious for the Uber case), you can't just randomly split users.
Best way to answer:
- State why you can't randomly split users: this is really what the question is looking for.
- Therefore, decide you will test by market. That is, you will match comparable markets in pairs and, for each pair, give everyone in one market the new app and in the other one the old app. Comparable means that, during the time of the test, the main metrics are expected to be very similar, if there were no test.
- To choose how many markets you need, define in advance sample size needed for a t-test. To identify required sample size, choose power, significance level, minimum difference between test and control, and std deviation. Review how to estimate t-test sample size. It is often asked.
- Run the test and, after having reached the required sample size, check if results are significant.
- [Bonus]: check for novelty effect. Users use more a product when it is new. Not because it is better, but simply because it is new. As novelty ends, they will use it less. This is called novelty effect and often makes tests look like winners when they are not.
You can control for this by, in your results, subsetting by drivers for which it is the first experience. Novelty effect obviously doesn't affect new users. If a test is winning overall, but it is not winning when comparing new users in test vs new users in control, it is a big warning that there might be novelty effect [more on novelty effect in other questions in this ebook since it is a common topic, often asked like this: "We ran a test. It won by 5%, but, after making the change for all users and waiting for a couple of weeks, we didn't see any improvement in our metric. Why?" ].