Bell | Blog | Why the uplift in A/B tests often differs from real-world results

Allon Korem

Chief Executive Officer

As someone who has worked with numerous clients over the years, I've frequently encountered the same frustration: A/B tests show promising results, but once the feature is launched, the anticipated uplift just doesn’t materialize.

This disconnect can be puzzling and disappointing, especially when decisions and expectations are built around these tests. Understanding why the uplift seen in A/B tests often differs from real-world outcomes is essential for product managers, data scientists, and stakeholders.

Today, we'll explore some of the key reasons behind this discrepancy and offer insights on how to manage expectations effectively.

Human bias in analysis and interpretation

Human bias plays a significant role in the disconnect between A/B test results and real-world performance. The natural inclination to “win” a test can introduce bias into both the analysis and interpretation of results. Confirmation bias, where analysts favor information that supports their preconceptions, can lead to selective reporting of positive outcomes while overlooking negative or neutral results.

A common example I’ve encountered with clients involves tests that yield inconclusive (non-significant) results. Instead of ending the test, they often decide to let it run longer to “give the treatment an opportunity to win.” If the results later turn positive (significant), they immediately close the test, interpreting the delayed success as a true positive effect. This approach, however, is highly susceptible to bias and can lead to misleading conclusions.

Blind analysis and peer review are two strategies that can help counteract this bias. Blind analysis involves analyzing data without knowing which group is the control or treatment, reducing the chance of biased interpretations. Peer review adds an additional layer of scrutiny, where other experts review the methodology and findings, helping to catch any biases or errors that might have been overlooked.

False positives

False positives occur when an A/B test incorrectly indicates that a change has a significant effect when it doesn’t. This error can mislead stakeholders into believing that a feature will perform better than it actually will post-launch.

Consider this example: Suppose only 10% of your A/B test ideas actually have a positive effect. This is a situation reported by several large companies. If you run tests with 80% statistical power and a 5% significance level (for one-tailed tests), 8% of your A/B tests will yield true positives (10% * 80%), while 4.5% will be false positives (90% * 5%). This means that more than a third (36% = 4.5% / (4.5% + 8%)) of the positive significant results are falsely significant! That’s a significant proportion of misleading results.

While reducing the significance level can decrease the number of false positives, it would also require longer test durations, which may not always be feasible.

Sequential testing and overstated effect sizes

Sequential testing, where data is analyzed at multiple points during the experiment, is a common practice in A/B testing. It allows teams to monitor results and potentially stop a test early if results seem favorable. However, research has shown that even when done properly, sequential testing can introduce bias and overstate effect sizes.

This overestimation occurs because stopping a test at the first sign of positive results might capture a momentary peak rather than a true, long-term effect. While there are methods to correct this bias and create an unbiased estimator, we won’t delve into those techniques in this post. It's crucial, however, to be aware that sequential testing can lead to inflated expectations, which may not hold up in the real world.

Novelty effect and user behavior

Another key reason for the discrepancy is the novelty effect. When users are exposed to something new, especially in a controlled testing environment, they often react positively simply because it’s different or exciting. This effect is particularly pronounced among existing users who are already familiar with the product.

For example, imagine a new user interface feature that initially boosts engagement during the A/B test phase. Users, intrigued by the fresh design, may interact with it more frequently. However, as time passes and the novelty wears off, their behavior tends to revert to their usual patterns. The initial uplift observed in the test diminishes, leading to less impressive results post-launch.

External validity and real-world factors

Another critical factor to consider is external validity, which refers to how well the results of an A/B test generalize to real-world settings. A/B tests are typically conducted in controlled environments where variables can be carefully managed. However, the real world is far more complex, with numerous external factors influencing user behavior.

For instance, seasonality, marketing efforts, or competitive actions can significantly impact the performance of a feature post-launch. A feature that performed well during a test in a quiet period might not do as well during a busy season, or vice versa. This variability can lead to a significant difference between test results and real-world outcomes.

Limited exposure in testing

The limited exposure of some A/B tests can also lead to discrepancies between test results and real-world performance. Tests are often conducted in specific parts of a product or funnel, meaning that not all users are exposed to the treatment. This can limit the generalizability of the test results.

For example, suppose there are two pages where a product can be purchased, and 50% of purchases are made on each page. If you test only one page and see a significant positive lift in purchases, the real effect on overall purchases, assuming independence between the pages, will be halved. This underlines the importance of considering the full user experience when interpreting A/B test results.

Strategies for mitigating discrepancies

To better align test results with real-world performance, teams can consider several strategies:

Repeated Tests: Running repeated tests involves conducting the same test multiple times to verify the results. This approach is excellent for mitigating human bias, false positives, novelty effects, and external validity issues. However, it requires more time and operational resources, which may not always be feasible.
Using Smaller Significance Levels: Reducing the significance level (e.g., from 0.05 to 0.01) directly decreases the likelihood of false positives. This is an easy strategy to implement, but it will extend the test duration, which could be a drawback in time-sensitive situations.
Employing Holdout Groups: Holdout groups involve keeping a segment of users who are not exposed to the test, serving as a control group. While setting this up internally can be challenging, many A/B testing platforms like Statsig and Eppo offer this as a feature. This method has a similar effect to repeated tests, providing more reliable results.
Maintaining a Healthy Skepticism About Test Results: Always approach test results with a critical eye, especially when the outcomes are unexpectedly positive. This mindset is particularly important in avoiding human bias and ensuring objective analysis.
Conducting Blind Analyses: Analyzing data without knowing which group is the control or treatment helps reduce bias. This technique ensures that conclusions are drawn based on the data alone, without any preconceived notions influencing the outcome.
Involving Peer Reviews: Peer review adds an additional layer of scrutiny to your analysis, helping to identify potential biases or errors. This collaborative approach can significantly improve the reliability of your conclusions.
Checking Effect Sizes Over Time: Monitoring effect sizes over time can reveal trends that might not be apparent in a single snapshot. This approach helps identify whether observed effects are stable or if they diminish as the novelty wears off or other factors come into play.
Correcting Biases in Sequential Testing: Being aware of and correcting for biases introduced by sequential testing can help ensure that the effect sizes reported are accurate and reliable.
Calculating the Overall Effect When There Is Limited Exposure in Testing: When tests are conducted in limited areas of a product or funnel, it’s crucial to calculate the overall effect across all relevant areas. This approach ensures that the impact on the entire user base is accurately represented.

By incorporating these strategies, teams can set more realistic expectations and improve the accuracy of their predictions, leading to better decision-making and ultimately more successful product launches.

Takeaways

Understanding the reasons behind the discrepancy between A/B test results and real-world outcomes is crucial for anyone involved in product development and decision-making. By being aware of factors like human bias, false positives, sequential testing, novelty effects, and external validity, and by implementing strategies to mitigate these issues, you can better manage expectations and achieve more reliable results. Ultimately, these practices lead to more informed decisions and successful product launches.

Why the uplift in A/B tests often differs from real-world results

Understanding why the uplift seen in A/B tests often differs from real-world outcomes is essential. Explore key reasons behind this discrepancy and insights on how to manage expectations effectively.