Bell | Blog | When one bad apple spoils the barrel: Tackling outliers in A/B Testing

Allon Korem

Chief Executive Officer

Note: This post was written in collaboration with Oryah Lancry-Dayan, Lead Statistician

As A/B testing analysts, we often encounter skewed distributions for our Key Performance Indicators (KPIs), especially when dealing with revenue metrics. It’s easy to accept these distributions as they are, but the presence of outliers - extreme high or low values - can quietly disrupt the validity of our tests. These outliers can inflate variance, which in turn reduces statistical power, and lead to misleading conclusions, making it harder to detect real effects.

The goal of this blog is to build intuition around why outliers can be problematic, explore the challenges of identifying and handling them, and provide practical guidelines for dealing with them effectively. So, if you’ve ever struggled with a skewed KPI distribution (let’s be honest, who hasn’t?), this one is for you.

Why Outliers Can Be Harmful

A fundamental principle in statistical inference is that while it’s impossible to determine whether you’ve made an error in a single experiment, you can control the likelihood of errors across many experiments. Hypothesis testing, the leading statistical procedure in AB testing, accounts for two main types of errors:

Type I error (α): The probability of incorrectly concluding that a new version is better when it actually isn’t.
Type II error (β): The probability of failing to detect a true improvement when one exists.

The probability of a Type I error (α) is set by analysts before conducting the test, usually at 0.05. However, the probability of a Type II error (β) depends on multiple factors, including sample size, effect size, and variance. This is where outliers can have a significant impact: extreme values increase data variability, inflating variance and potentially leading to underpowered tests—statistical tests that are less likely to detect true effects. Thus, when the data contains outliers and the variance is high, the test may fail to identify meaningful improvements, increasing the risk of a Type II error and making the experiment less reliable.

How to Identify Outliers

Methods for identifying outliers generally fall into two categories: visual and statistical. Visual methods, such as boxplots, scatter plots, and histograms, provide an intuitive way to detect unusual data points by highlighting deviations from expected patterns. These techniques are particularly useful for gaining a quick overview of data distribution and identifying extreme values. You can see a demonstration of these visualization techniques in Figure 1.

**Figure 1.** Various visualization techniques for detecting outliers (highlighted in red).

However, outlier detection requires a more systematic approach beyond visualization. Statistical techniques, such as Z-scores, quantify how far a data point deviates from the mean in terms of standard deviations. A common threshold considers observations with an absolute Z-score greater than 3 as outliers. Alternatively, quantile-based methods, such as the Interquartile Range (IQR) approach, identify outliers based on data dispersion. The IQR method flags values that fall beyond the first and third quartiles, often using 1.5 times the IQR as a cutoff. Another quantile-based approach treats the highest or lowest 1% of observations as potential outliers, providing a flexible way to account for extreme values in skewed distributions.

What to Do With Outliers

So, I’ve found out my dataset contains outliers—now what? The answer depends on their source. Some outliers result from errors like data entry mistakes or sensor malfunctions. These should be removed to prevent distortion.

However, our focus is on legitimate outliers—extreme yet valid observations. These present a challenge: they inflate variance and reduce statistical power but may also provide critical insights into the effect. For example, a treatment’s impact on revenue might be most noticeable among high-spending players, where behavioral changes are more pronounced. Therefore, handling outliers requires careful consideration, typically following one of two main approaches:

Data Transformations – Techniques such as logarithmic or square root can be applied to reduce the impact of extreme values. However, these transformations can distort the interpretation of the KPI, making it harder to communicate insights meaningfully.
Winsorization (Recommended) – Instead of removing outliers, this method caps extreme values at a predefined percentile (e.g., replacing the top 1% with the 99th percentile value). Winsorization retains outliers while reducing their influence on variance, preserving both the dataset's integrity and statistical power.

Time for Winsorization!

If you're unsure how to apply winsorization, don’t worry—we’ve got you covered! Follow this step-by-step guide to implement it effectively:

Set the Winsorization Threshold (X%) – In A/B testing, common choices are 1% or 0.1%, depending on the required adjustment and sample size.
Calculate the X% and (100 - X)% Percentiles – These percentiles set the thresholds for modifying extreme values and should be calculated using the combined control and treatment groups. Winsorization can be symmetrical (both tails) or asymmetrical (one side). The latter is especially useful for skewed distributions where extreme values are concentrated in one direction.
Replace Extreme Values – Values beyond these thresholds are capped at the corresponding percentile values.
Proceed with Analysis – The adjusted dataset, with reduced outlier influence, is now ready for statistical tests. If your KPI is sensitive to outliers (e.g., revenue), plan for winsorization during the planning phase to ensure an adequate sample size and maintain statistical power.

To get a better insight regarding the impact of winsorization, let’s have a look at a dataset of revenue from a gaming company. This is how the data looks like:

**Figure 2.** Histogram of total revenue (x-axis) with the y-axis displayed on a logarithmic scale for better visibility.

A quick look at the histogram easily conveys the existence of outliers. While most of the revenue is zero and 95% of the data is below 800, there are still few users with much more extreme revenues.

To quantify the impact of winsorization on this type of data, we simulated a true effect scenario and computed the statistical power (i.e., the proportion of times the null hypothesis is rejected). To achieve this, we randomly divided the data into test and control groups. In the test group, we modified the data by scaling each value by a factor of 1 plus a specified percentage (0%, 1%, 5%, or 10%)

Next, we applied winsorization at three levels—none, top 1%, and top 0.1%—using the pooled sample of both the control and test groups. That is, we calculated the 99th or 99.9th percentile and replaced any more extreme observations with that value. After applying the winsorization, we conducted a t-test to evaluate statistical significance.

This entire process was repeated 1,000 times, allowing us to compute the proportion of simulations where the null hypothesis was rejected, providing insight into how outliers and winsorization impact the power of the test.

The graph below illustrates the results of our simulation. As expected, when no modification was applied to the data, the proportion of rejections remained close to the significance level (5%). When an effect was introduced in the test group, we observed a clear trend: as the effect size increased, statistical power also increased. Most notably, applying winsorization significantly enhanced power, improving the test's ability to detect true effects. This highlights the crucial role of outlier management in boosting sensitivity and ensuring more reliable A/B test outcomes.

**Figure 3.** Summary of simulation results showing the proportion of rejected null hypotheses as a function of effect type, Winsorization level, and effect size. Different colors represent various winsorization levels: 1% (blue), 0.1% (orange), and no winsorization (green).

Note that while in our simulation, a winsorization limit of 1% proves superior to one of 0.1%, this may not hold true for all datasets. The choice of winsorization limits should be applied carefully, taking into account the specific characteristics of the data at hand.

Conclusions

Outliers can be a real headache—what should you do with them? On one hand, they may be where your effect is most evident; on the other, they can undermine your ability to detect that effect. This duality calls for a balanced approach—minimizing the impact of outliers on variance without disregarding the valuable information they may provide. winsorization offers a solution by retaining outlier observations while limiting their extremity. The effectiveness of this approach is demonstrated by its ability to improve actual power. So, don't let a few extreme observations spoil your experiment—just winsorize them!

When one bad apple spoils the barrel: Tackling outliers in A/B Testing

A/B testing analysts often encounter skewed distributions for different KPIs. Learn why outliers can be problematic, explore the challenges of identifying and handling them, and get practical guidelines for dealing with them effectively.

Why Outliers Can Be Harmful

How to Identify Outliers

What to Do With Outliers

Time for Winsorization!

Conclusions