Note: This post was written in collaboration with Oryah Lancry-Dayan, Lead Statistician
As A/B testing analysts, we often encounter skewed distributions for our Key Performance Indicators (KPIs), especially when dealing with revenue metrics. It’s easy to accept these distributions as they are, but the presence of outliers - extreme high or low values - can quietly disrupt the validity of our tests. These outliers can inflate variance, which in turn reduces statistical power, and lead to misleading conclusions, making it harder to detect real effects.
The goal of this blog is to build intuition around why outliers can be problematic, explore the challenges of identifying and handling them, and provide practical guidelines for dealing with them effectively. So, if you’ve ever struggled with a skewed KPI distribution (let’s be honest, who hasn’t?), this one is for you.
A fundamental principle in statistical inference is that while it’s impossible to determine whether you’ve made an error in a single experiment, you can control the likelihood of errors across many experiments. Hypothesis testing, the leading statistical procedure in AB testing, accounts for two main types of errors:
The probability of a Type I error (α) is set by analysts before conducting the test, usually at 0.05. However, the probability of a Type II error (β) depends on multiple factors, including sample size, effect size, and variance. This is where outliers can have a significant impact: extreme values increase data variability, inflating variance and potentially leading to underpowered tests—statistical tests that are less likely to detect true effects. Thus, when the data contains outliers and the variance is high, the test may fail to identify meaningful improvements, increasing the risk of a Type II error and making the experiment less reliable.
Methods for identifying outliers generally fall into two categories: visual and statistical. Visual methods, such as boxplots, scatter plots, and histograms, provide an intuitive way to detect unusual data points by highlighting deviations from expected patterns. These techniques are particularly useful for gaining a quick overview of data distribution and identifying extreme values. You can see a demonstration of these visualization techniques in Figure 1.
However, outlier detection requires a more systematic approach beyond visualization. Statistical techniques, such as Z-scores, quantify how far a data point deviates from the mean in terms of standard deviations. A common threshold considers observations with an absolute Z-score greater than 3 as outliers. Alternatively, quantile-based methods, such as the Interquartile Range (IQR) approach, identify outliers based on data dispersion. The IQR method flags values that fall beyond the first and third quartiles, often using 1.5 times the IQR as a cutoff. Another quantile-based approach treats the highest or lowest 1% of observations as potential outliers, providing a flexible way to account for extreme values in skewed distributions.
So, I’ve found out my dataset contains outliers—now what? The answer depends on their source. Some outliers result from errors like data entry mistakes or sensor malfunctions. These should be removed to prevent distortion.
However, our focus is on legitimate outliers—extreme yet valid observations. These present a challenge: they inflate variance and reduce statistical power but may also provide critical insights into the effect. For example, a treatment’s impact on revenue might be most noticeable among high-spending players, where behavioral changes are more pronounced. Therefore, handling outliers requires careful consideration, typically following one of two main approaches:
If you're unsure how to apply winsorization, don’t worry—we’ve got you covered! Follow this step-by-step guide to implement it effectively:
To get a better insight regarding the impact of winsorization, let’s have a look at a dataset of revenue from a gaming company. This is how the data looks like:
A quick look at the histogram easily conveys the existence of outliers. While most of the revenue is zero and 95% of the data is below 800, there are still few users with much more extreme revenues.
To quantify the impact of winsorization on this type of data, we simulated a true effect scenario and computed the statistical power (i.e., the proportion of times the null hypothesis is rejected). To achieve this, we randomly divided the data into test and control groups. In the test group, we modified the data by scaling each value by a factor of 1 plus a specified percentage (0%, 1%, 5%, or 10%)
Next, we applied winsorization at three levels—none, top 1%, and top 0.1%—using the pooled sample of both the control and test groups. That is, we calculated the 99th or 99.9th percentile and replaced any more extreme observations with that value. After applying the winsorization, we conducted a t-test to evaluate statistical significance.
This entire process was repeated 1,000 times, allowing us to compute the proportion of simulations where the null hypothesis was rejected, providing insight into how outliers and winsorization impact the power of the test.
The graph below illustrates the results of our simulation. As expected, when no modification was applied to the data, the proportion of rejections remained close to the significance level (5%). When an effect was introduced in the test group, we observed a clear trend: as the effect size increased, statistical power also increased. Most notably, applying winsorization significantly enhanced power, improving the test's ability to detect true effects. This highlights the crucial role of outlier management in boosting sensitivity and ensuring more reliable A/B test outcomes.
Note that while in our simulation, a winsorization limit of 1% proves superior to one of 0.1%, this may not hold true for all datasets. The choice of winsorization limits should be applied carefully, taking into account the specific characteristics of the data at hand.
Outliers can be a real headache—what should you do with them? On one hand, they may be where your effect is most evident; on the other, they can undermine your ability to detect that effect. This duality calls for a balanced approach—minimizing the impact of outliers on variance without disregarding the valuable information they may provide. winsorization offers a solution by retaining outlier observations while limiting their extremity. The effectiveness of this approach is demonstrated by its ability to improve actual power. So, don't let a few extreme observations spoil your experiment—just winsorize them!