Bell | Blog | What to do when you encounter Sample Ratio Mismatch in A/B Testing

Allon Korem

Chief Executive Officer

Note: This post was written in collaboration with Oryah Lancry-Dayan, Lead Statistician

As an eager analyst, you’ve just received the data for an A/B test. Wasting no time, you dive into the analysis: selecting the appropriate statistical test and meticulously sidestepping pitfalls like data peeking. To your delight, the results reveal a significant improvement in the treatment group. However, despite following best practices, something fundamental is still missing from your process. Any idea what it could be?

The missing piece in your process is a Sample Ratio Mismatch (SRM) check — or simply put, verifying whether the actual allocation of participants to groups matches the intended split. This post aims to explain why checking for SRM is crucial, explore common reasons it occurs, and guide you on how to detect, diagnose, and address SRM issues effectively.‍

What is SRM?

In each A/B test, users are divided into at least two groups. Before the test begins, the analyst determines the proportion of users assigned to each group. The best practice is to split users evenly between the groups, though other distributions are also acceptable.While minor deviations from the planned allocation are common, when this discrepancy becomes substantial, it results in a "sample ratio mismatch".

Why does SRM matter?

To isolate the impact of product variation on the Key Performance Indicator (KPI), the control and test groups must be equivalent across all parameters except the one being manipulated in the test. How can we ensure this? The simplest approach is random allocation — since users are assigned to groups randomly, there should be no other consistent difference between the characteristics of users in each group.

While more advanced techniques, such as stratified sampling, can further ensure group similarity, random allocation is often sufficient for our purposes. However, discovering SRM in the dataset can seriously undermine the principle of random allocation, as it suggests that users may be disproportionately excluded from one of the groups. This imbalance can introduce bias into the user characteristics represented in each group, potentially compromising the test's validity. Take a look at the following illustration to gain a clearer understanding of this point:

Figure 1. Suppose you're analyzing the effect of a change in your game on revenue. Your population includes both game-addicted players (purple) and regular players (black), split evenly between control and test groups. Due to technical issues, loading times are longer in the treatment version, causing regular players to quit—especially in the treatment group. This leads to a smaller test group with a higher proportion of addicted players, as shown in the final sample. While the treatment group shows higher averaged revenue, the result is inconclusive because it's unclear if the new version or the different player profiles are driving the revenue increase.

Thus, detecting an SRM in our dataset may signal a violation of one of the fundamental assumptions of statistical inference: the assurance of random allocation. Without it, any observed differences in the data could be attributed to underlying group characteristics rather than the variation being tested, undermining the validity of the results.

How to detect SRM?

SRM detection involves comparing the expected and actual sample sizes for each variant. While perfect alignment isn’t expected, there should be reasonable consistency between the planned and observed allocations. To evaluate whether discrepancies are acceptable, analysts typically use the goodness-of-fit chi-squared test. This test compares the planned and actual group proportions to assess if there is a significant difference. The null hypothesis here assumes equal allocation between groups. Thus, unlike standard KPI analysis, the goal is not to reject the null hypothesis, but to obtain a non-significant result, indicating that the actual allocation aligns with the planned one.

To quantify the difference between what you planned and what you actually observed, we use the chi-squared statistic, calculated with the formula:

Where:

O_i represents the observed frequency (i.e., the actual number of users in each group),
E_i represents the expected frequency (i.e., the number of users you would expect in each group based on your planned allocation).

For example, suppose you're running an A/B test with 200 users. You expect 100 users in each group, but the actual allocation is 90 users in the control group and 110 in the treatment group. The chi-squared calculation would be:

This value follows a chi-squared distribution with 1 degree of freedom (since we have two groups). Using this statistic, we can calculate the p-value, which is approximately 0.157. Since the p-value is greater than typical significance levels used for SRM checks (e.g., 0.1), we fail to reject the null hypothesis. This suggests that the allocation to groups is acceptable and there is no significant deviation. If you want to gain more intuition about the idea of chi-squared, try to repeat this calculation with an actual allocation of 130-70. What will be the conclusion in this case?

While it's crucial to conduct SRM analysis at the overall sample level, it can also be valuable to examine subgroups within the population. For example, you might check whether the division of users between the control and test groups is consistent across different operating system levels. In this case we use a different type of a chi-squared, a chi-squared test for independence that identifies whether allocation discrepancies might exist in specific subgroups of the sample.

I found SRM in my data, what to do next?

Once SRM is detected, it’s crucial to identify where the issue in group division is arising within the test process. Specifically, there are two key points where problems might occur:

1. Randomization Mechanism: Sometimes, the discrepancy between the planned and actual allocation is due to the procedure used to assign users to groups. For example, if a randomization function generates numbers from 1 to 256 based on user and experiment IDs for a test with three groups (each allocated 33%), the numbers may not divide evenly, leading to SRM. Fortunately, this type of SRM doesn’t indicate any systematic differences between groups, so it doesn’t undermine the test’s validity. In such cases, recognizing this as the cause of SRM allows you to proceed with the analysis without concern.

2. Confounding Variables with the Treatment: SRM becomes problematic when the cause is related to the treatment itself. In this case, the initial group allocation might be equal, but imbalances can emerge later in the process. There are two main types of factors to consider:

External Factors: Sometimes, specific events, such as infrastructure updates or shifts in traffic patterns, coincide with the test and cause traffic anomalies. To detect such issues, visualizing user assignments over time and correlating anomalies with external events can help identify if the flawed allocation is linked to an external factor.
Inherent Factors: These are factors directly related to the treatment, such as technical issues in the test version (e.g., loading times), which may lead to different rates of dropping. To detect these issues, analysts should compare the groups on key measures of the experiment’s flow, such as API delays or load times.

In some cases, the treatment may only affect certain subsets of users, for example, longer loading times for mobile users but not for desktop users. In such situations, it's critical to examine group allocation across various subpopulations (e.g., mobile vs. desktop, different browsers) to see if the distribution is influenced by the treatment. If SRM appears in specific segments, it could signal issues in the experiment’s infrastructure or flow.

Once such patterns are identified, the next step is to investigate whether there are differences in technical properties, like loading times, across these segments. Although multiple comparisons can lead to false positives, when using statistical tests to flag potential issues, false positives are less concerning.

What to do if you cannot find the cause for the SRM?

Even if the source of the SRM cannot be identified, it is still likely that something is causing the discrepancy in the proportion of users, rendering the data invalid for analysis. In this case, you may consider re-running the test. If the SRM is due to random chance (i.e., a false positive) or an unknown issue that occurred on specific dates or with particular traffic, it’s possible that the SRM will not recur, even without fixing something in the test.

Conclusions

A fundamental requirement for reliable statistical inference is that the groups differ only by the factor manipulated by the treatment. Detecting an SRM in the data strongly suggests that this assumption has been violated. Therefore, when SRM is identified, the data cannot be reliably analyzed until the source of the imbalance is investigated and resolved. In this blog, we covered how to check for SRM, explored its potential causes, and discussed strategies to identify its source. We hope you found this guide helpful—may your experiments stay balanced, and what you plan will be what you get!

What to do when you encounter Sample Ratio Mismatch in A/B Testing