Why Are Most A/B Test Results A Lie?
Why Are Most A/B Test Results A Lie?
In a report on A/B testing Martin Goodson at Qubit suggests that “most A/B winning test results are illusory”. Andre Morys at Web Arts goes even further and argues that “90% of test results are a lie.” If their estimates are correct then a lot of decisions are being made based upon invalid A/B test results. Does this explain why many non-CRO managers are sceptical about the sustainability of A/B test results? Why are some A/B test results invalid and what can we do about it?
Andre Morys suggests that confirmation bias results in many false positives. This is because optimisers naturally base most of their test hypothesis and designs on their own attitudes and opinions. They ignore information that contradicts their ideas. As a result they become too emotionally attached to their design and stop experiments as soon as the testing software indicates they have a winner.
Stopping a test as soon as statistical confidence is achieved can be highly misleading. It does not allow for the length of the business cycle. It is important to also consider source of traffic and current marketing campaigns. You should run tests for a minimum of two business cycles. Make allowances for day-to-day fluctuations and changing behaviour at weekends. Tests should run for at least 2 to 4 weeks depending upon your test design and business cycle.
This refers to our propensity to focus on people who survive a journey (e.g. returning visitors or VIP customers). But ignore the fact that the process they have been through influences their characteristics and behaviour. Even returning visitors have survived in the sense that they weren’t put off by any negative aspects of the user experience. VIP customers may be your most profitable users but they are not a fixed pool of visitors. Their level of intent will often be much higher than your normal user.
The danger is that by including returning users in an A/B test for a landing page experiment they will behave very differently from new users who have never seen the site before. For tests relating to existing users the process of excluding outliers can reduce problems with VIP customer influencing test results. Consider excluding VIP customers completely from your A/B tests as they do not reflect normal users.
For more on survivorship bias see my post: Don’t let this bias destroy your optimisation strategy!
3. Statistical Power:
Statistical power refers to the probability of identifying a genuine difference between two experiences in an experiment. To achieve a high level of statistical power, we have to build up a sufficiently large sample size and generate a reasonable uplift. However, when working in a commercial organisation there is often a lot of pressure to achieve quick results and move on to the next idea. Unfortunately this can sometimes undermine the testing process.
Before you begin a test you should estimate the sample size required to achieve a high level of statistical power (see confidence level), normally around 90%. This means that you should identify 9 out of 10 genuine test differences. Due to taking a sample of observations and the natural random variation this causes, we know that tests will also generate a false positive. By convention this is normally set at 5%.
According to an analysis of 1700 A/B tests by convert.com, only around 10% of tests achieve a statistically significant uplift. This means that if we run 100 tests we might expect 10 of those tests to generate a genuine uplift. However, given current traffic levels for each site we estimate that we would need to run each test for 2 months to achieve 80% power. This means we should in theory identify 90% of uplifts or 9 tests. Using the p-value cut off of 5% we would also anticipate 5 false positives. Thus our tests generate 9 + 5 = 14 winning tests.
The danger here is that people are too impatient to allow a test showing an uplift to run for a full two months and so they decide to stop the test after just two weeks. The problem with this is that the much smaller sample size decreases the power of the test from 90% to perhaps as low as 30%. Under this scenario we would now expect to achieve 3 genuine uplifts and still get 5 false positives. This means that 63% of your winning tests are not genuine uplifts.
Before running a test always use a sample size calculator and estimate the length of time needed achieve your required statistical power. This allows you to consider the implications on the power of the test if you do decide to cut it short. Assume a much greater risk of a false positive if you do stop tests early.
Once you begin a test avoid altering the settings, the designs of the variant or the control and don’t change the traffic allocated to the variants during the experiment. Adjusting the traffic split for a variant during a test will potentially undermine the test result because of a phenomenon known as Simpson’s Paradox. This occurs when a trend in different groups of data vanishes when data from both groups is combined.
Experimenters at Microsoft experienced this issue when they allocated just 1% of traffic to a test variant on Friday, but increased this to 50% on Saturday. The site received one million daily visitors. Although the variant had a higher conversion rate than the Control on both days, when the data was aggregated the variant appeared to have a lower overall conversion rate.
This occurs because we are dealing with weighted averages. Saturday had a much lower conversion rate. As the variant had 50 times more traffic that day than it did on Friday it had a much greater impact on the overall result.
All About The Tests
Simpson’s Paradox occurs when sampling is not uniform and so avoid making decisions on sub-groups (e.g. different sources of traffic or type of device) using aggregate data. This demonstrates the benefit of targeted tests for example where you are only looking at a single traffic source or type of device.
When you do run tests for multiple sources of traffic or user segments it is best to avoid using aggregate data. Instead treat each source/page as a separate test variant. You can then arrange to run the test for each variant until you achieve the desired statistically significant result.
Altering traffic allocation during a test will also bias your results because it changes the sampling of your returning visitors. Because traffic allocation only affects new users. A change in the share of traffic won’t adjust for any difference in returning visitor numbers that the initial traffic split generated.
5. Don’t validate your A/B testing software:
Sometimes companies begin using A/B testing software without proper validation that it is accurately measuring all key metrics for all user journeys. It is not uncommon for A/B test software not to be universally integrated. Different teams are often responsible for platform, registration and check-out.
During the process of integration be careful to check that all different user journeys have been included. People have a tendency to prioritise what they perceive to be the most important paths. However, users rarely, if ever, follow the preferred “happy path”.
Once integration is complete it is then necessary to validate this through either running A/A tests or using web analytics to confirm metrics are being measured correctly. It is also advisable to check that both testing software and web analytics align with your data warehouse. If there is any discrepancy it is better to know before you run a test rather than when you present results to senior management.
There is a real danger of looking at test results in the first few days that you see a large uplift (or fall) and you mention it to your boss or other members of the team. Everyone then gets excited and there is pressure on you to end the test early to benefit from the uplift or to limit the damage. Often though, this large difference in performance gradually evaporates over a number of days or weeks.
Never fall into this trap as what you are seeing here is regression to the mean. This is a phenomenon that if a metric is extreme the first time it is measured it will tend to move towards the mean on subsequent observations. Small sample sizes of course tend to generate extreme results. Be careful not to read anything into your conversion rate when a test first starts to generate numbers.
7. The Fallacy Of Session Based Metrics:
Most A/B testing software uses standard statistical tests to determine whether the performance of a variant is likely to be significantly different from the Control. This is based upon the assumption that each observation is independent.
However, if you use session level metrics such as conversion per session you have a problem. A/B testing software allocates users into either group A or B to prevent the same visitor seeing both variants and to ensure a consistent user experience. Sessions are therefore not independent as a user can have multiple sessions.
Analysis by Skyscanner has shown that a visitor is more likely to have converted if they have had multiple sessions. On the other hand an individual session is less likely to have converted if made by a user with many sessions.
This lack of independence is a concern as Skyscanner simulated how this affects their conversion rate estimates. They discovered that when they randomly selected users rather than sessions, as occurs in an A/B test, the variance is greater than assumed in significance calculations.
Skyscanner found that the effect is greater with longer experiments due to the average number of sessions being higher. What this means is that month-long tests based on session conversion rate (i.e. users randomised) would have three times as many false positives as normally expected. However, when the test was based on users (i.e. randomised sessions irrespective of user) the variance conformed to that predicted by significance calculations.
Furthermore, this problem also occurs whenever you use a rate metric that is not defined by what you randomised on. So per-page view, per-click or click-through rate metrics will also be subject to the same problem if you are randomising on users. The team at Skyscanner suggest three ways of avoiding being misled by this statistical phenomenon.
- Keep to user-level metrics when randomising on users and you will normally avoid an increased rate of false positives.
- When it is necessary use a metric that will be subject to the increased propensity of false positives. There are methods of estimating the true variance and calculating accurate p-values. Check this paper from the Microsoft team and also this one.
- True variance and accurate p-values can be computationally complex and time consuming. You can just accept a higher false positive rate. Use AA tests to estimate how much the metric variance is inflated by the statistical phenomenon.
When trying to avoid these pitfalls of A/B testing the key is to have a strong process for developing and running your experiments. A good framework for testing ensures that hypotheses are on evidence rather than intuition and that you agree the test parameters upfront. Make sure you calculate the sample size and how long your test will need to run to achieve the statistical power you want to reach.
There is an opportunity cost to running tests to their full length. Sometimes you may want to end tests early. This can be fine if you allow for the lower level of statistical confidence and the increased risk of a false positive. There is evidence to suggest that you are continuously running tests. This can largely compensate for any increase in the rate of false positives.