This page is a summary verison of the statistical methods used by GrowthBook. If you want to read more detail, you can see the full white paper PDF.
Bayesian methods offers some distinct advantages over the frequentist approach. With frequentist methods, you must decide on the sample size in advance before running an experiment, which is known as a fixed horizon. This means you cannot stop a test early, nor run it for longer. If you do look at the test and decide to act upon it, you run into the peeking problem, which drastically increases your chances of Type I errors (false positives). Furthermore the results of a frequentist experiment are p-values and confidence intervals. These measures are very difficult to interpret correctly, and even harder to explain to others.
Bayesian methods help with both of these issues. There are no fixed horizons, and although Bayesian methods are not completely immune to the peeking problem, it is much less of a concern. You can generally stop an experiment whenever you want without a huge Type I error increase. In addition, the results are very easy to explain and interpret. Everything has some probability of being true and you adjust the probabilities as you gather data and learn more about the world. This matches up with how most people think about experiments - "there’s a 95% chance this new button is better and a 5% chance it’s worse."
Priors and Posteriors.
At GrowthBook, we use an Uninformative Prior. This simply means that before an experiment runs, we assume both variations have an equal chance of being higher/lower than the other one. As the experiment runs and you gather data, the Prior is updated to create a Posterior distribution. For Binomial metrics (simple yes/no conversion events) we use a Beta-Binomial Prior. For count, duration, and revenue metrics, we use a Gaussian (or Normal) Prior.
GrowthBook uses fast estimation techniques to quickly generate inferential statistics at scale for every metric in an experiment - Chance to Beat Control, Relative Uplift, and Risk (or expected loss).
Chance to Beat Control is straight forward. It is simply the probability that a variation is better. You typically want to wait until this reaches 95% (or 5% if it's worse).
Relative Uplift is similar to a frequentist Confidence Interval. Instead of showing a fixed 95% interval, we show the full probability distribution using a violin plot:
We have found this tends to lead to more accurate interpretations. For example, instead of just reading the above as "it’s 17% better", people tend to factor in the error bars ("it’s about 17% better, but there’s a lot of uncertainty still").
Risk (or expected loss) can be interpreted as “If I stop the test now and choose X and it’s actually worse, how much am I expected to lose?”. This is shown as a relative percent change - so if your baseline metric value is $5/user, a 10% risk equates to losing $0.50. You can specify your risk tolerance thresholds on a per-metric basis within GrowthBook.
GrowthBook gives the human decision maker everything they need to weigh the results against external factors to determine when to stop an experiment and which variation to declare the winner.
Data Quality Checks
In addition, GrowthBook performs automatic data quality checks to ensure the statistical inferences are valid and ready for interpretation. We currently run a number of checks and plan to add even more in the future.
- Sample Ratio Mismatch (SRM) detects when the traffic split doesn't match what you are expecting (e.g. a 48/52 split when you expect it to be 50/50)
- Multiple Exposures which alerts you if too many users were exposed to multiple variations of a single experiment (e.g. someone saw both A and B)
- Guardrail Metrics help ensure an experiment isn't inadvertently hurting core metrics like error rate or page load time
- Minimum Data Thresholds so you aren't drawing conclusions too early (e.g. when it's 5 vs 2 conversions)
- Variation Id Mismatch which can detect missing or improperly-tagged rows in your data warehouse
- Suspicious Uplift Detection which alerts you when a metric changes by too much in a single experiment, indicating a likely bug
Many of these checks are customizeable at a per-metric level. So you can, for example, have stricter quality checks for revenue than you have for less important metrics.
There is often a desire to drill down into results to see how segments or dimensions of your users were affected by an A/B test variation. This is especially useful for finding bugs (e.g. if Safari is down, but the other browsers are up) and for identifying ideas for follow-up experiments (e.g. "European countries seem to be responding really well to this test, let's try a dedicated variation for them").
However, too much slicing and dicing of data can lead to what is known as the Multiple Testing Problem. If you look at the data in enough ways, one of them will look significant just by random chance.
GrowthBook does not run any statistical corrections (e.g. Bonferroni). Instead, we change how much we show depending on the cardinality of the dimension. For example, if your dimension has only a few distinct values (e.g. "free" or "paid") we show the full statistical analysis since the impact to the false positive rate is low. However, if your dimension has many distinct values (e.g.
country) we only show the raw conversion numbers without any statistical inferences at all.
In addition, we apply automatic grouping to very high-cardinality dimensions. In the country example, only the top 20 countries will be shown individually. The rest will be lumped together into an
We have found this to be a good trade-off between false positives and false negatives.
GrowthBook utilizes a combination of Bayesian statistics, fast estimation techniques, and data quality checks to robustly analyze A/B tests at scale and provide intuitive results to decision makers. The implementation is fully open source under an MIT license and available on GitHub. You can also read more about the statistics and equations used in the white paper PDF.