Skip to main content

GrowthBook's Statistics

GrowthBook provides both Bayesian and frequentist approaches to experiment analysis. We default to Bayesian statistics because they provide a more intuitive framework for decision making for most customers, but we provide ample tools to select between both approaches based on your experimentation needs. You can choose between the two statistics engines at the Organization or Project level.

Bayesian Statistics

Bayesian methods provides a few key advantages over frequentist approaches.

First, they provide more intuitive results. Instead of p-values and confidence intervals, you get probabilities and distributions of likely outcomes. These values allow you to make statements like "there’s a 95% chance this new button is better and a 5% chance it’s worse" while there is no direct analog in a frequentist framework.

Second, Bayesian methods allow you to write down your prior knowledge about experiment effects to ensure that you do not over-interpret small sample sizes and to benefit from old knowledge to reduce uncertainty in new experiments.

Third, Bayesian results are still valid even if you stop an experiment early. While they can suffer from the same "peeking" problems as frequentist statistics, at least the main probabilities and statistical results that you see are not invalidated by stopping early. However, this is something of a difference without a distinction, as the decision to stop an experiment early can still result in inflated false positive rates.

However, they can require a bit more fine-tuning to get right; however, in GrowthBook both engines are very similar under the hood so picking one can come largely down to personal preference and familiarity. In fact, tools like CUPED are available for both engines.

Priors and Posteriors

At GrowthBook, we default to an improper, uninformative prior. This means that we do not use prior information to impact your experiment results by default. We do this to ensure that people who want to use the Bayesian engine to the fullest are able to enable proper priors and reap their benefits, but without automatically affecting results when experimenters first begin experimenting with GrowthBook.

A prior works by providing additional information to your analysis for what kinds of results are likely based on past evidence. Our statistics engine will combine it with the actual results from your experiment to come up with our final distribution (our "posterior"). This represents the most likely outcomes of your experiment combining prior knowledge and the experiment data. Priors can be very helpful in reducing uncertainty in small sample sizes and in ensuring that you do not over-interpret results that are unlikely to be reliable.

You can easily turn on proper priors by visiting the organization settings, going to the Bayesian engine settings, and turning on the "Proper Prior". By default, we use a Normal distribution with mean 0 and standard deviation 0.3. This prior implies that about 68% of effects are between -30% and 30%, 95% of effects are between -60% and 60%, and the average effect is 0%. In effect, it will shrink positive and negative results towards 0, but eventually will be overcome as more data is collected.

The choice of 0 and 0.3 corresponds roughly to the distribution of effects that we have actually observed on GrowthBook and is both:

  1. Weak enough to not shrink experiments with large sample sizes
  2. Strong enough to ensure that experiments with small sample sizes are not over-interpreted

You can read more about how we use the prior and your experiment data to produce experiment results in our detailed documentation.

Inferential Statistics

GrowthBook uses fast estimation techniques to quickly generate inferential statistics at scale for every metric in an experiment - Chance to Win, Relative Uplift, and Risk (or expected loss).

Chance to Win is straight forward. It is simply the probability that a variation is better. You typically want to wait until this reaches 95% (or 5% if it's worse).

Relative Uplift is similar to a frequentist Confidence Interval. Instead of showing a fixed 95% interval, we show the full probability distribution using a violin plot:

Violin plot of a metrics change

We have found this tends to lead to more accurate interpretations. For example, instead of just reading the above as "it’s 17% better", people tend to factor in the error bars ("it’s about 17% better, but there’s a lot of uncertainty still").

Risk (or expected loss) can be interpreted as “If I stop the test now and choose X and it’s actually worse, how much am I expected to lose?”. This is shown as a relative percent change - so if your baseline metric value is $5/user, a 10% risk equates to losing $0.50/user. You can specify your risk tolerance thresholds on a per-metric basis within GrowthBook.

GrowthBook gives the human decision maker everything they need to weigh the results against external factors to determine when to stop an experiment and which variation to declare the winner.

Frequentist Statistics

Frequentist statistics are are familiar to many practitioners, power much of our statistics engine, and have certain advantages. Their widespread adoption has spurred important developments in variance reduction, heterogeneous treatment effect detection, and indeed corrections to peeking issues (e.g. sequential testing) that make frequentist statistics less problematic and, at times, more valuable.

The current frequentist engine computes two-sample t-tests for relative percent change; you can reduce variance (via CUPED) and you can enable sequential testing to mitigate concerns with peeking.

You can read more in our detailed documentation.

Data Quality Checks

In addition, GrowthBook performs automatic data quality checks to ensure the statistical inferences are valid and ready for interpretation. We currently run a number of checks and plan to add even more in the future.

  1. Sample Ratio Mismatch (SRM) detects when the traffic split doesn't match what you are expecting (e.g. a 48/52 split when you expect it to be 50/50)
  2. Multiple Exposures which alerts you if too many users were exposed to multiple variations of a single experiment (e.g. someone saw both A and B)
  3. Guardrail Metrics help ensure an experiment isn't inadvertently hurting core metrics like error rate or page load time
  4. Minimum Data Thresholds so you aren't drawing conclusions too early (e.g. when it's 5 vs 2 conversions)
  5. Variation Id Mismatch which can detect missing or improperly-tagged rows in your data warehouse
  6. Suspicious Uplift Detection which alerts you when a metric changes by too much in a single experiment, indicating a likely bug

Many of these checks are customizeable at a per-metric level. So you can, for example, have stricter quality checks for revenue than you have for less important metrics.

Dimensional Analysis

There is often a desire to drill down into results to see how segments or dimensions of your users were affected by an A/B test variation. This is especially useful for finding bugs (e.g. if Safari is down, but the other browsers are up) and for identifying ideas for follow-up experiments (e.g. "European countries seem to be responding really well to this test, let's try a dedicated variation for them").

However, too much slicing and dicing of data can lead to what is known as the Multiple Testing Problem. If you look at the data in enough ways, one of them will look significant just by random chance.

GrowthBook only provides multiple testing corrections for the frequentist engine, but we have a few guardrails at the metric level to ensure that results are only shown when there's at least enough data to reliably learn about a specific dimension.

In addition, we apply automatic grouping to very high-cardinality dimensions. In the country example, only the top 20 countries will be shown individually. The rest will be lumped together into an (other) category.

We have found this to be a good trade-off between false positives and false negatives.

Conclusion

GrowthBook utilizes a combination of Bayesian and frequentist statistics, fast estimation techniques, and data quality checks to robustly analyze A/B tests at scale and provide intuitive results to decision makers. The implementation is fully open source under an MIT license and available on GitHub.