Skip to main content

Multiple Testing Corrections

note

Multiple testing corrections are only implemented for the Frequentist statistics engine.

If you test a bunch of things at once, you will naturally get some results that look good, but aren't real! This page explains this problem in detail and how you can control it with GrowthBook.

What is the multiple testing problem?

In the frequentist framework, testing more than one hypothesis at a time increases the probability that you find a false positive, beyond the user's specified rate of α\alpha, often 0.05. This is often known as the multiple testing, or multiple comparisons, problem.

In online AB testing, experimenters will often be running hundreds of tests, if not tens of thousands, of tests, at the same time. In this setting, a "test" tends to be the comparison of variation A vs variation B, for one group, for one metric. If you are running 10 experiments with 2 variations, each with 10 metrics, you are running 100 tests at one time. Even if the test had no effect on any metric, having this many tests will dramatically increase the chance of seeing a false positive.

Error rates

To understand approaches to solving this issue, let's first define two the False Discovery Rate (FDR) and the Family Wise Error Rate (FWER).

In plain English, the FDR is proportion of your significant results that are false. If you control your FDR at 0.05 (or 5%), and you get 20 significant results, on average only 1 of these should be a false positive.

The FWER is the probability of at least one test being a false positive. This tends to be much stricter as the number of tests grows.

We can write both as follows, where VV is the number of tests that are statistically significant but are false positives and RR are the number of tests that are statistically signficant.

False Discovery Rate=FDR=E[VV+R]\text{False Discovery Rate} = \text{FDR} = \text{E}[\frac{V}{V+R}]

Family-wise Error Rate=FWER=Pr(V1)\text{Family-wise Error Rate} = \text{FWER} = \text{Pr}(V \geq 1)

Corrections in GrowthBook

Statistical methods

Growthbook allows you to choose no multiple testing correction, a correction that controls the Family-wise Error Rate (Holm-Bonferroni), and a correction that controls the False Discovery Rate (Benjamini-Hochberg).

Which you choose depends on your tolerance for false positives, the number of tests you are running, and your comfort with each procedure. General guidance would be that if your analysis is more exploratory, controlling the FDR and having a slightly higher false positive rate may be better, while controlling the FWER will give you more guarantees that your results are reliable. However, with enough tests, controlling the FWER may completely undermine your test power.

You can set this for your organization in the Organization → Settings page, as seen below:

Org-level Multiple Testing Correction Settings

Once you make a selection, adjustments will happen on-the-fly in experiment results and reports. For now, these adjusted p-values are only available in the UI or in downloadable CSVs and not exportable via the API.

tip

The corrections we implement do not have directly analogous confidence intervals. Nonetheless, we do construct confidence intervals in an ad-hoc manner after adjusting p-values. We take the adjusted p-values and we back out what the experiment effect standard deviation would need to be to produce that adjusted p-value. We then construct CIs using this adjusted standard deviation. In this way, adjusted p-values below your p-value threshold should always coincide with confidence intervals that do not cross zero.

Some adjusted p-values are 1, in which case the ad-hoc confidence interval constructed in this way is undefined.

Unadjusted confidence intervals are available in the metric tooltip.

Holm-Bonferroni (controlling FWER)

To control the FWER, we implement the Holm-Bonferroni method. This method is an adaptation of the well known Bonferroni method, which simply multiplies p-values by the number of tests in the family.

The Holm-Bonferroni does just as well as the Bonferroni method to control the FWER, but it is less conservative. The main trade-off is that the implementation is slightly more complex, and one cannot adjust the confidence intervals in a meaningful way. There are other approaches that are even less conservative, but they require making assumptions about the dependence between tests to guarantee FWER control.

Benjamini-Hochberg (controlling FDR)

To control the FDR, we implement the Benjamini-Hochberg procedure. This method does assume that the tests are independent or positively correlated. There are methods that make fewer assumptions (such as the Benjamini-Yekutieli method), but they can be even more conservative than Bonferroni corrections.

Therefore, in order to provide a reasonably powered approach that controls the FDR in some conditions, and to select an approach that has widespread adoption, we implemented the Benjamini-Hochberg procedure.

Defining a family of tests

Controlling for multiple comparisons across all of the tests one can run in GrowthBook, even within the context of one experiment, is impractical and can be prohibitively costly in terms of statistical power. Instead, GrowthBook corrects across all metrics, variations, (and dimension values, if available) in a particular result view.

tip

We correct using all p-values that you can see on one experiment results page, excluding guardrail metrics

On overall results

On the experiment results overall page, we collect all tests across non-guardrail metrics and variations to be part of a family, and correct within those values. The following image circles the p-values that we collect and adjust for on the overall results page for an experiment.

Demonstrating Correction in Results

Within dimensional slices

If you look within dimensional slices, to account for the additional increase in tests from dimensional splits, we collect correct within the family of tests across all dimension groups, non-guardrail metrics, and variations. The following video scrolls through a corrected set of dimension results; the ones that are adjusted are the non-guardrail metrics, and we include all adjusted p-values in this video (across dimensions and metrics) in one family for correction.

tip

We never correct across overall and dimension splits at the same time.