Holdout Experiments in GrowthBook
What are holdout experiments?
Holdout experiments (holdouts) are an approach to measuring the long term impact of one feature or a set of features. Essentially, you take some set of users and keep them from seeing new features; you then use them as a control group to measure against some other set of users who are getting all of the features you have launched.
Who uses holdouts?
Large tech companies often use them to measure both the long-term impact of individual features as well as the general evolution of a product that some team owns. We advocate that everyone uses holdouts at some level, even if only to test the long-term impacts of a single feature every now and then, to begin to understand how they work and what they say about the persistence of experiment effects at your company.
Why would I want to run a holdout?
- Holdouts are a great way to measure long-term impact. The impact of features changes over time, as does user behavior, and running an experiment for an extended period can help you understand these effects.
- Holdouts can help you measure the impact of multiple features at once. Experiments can interact with each other in unexpected ways and these interactions can change as time goes on.
Why wouldn't I want to run a holdout?
- Holdouts require you to keep a certain set of users behind the rest in terms of functionality.
- Holdouts require you to maintain feature flags in your codebase for their duration.
- Holdouts work best for logged-in experimentation, but can still be useful with anonymous traffic.
Can I run a holdout experiment in GrowthBook?
Yes!
Holdouts are a special class of experiment. While the GrowthBook team plans to build dedicated support for holdouts to make them even easier to run, this document will show you how to run them today.
How to run a holdout experiment in GrowthBook
Background
For this tutorial, we will assume the following goals:
- You want to hold out 5% of users from some time period from seeing a set of features (the 5% is customizable)
- You want to measure the impact of launching one or more features to the general population
To achieve these goals we will essentially split our traffic into 2 groups:
- 10% of traffic: your holdout population, split into two sub-groups
- A
holdout control
group here - 5% of global traffic that will never see new features and serves as our holdout control - A
holdout treatment
group - 5% of global traffic that sees all new features but only once they are released, not while they are being experimented/tested.
- A
- Our
general population
- 90% of traffic that gets experimented on and released to
1. Create a Holdout Experiment
- Create an experiment called, for example, ”Global Holdout Q1 2024”.
- Splits: Set coverage to 10% and then use a 50/50 split (again to achieve our 10% holdout test population, which you can customize).
- Metrics: Likely you do not want to use conversion windows with a holdout experiment. Users are going to get exposed to the holdout experiment as soon as you start testing the feature, so conversion windows may expire before you ever actually expose the holdout treatment group to the experiment. There are two solutions:
- Use your regular metrics and set Conversion Window Override to Ignore Conversion Windows in your Experiment Analysis settings
- Use metrics that have no conversion window OR use a Lookback Window to only measure the last X days of the holdout. For example, if you want to run a holdout for 2 months, but only measure effects in the last month, you can use metrics with 30 day Lookback Windows (you can use metric overrides within the experiment to do this, or create versions of your metrics that use Lookback Windows).
- Start the experiment, even though it doesn't have any linked features.
2. Add Holdout Experiment to All Future Features
- Create a feature that you want to add to your holdout. Before testing or launching the feature, add the above Holdout Experiment as an experiment rule ABOVE any feature experiment. This will ensure your holdout population never gets the feature until you choose to release it to them.
- Ensure that the
holdout control
andholdout feature
group get the same value as one another, and that this value is the same as the default/control behavior for your feature test in thegeneral population
. - To then test your new feature, add an experiment rule to your feature, where the control is getting the same value as the holdout groups.
See image below for an example of the state the test feature (here called "New Checkout Flow") should be in now:
3. Launch Features
Once you are ready to launch your feature, you have to take two steps:
- Release the feature in the
general population
as you would normally. You can do this by enabling “Temporary rollout” in your feature experiment (navigate to the experiment, click “Stop Experiment” and then pick the winning variation with temporary rollout enabled) or by editing the feature experiment rule to roll it out to all users more manually. - Update the holdout experiment rule manually to roll out the winning variation to the holdout feature group. Just update the feature flag value that the holdout feature group is getting. Then, you'll be in the state below.
4. Monitor your Holdout Experiment!
That's it! Initially, your holdout population will be seeing the same version of the app, so this experiment will show no differences. As you begin rolling out features, you should begin to see differences in the two groups. At a high-level, here's how this set-up would look for one quarter where you ran three tests and rolled out two features:
A couple of notes:
- As you can see that double blue and orange shaded region will serve as the full test of both shipped features together. You may even want to extend the measurement period beyond the quarter (where you don't add any new features) in order to let all features be in the holdout test group for a period.
- You'll also note that the
holdout feature
group is only getting features once they are released to everyone. This helps keep the measurement of the impact of the features clean, as you are not measuring the impact of the feature being tested with different settings, but rather the impact of the final feature being released. That said, the alternative is also interesting, where you comparegeneral population
to yourholdout control
group, but that isn't currently supported in GrowthBook.