Experiments are the core of GrowthBook. This page covers several topics:
- Starting and Stopping
When you create a new experiment, it starts out as a draft and remains fully editable until you start it.
Experiment drafts are a great place to collaborate between PMs, designers, and engineers.
PMs can spec out the requirements, designers can upload mockups and screenshots, and engineers can work on implementation.
Use the built-in discussion thread to add comments or collect feedback.
There are a few different ways to limit your experiment to a subset of users:
- Select specific user groups to include (e.g. internal employees, beta testers)
- Limit to logged-in users only
- Specify a URL regex pattern
- Custom targeting rules (e.g.
age > 18)
All of these rules are evaluated locally in your app at runtime and no HTTP requests are made to the GrowthBook servers.
It's also possible to use a completely custom implementation (or another library like PlanOut). The only requirement is that you track in your datasource when users are put into an experiment and which variation they received.
Starting and Stopping
When you start an experiment, you will be prompted for how you want to split traffic between the variations and who you want to roll the test out to.
When stopping an experiment, you'll be prompted to enter which variation (if any) won and why you are stopping the test.
Client Library Integration
If you are using the Client Libraries to implement experiments, there are some additional steps you must take.
The Client Libraries never communicate with the GrowthBook servers. That means as soon as your deploy the A/B test code to production, people will start getting put into the experiment immediately and the experiment will continue until you remove the code and do another deploy.
This separation has huge performance and reliability benefits (if GrowthBook goes down, it has no effect on your app), but it can be a bit unintuitive when you press the "Stop" button in the UI and people continue to be put into the experiment.
To get the best of both worlds, you can store a cached copy of experiments in Redis (or similar) and keep it up-to-date either by periodically hitting the GrowthBook API or setting up a Webhook Endpoint. Then your app can query the cache at runtime to get the latest experiment statuses.
Each row of this table is a different metric.
Risk tells you how much you are predicted to lose if you choose the selected variation as the winner and you are wrong. Anything below 0.25% is highlighted green indicating the risk is very low and it's safe to call the experiment. You can use the dropdown to see the risk of choosing a different winner.
Value is the conversion rate or average value per user. In small print you can see the raw numbers used to calculate this.
Chance to Beat Control tells you the probability that the variation is better. Anything above 95% is highlighted green indicating a very clear winner. Anything below 5% is highlighted red, indicating a very clear loser. Anything in between is grayed out indicating it's inconclusive. If that's the case, there's either no measurable difference or you haven't gathered enough data yet.
Percent Change shows how much better/worse the variation is compared to the control. It is a probability density graph and the thicker the area, the more likely the true percent change will be there. As you collect more data, the tails of the graphs will shorten, indicating more certainty around the estimates.
Sample Ratio Mismatch (SRM)
Every experiment automatically checks for a Sample Ratio Mismatch and will warn you if found. This happens when you expect a certain traffic split (e.g. 50/50) but you see something significantly different (e.g. 46/54). We only show this warning if the p-value is less than
0.001, which means it's extremely unlikely to occur by chance.
Like the warning says, you shouldn't trust the results since they are likely misleading. Instead, find and fix the source of the bug and restart the experiment.
Guardrail metrics are ones that you want to keep an eye on, but aren't trying to specifically improve with your experiment. For example, if you are trying to improve page load times, you may add revenue as a guardrail since you don't want to inadvertantly harm it.
Guardrail results show up beneath the main table of metrics and you can click on one to expand it and show more info. They are colored based on "Chance of Being Worse", which is just the complement of "Chance to Beat Control". If there are more than 2 variations, the max value is used to determine the overall color. A "Chance of Being Worse" less than 65% is green and of no concern. Between 65% and 90% is yellow and should be watched as more data comes in. Above 90% is red and you may consider stopping the experiment. If we don't have enough data to accurately predict the "Chance of Being Worse", we will color the metric grey.
If you have defined dimensions for your users, you can use the Dimension dropdown to drill down into your results. This is very useful for debugging (e.g. if Safari is down, but the other browser are fine, you may have an implementation bug).
Be careful. The more metrics and dimensions you look at, the more likely you are to see a false positive. If you find something that looks surprising, it's often worth a dedicated follow-up experiment to verify that it's real.