Data Sources

Data Sources are how GrowthBook connects to your analytics tool or data warehouse to automatically pull metrics and experiment results.

You can use GrowthBook without a Data Source, but the user experience is not as smooth since you must enter all data manually.

Below are the currently supported Data Sources:

  • Redshift
  • Snowflake
  • BigQuery
  • ClickHouse
  • AWS Athena
  • Postgres
  • MySQL/MariaDB
  • PrestoDB (and Trino)
  • Mixpanel
  • Google Analytics

Configuration Settings

To effectively use GrowthBook, you'll need to tell us a little about the shape of your data.

SQL Sources

GrowthBook can work with your existing SQL data, no matter what shape or format it is in, whether you have a strongly normalized schema, a single “events” table with JSON fields, or something in between.

GrowthBook just needs two base SQL queries plus one additional SQL query for each metric. Writing this SQL is a (mostly) one-time setup task. After defining these base queries and metrics, they can be reused across many experiments.

Don’t worry about the potentially huge number of rows returned by these raw queries. They are never run directly as-is and are instead combined, filtered, and aggregated as part of larger queries. Most of the final queries run by GrowthBook result in fewer than 10 rows returned.

In the spirit of transparency, any time a query is run, you should see a View Queries link in the app to view the raw SQL sent to the data warehouse. This can help with debugging or let you move a query into a tool like Mode Analytics for more advanced analysis.

1. Experiments Query

This is the most important query for GrowthBook. It needs to return which user was part of which experiment, what variation they saw, and when they saw it. The end result of the query should return data like this:

user_idanonymous_idtimestampexperiment_idvariation_id
123abcdef2021-08-23-10:53:04my-button-test0
456fedcba2021-08-23 10:53:06my-button-test1

If you use Segment to populate your data warehouse, the SQL will look something like this:

SELECT
  user_id,
  anonymous_id,
  received_at as timestamp,
  experiment_id,
  variation_id
FROM
  experiment_viewed

If you have all of your events in a single denormalized table, the query will be similar with the addition of a WHERE clause to filter down to just the variation assignment event type. If you have a custom data structure, you may even need to also add some JOINs.

Don’t worry about returning duplicate rows or ordering the result. We handle that for you automatically. Whatever you do, just make sure to return the exact column names shown above. If your table’s columns use a different name, add an alias in the SELECT list (e.g. SELECT original_column as new_column).

Experiment Dimensions

In addition to the standard 5 columns above, you can return additional dimension columns. For example, browser or referrer. These extra columns can be used to drill down into experiment results.

2. Page Views Query

This needs to return one row for every page view (or similar event) in your app. This is used to analyze historical user behavior. For example, we can use this data to predict how long an experiment will take before you start. As another example, we can use this to plot your overall metric conversion rates over time.

The end result should return data like this:

user_idanonymous_idtimestampurl
123abcdef2021-08-23-10:53:04/search
456fedcba2021-08-23 10:53:06/

If you use Segment to populate your data warehouse, you can use the default pages table and your query will look like this:

SELECT
  user_id,
  anonymous_id,
  received_at as timestamp,
  path as url
FROM
  pages

Similar to the experiment query, don’t worry about duplicate rows or ordering. Just make sure the column names you select exactly match the above.

3. Jupyter Notebook Query Runner

This setting is only required if you want to export experiment results as a Jupyter Notebook.

There is no one standard way to store credentials or run SQL queries from Jupyter notebooks, so GrowthBook lets you define your own Python function.

It needs to be called runQuery, accept a single string argument named sql, and return a pandas data frame.

Here's an example for a Postgres (or Redshift) data source:

import os
import psycopg2
import pandas as pd
from sqlalchemy import create_engine, text

# Use environment variables or similar for passwords!
password = os.getenv('POSTGRES_PW')
connStr = f'postgresql+psycopg2://user:{password}@localhost'
dbConnection = create_engine(connStr).connect();

def runQuery(sql):
  return pd.read_sql(text(sql), dbConnection)

Note: This python source is stored as plain text in the database. Do not hard-code passwords or sensitive info. Use environment variables (shown above) or another credential store instead.

Mixpanel

We query Mixpanel using JQL. We have sensible defaults for the event and property names, but you can change them if you need to.

  • Experiments
    • View Experiments Event - The name of the event you are firing when a user is put into a variation
    • Experiment Id Property - The property name that stores the experiment tracking key
    • Variation Id Property - The property name that stores the variation the user was assigned
    • Variation Id Format - What format the variation id is stored in.
      1. Numeric (0 = control, 1 = variation 1, etc.)
      2. Unique String Keys (e.g. "blue", "random-uuid", etc.)
  • Page Views
    • Page Views Event - the name of the event you are firing for every page view on your site
    • URL Path Property - in the event, the property name that stores the URL path for the pageview
    • User Agent Property - In the event, the property name that stores the user agent for the pageview

Connection Info

Connection info is encrypted twice - once within the app and again by the database when persisting to disk.

GrowthBook only runs SELECT queries (or the equivalent for non-SQL data sources). We still always recommend creating read-only users with as few permissions as possible.

If you are using GrowthBook Cloud (https://app.growthbook.io), make sure to whitelist the ip address 52.70.79.40 if applicable.

Most data sources have straight forward connection parameters like host, port, username, password. A few of the data sources, documented below, require some extra work to connect.

AWS Athena

Unlike other database engines with their own user management system, Athena uses IAM for authentication.

We recommend creating a new IAM user with readonly permissions for GrowthBook. The managed Quick Sight Policy is a good starting point.

For the S3 results url, we recommend naming your bucket with the prefix aws-athena-query-results-

BigQuery

You must first create a Service Account in Google with the proper permissions. To connect GrowthBook, we just need the JSON key file for that account. It should contain the project_id, client_email, and private_key.

Mixpanel

You must first create a Service Account in Mixpanel under your Project Settings.

To add the datasource in GrowthBook, you will need:

  1. The service account username
  2. The service account secret
  3. Your project id (found on the Project Settings Overview page)

Google Analytics

Because of Google Analytics tracking limitations, a user can only be in a single experiment at a time. We highly recommend using a more full-featured data source for serious A/B testing.

We require 3 things to query the Google Analytics API:

  1. OAuth Authorization
  2. View ID (found in Admin -> View Settings)
  3. Custom Dimension Index

When tracking experiment views, the custom dimension value must be formatted as experiment-key:variation-index. For example: my-test:0 for the control and my-test:1 for the 1st variation.