Skip to main content

Data Source Configuration

Overview

Data Sources are how GrowthBook connects to your data warehouse so that it can pull those aggregated statistics in order to compute metrics and experiment results. Each Data Source defines how to connect to your data, what version of SQL to use when querying your data, and can provide templates for what the SQL to connect to your Data Source should look like depending upon which event tracker software you use. GrowthBook works with your existing SQL data, no matter where it is located and no matter what shape or format it is in, whether you have a strongly normalized schema, a single “events” table with JSON fields, or something in between.

Supported Event Schemas

When adding a new Data Source in Growthbook from /datasources page we first guide you to select what event tracker software you use, or you can choose custom if you don't use any of the popular third party event trackers that we support. Telling us which event tracker you use, gives us an idea on the likely shape of your data. This will help us generate the correct sql to extract out the aggregated statistics from your site with as little modification on your end as possible. Here is Growthbook's current list of event trackers that we support with links for more details on how to set them up with GrowthBook:

If you do not use any of those you choose Custom Data Source and define some of the sql yourself.

Configuration Settings

Once you have chosen your event tracker and data source type and successfully connected, you will be given an opportunity to modify your configuration settings. For many applications Growthbook will have choosen the correct configuration settings straight out of the box based upon which event tracker you choose. In some instances, you may need to tweak them slightly, or in the case of using a custom datasource, define them more explicitly.

Identifier Types

These are all of the types of identifiers you use to split traffic in an experiment and track metric conversions. Common examples are user_id, anonymous_id, device_id, and ip_address.

There are some cases where a single database column isn't enough to uniquely identify a subject. For example, you might need the combination of company and user_id. In this case, we recommend creating a synthetic identifier by concatenating all of the fields together. For example, you can create a company_user identifier and then in your SQL, select it as follows: CONCAT(company, user_id) as company_user.

Experiment Assignment Queries

An experiment assignment query returns which users were part of which experiment, what variation they saw, and when they saw it. Each assignment query is tied to a single identifier type (defined above). You can also have multiple assignment queries if you store that data in different tables, for example, one from your email system and one from your back-end.

tip

Assignment queries are one-half of the queries that are used to generate experiment results, the other being metric queries. Assignment queries can be edited from the Metrics and DataData Sources page.

The end result of the query should return data like this:

user_idtimestampexperiment_idvariation_id
1232021-08-23-10:53:04my-button-test0
4562021-08-23 10:53:06my-button-test1

The above assumes the identifier type you are using is user_id. If you are using a different identifier, you would use a different column name.

Here's an example query you might use:

SELECT
user_id,
received_at as timestamp,
experiment_id,
variation_id
FROM
events
WHERE
event_type = 'viewed experiment'

Make sure to return the exact column names that GrowthBook is expecting. If your table’s columns use a different name, add an alias in the SELECT list (e.g. SELECT original_column as new_column).

Duplicate Rows

If a user sees an experiment multiple times, you should return multiple rows in your assignment query, one for each time the user was exposed to the experiment.

This helps us detect when users were exposed to more than one variation, and eventually may be useful in helping build interesting time series.

Experiment Dimensions

In addition to the standard 4 columns above, you can also select additional dimension columns. For example, browser or referrer. These extra columns can be used to drill down into experiment results.

Identifier Join Tables

If you have multiple identifier types and want to be able to auto-merge them together during analysis, you also need to define identifier join tables. For example, if your experiment is assigned based on device_id, but the conversion metric only has a user_id column.

These queries are very simple and just need to return columns for each of the identifier types being joined. For example:

SELECT user_id, device_id FROM logins

SQL Templates

We use {{Handlebars}} to compile the assignment sql, identity queries, etc. into what is actually called to your database.

You can use any of the in-built variables that Growthbook automatically sets:

  • startDate - yyyy-MM-dd HH:mm:ss of the earliest data that needs to be included
  • startDateISO - yyyy-MM-dd'T'HH:mm:ss.SSS'Z' of the startDate in ISO format. This can then be used with the date helper to achieve whatever format you like (ex. {{date startDateISO "yyyyMMdd"}})
  • endDate - yyyy-MM-dd HH:mm:ss of the latest data that needs to be included
  • endDateISO - yyyy-MM-dd'T'HH:mm:ss.SSS'Z' of the endDate in ISO format. This can then be used with the date helper to achieve whatever format you like (ex. {{date endDateISO "yyyyMMdd"}})
  • experimentId - Either a specific experiment id OR % if you should include all experiments

You can also use any of the in-built helper functions:

  • camelcase [str] - ex. {{camelcase "My database"}} compiles to myDatabase.
  • dotcase [str] - ex. {{dotcase "My database"}} compiles to my.database.
  • kebabcase [str] - ex. {{kebabcase "My database"}} compiles to my-database.
  • lowercase [str] - ex. {{lowercase "My database"}} compiles to my database.
  • pascalcase [str] - ex. {{pascalcase "My database"}} compiles to MyDatabase.
  • replace [str] [pattern] [replacement] - Replace all occurences of a regular expression with something else. ex. {{replace "My%%%Database!" "\[^a-zA-Z\]" ""}} compiles to MyDatabase
  • snakecase [str] - ex. {{pascalcase "My database"}} compiles to my_database.
  • uppercase [str] - ex. {{uppercase "My database"}} compiles to MY DATABASE.
  • date [date] [format] - Format an ISO date according to this format, being careful not to mix up months (MM) and minutes (mm). ex. {{date startDateISO "yyyyMMdd"}} might compile to 20230130. The most common codes are:
    codemeaning
    yyyyyear
    MMmonth
    ddday
    HHhour
    mmminutes
    ssseconds
    ttimestamp

For example:

SELECT
user_id,
anonymous_id,
received_at as timestamp,
experiment_id,
variation_id
FROM
events_*
WHERE
_TABLE_SUFFIX BETWEEN '{{date startDateISO "yyyyMMdd"}}' AND '{{date endDateISO "yyyyMMdd"}}'
AND event_name = 'experiment_viewed'
AND experiment_id LIKE '{{ experimentId }}'
note

The inserted values do not have surrounding quotes, so you must add those yourself (e.g. use '{{ startDate }}' instead of just {{ startDate }})

Jupyter Notebook Query Runner

This setting is only required if you want to export experiment results as a Jupyter Notebook.

There is no one standard way to store credentials or run SQL queries from Jupyter notebooks, so GrowthBook lets you define your own Python function.

It needs to be called runQuery, accept a single string argument named sql, and return a pandas data frame.

Here's an example for a Postgres (or Redshift) data source:

import os
import psycopg2
import pandas as pd
from sqlalchemy import create_engine, text

# Use environment variables or similar for passwords!
password = os.getenv('POSTGRES_PW')
connStr = f'postgresql+psycopg2://user:{password}@localhost'
dbConnection = create_engine(connStr).connect();

def runQuery(sql):
return pd.read_sql(text(sql), dbConnection)

Note: This python source is stored as plain text in the database. Do not hard-code passwords or sensitive info. Use environment variables (shown above) or another credential store instead.

Schema Browser

When you connect a supported data source to GrowthBook, we automatically generate metadata that is used by our Schema Browser. The Schema Browser is a user-friendly interface that makes writing queries easier as you can easily explore information about the datasource such as databases, schemas, tables, columns, and data types.

GrowthBook Schema Browser

Below are the data sources that currently support the Schema Browser:

  • AWS Athena - Requires a Default Catalog
  • BigQuery - Requires a Project Name and Default Dataset
  • ClickHouse
  • Databricks - Currently only supported on version 10.2 and above with a Unity Catalog
  • MsSQL/SQL Server
  • MySQL/MariaDB
  • Postgres
  • PrestoDB (and Trino) - Requires a Default Catalog
  • Redshift
  • Snowflake
note

If you added a supported data source prior to GrowthBook v2.0, you can generate the schema manually by clicking "Data Sources" on the left-nav, selecting the data source, and then clicking the "View Schema Browser" button and following the on-screen prompt.