Technical post-stratification details
Here we document the technical details behind GrowthBook regression adjustment and post-stratification. This approach permits estimation of absolute and relative effects, and unadjusted and CUPED inference, for binomial, count, and ratio metrics.
Throughout the document we describe CUPED estimation for ratio metrics, and then discuss simpler cases (e.g., non-adjusted estimates, count metrics).
We assume data are available in each cell, i.e., either the number of cells is not too big, or we have already aggregated some cells together.
For each case there are 4 steps.
In Regression we describe how to construct regression estimates of the treatment effect and control mean for each cell (i.e. dimension level).
In First Stage we describe how to construct cell-specific estimates of absolute treatment effects and control means using cell-specific summary statistics.
In Second Stage we describe how to combine estimates across cells to estimate population effects and population control means.
Finally, in Third Stage we transform the combined estimates into estimates of lift, ratio parameters, etc.
Regression
Below we describe regression models for each cell, or dimension level. The regression output will be used in the next section to construct the joint sampling distribution of effect estimates and control means within a stratification cell.
We do this for ratio metrics, and discuss along the way the simpler case of count metrics.
Define m i 1 m_{i1} m i 1 (d i 1 d_{i1} d i 1 ) as the numerator (denominator) outcome for the i th i^{\text{th}} i th user, i = 1 , 2 , . . . , N i=1,2,..., N i = 1 , 2 , ... , N
Define x i m x_{im} x im (x i d x_{id} x i d ) as the pre-exposure numerator (denominator) variable for the i th i^{\text{th}} i th user.
Define w i w_{i} w i as the binary treatment assignment for the i th i^{\text{th}} i th user.
Define the covariate vector x i = ( 1 , w i , x i m , x i d ) \textbf{x}_{i} = \left(1, w_{i}, x_{im}, x_{id}\right) x i = ( 1 , w i , x im , x i d ) .
Define the N × 4 N \times 4 N × 4 design matrix X ~ \tilde{\textbf{X}} X ~ whose i th i^{\text{th}} i th row equals x i \textbf{x}_{i} x i .
Define the 2 N × 8 2N \times 8 2 N × 8 design matrix X = I 2 ⊗ X ~ \textbf{X} = \textbf{I}_{2}\otimes \tilde{\textbf{X}} X = I 2 ⊗ X ~ .
Define the 2 N 2N 2 N length vector Y = { m 11 , d 11 , m 21 , d 21 , . . . , m N 1 , d N 1 } \boldsymbol{Y} = \left\{m_{11}, d_{11},m_{21}, d_{21},..., m_{N1}, d_{N1}\right\} Y = { m 11 , d 11 , m 21 , d 21 , ... , m N 1 , d N 1 } .
Define the regression coefficients as γ \boldsymbol{\gamma} γ .
Our model is of the form Y = X γ + E . \textbf{Y} = \textbf{X}\boldsymbol{\gamma} + \textbf{E}. Y = X γ + E .
The least squares solution for the 8 × 1 8 \times 1 8 × 1 vector of regression coefficients γ \boldsymbol{\gamma} γ is
γ ^ = ( X ⊤ X ) − 1 X ⊤ Y . \hat{\boldsymbol{\gamma}} = \left(\textbf{X}^{\top}\textbf{X} \right)^{-1}\textbf{X}^{\top}\textbf{Y}. γ ^ = ( X ⊤ X ) − 1 X ⊤ Y .
Define E ~ \tilde{\boldsymbol{E}} E ~ as the N × 2 N \times 2 N × 2 matrix of residuals, whose first column corresponds to the residuals for the numerator and the second column is the residuals for the denominator.
Define the 2 × 2 2\times 2 2 × 2 covariance of E \textbf{E} E as Ψ \boldsymbol{\Psi} Ψ .
The covariance of γ ^ \hat{\boldsymbol{\gamma}} γ ^ is
Σ γ = Cov ( γ ^ ) = Ψ ⊗ ( X ⊤ X ) − 1 . \boldsymbol{\Sigma}_{\boldsymbol{\gamma}} = \text{Cov}\left(\hat{\boldsymbol{\gamma}} \right) = \boldsymbol{\Psi}\otimes \left(\textbf{X}^{\top}\textbf{X} \right)^{-1}. Σ γ = Cov ( γ ^ ) = Ψ ⊗ ( X ⊤ X ) − 1 .
By Lyapunov's central limit theorem,
γ ^ ∼ N ( γ , Σ γ ) . \hat{\boldsymbol{\gamma}} \stackrel{}{\sim} \mathcal{N}\left(\boldsymbol{\gamma}, \boldsymbol{\Sigma}_{\boldsymbol{\gamma}} \right). γ ^ ∼ N ( γ , Σ γ ) .
Cell moments
In this section we describe how to use the regression output from the previous section to construct the joint sampling distribution of effect estimates and control means within a stratification cell.
In the k th k^{\text{th}} k th cell, our inferential focus is the vector α k \boldsymbol{\alpha}_{k} α k , which has four elements:
numerator absolute effect estimate for the k th k^{\text{th}} k th cell
numerator control mean for the k th k^{\text{th}} k th cell
denominator absolute effect estimate for the k th k^{\text{th}} k th cell
denominator control mean for the k th k^{\text{th}} k th cell
Now that we have our summary statistics in the form of a multivariate CLT, we linearly transform them to create our estimates of numerator and denominator effects and control means.
Define x ˉ m \bar{x}_{m} x ˉ m (x ˉ d \bar{x}_{d} x ˉ d ) as the sample mean pre-exposure numerator (denominator) variable.
Define μ x m \mu_{xm} μ x m and μ x d \mu_{xd} μ x d as their population counterparts.
Define the 4 × 8 4\times 8 4 × 8 contrast matrix A k , r e g \textbf{A}_{k, reg} A k , re g where
A k , r e g = ( 1 0 x ˉ m x ˉ d 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 x ˉ m x ˉ d 0 0 0 0 0 1 0 0 ) . \begin{align}
\textbf{A}_{k, reg} =
\begin{pmatrix}
1 & 0 & \bar{x}_{m} & \bar{x}_{d} & 0 & 0 & 0 & 0\\
0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 1 & 0 & \bar{x}_{m} & \bar{x}_{d}\\
0 & 0 & 0 & 0 & 0 & 1 & 0 & 0
\end{pmatrix}.
\end{align} A k , re g = 1 0 0 0 0 1 0 0 x ˉ m 0 0 0 x ˉ d 0 0 0 0 0 1 0 0 0 0 1 0 0 x ˉ m 0 0 0 x ˉ d 0 .
We estimate α k \boldsymbol{\alpha}_{k} α k with α ^ k = A k , r e g γ ^ k \boldsymbol{\hat{\alpha}}_{k} = \textbf{A}_{k, reg}\hat{\boldsymbol{\gamma}}_{k} α ^ k = A k , re g γ ^ k .
We now calculate the covariance of α ^ k \boldsymbol{\hat{\alpha}}_{k} α ^ k , denoted as Σ k \boldsymbol{\Sigma}_{k} Σ k .
Many readers may want to skip to the next section, Combining cell estimates , where we describe how to combine estimates across cells to estimate population absolute effects and control means.
One subtlety is that A k , r e g \textbf{A}_{k, reg} A k , re g has random components which must be accounted for.
For inference within a cell, we condition upon the sample size for that cell.
We deal with the assignment randomness in the next section.
Technically, each of the covariances and expectations below are conditional upon n k n_{k} n k , but we suppress this notation for clarity.
Below we describe how to calculate row means and covariances between individual rows of A k , r e g \textbf{A}_{k, reg} A k , re g .
The first moment of A k , r e g \textbf{A}_{k, reg} A k , re g is
E [ A k , r e g ] = E [ ( 1 0 x ˉ m x ˉ d 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 x ˉ m x ˉ d 0 0 0 0 0 1 0 0 ) ] = ( 1 0 μ x m μ x d 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 μ x m μ x d 0 0 0 0 0 1 0 0 ) . \begin{align}
E\left[\textbf{A}_{k, reg}\right]
&=
E\left[
\begin{pmatrix}
1 & 0 & \bar{x}_{m} & \bar{x}_{d} & 0 & 0 & 0 & 0\\
0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 1 & 0 & \bar{x}_{m} & \bar{x}_{d}\\
0 & 0 & 0 & 0 & 0 & 1 & 0 & 0
\end{pmatrix}
\right]
\\&=
\begin{pmatrix}
1 & 0 & \mu_{xm} & \mu_{xd} & 0 & 0 & 0 & 0\\
0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 1 & 0 & \mu_{xm} & \mu_{xd}\\
0 & 0 & 0 & 0 & 0 & 1 & 0 & 0
\end{pmatrix}.
\end{align} E [ A k , re g ] = E 1 0 0 0 0 1 0 0 x ˉ m 0 0 0 x ˉ d 0 0 0 0 0 1 0 0 0 0 1 0 0 x ˉ m 0 0 0 x ˉ d 0 = 1 0 0 0 0 1 0 0 μ x m 0 0 0 μ x d 0 0 0 0 0 1 0 0 0 0 1 0 0 μ x m 0 0 0 μ x d 0 .
We also need the covariance between individual rows of A k , r e g \textbf{A}_{k, reg} A k , re g .
Note that there is nothing random in the second and fourth rows of A k , r e g \textbf{A}_{k, reg} A k , re g , so the covariance of any vectors with these terms is 0.
There are only 4 cases we need to consider, and we start with the covariance of the first row of A k , r e g \textbf{A}_{k, reg} A k , re g with itself.
C o v ( A k , r e g [ 1 , ] , A k , r e g [ 1 , ] ) = E [ ( 1 0 x ˉ m x ˉ d 0 0 0 0 ) ( 1 0 x ˉ m x ˉ d 0 0 0 0 ) ] − E [ ( 1 0 x ˉ m x ˉ d 0 0 0 0 ) ] E [ ( 1 0 x ˉ m x ˉ d 0 0 0 0 ) ] = ( 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 σ x m 2 / n σ x m d / n 0 0 0 0 0 0 σ x m d / n σ x d 2 / n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ) . \begin{align*}
Cov\left(\textbf{A}_{k, reg}[1, ], \textbf{A}_{k, reg}[1, ]\right) &=
E\left[
\begin{pmatrix}
1 & 0 & \bar{x}_{m} & \bar{x}_{d} & 0 & 0 & 0 & 0
\end{pmatrix}
\begin{pmatrix}
1\\
0\\
\bar{x}_{m} \\
\bar{x}_{d} \\
0\\
0\\
0 \\
0\\
\end{pmatrix}
\right]
\\&-
E\left[
\begin{pmatrix}
1 & 0 & \bar{x}_{m} & \bar{x}_{d} & 0 & 0 & 0 & 0
\end{pmatrix}
\right]
E\left[
\begin{pmatrix}
1\\
0\\
\bar{x}_{m} \\
\bar{x}_{d} \\
0\\
0\\
0 \\
0\\
\end{pmatrix}
\right]
\\&=
\begin{pmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & \sigma_{xm}^{2}/n & \sigma_{xmd}/n & 0 & 0 & 0 & 0\\
0 & 0 & \sigma_{xmd}/n & \sigma_{xd}^{2}/n & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{pmatrix}.
\end{align*} C o v ( A k , re g [ 1 , ] , A k , re g [ 1 , ] ) = E ( 1 0 x ˉ m x ˉ d 0 0 0 0 ) 1 0 x ˉ m x ˉ d 0 0 0 0 − E [ ( 1 0 x ˉ m x ˉ d 0 0 0 0 ) ] E 1 0 x ˉ m x ˉ d 0 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 σ x m 2 / n σ x m d / n 0 0 0 0 0 0 σ x m d / n σ x d 2 / n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .
Using a similar argument for the (3, 3) case:
C o v ( A k , r e g [ 3 , ] , A k , r e g [ 3 , ] ) = = ( 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 σ x m 2 / n σ x m d / n 0 0 0 0 0 0 σ x m d / n σ x d 2 / n ) . \begin{align*}
Cov\left(\textbf{A}_{k, reg}[3, ], \textbf{A}_{k, reg}[3, ]\right) &=
\\&=
\begin{pmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & \sigma_{xm}^{2}/n & \sigma_{xmd}/n\\
0 & 0 &0 & 0 & 0 & 0 & \sigma_{xmd}/n & \sigma_{xd}^{2}/n
\end{pmatrix}.
\end{align*} C o v ( A k , re g [ 3 , ] , A k , re g [ 3 , ] ) = = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 σ x m 2 / n σ x m d / n 0 0 0 0 0 0 σ x m d / n σ x d 2 / n .
For the (1, 3) case:
C o v ( A k , r e g [ 1 , ] , A k , r e g [ 3 , ] ) = E [ ( 1 0 x ˉ m x ˉ d 0 0 0 0 ) ( 0 0 0 0 1 0 x ˉ m x ˉ d ) ] − E [ ( 1 0 x ˉ m x ˉ d 0 0 0 0 ) ] E [ ( 0 0 0 0 1 0 x ˉ m x ˉ d ) ] = ( 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 σ x m 2 / n σ x m d / n 0 0 0 0 0 0 σ x m d / n σ x d 2 / n 0 0 0 0 ) \begin{align*}
Cov\left(\textbf{A}_{k, reg}[1, ], \textbf{A}_{k, reg}[3, ]\right) &=
E\left[
\begin{pmatrix}
1 & 0 & \bar{x}_{m} & \bar{x}_{d} & 0 & 0 & 0 & 0
\end{pmatrix}
\begin{pmatrix}
0\\
0\\
0\\
0\\
1 \\
0\\
\bar{x}_{m} \\
\bar{x}_{d}
\end{pmatrix}
\right]
\\&-
E\left[
\begin{pmatrix}
1 & 0 & \bar{x}_{m} & \bar{x}_{d} & 0 & 0 & 0 & 0
\end{pmatrix}
\right]
E\left[
\begin{pmatrix}
0\\
0\\
0\\
0\\
1 \\
0\\
\bar{x}_{m} \\
\bar{x}_{d}
\end{pmatrix}
\right]
\\&=
\begin{pmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & \sigma_{xm}^{2}/n & \sigma_{xmd}/n & 0 & 0 & 0 & 0\\
0 & 0 & \sigma_{xmd}/n & \sigma_{xd}^{2}/n & 0 & 0 & 0 & 0
\end{pmatrix}
\end{align*} C o v ( A k , re g [ 1 , ] , A k , re g [ 3 , ] ) = E ( 1 0 x ˉ m x ˉ d 0 0 0 0 ) 0 0 0 0 1 0 x ˉ m x ˉ d − E [ ( 1 0 x ˉ m x ˉ d 0 0 0 0 ) ] E 0 0 0 0 1 0 x ˉ m x ˉ d = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 σ x m 2 / n σ x m d / n 0 0 0 0 0 0 σ x m d / n σ x d 2 / n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
For the (3, 1) case:
C o v ( A k , r e g [ 3 , ] , A k , r e g [ 1 , ] ) = C o v ( A k , r e g [ 1 , ] , A k , r e g [ 3 , ] ) ′ \begin{align*}
Cov\left(\textbf{A}_{k, reg}[3, ], \textbf{A}_{k, reg}[1, ]\right) =
Cov\left(\textbf{A}_{k, reg}[1, ], \textbf{A}_{k, reg}[3, ]\right)'
\end{align*} C o v ( A k , re g [ 3 , ] , A k , re g [ 1 , ] ) = C o v ( A k , re g [ 1 , ] , A k , re g [ 3 , ] ) ′
Define μ k , r e g \boldsymbol{\mu}_{k, reg} μ k , re g as the mean of A k , r e g \textbf{A}_{k, reg} A k , re g .
Cov ( α ^ k ) = Cov ( A k , r e g γ ^ k ) = E [ Cov ( A k , r e g γ ^ k ) ∣ A k , r e g ] + Cov [ E ( A k , r e g γ ^ k ) ∣ A k , r e g ] = E [ A k , r e g Cov ( γ ^ k ) A k , r e g ⊤ ] + Cov [ A k , r e g γ k ] \begin{align*}
\text{Cov}\left(\hat{\alpha}_{k}\right) &= \text{Cov}\left(\textbf{A}_{k, reg}\hat{\gamma}_{k}\right)
\\&= E\left[\text{Cov}\left(\textbf{A}_{k, reg}\hat{\gamma}_{k}\right)|\textbf{A}_{k, reg}\right] +
\text{Cov}\left[E\left(\textbf{A}_{k, reg}\hat{\gamma}_{k}\right)|\textbf{A}_{k, reg}\right]
\\&= E\left[\textbf{A}_{k, reg}\text{Cov}\left(\hat{\gamma}_{k}\right)\textbf{A}_{k, reg}^{\top}\right] +
\text{Cov}\left[\textbf{A}_{k, reg}\gamma_{k}\right]
\end{align*} Cov ( α ^ k ) = Cov ( A k , re g γ ^ k ) = E [ Cov ( A k , re g γ ^ k ) ∣ A k , re g ] + Cov [ E ( A k , re g γ ^ k ) ∣ A k , re g ] = E [ A k , re g Cov ( γ ^ k ) A k , re g ⊤ ] + Cov [ A k , re g γ k ]
The first term has ( i , j ) th (i,j)^{\text{th}} ( i , j ) th element equal to
E [ A k , r e g Cov ( γ ^ k ) A k , r e g ⊤ ] [ i , j ] = E [ A k , r e g [ i , ] Cov ( γ ^ k ) A k , r e g [ j , ] ⊤ ] = E [ trace ( A k , r e g [ i , ] Cov ( γ ^ k ) A k , r e g [ j , ] ⊤ ) ] = E [ trace ( Cov ( γ ^ k ) A k , r e g [ j , ] ⊤ A k , r e g [ i , ] ) ] = trace ( Cov ( γ ^ k ) E [ A k , r e g [ j , ] ⊤ A k , r e g [ i , ] ) ] \begin{align*}
E\left[\textbf{A}_{k, reg}\text{Cov}\left(\hat{\boldsymbol{\gamma}}_{k}\right)\textbf{A}_{k, reg}^{\top}\right][i,j] &= E\left[\textbf{A}_{k, reg}[i, ]\text{Cov}\left(\hat{\boldsymbol{\gamma}}_{k}\right)\textbf{A}_{k, reg}[j, ]^{\top}\right]
\\&= E\left[
\text{trace}\left(\textbf{A}_{k, reg}[i, ]\text{Cov}\left(\hat{\boldsymbol{\gamma}}_{k}\right)\textbf{A}_{k, reg}[j, ]^{\top}\right)
\right]
\\&= E\left[
\text{trace}\left(\text{Cov}\left(\hat{\boldsymbol{\gamma}}_{k}\right)\textbf{A}_{k, reg}[j, ]^{\top}\textbf{A}_{k, reg}[i, ]\right)
\right]
\\&=
\text{trace}\left(\text{Cov}\left(\hat{\boldsymbol{\gamma}}_{k}\right)
E\left[
\textbf{A}_{k, reg}[j, ]^{\top}\textbf{A}_{k, reg}[i, ]\right)
\right]
\end{align*} E [ A k , re g Cov ( γ ^ k ) A k , re g ⊤ ] [ i , j ] = E [ A k , re g [ i , ] Cov ( γ ^ k ) A k , re g [ j , ] ⊤ ] = E [ trace ( A k , re g [ i , ] Cov ( γ ^ k ) A k , re g [ j , ] ⊤ ) ] = E [ trace ( Cov ( γ ^ k ) A k , re g [ j , ] ⊤ A k , re g [ i , ] ) ] = trace ( Cov ( γ ^ k ) E [ A k , re g [ j , ] ⊤ A k , re g [ i , ] ) ]
A similar argument exists for the second term.
Therefore,
Σ k [ i , j ] = Cov ( α ^ k ) [ i , j ] = trace ( Cov ( γ ^ k ) E [ A k , r e g [ j , ] A k , r e g [ i , ] ⊤ ) ] + trace ( γ k γ k ⊤ Cov [ A k , r e g [ j , ] A k , r e g [ i , ] ⊤ ] ) \begin{align*}
\boldsymbol{\Sigma}_{k}[i, j] &=
\text{Cov}\left(\boldsymbol{\hat{\alpha}}_{k}\right)[i, j] &=
\text{trace}\left(\text{Cov}\left(\hat{\boldsymbol{\gamma}}_{k}\right)
E\left[
\textbf{A}_{k, reg}[j, ]\textbf{A}_{k, reg}[i, ]^{\top}\right)
\right]
+\text{trace}\left(
\boldsymbol{\gamma}_{k}
\boldsymbol{\gamma}_{k}^{\top}
\text{Cov}\left[
\textbf{A}_{k, reg}[j, ]\textbf{A}_{k, reg}[i, ]^{\top}
\right] \right)
\end{align*} Σ k [ i , j ] = Cov ( α ^ k ) [ i , j ] = trace ( Cov ( γ ^ k ) E [ A k , re g [ j , ] A k , re g [ i , ] ⊤ ) ] + trace ( γ k γ k ⊤ Cov [ A k , re g [ j , ] A k , re g [ i , ] ⊤ ] )
In practice, we substitute γ k \boldsymbol{\gamma}_{k} γ k for γ ^ k \hat{\boldsymbol{\gamma}}_{k} γ ^ k .
Combining cell estimates
At a high level, for each cell we now have estimates of population means (e.g., the control mean and the absolute effect), and uncertainty about those estimates.
In this section we describe how to combine these estimates across cells to estimate population absolute effects and control means.
This algorithm can be used for count or ratio metrics, unadjusted or adjusted (e.g., CUPED), and absolute or relative inference.
Define the population (sample) proportion for the k th k^{\text{th}} k th strata cell as ν k \nu_{k} ν k (ν ^ k ) \hat{\nu}_{k}) ν ^ k ) .
Under stratified sampling, the ν k \nu_{k} ν k are deterministic, and we could define α ^ = ∑ k = 1 K ν k α ^ k \hat{\boldsymbol{\alpha}} = \sum_{k=1}^{K}\nu_{k}\hat{\boldsymbol{\alpha}}_{k} α ^ = ∑ k = 1 K ν k α ^ k and Σ ^ = ∑ k = 1 K ν k 2 n k − 1 Σ ^ k \hat{\boldsymbol{\Sigma}} = \sum_{k=1}^{K}\nu_{k}^{2}n_{k}^{-1}\hat{\boldsymbol{\Sigma}}_{k} Σ ^ = ∑ k = 1 K ν k 2 n k − 1 Σ ^ k . However, we do not conduct stratified sampling in GrowthBook. Under simple random sampling the ν ^ k \hat{\nu}_{k} ν ^ k are multinomial random variables, and we could define α ^ = ∑ k = 1 K ν ^ k α ^ k \hat{\boldsymbol{\alpha}} = \sum_{k=1}^{K}\hat{\nu}_{k}\hat{\boldsymbol{\alpha}}_{k} α ^ = ∑ k = 1 K ν ^ k α ^ k .
Define the 4 × K 4 \times K 4 × K matrix α M \boldsymbol{\alpha}_{M} α M as the matrix whose k th k^{\text{th}} k th row is α k \boldsymbol{\alpha}_{k} α k .
Our point estimate is the expected value of ∑ k = 1 K ν k ^ α ^ k \sum_{k=1}^{K}\hat{\nu_{k}}\hat{\boldsymbol{\alpha}}_{k} ∑ k = 1 K ν k ^ α ^ k , which is
E ( α ^ M ν ^ ) = E ν ^ ( E α ^ ( α ^ M ν ^ ∣ ν ^ ) ) = α M E ν ^ ( ν ^ ) = α M ν . \begin{align*}
E\left(\hat{\boldsymbol{\alpha}}_{M}\hat{\boldsymbol{\nu}} \right) &= E_{\hat{\boldsymbol{\nu}}}\left(E_{\hat{\boldsymbol{\alpha}}}\left(\hat{\boldsymbol{\alpha}}_{M}\hat{\boldsymbol{\nu}}|\hat{\boldsymbol{\nu}} \right)\right)
\\&= \boldsymbol{\alpha}_{M}E_{\hat{\boldsymbol{\nu}}}\left( \hat{\boldsymbol{\nu}} \right)
\\&= \boldsymbol{\alpha}_{M}\boldsymbol{\nu}.
\end{align*} E ( α ^ M ν ^ ) = E ν ^ ( E α ^ ( α ^ M ν ^ ∣ ν ^ ) ) = α M E ν ^ ( ν ^ ) = α M ν .
Below we derive its covariance.
Define the collection of ν ^ k \hat{\nu}_{k} ν ^ k as ν ^ \hat{\boldsymbol{\nu}} ν ^ .
The naive covariance is Σ ^ = ∑ k = 1 K ν ^ k 2 n k − 1 Σ ^ k \hat{\boldsymbol{\Sigma}} = \sum_{k=1}^{K}\hat{\nu}_{k}^{2}n_{k}^{-1}\hat{\boldsymbol{\Sigma}}_{k} Σ ^ = ∑ k = 1 K ν ^ k 2 n k − 1 Σ ^ k .
Alternatively, we can use Equation 15 in (Xie and Aurriset 2016 ) to define
Σ ^ = n − 1 ∑ ∗ k = 1 K ( ν [ k ] + 1 − ν [ k ] n ) Σ ∗ k \hat{\boldsymbol{\Sigma}} = n^{-1}\sum*{k=1}^{K}\left(\boldsymbol{\nu}[k] + \frac{1-\boldsymbol{\nu}[k]}{n} \right)\boldsymbol{\Sigma}*{k} Σ ^ = n − 1 ∑ ∗ k = 1 K ( ν [ k ] + n 1 − ν [ k ] ) Σ ∗ k
Both approaches assume the population cell proportions ν k \nu_{k} ν k are known.
For GrowthBook experiments, the ν ^ k \hat{\nu}_{k} ν ^ k are random variables, and this assumption is not met. There is dependence between the n k n_{k} n k (or equivalently, between the ν ^ k \hat{\boldsymbol{\nu}}_{k} ν ^ k ) that is not accounted for when estimating the variance.
We show below in the Section Derivation of conditional covariance that Cov ( α ^ M ν ^ ) \text{Cov}\left(\hat{\boldsymbol{\alpha}}_{M}\hat{\boldsymbol{\nu}} \right) Cov ( α ^ M ν ^ ) is:
Cov ( α ^ M ν ^ ) = α M Cov ( ν ^ ) α M ⊤ + n − 1 ∑ k = 1 K ν [ k ] Σ k . \begin{align}
\text{Cov}\left(\hat{\boldsymbol{\alpha}}_{M}\hat{\boldsymbol{\nu}} \right) &=
\boldsymbol{\alpha}_{M}\text{Cov}\left(\hat{\boldsymbol{\nu}}\right)\boldsymbol{\alpha}_{M}^{\top}
+
n^{-1}\sum_{k=1}^{K} \boldsymbol{\nu}[k]\boldsymbol{\Sigma}_{k}.
\end{align} Cov ( α ^ M ν ^ ) = α M Cov ( ν ^ ) α M ⊤ + n − 1 k = 1 ∑ K ν [ k ] Σ k .
Note that ν ^ \hat{\boldsymbol{\nu}} ν ^ is a multinomial random variable divided by n n n , so its K × K K\times K K × K covariance matrix has diagonal element equal to ν k / n 2 \nu_{k} / n^{2} ν k / n 2 and off-diagonal element ( i , j ) th (i, j)^{\text{th}} ( i , j ) th element equal to − ν i ν j / n 2 -\nu_{i} \nu_{j} / n^{2} − ν i ν j / n 2 .
Delta method
To recapitulate, we now have an estimate of the joint sampling distribution of the vector α \boldsymbol{\alpha} α , which has four elements:
numerator absolute effect estimate
numerator control mean
denominator absolute effect estimate
denominator control mean.
To estimate lift (relative effects), we use the delta method.
Delta method for ratio metrics
By the central limit theorem
α ^ = ( α ^ 1 α ^ 2 α ^ 3 α ^ 4 ) ∼ N ( α = ( α 1 α 2 α 3 α 4 ) , Σ ) \begin{equation}
\hat{\boldsymbol{\alpha}}
=\begin{pmatrix}
\hat{\boldsymbol{\alpha}}_{1}\\
\hat{\boldsymbol{\alpha}}_{2} \\
\hat{\boldsymbol{\alpha}}_{3} \\
\hat{\boldsymbol{\alpha}}_{4}
\end{pmatrix}\stackrel{}{\sim}\mathcal{N}\left(\boldsymbol{\alpha}=\begin{pmatrix}
\boldsymbol{\alpha}_{1}\\
\boldsymbol{\alpha}_{2} \\
\boldsymbol{\alpha}_{3} \\
\boldsymbol{\alpha}_{4}
\end{pmatrix},\boldsymbol{\Sigma}\right)
\end{equation} α ^ = α ^ 1 α ^ 2 α ^ 3 α ^ 4 ∼ N α = α 1 α 2 α 3 α 4 , Σ
Define g a b s ( α ) = α [ 1 ] + α [ 2 ] α [ 3 ] + α [ 4 ] − α [ 1 ] α [ 3 ] g_{abs}(\boldsymbol{\alpha}) = \frac{\boldsymbol{\alpha}[1] + \boldsymbol{\alpha}[2]}{\boldsymbol{\alpha}[3] + \boldsymbol{\alpha}[4]} - \frac{\boldsymbol{\alpha}[1]}{\boldsymbol{\alpha}[3]} g ab s ( α ) = α [ 3 ] + α [ 4 ] α [ 1 ] + α [ 2 ] − α [ 3 ] α [ 1 ] .
Define
g r e l ( α ) = α [ 1 ] + α [ 2 ] α [ 3 ] + α [ 4 ] − α [ 1 ] α [ 3 ] α [ 1 ] / α [ 3 ] = α [ 1 ] + α [ 2 ] α [ 3 ] + α [ 4 ] α [ 1 ] / α [ 3 ] − 1 = α [ 3 ] ( α [ 1 ] + α [ 2 ] ) α [ 1 ] ( α [ 3 ] + α [ 4 ] ) − 1 = g r e l , N g r e l , D − 1. \begin{align*}
g_{rel}(\boldsymbol{\alpha}) &= \frac{
\frac{\boldsymbol{\alpha}[1] + \boldsymbol{\alpha}[2]}{\boldsymbol{\alpha}[3] + \boldsymbol{\alpha}[4]} - \frac{\boldsymbol{\alpha}[1]}{\boldsymbol{\alpha}[3]}}{\boldsymbol{\alpha}[1] / \boldsymbol{\alpha}[3]}
\\&= \frac{\frac{\boldsymbol{\alpha}[1] + \boldsymbol{\alpha}[2]}{\boldsymbol{\alpha}[3] + \boldsymbol{\alpha}[4]}}{\boldsymbol{\alpha}[1] / \boldsymbol{\alpha}[3]} - 1
\\&= \frac{\boldsymbol{\alpha}[3]\left(\boldsymbol{\alpha}[1] + \boldsymbol{\alpha}[2]\right)}{\boldsymbol{\alpha}[1]\left(\boldsymbol{\alpha}[3] + \boldsymbol{\alpha}[4]\right)}- 1
\\&=\frac{g_{rel, N}}{g_{rel, D}} - 1.
\end{align*} g re l ( α ) = α [ 1 ] / α [ 3 ] α [ 3 ] + α [ 4 ] α [ 1 ] + α [ 2 ] − α [ 3 ] α [ 1 ] = α [ 1 ] / α [ 3 ] α [ 3 ] + α [ 4 ] α [ 1 ] + α [ 2 ] − 1 = α [ 1 ] ( α [ 3 ] + α [ 4 ] ) α [ 3 ] ( α [ 1 ] + α [ 2 ] ) − 1 = g re l , D g re l , N − 1.
Define g ∈ { g a b s , g r e l } g \in \left\{g_{abs}, g_{rel} \right\} g ∈ { g ab s , g re l } .
Define the vector of partials of length 4 4 4 ∇ = ∂ g ∂ α \boldsymbol{\nabla} = \frac{\partial g}{\partial \boldsymbol{\alpha}} ∇ = ∂ α ∂ g .
If g = g a b s g = g_{abs} g = g ab s then set ∇ \boldsymbol{\nabla} ∇ equal to ∇ a b s \boldsymbol{\nabla}_{abs} ∇ ab s , where
∇ a b s [ 1 ] = 1 α [ 3 ] + α [ 4 ] − 1 α [ 3 ] ∇ a b s [ 2 ] = 1 α [ 3 ] + α [ 4 ] ∇ a b s [ 3 ] = − ( [ α [ 1 ] + α [ 2 ] ] ) ( [ α [ 3 ] + α [ 4 ] ] ) 2 + α [ 1 ] α [ 3 ] 2 ∇ a b s [ 4 ] = − ( [ α [ 1 ] + α [ 2 ] ] ) ( [ α [ 3 ] + α [ 4 ] ] ) 2 \begin{align*}
\boldsymbol{\nabla}_{abs}[1] &= \frac{1}{\boldsymbol{\alpha}[3] + \boldsymbol{\alpha}[4]}- \frac{1}{\boldsymbol{\alpha}[3]}\\
\boldsymbol{\nabla}_{abs}[2] &= \frac{1}{\boldsymbol{\alpha}[3] + \boldsymbol{\alpha}[4]}\\
\boldsymbol{\nabla}_{abs}[3] &= \frac{-\left(\left[\boldsymbol{\alpha}[1] + \boldsymbol{\alpha}[2]\right]\right)}{
\left(\left[\boldsymbol{\alpha}[3] + \boldsymbol{\alpha}[4]\right]\right)^{2}
} + \frac{\boldsymbol{\alpha}[1]}{\boldsymbol{\alpha}[3]^{2}}\\
\boldsymbol{\nabla}_{abs}[4] &=
\frac{-\left(\left[\boldsymbol{\alpha}[1] + \boldsymbol{\alpha}[2]\right]\right)}{
\left(\left[\boldsymbol{\alpha}[3] + \boldsymbol{\alpha}[4]\right]\right)^{2}
}
\end{align*} ∇ ab s [ 1 ] ∇ ab s [ 2 ] ∇ ab s [ 3 ] ∇ ab s [ 4 ] = α [ 3 ] + α [ 4 ] 1 − α [ 3 ] 1 = α [ 3 ] + α [ 4 ] 1 = ( [ α [ 3 ] + α [ 4 ] ] ) 2 − ( [ α [ 1 ] + α [ 2 ] ] ) + α [ 3 ] 2 α [ 1 ] = ( [ α [ 3 ] + α [ 4 ] ] ) 2 − ( [ α [ 1 ] + α [ 2 ] ] )
If g = g r e l g = g_{rel} g = g re l then define ∇ \boldsymbol{\nabla} ∇ equal to ∇ r e l \boldsymbol{\nabla}_{rel} ∇ re l , where
∇ r e l [ 1 ] = α [ 3 ] g r e l , D − ( α [ 3 ] + α [ 4 ] ) g r e l , N g r e l , D 2 ∇ r e l [ 2 ] = α [ 3 ] g r e l , D ∇ r e l [ 3 ] = ( α [ 1 ] + α [ 2 ] ) g r e l , D − α [ 1 ] g r e l , N g r e l , D 2 ∇ r e l [ 4 ] = − α [ 3 ] ( α [ 1 ] + α [ 2 ] ) α [ 1 ] ( α [ 3 ] + α [ 4 ] ) 2 \begin{align*}
\boldsymbol{\nabla}_{rel}[1] &= \frac{\boldsymbol{\alpha}[3]g_{rel, D} - (\boldsymbol{\alpha}[3] + \boldsymbol{\alpha}[4])g_{rel, N}}
{g_{rel, D}^{2}}\\
\boldsymbol{\nabla}_{rel}[2] &= \frac{\boldsymbol{\alpha}[3]}{g_{rel, D}}\\
\boldsymbol{\nabla}_{rel}[3] &= \frac{(\boldsymbol{\alpha}[1] + \boldsymbol{\alpha}[2])g_{rel, D} - \boldsymbol{\alpha}[1] g_{rel, N}}
{g_{rel, D}^{2}}\\
\boldsymbol{\nabla}_{rel}[4] &=
\frac{-\boldsymbol{\alpha}[3]\left(\boldsymbol{\alpha}[1] + \boldsymbol{\alpha}[2]\right)}{
\boldsymbol{\alpha}[1]\left(\boldsymbol{\alpha}[3] + \boldsymbol{\alpha}[4]\right)^{2}
}
\end{align*} ∇ re l [ 1 ] ∇ re l [ 2 ] ∇ re l [ 3 ] ∇ re l [ 4 ] = g re l , D 2 α [ 3 ] g re l , D − ( α [ 3 ] + α [ 4 ]) g re l , N = g re l , D α [ 3 ] = g re l , D 2 ( α [ 1 ] + α [ 2 ]) g re l , D − α [ 1 ] g re l , N = α [ 1 ] ( α [ 3 ] + α [ 4 ] ) 2 − α [ 3 ] ( α [ 1 ] + α [ 2 ] )
By the delta method,
Δ ^ r = g ( α ^ ) ∼ N ( Δ r = g ( α ) , ∇ ⊤ Σ ∇ ) \hat{\Delta}_{r} = g(\hat{\boldsymbol{\alpha}}) \stackrel{}{\sim}\mathcal{N}\left(\Delta_{r} = g\left(\alpha\right), \boldsymbol{\nabla}^{\top}\boldsymbol{\Sigma}\boldsymbol{\nabla}\right) Δ ^ r = g ( α ^ ) ∼ N ( Δ r = g ( α ) , ∇ ⊤ Σ ∇ ) .
In summary, the steps for the algorithm are:
Compute the point estimate Δ ^ = g ( α ^ ) \hat{\Delta} = g(\hat{\boldsymbol{\alpha}}) Δ ^ = g ( α ^ ) .
Compute the estimated variance v ^ = ∇ ⊤ Σ ∇ \hat{v} = \boldsymbol{\nabla}^{\top}\boldsymbol{\Sigma}\boldsymbol{\nabla} v ^ = ∇ ⊤ Σ ∇ .
Return ( Δ ^ , v ^ ) (\hat{\Delta}, \hat{v}) ( Δ ^ , v ^ ) .
Delta method for count metrics
Define α ^ \hat{\boldsymbol{\alpha}} α ^ as the 2 × 1 2\times 1 2 × 1 vector with the control sample mean and the numerator effect estimate.
Define Σ ^ \hat{\boldsymbol{\Sigma}} Σ ^ as the 2 × 2 2 \times 2 2 × 2 covariance of α ^ \hat{\boldsymbol{\alpha}} α ^ .
By the central limit theorem
α ^ = ( α ^ 1 α ^ 2 ) ∼ N ( α = ( α 1 α 2 ) , Σ ) \begin{equation}
\hat{\boldsymbol{\alpha}}
=\begin{pmatrix}
\hat{\boldsymbol{\alpha}}_{1}\\
\hat{\boldsymbol{\alpha}}_{2}
\end{pmatrix}\stackrel{}{\sim}\mathcal{N}\left(\boldsymbol{\alpha}=\begin{pmatrix}
\boldsymbol{\alpha}_{1}\\
\boldsymbol{\alpha}_{2}
\end{pmatrix},\boldsymbol{\Sigma}\right)
\end{equation} α ^ = ( α ^ 1 α ^ 2 ) ∼ N ( α = ( α 1 α 2 ) , Σ )
Define g a b s ( α ) = α [ 2 ] g_{abs}(\boldsymbol{\alpha}) = \boldsymbol{\alpha}[2] g ab s ( α ) = α [ 2 ] .
Define g r e l ( α ) = α [ 2 ] α [ 1 ] . g_{rel}(\boldsymbol{\alpha}) = \frac{\boldsymbol{\alpha}[2]}{\boldsymbol{\alpha}[1]}. g re l ( α ) = α [ 1 ] α [ 2 ] .
Define g ∈ { g a b s , g r e l } g \in \left\{g_{abs}, g_{rel} \right\} g ∈ { g ab s , g re l } .
Define the vector of partials of length 2 2 2 ∇ = ∂ g ∂ α \boldsymbol{\nabla} = \frac{\partial g}{\partial \boldsymbol{\alpha}} ∇ = ∂ α ∂ g .
If g = g a b s g = g_{abs} g = g ab s then set ∇ \boldsymbol{\nabla} ∇ equal to ∇ a b s \boldsymbol{\nabla}_{abs} ∇ ab s , where
∇ a b s [ 1 ] = 0 ∇ a b s [ 2 ] = 1. \begin{align*}
\boldsymbol{\nabla}_{abs}[1] &= 0\\
\boldsymbol{\nabla}_{abs}[2] &= 1.
\end{align*} ∇ ab s [ 1 ] ∇ ab s [ 2 ] = 0 = 1.
If g = g r e l g = g_{rel} g = g re l then define ∇ \boldsymbol{\nabla} ∇ equal to ∇ r e l \boldsymbol{\nabla}_{rel} ∇ re l , where
∇ r e l [ 1 ] = − α [ 2 ] α [ 1 ] 2 ∇ r e l [ 2 ] = 1 α [ 1 ] \begin{align*}
\boldsymbol{\nabla}_{rel}[1] &= \frac{-\boldsymbol{\alpha}[2]}
{\boldsymbol{\alpha}[1]^{2}}\\
\boldsymbol{\nabla}_{rel}[2] &= \frac{1}{\boldsymbol{\alpha}[1]}\\
\end{align*} ∇ re l [ 1 ] ∇ re l [ 2 ] = α [ 1 ] 2 − α [ 2 ] = α [ 1 ] 1
By the delta method,
Δ ^ = g ( α ^ ) ∼ N ( Δ r = g ( α ) , ∇ ⊤ Σ ∇ ) \hat{\Delta} = g(\hat{\boldsymbol{\alpha}}) \stackrel{}{\sim}\mathcal{N}\left(\Delta_{r} = g\left(\alpha\right), \boldsymbol{\nabla}^{\top}\boldsymbol{\Sigma}\boldsymbol{\nabla}\right) Δ ^ = g ( α ^ ) ∼ N ( Δ r = g ( α ) , ∇ ⊤ Σ ∇ ) .
In summary, the steps for the algorithm are:
Compute the point estimate Δ ^ = g ( α ^ ) \hat{\Delta} = g(\hat{\boldsymbol{\alpha}}) Δ ^ = g ( α ^ ) .
Compute the estimated variance v ^ = ∇ ⊤ Σ ∇ \hat{v} = \boldsymbol{\nabla}^{\top}\boldsymbol{\Sigma}\boldsymbol{\nabla} v ^ = ∇ ⊤ Σ ∇ .
Return ( Δ ^ , v ^ ) (\hat{\Delta}, \hat{v}) ( Δ ^ , v ^ ) .
Appendix
Derivation of conditional covariance
In this section we derive the covariance of α ^ M ν ^ \hat{\boldsymbol{\alpha}}_{M}\hat{\boldsymbol{\nu}} α ^ M ν ^ .
We derive the covariance using results from linear models.
Recall that if A \textbf{A} A is a matrix and Z \textbf{Z} Z is a random vector with mean μ \boldsymbol{\mu} μ and covariance Ψ \boldsymbol{\Psi} Ψ then E ( Z ⊤ AZ ) = α ⊤ A α + tr ( A Ψ ) . E\left(\textbf{Z}^{\top} \textbf{A}\textbf{Z}\right) = \boldsymbol{\alpha}^{\top} \textbf{A}\boldsymbol{\alpha} + \text{tr}\left( \textbf{A}\boldsymbol{\Psi}\right). E ( Z ⊤ A Z ) = α ⊤ A α + tr ( A Ψ ) .
Define α l \boldsymbol{\alpha}_{l} α l as the l th l^{\text{th}} l th row of α M \boldsymbol{\alpha}_{M} α M , and analogously define α ^ m \hat{\boldsymbol{\alpha}}_{m} α ^ m .
Below we derive Cov ( α ^ M ∣ ν ^ ) \text{Cov}\left(\hat{\boldsymbol{\alpha}}_{M}|\hat{\boldsymbol{\nu}}\right) Cov ( α ^ M ∣ ν ^ ) .
Define α l \boldsymbol{\alpha}_{l} α l as the l th l^{\text{th}} l th row of α M \boldsymbol{\alpha}_{M} α M , and analogously define α ^ m \hat{\boldsymbol{\alpha}}_{m} α ^ m .
First we need the following result:
E ( α ^ l α ^ m ⊤ ∣ ν ^ ) E\left(\hat{\boldsymbol{\alpha}}_{l}\hat{\boldsymbol{\alpha}}_{m}^{\top}|\hat{\boldsymbol{\nu}}\right) E ( α ^ l α ^ m ⊤ ∣ ν ^ )
= [ E ( α ^ l [ 1 ] α ^ m [ 1 ] ∣ ν ^ ) E ( α ^ l [ 1 ] α ^ m [ 2 ] ∣ ν ^ ) . . . E ( α ^ l [ 1 ] α ^ m [ i , K ] ∣ ν ^ ) E ( α ^ l [ 2 ] α ^ m [ 1 ] ∣ ν ^ ) E ( α ^ l [ 2 ] α ^ m [ 2 ] ∣ ν ^ ) . . . E ( α ^ l [ 2 ] α ^ m [ i , K ] ∣ ν ^ ) ⋮ ⋮ ⋮ ⋮ E ( α ^ l [ K ] α ^ m [ 1 ] ∣ ν ^ ) E ( α ^ l [ K ] α ^ m [ 2 ] ∣ ν ^ ) . . . E ( α ^ l [ K ] α ^ m [ i , K ] ∣ ν ^ ) ] = [ E ( α ^ l [ 1 ] α ^ m [ 1 ] ∣ ν ^ ) E ( α ^ l [ 1 ] ∣ ν ^ ) E ( α ^ m [ 2 ] ∣ ν ^ ) . . . E ( α ^ l [ 1 ] ∣ ν ^ ) E ( α ^ m [ i , K ] ∣ ν ^ ) E ( α ^ l [ 2 ] ∣ ν ^ ) E ( α ^ m [ 1 ] ∣ ν ^ ) E ( α ^ l [ 2 ] α ^ m [ 2 ] ∣ ν ^ ) . . . E ( α ^ l [ 2 ] ∣ ν ^ ) E ( α ^ m [ i , K ] ∣ ν ^ ) ⋮ ⋮ ⋮ ⋮ E ( α ^ l [ K ] ∣ ν ^ ) E ( α ^ m [ 1 ] ∣ ν ^ ) E ( α ^ l [ 1 ] ∣ ν ^ ) E ( α ^ m [ i , K ] ∣ ν ^ ) . . . E ( α ^ l [ K ] ∣ ν ^ ) E ( α ^ m [ i , K ] ∣ ν ^ ) ] \begin{align*}
&=
\begin{bmatrix}
E\left(\hat{\boldsymbol{\alpha}}_{l}[1]\hat{\boldsymbol{\alpha}}_{m}[1]|\hat{\boldsymbol{\nu}}\right) & E\left(\hat{\boldsymbol{\alpha}}_{l}[1]\hat{\boldsymbol{\alpha}}_{m}[2]|\hat{\boldsymbol{\nu}}\right) & ... & E\left(\hat{\boldsymbol{\alpha}}_{l}[1]\hat{\boldsymbol{\alpha}}_{m}[i, K]|\hat{\boldsymbol{\nu}}\right) \\
E\left(\hat{\boldsymbol{\alpha}}_{l}[2]\hat{\boldsymbol{\alpha}}_{m}[1]|\hat{\boldsymbol{\nu}}\right) & E\left(\hat{\boldsymbol{\alpha}}_{l}[2]\hat{\boldsymbol{\alpha}}_{m}[2]|\hat{\boldsymbol{\nu}}\right) & ... & E\left(\hat{\boldsymbol{\alpha}}_{l}[2]\hat{\boldsymbol{\alpha}}_{m}[i, K]|\hat{\boldsymbol{\nu}}\right) \\
\vdots & \vdots & \vdots & \vdots \\
E\left(\hat{\boldsymbol{\alpha}}_{l}[K]\hat{\boldsymbol{\alpha}}_{m}[1]|\hat{\boldsymbol{\nu}}\right) & E\left(\hat{\boldsymbol{\alpha}}_{l}[K]\hat{\boldsymbol{\alpha}}_{m}[2]|\hat{\boldsymbol{\nu}}\right) & ... & E\left(\hat{\boldsymbol{\alpha}}_{l}[K]\hat{\boldsymbol{\alpha}}_{m}[i, K]|\hat{\boldsymbol{\nu}}\right) \\
\end{bmatrix}
\\&=
\begin{bmatrix}
E\left(\hat{\boldsymbol{\alpha}}_{l}[1]\hat{\boldsymbol{\alpha}}_{m}[1]|\hat{\boldsymbol{\nu}}\right) & E\left(\hat{\boldsymbol{\alpha}}_{l}[1]|\hat{\boldsymbol{\nu}}\right)E\left(\hat{\boldsymbol{\alpha}}_{m}[2]|\hat{\boldsymbol{\nu}}\right) & ... & E\left(\hat{\boldsymbol{\alpha}}_{l}[1]|\hat{\boldsymbol{\nu}}\right)E\left(\hat{\boldsymbol{\alpha}}_{m}[i, K]|\hat{\boldsymbol{\nu}}\right) \\
E\left(\hat{\boldsymbol{\alpha}}_{l}[2]|\hat{\boldsymbol{\nu}}\right)E\left(\hat{\boldsymbol{\alpha}}_{m}[1]|\hat{\boldsymbol{\nu}}\right) & E\left(\hat{\boldsymbol{\alpha}}_{l}[2]\hat{\boldsymbol{\alpha}}_{m}[2]|\hat{\boldsymbol{\nu}}\right) & ... & E\left(\hat{\boldsymbol{\alpha}}_{l}[2]|\hat{\boldsymbol{\nu}}\right)E\left(\hat{\boldsymbol{\alpha}}_{m}[i, K]|\hat{\boldsymbol{\nu}}\right) \\
\vdots & \vdots & \vdots & \vdots \\
E\left(\hat{\boldsymbol{\alpha}}_{l}[K]|\hat{\boldsymbol{\nu}}\right)E\left(\hat{\boldsymbol{\alpha}}_{m}[1]|\hat{\boldsymbol{\nu}}\right) &
E\left(\hat{\boldsymbol{\alpha}}_{l}[1]|\hat{\boldsymbol{\nu}}\right)E\left(\hat{\boldsymbol{\alpha}}_{m}[i, K]|\hat{\boldsymbol{\nu}}\right) & ... &
E\left(\hat{\boldsymbol{\alpha}}_{l}[K]|\hat{\boldsymbol{\nu}}\right)E\left(\hat{\boldsymbol{\alpha}}_{m}[i, K]|\hat{\boldsymbol{\nu}}\right)
\end{bmatrix}
\end{align*} = E ( α ^ l [ 1 ] α ^ m [ 1 ] ∣ ν ^ ) E ( α ^ l [ 2 ] α ^ m [ 1 ] ∣ ν ^ ) ⋮ E ( α ^ l [ K ] α ^ m [ 1 ] ∣ ν ^ ) E ( α ^ l [ 1 ] α ^ m [ 2 ] ∣ ν ^ ) E ( α ^ l [ 2 ] α ^ m [ 2 ] ∣ ν ^ ) ⋮ E ( α ^ l [ K ] α ^ m [ 2 ] ∣ ν ^ ) ... ... ⋮ ... E ( α ^ l [ 1 ] α ^ m [ i , K ] ∣ ν ^ ) E ( α ^ l [ 2 ] α ^ m [ i , K ] ∣ ν ^ ) ⋮ E ( α ^ l [ K ] α ^ m [ i , K ] ∣ ν ^ ) = E ( α ^ l [ 1 ] α ^ m [ 1 ] ∣ ν ^ ) E ( α ^ l [ 2 ] ∣ ν ^ ) E ( α ^ m [ 1 ] ∣ ν ^ ) ⋮ E ( α ^ l [ K ] ∣ ν ^ ) E ( α ^ m [ 1 ] ∣ ν ^ ) E ( α ^ l [ 1 ] ∣ ν ^ ) E ( α ^ m [ 2 ] ∣ ν ^ ) E ( α ^ l [ 2 ] α ^ m [ 2 ] ∣ ν ^ ) ⋮ E ( α ^ l [ 1 ] ∣ ν ^ ) E ( α ^ m [ i , K ] ∣ ν ^ ) ... ... ⋮ ... E ( α ^ l [ 1 ] ∣ ν ^ ) E ( α ^ m [ i , K ] ∣ ν ^ ) E ( α ^ l [ 2 ] ∣ ν ^ ) E ( α ^ m [ i , K ] ∣ ν ^ ) ⋮ E ( α ^ l [ K ] ∣ ν ^ ) E ( α ^ m [ i , K ] ∣ ν ^ )
Therefore,
Cov ( α ^ l , α ^ m ∣ ν ^ ) = E ( α ^ l α ^ m ⊤ ∣ ν ^ ) − E ( α ^ l ∣ ν ^ ) E ( α ^ m ⊤ ∣ ν ^ ) = [ Cov ( α ^ l [ 1 ] , α ^ m [ 1 ] ∣ ν ^ ) 0 . . . 0 0 Cov ( α ^ l [ 1 ] , α ^ m [ 1 ] ∣ ν ^ ) . . . 0 ⋮ ⋮ ⋮ ⋮ 0 0 . . . Cov ( α ^ l [ K ] , α ^ m [ K ] ∣ ν ^ ) ] = n − 1 [ Σ 1 [ l , m ] / ( ν ^ [ 1 ] ) 0 . . . 0 0 Σ 2 [ l , m ] / ( ν ^ [ 2 ] ) . . . 0 ⋮ ⋮ ⋮ ⋮ 0 0 . . . Σ K [ l , m ] / ( ν ^ [ K ] ) ] \begin{align*}
\text{Cov}\left(\hat{\boldsymbol{\alpha}}_{l}, \hat{\boldsymbol{\alpha}}_{m}|\hat{\boldsymbol{\nu}}\right) &=
E\left(\hat{\boldsymbol{\alpha}}_{l}\hat{\boldsymbol{\alpha}}_{m}^{\top}|\hat{\boldsymbol{\nu}}\right) -
E\left(\hat{\boldsymbol{\alpha}}_{l}|\hat{\boldsymbol{\nu}}\right)E\left(\hat{\boldsymbol{\alpha}}_{m}^{\top}|\hat{\boldsymbol{\nu}}\right)
\\&= \begin{bmatrix}
\text{Cov}\left(\hat{\boldsymbol{\alpha}}_{l}[1], \hat{\boldsymbol{\alpha}}_{m}[1]|\hat{\boldsymbol{\nu}}\right) & 0 & ... & 0 \\
0 & \text{Cov}\left(\hat{\boldsymbol{\alpha}}_{l}[1], \hat{\boldsymbol{\alpha}}_{m}[1]|\hat{\boldsymbol{\nu}}\right) & ... & 0 \\
\vdots & \vdots & \vdots & \vdots \\
0 &
0 & ... &
\text{Cov}\left(\hat{\boldsymbol{\alpha}}_{l}[K], \hat{\boldsymbol{\alpha}}_{m}[K]|\hat{\boldsymbol{\nu}}\right)
\end{bmatrix}
\\&=
n^{-1}\begin{bmatrix}
\boldsymbol{\Sigma}_{1}[l, m] / (\hat{\boldsymbol{\nu}}[1]) & 0 & ... & 0 \\
0 & \boldsymbol{\Sigma}_{2}[l, m] / (\hat{\boldsymbol{\nu}}[2]) & ... & 0 \\
\vdots & \vdots & \vdots & \vdots \\
0 &
0 & ... &
\boldsymbol{\Sigma}_{K}[l, m] / (\hat{\boldsymbol{\nu}}[K])
\end{bmatrix}
\end{align*} Cov ( α ^ l , α ^ m ∣ ν ^ ) = E ( α ^ l α ^ m ⊤ ∣ ν ^ ) − E ( α ^ l ∣ ν ^ ) E ( α ^ m ⊤ ∣ ν ^ ) = Cov ( α ^ l [ 1 ] , α ^ m [ 1 ] ∣ ν ^ ) 0 ⋮ 0 0 Cov ( α ^ l [ 1 ] , α ^ m [ 1 ] ∣ ν ^ ) ⋮ 0 ... ... ⋮ ... 0 0 ⋮ Cov ( α ^ l [ K ] , α ^ m [ K ] ∣ ν ^ ) = n − 1 Σ 1 [ l , m ] / ( ν ^ [ 1 ]) 0 ⋮ 0 0 Σ 2 [ l , m ] / ( ν ^ [ 2 ]) ⋮ 0 ... ... ⋮ ... 0 0 ⋮ Σ K [ l , m ] / ( ν ^ [ K ])
To get the ( l , m ) th (l, m)^{\text{th}} ( l , m ) th element of the covariance we first calculate the ( l , m ) th (l, m)^{\text{th}} ( l , m ) th element of the second moment E ( α ^ l ν ^ ν ^ ⊤ α ^ m ⊤ ) E\left(\hat{\boldsymbol{\alpha}}_{l}\hat{\boldsymbol{\nu}}\hat{\boldsymbol{\nu}}^{\top}\hat{\boldsymbol{\alpha}}_{m}^{\top} \right) E ( α ^ l ν ^ ν ^ ⊤ α ^ m ⊤ ) :
= E ν ^ ( E α ^ ( α ^ l ν ^ ν ^ ⊤ α ^ m ⊤ ∣ ν ^ ) ) = E ν ^ ( α l ν ^ ν ^ ⊤ α m ⊤ ) + E ν ^ ( tr ( ν ^ ν ^ ⊤ Cov ( α ^ l , α ^ m ∣ ν ^ ) ) ∣ ν ^ ) = E ν ^ ( α l ν ^ ν ^ ⊤ α m ⊤ ) + E ν ^ ( tr ( Cov ( α ^ l , α ^ m ∣ ν ^ ) ν ^ ν ^ ⊤ ) ∣ ν ^ ) = E ν ^ ( α l ν ^ ν ^ ⊤ α m ⊤ ) + E ν ^ ( tr ( n − 1 [ Σ 1 [ l , m ] / ν ^ [ 1 ] 0 . . . 0 0 Σ 2 [ l , m ] / ν ^ [ 2 ] . . . 0 ⋮ ⋮ ⋮ ⋮ 0 0 . . . Σ K [ l , m ] / ν ^ [ K ] ] ) ) = α l ( Cov ( ν ^ ) + ν ν ⊤ ) α m ⊤ + n − 1 ∑ k = 1 K ν [ k ] Σ k [ l , m ] \begin{align*}
&=
E_{\hat{\boldsymbol{\nu}}}\left(E_{\hat{\boldsymbol{\alpha}}}\left(\hat{\boldsymbol{\alpha}}_{l}\hat{\boldsymbol{\nu}}\hat{\boldsymbol{\nu}}^{\top}\hat{\boldsymbol{\alpha}}_{m}^{\top}|\hat{\boldsymbol{\nu}} \right)\right)
\\&=
E_{\hat{\boldsymbol{\nu}}}
\left(
\boldsymbol{\alpha}_{l}\hat{\boldsymbol{\nu}}\hat{\boldsymbol{\nu}}^{\top}\boldsymbol{\alpha}_{m}^{\top}
\right)
\\&+
E_{\hat{\boldsymbol{\nu}}}
\left(
\text{tr}\left(\hat{\boldsymbol{\nu}}\hat{\boldsymbol{\nu}}^{\top}\text{Cov}\left(\hat{\boldsymbol{\alpha}}_{l}, \hat{\boldsymbol{\alpha}}_{m}|\hat{\boldsymbol{\nu}} \right)\right)|\hat{\boldsymbol{\nu}}
\right)
\\&=
E_{\hat{\boldsymbol{\nu}}}
\left(
\boldsymbol{\alpha}_{l}\hat{\boldsymbol{\nu}}\hat{\boldsymbol{\nu}}^{\top}\boldsymbol{\alpha}_{m}^{\top}
\right)
\\&+
E_{\hat{\boldsymbol{\nu}}}
\left(
\text{tr}\left(\text{Cov}\left(\hat{\boldsymbol{\alpha}}_{l}, \hat{\boldsymbol{\alpha}}_{m}|\hat{\boldsymbol{\nu}} \right)\hat{\boldsymbol{\nu}}\hat{\boldsymbol{\nu}}^{\top}\right)|\hat{\boldsymbol{\nu}}
\right)
\\&=
E_{\hat{\boldsymbol{\nu}}}
\left(
\boldsymbol{\alpha}_{l}\hat{\boldsymbol{\nu}}\hat{\boldsymbol{\nu}}^{\top}\boldsymbol{\alpha}_{m}^{\top}
\right)
\\&+
E_{\hat{\boldsymbol{\nu}}}
\left(
\text{tr}\left(
n^{-1}\begin{bmatrix}
\boldsymbol{\Sigma}_{1}[l, m] / \hat{\boldsymbol{\nu}}[1] & 0 & ... & 0 \\
0 & \boldsymbol{\Sigma}_{2}[l, m] / \hat{\boldsymbol{\nu}}[2] & ... & 0 \\
\vdots & \vdots & \vdots & \vdots \\
0 &
0 & ... &
\boldsymbol{\Sigma}_{K}[l, m] / \hat{\boldsymbol{\nu}}[K]
\end{bmatrix}
\right)
\right)
\\&=
\boldsymbol{\alpha}_{l}\left(\text{Cov}\left(\hat{\boldsymbol{\nu}}\right) + \boldsymbol{\nu}\boldsymbol{\nu}^{\top}\right)\boldsymbol{\alpha}_{m}^{\top}
\\&+
n^{-1}\sum_{k=1}^{K} \boldsymbol{\nu}[k]\boldsymbol{\Sigma}_{k}[l, m]
\end{align*} = E ν ^ ( E α ^ ( α ^ l ν ^ ν ^ ⊤ α ^ m ⊤ ∣ ν ^ ) ) = E ν ^ ( α l ν ^ ν ^ ⊤ α m ⊤ ) + E ν ^ ( tr ( ν ^ ν ^ ⊤ Cov ( α ^ l , α ^ m ∣ ν ^ ) ) ∣ ν ^ ) = E ν ^ ( α l ν ^ ν ^ ⊤ α m ⊤ ) + E ν ^ ( tr ( Cov ( α ^ l , α ^ m ∣ ν ^ ) ν ^ ν ^ ⊤ ) ∣ ν ^ ) = E ν ^ ( α l ν ^ ν ^ ⊤ α m ⊤ ) + E ν ^ tr n − 1 Σ 1 [ l , m ] / ν ^ [ 1 ] 0 ⋮ 0 0 Σ 2 [ l , m ] / ν ^ [ 2 ] ⋮ 0 ... ... ⋮ ... 0 0 ⋮ Σ K [ l , m ] / ν ^ [ K ] = α l ( Cov ( ν ^ ) + ν ν ⊤ ) α m ⊤ + n − 1 k = 1 ∑ K ν [ k ] Σ k [ l , m ]
where we used the fact that the trace of a product of a diagonal matrix and a matrix is the sum of the products of the diagonal elements of the diagonal matrix and the diagonal elements of the matrix.
In summary, the ( l , m ) th (l, m)^{\text{th}} ( l , m ) th element of Cov ( α ^ M ν ^ ) \text{Cov}\left(\hat{\boldsymbol{\alpha}}_{M}\hat{\boldsymbol{\nu}} \right) Cov ( α ^ M ν ^ ) is:
Cov ( α ^ M ν ^ ) [ l , m ] = α l Cov ( ν ^ ) α m ⊤ + n − 1 ∑ k = 1 K ν [ k ] Σ k [ l , m ] . \begin{align*}
\text{Cov}\left(\hat{\boldsymbol{\alpha}}_{M}\hat{\boldsymbol{\nu}} \right)[l, m] &=
\boldsymbol{\alpha}_{l}\text{Cov}\left(\hat{\boldsymbol{\nu}}\right)\boldsymbol{\alpha}_{m}^{\top}
+
n^{-1}\sum_{k=1}^{K} \boldsymbol{\nu}[k]\boldsymbol{\Sigma}_{k}[l, m].
\end{align*} Cov ( α ^ M ν ^ ) [ l , m ] = α l Cov ( ν ^ ) α m ⊤ + n − 1 k = 1 ∑ K ν [ k ] Σ k [ l , m ] .
Therefore, the covariance of α ^ M ν ^ \hat{\boldsymbol{\alpha}}_{M}\hat{\boldsymbol{\nu}} α ^ M ν ^ is:
Cov ( α ^ M ν ^ ) = α M Cov ( ν ^ ) α M ⊤ + n − 1 ∑ k = 1 K ν [ k ] Σ k . \begin{align*}
\text{Cov}\left(\hat{\boldsymbol{\alpha}}_{M}\hat{\boldsymbol{\nu}} \right) &=
\boldsymbol{\alpha}_{M}\text{Cov}\left(\hat{\boldsymbol{\nu}}\right)\boldsymbol{\alpha}_{M}^{\top}
+
n^{-1}\sum_{k=1}^{K} \boldsymbol{\nu}[k]\boldsymbol{\Sigma}_{k}.
\end{align*} Cov ( α ^ M ν ^ ) = α M Cov ( ν ^ ) α M ⊤ + n − 1 k = 1 ∑ K ν [ k ] Σ k .