Fitting GLM with large datasets

Author: Dr Chibisi Chima-Okereke Created: December 15, 2014 15:22:00 GMT Published: December 15, 2014 15:22:00 GMT

Update on 2020-11-13 18:07 GMT. This article describes an approach that is a useful starting point but my thinking on this topic has developed markedly. The bigReg package uses the principles described here and was created sometime ago from another more comprehensive package (also based on the same principles). There are currently no plans to update the R package, but an implementation providing the user with better implementations for large data sets and more solver options is now being implemented in the D programming language and should be available for future release on a commercial basis. Enjoy the article.

bigReg is our package for fitting large GLMs. Today we outline the basic idea behind the fitting procedure of this package. Writing algorithms that fit GLM to large data sets is increasing important. The basic idea centers around multiplying two matrices together. Consider the process of obtaining a matrix, \(C\), by multiplying two matrices \(A\), and \(B\) together:

$$(C = A B)$$

It turns out that the matrix \(C\) can be computed by considering subsets of the matrices \(A\) and \(B\), i.e. we can do operations on subsets \(A^k\) and \(B^k\) to obtain \(C^k\). Consider that matrix \(A\) has \(m\) rows and matrix \(B\) has the same number of columns. We split the rows of matrix \(A\) to \(N\) sub-matrices so that each split has \(n_k\) rows, where \(k = 1, 2, \ldots, N\). We also split matrix \(B\) by column into \(m\) partitions just. Note that the number of rows in each of the matrices \(A^k\) is the same as the number of columns in the corresponding matrices \(B^k\). We can carry out the matrix operations on the subsets of \(A^k\) and \(B^k\), to obtain \(C^k\)

$$(C^k = A^k B^k)$$

The each of the matrices \(C^k\) are of the same size as the matrix \(C\) and each element of this matrix \(C_{ij}\) are obtained by summing over the equivalent elements in each of the sub matrices \(C^k_{ij}\):

$$(C_{ij} = \sum^{N}_{k = 1}{C^k_{ij}})$$

From now on we will represent this equivalence using the form:

$$(AB = \sum_{k = 1}^{K}{A^kB^k})$$

Below is example code demonstrating this equivalence.

# Create a function that generates matrices
genMat <- function(rows, cols)matrix(runif(rows*cols), nr = rows)
# Demo the funtion
genMat(3, 4)
          [,1]      [,2]       [,3]      [,4]
[1,] 0.6002032 0.4055390 0.55304991 0.2820599
[2,] 0.3571885 0.2586743 0.89608278 0.4897254
[3,] 0.4946626 0.2257442 0.08907279 0.1204725
# Create lists of matrices of length 10 each having 10 columns 
#	and 10 rows (A) and 10 columns and 5 rows (B)
A <- lapply(rep(10, 10), genMat, cols = 10)
B <- lapply(rep(5, 10), genMat, cols = 10)
 
AA <- do.call(rbind, A)
BB <- do.call(cbind, B)
 
 
CC <- BB %*% AA
 
C <- 0
for(i in 1:10)
 C <- C + B[[i]] %*% A[[i]]
# Print the matrices C and CC obtained with the two methods:
C
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
[1,] 21.21956 23.69982 23.00338 23.60277 24.66198 20.83243 21.46384 25.44557
[2,] 28.48572 29.96976 30.04569 29.28665 29.22631 26.84277 27.92835 29.80939
[3,] 24.05694 26.88173 26.13994 25.94221 24.12093 21.98697 23.80476 26.24499
[4,] 23.94928 27.20216 25.89008 22.74377 26.02501 23.70129 21.93890 25.71572
[5,] 22.21175 26.18588 25.71810 24.11232 24.98093 21.36459 24.52625 24.55646
         [,9]    [,10]
[1,] 22.23165 23.30815
[2,] 27.49436 27.57304
[3,] 24.21939 25.43667
[4,] 25.63084 25.16481
[5,] 24.99900 24.25997
CC
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
[1,] 21.21956 23.69982 23.00338 23.60277 24.66198 20.83243 21.46384 25.44557
[2,] 28.48572 29.96976 30.04569 29.28665 29.22631 26.84277 27.92835 29.80939
[3,] 24.05694 26.88173 26.13994 25.94221 24.12093 21.98697 23.80476 26.24499
[4,] 23.94928 27.20216 25.89008 22.74377 26.02501 23.70129 21.93890 25.71572
[5,] 22.21175 26.18588 25.71810 24.11232 24.98093 21.36459 24.52625 24.55646
         [,9]    [,10]
[1,] 22.23165 23.30815
[2,] 27.49436 27.57304
[3,] 24.21939 25.43667
[4,] 25.63084 25.16481
[5,] 24.99900 24.25997
# Rounding errors on summation 
sum(C - CC)
[1] 4.973799e-14

Application to GLM

I guess we could stop the blog given the above demonstration and simply ask those interested to apply the equivalence to GLM calculations but lets show how this can be applied to fitting coefficients. We have a design matrix \(X\) with \(n\) rows and \(p\) columns. We would like to calculate all of \(\hat{\beta}\) values. If \(W\) is the weights matrix, diagonal \(n\) by \(n\), the iterative GLM fitting process at iteration \(t\) can be represented by:

$$(\hat{\beta}^{(t)} = (X^T W X)^{-1} (X^T W z^{(t)}))$$

Where \(z^{(t)}\) is a matrix with 1 column and n rows and is a function of \(\hat{\eta}^{(t-1)}, \hat{\beta}^{(t-1)}\) and \(y\) our dependent variable, i.e. \(z^{(t)} = f(\hat{\eta}^{(t-1)}, \hat{\beta}^{(t-1)}, y)\). Here \(\hat{\eta}\) is the estimate of the linear predictor.

The GLM fitting process carries out this calculation till either the deviance or Manhattan norm of the coeffcient changes less than some small tolerance (\eta) (we use the latter measure):

$$(D \lt \eta_a )$$

or

$$(\sum{\mid \hat{\beta}^{(t)} - \hat{\beta}^{(t - 1)}\mid} \lt \epsilon_b )$$

Consider now that we split the \(n\) rows in each of the matrices \(X, W, z\) into \(k = 1, \ldots, K\) parts. Where,

$$(z^{(t)}_{k} = f(\hat{\eta}_k^{(t-1)}, \hat{\beta}^{(t-1)}, y_k))$$

It can be shown that:

$$(\sum^{K}_{k=1}{X^T_k W_k X_k} = X^T W X)$$

Now \(X^T W X\) is a \(p \times p\) matrix much smaller than \(X\) especially for a large \(n\). Also,

$$(\sum^{K}_{k=1}{(X^T_k W_k z_k)} = X^T Wz)$$

The above holds since \(W\) and \(W_k\) are square diagonal matrices of dimension \(n\) and \(n_k\) respectively.