Fitting GLM with large datasets

Active Analytics Ltd: posted 15 December 15:22 PST by Chibisi Chima-Okereke

Introduction

ActiveReg is our package for fitting large GLMs. Today we outline the basic idea behind the fitting procedure of this package.

Writing algorithms that fit GLM to large data sets is what I would term an "open secret" for data scientists with enough technical competence. The basic idea centers around multiplying two matrices together. Consider the process of obtaining a matrix, \(C\), by multiplying two matrices \(A\), and \(B\) together:

\(C = A B\)

It turns out that the matrix \(C\) can be computed by considering subsets of the matrices \(A\) and \(B\), i.e. we can do operations on subsets \(A^k\) and \(B^k\) to obtain \(C^k\). Consider that matrix \(A\) has \(m\) rows and matrix \(B\) has the same number of columns. We split the rows of matrix \(A\) to \(N\) sub-matrices so that each split has \(n_k\) rows, where \(k = 1, 2, \ldots, N\). We also split matrix \(B\) by column into \(m\) partitions just. Note that the number of rows in each of the matrices \(A^k\) is the same as the number of columns in the corresponding matrices \(B^k\). We can carry out the matrix operations on the subsets of \(A^k\) and \(B^k\), to obtain \(C^k\)

\(C^k = A^k B^k\)

The each of the matrices \(C^k\) are of the same size as the matrix \(C\) and each element of this matrix \(C_{ij}\) are obtained by summing over the equivalent elements in each of the sub matrices \(C^k_{ij}\):

\(C_{ij} = \sum^{N}_{k = 1}{C^k_{ij}}\)

From now on we will represent this equivalence using the form:

\(AB = \sum_{k = 1}^{K}{A^kB^k}\)

Below is example code demonstrating this equivalence.

# Create a function that generates matrices
genMat <- function(rows, cols)matrix(runif(rows*cols), nr = rows)

# Demo the funtion
genMat(3, 4)
          [,1]      [,2]       [,3]      [,4]
[1,] 0.6002032 0.4055390 0.55304991 0.2820599
[2,] 0.3571885 0.2586743 0.89608278 0.4897254
[3,] 0.4946626 0.2257442 0.08907279 0.1204725

# Create lists of matrices of length 10 each having 10 columns 
#	and 10 rows (A) and 10 columns and 5 rows (B)
A <- lapply(rep(10, 10), genMat, cols = 10)
B <- lapply(rep(5, 10), genMat, cols = 10)
 
AA <- do.call(rbind, A)
BB <- do.call(cbind, B)
 
 
CC <- BB %*% AA
 
C <- 0
for(i in 1:10)
 C <- C + B[[i]] %*% A[[i]]

# Print the matrices C and CC obtained with the two methods:

C
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
[1,] 21.21956 23.69982 23.00338 23.60277 24.66198 20.83243 21.46384 25.44557
[2,] 28.48572 29.96976 30.04569 29.28665 29.22631 26.84277 27.92835 29.80939
[3,] 24.05694 26.88173 26.13994 25.94221 24.12093 21.98697 23.80476 26.24499
[4,] 23.94928 27.20216 25.89008 22.74377 26.02501 23.70129 21.93890 25.71572
[5,] 22.21175 26.18588 25.71810 24.11232 24.98093 21.36459 24.52625 24.55646
         [,9]    [,10]
[1,] 22.23165 23.30815
[2,] 27.49436 27.57304
[3,] 24.21939 25.43667
[4,] 25.63084 25.16481
[5,] 24.99900 24.25997

CC
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
[1,] 21.21956 23.69982 23.00338 23.60277 24.66198 20.83243 21.46384 25.44557
[2,] 28.48572 29.96976 30.04569 29.28665 29.22631 26.84277 27.92835 29.80939
[3,] 24.05694 26.88173 26.13994 25.94221 24.12093 21.98697 23.80476 26.24499
[4,] 23.94928 27.20216 25.89008 22.74377 26.02501 23.70129 21.93890 25.71572
[5,] 22.21175 26.18588 25.71810 24.11232 24.98093 21.36459 24.52625 24.55646
         [,9]    [,10]
[1,] 22.23165 23.30815
[2,] 27.49436 27.57304
[3,] 24.21939 25.43667
[4,] 25.63084 25.16481
[5,] 24.99900 24.25997


# Rounding errors on summation 
sum(C - CC)
[1] 4.973799e-14

Application to GLM

I guess we could stop the blog given the above demonstration and simply ask those interested to apply the equivalence to GLM calculations but lets show how this can be applied to fitting coefficients. We have a design matrix \(X\) with \(n\) rows and \(p\) columns. We would like to calculate all of \(\hat{\beta}\) values. If \(W\) is the weights matrix, diagonal \(n\) by \(n\), the iterative GLM fitting process at iteration \(t\) can be represented by:

\(\hat{\beta}^{(t)} = (X^T W X)^{-1} (X^T W z^{(t)})\)

Where \(z^{(t)}\) is a matrix with 1 column and n rows and is a function of \(\hat{\eta}^{(t-1)}, \hat{\beta}^{(t-1)}\) and \(y\) our dependent variable, i.e. \(z^{(t)} = f(\hat{\eta}^{(t-1)}, \hat{\beta}^{(t-1)}, y)\). Here \(\hat{\eta}\) is the estimate of the linear predictor.

THe GLM fitting process carries out this calculation till either the deviance or Manhattan norm of the coeffcient changes less than some small tolerance \(\eta\) (we use the latter measure):

\(D \lt \eta_a\)

or

\(\sum{\mid \hat{\beta}^{(t)} - \hat{\beta}^{(t - 1)}\mid} \lt \epsilon_b \)

Consider now that we split the \(n\) rows in each of the matrices \(X, W, z\) into \(k = 1, \ldots, K\) parts. Where,

\(z^{(t)}_{k} = f(\hat{\eta}_k^{(t-1)}, \hat{\beta}^{(t-1)}, y_k)\)

It can be shown that:

\(\sum^{K}_{k=1}{X^T_k W_k X_k} = X^T W X\)

Now \(X^T W X\) is a \(p \times p\) matrix much smaller than \(X\) especially for a large \(n\). Also,

\(\sum^{K}_{k=1}{(X^T_k W_k z_k)} = X^T Wz\)

The above holds since \(W\) and \(W_k\) are square diagonal matrices of dimension \(n\) and \(n_k\) respectively.

Summary

We have revealed that the main "trick" behind fitting large GLM models is in partitioning matrix multiplication operation. There are many other things to consider when carrying out the fitting process such as signularity issues of the design matrix, and convergence but those exist for both large and small data sets. Given the technical details outlined here it is a (fairly) straightforward (and maybe tedious?) matter to implement parallelization of the fitting process on a cluster or using MapReduce or other suitable algorithms.

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.