Some data I/O benchmarks in R

Active Analytics Ltd: posted 15 Jan 2013 18:50 by Chibisi Chima-Okereke [ updated 22 May 2013 05:38 ]

Introduction

This blog post is an attempt to provide a look as some benchmarks for read/write times using data formats that can be read and written with basic R. In this exercise I have used "native R" and have not attempted to optimize it in anyway regarding parallel processing or specialized packages that optimize data read/write speeds.

The file formats covered here are CSV, RData, dBase, and bin (binary) files. The interested R programmer should see the data i/o manual for more details. I attempt to use MySQL using RODBC, but the connection, but the data set I used contained too many columns to be written, so I swiftly abandoned this idea. The thought of having to create a routine that splices the table down to hundreds of parts and recompile them only to tell me what I already know – database will be the slowest was not worth the time investment.

Machine Specifications: OS: Windows 7, Architecture: 64 bit, RAM: 16GB @1600 MHz, CPU: i7-3612QM, Hard disk type: Solid state

The data set

The KDD cup is an annual data mining competition, where a data mining problem is set for anyone interested to take part. Kind of like an Olympic event for geeks! The reason I mention this is that the data set I have used comes from KDD 2009 where the competition was on churn modelling. Incidentally, Allan Englehardt has written an interesting article on churn modelling. 2 out of 5 chunks of the large training data set was used. This amounts to a table with 19,999 rows and 15,000 columns, almost 300 million items (the first row of the raw chunk1 are the column names). As is, the raw 2 data chunks total to 642 MB in size. The first 14,740 columns are numeric variables, and the rest 260 columns are categorical.

This is clearly a large data set, and worse still, it is very wide. I chose a hard data set on purpose, it is of a size is where R starts to struggle with data manipulation and read/write operations, and it’s width makes it very unwieldy. Once you start trying to throw around data of this of size and shape, it is apparent that you need to think differently about these operations.

One thing that R gets criticized for is its memory hungriness; this is quite true. Everything is stored in memory, and once the memory is allocated, it’s pretty difficult to unallocated it regardless of whether you rm() the object and gc() any number of times. This may be to do with the way operations are carried out and where data is left over. Part of the problem also is that adding gc() inside an iterative read function will slow down the process and defeat the purpose of a benchmark. I tried this but did not notice any change in the memory usage. But memory in R is an issue for another day.

The Benchmarks

Introduction

Note that in these benchmarks before the read operation was done, R was restarted.

CSV file I/O

This benchmarks start with reading CSV files using the read.csv() function in R. Amongst the methods that worked, this took the longest. Some people may or may not be surprised to hear this. I wrote and read the data in a naïve way not bother to use any explicit chunking of the files leaving the operation to the read.csv() function.

# Reading in the CSV file
system.time(mainTable <- read.csv(file = paste(path, "mainTable.csv", sep = "")))
# user system elapsed 
# 191.63 2.14 193.80

# Writing the CSV file
system.time(write.csv(x = mainTable, file = paste(path, "mainTable.csv", sep = ""), row.names = FALSE))
# user system elapsed 
# 424.43 5.53 430.45 

The size on disk of the CSV file was 658 MB

dBase file I/O

Using this format required the foreign package. The size on disk of the file was 4 GB and the writing (therefore the reading) could not be done to completion.

system.time(write.dbf(mainTable, file = paste(path, "mainTable.dbf", sep = "/")))
# user system elapsed 
# 581.03 4.74 621.29 
# There were 50 or more warnings (use warnings() to see the first 50)

warnings()
# Warning messages:
# 1: In min(x) : no non-missing arguments to min; returning Inf, ...

system.time(mainTable <- read.dbf(file = paste(path, "mainTable.dbf", sep = "/")))
# user system elapsed 
# 5.84 0.78 8.76 
# Does not return all the data!
dim(mainTable)
# [1] 19999 664

RData/RDS format

The RData and RDS file formats are essentially the same and no difference in read/write speeds was shown. The size of the file on disk is 70.6 MB.

# Reading RData format
system.time(load(file = paste(path, "mainTable.RData", sep = "/")))
# user system elapsed 
# 10.63 0.44 11.06
# Writing RData format
system.time(save(mainTable, file = paste(path, "mainTable.RData", sep = "/")))
# user system elapsed 
# 38.11 0.42 38.58

Binary file format

The file was made up of categorical and numerical data which both amount to being numerical. The binary data format can therefore be purely numerical. Size of the data on the disk was 2.21 GB

We first take the table of factors, and extract all the factor levels.

# This is a table of the factors
tempTable <- mainTable[,14741:15000]

# First create the mapping layer for the factors, and save factors as integers
factorMap <- lapply(1:260, function(x){
  factLevels <- levels(tempTable[,x])
  # Note the global assign
  tempTable[,x] <<- as.integer(tempTable[,x])
  if(x %% 10 == 0)print(x)
  return(factLevels)
})
mainTable[,14741:15000] <- tempTable

Now we create a file for each of the 15,000 columns

# Writing binary files to folder
itemNames <- names(mainTable)
system.time(for(i in seq(along = itemNames))writeBin(mainTable[,i], 
            con = paste(path, "nativeBin\\", itemNames[i], ".bin", sep = ""), 
            endian = "little"))
# user  system elapsed 
# 2.90    6.13   12.67

# Saving the mapping layer
binLayer <- list("itemNames" = itemNames, "factorMap" = factorMap)
system.time(save(binLayer, file = paste(path, "binLayer.RData", sep = "")))
# user  system elapsed 
# 0.07    0.00    0.06 

Now we can read the binary files back (after restarting R). We read back the bin files into a list object, and bind it back into a data frame.

# The mapping layer
system.time(load(file = paste(path, "binLayer.RData", sep = "")))
# user  system elapsed 
# 0.03    0.00    0.03

# mainTable <- NULL
itemNames <- binLayer[[1]]
factorMap <- binLayer[[2]]

# Reading back bin files into a list
system.time(mainTable <- lapply(seq(along = itemNames), function(x){
  readData <- readBin(con = paste(path, "nativeBin\\", 
                              itemNames[x], ".bin", sep = ""), 
                  endian = "little", what = ifelse(x < 14741, "numeric", "integer"),
                  n = 19999)
  if(x > 14740){
    readData <- factor(readData, labels = factorMap[[(x - 14740)]])
  }
  
  return(readData)
  #rm(list = ls())
  #gc()
}))
# user  system elapsed 
# 6.11    2.71    9.77

# renaming the list
system.time(names(mainTable) <- itemNames)
# user  system elapsed 
# 1.06    0.59    1.65

# Binding the columns together
system.time(temp1 <- as.data.frame(mainTable[1:14740]))
# user  system elapsed 
# 4.23    1.04    5.27
system.time(temp2 <- as.data.frame(mainTable[14741:15000]))
# user  system elapsed 
# 0.03    0.00    0.04 
system.time(mainTable <- cbind(temp1, temp2))
# user  system elapsed 
# 1.17    0.03    1.21

# user  system elapsed 
# 1.40    0.67    2.07
# 17.94s all together

Summary

One of the main surprises from this exercise was that the RData format performed very will indeed. On more standard table dimensions, even up to 10 million rows but with less than 40 columns, I have seen the binary format read in a fraction of the time as RData. In this exercise I tried to write the binary file as a single string of numbers and reformat it back into a table of numbers and factors once back into R, however R died every time I attempted the write operation. However even if the read/write time might be faster, the time required for formatting the data back into a table with the factors etc. would be prohibitive – if at all possible on my system since extensive data manipulation such as folding a vector into a data frame and including the factor levels etc. takes up rather a lot of memory.

There are many R packages that cater to large file format, the high performance computing task views would be a good place to start investigating for those interested.

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.