activeH5 benchmarks: Data Frame I/O

Author: Dr Chibisi Chima-Okereke Created: September 5, 2014 00:00:00 GMT Published: September 5, 2014 00:00:00 GMT

The philosphy of the activeH5 package is to make reading and writing large amounts of data used in statistical models as easy and fast as possible. Our aim is to abstract away complexity and provide a simple programming interface that allows users to access functionality quickly and simply. We have built on HDF5 technology developed by NASA to make our package back-end reliable and fast. We are fully committed to continue to improve the performance of our package and its ease of use. As far as we are aware, activeH5 is the only open source package in R for easily and reliably reading and writing large data frames to file - and certainly the only one that allows users to store data frames completely in memory.

We have built this package for storing data that will be used in models. Our approach to the internals of our data set is that everything is a number. Statistical algorithms are run with numbers, categorical data is simply numerical data with lookup, therefore if you look in our HDF5 files with a viewer you will only see numbers in the main data, everything else is metadata that goes towards shaping those numbers into a recognisable data set.

Economically, the fact is that on-volatile memory will remain much less expensive than RAM for the forseeable future and has the advantage of persisting. However whether you choose to store and read data from hard disk or memory, activeH5 can accomodate you. Our package allows users to write large data frames and matrices to hard disk and RAM in chunks quickly and easily. The package is open source with a generous MIT license and is located in our GitHub repository. If you are carrying out computations with large amounts of data in R, we encourage you to try activeH5.

Apples and Oranges

Creating benchmarks is tricky because the comparisons may not be equivalent and different package have different emphases. For instance the rhdf5 package’s strength is in the range of objects that can be stored and it also acts as an API for HDF5, the emphasis of the ff package is on speed and the emphasis of the activeH5 package is on convenience, performance, and reliability geared for matrices and data frames. These packages are trying to do related but different things, however they all allow the user to store large amounts of data on disk and retrieve them relatively quickly. We encourage the user to bear this in mind when viewing the benchmarks.

Today’s Benchmarks

Today we present benchmarks for reading and writing data frames using our activeH5 package and compare this against other packages in R, Python, and Julia. The next blog on this package will compare benchmarking for writing large matrices but today, we concentrate on the benchmarks for writing and reading data frames or equivalent table-like structures in other programming languages.

The benchmarks were run on an computer with an Intel Core i7-3820QM CPU @2.7GHz with 32GB of RAM, a Segate ATA ST9500423AS 500GB hard drive. We used the Ubuntu 14.04 operating system with R version 3.1.1, activeH5 version 0.0.1, ff package version 2.2-13, and rhdf5 version 2.8.0. Julia version 0.3.0 was used, and the DataFrames package version 0.5.7, and HDF5 package version 0.4.0. Python version 2.7.6 was used and pandas version 0.13.1, h5py version 2.2.1, numpy version 1.8.1, and scipy 0.13.3 was installed. The code for the benchmarks is located on GitHub.

The task being benchmarked is the time taken to read/write the airlines data set for year 2007. We read in the data from the CSV file into a data frame, and then see how quickly we can write the data set to file, we then restart the host program and see how long it takes for us to read the file back into a data frame object in R or their equivalents in other environments. The packages we have chosen to use are meant for working with large tables and allow us to compare their performance with that of activeH5 package. Note also that the read and write times include any times requried to prepare files prior to writing the data.

For the R benchmarks specifically, the read and write posted times are those required to read from file to a data frame object and to write a data frame object to file. The ff package provides an intermediate ffdf data frame type object allowing users to access data in this object so the write times shown reflect the time taken to convert an in memory data frame to this object and then to write the object to file. In the same way the read time reflects the time taken to read the object from file and then convert to a data frame.

Benchmarks

Below is a table showing the time taken to read/write the airlines data set for year 2007 using the different packages.

Environment  Package ReadTime WriteTime
1           R   Native    3.615    33.860
2           R activeH5    4.287     4.577
3           R       ff    5.690     4.746
4           R    rhdf5    4.153     9.829
5       Julia     HDF5   11.577    11.928
6      Python     h5py    2.552     5.536

RData format

The write time for the RData (also called rda) format was long, however, the read time was unexpectedly short - shorter than the rest of the packages in R. We temper this by commenting that you cannot append to rda format so data can only be written that fits into memory. The length of the write time also limits the possible applications when dealing with very large data sets.

activeH5 package

We are aware that there may be question marks over how objectively any company can review their own product especially when benchmarked against others but we will endevour to be as objective as we can.

Our package gives good performance and reliability for writing and reading large data tables to and from file. In addition, extra rows can be appended to exisiting data tables, no other open source R package allows this as simply as we do. Data can either be read back in chunks from the HDF5 file or the whole table can be read at once. The other two R packages ff, and rhdf5 come with an option for compressing the files. The activeH5 package does not have a compression option however since all the data written is numeric the files created can be compressed with high ratio. For instance the HDF5 file created by activeH5 in these benchmark is 1.7 GB in size from a 702.9 MB CSV file. When the HDF5 file is compressed it is 128.3 MB in size.

ff package

The ff package was the most frustrating to work with, even though the functions are documented, the examples presented do not give a clear indication of how to work with data frames. When we got the package working we found that though the performance was good, the reliability was poor. We found that you could write a data frame to an ffdf object on file but we had problems reading back data especially after restarting R. Most of the time the readback worked after the first restart after writing, but never after subsequent restarts. We managed to transfer a file that we wrote to another computer (with the same OS and version of R and ff) and were able to read the data back once but every subsequent read after restarting R failed.

When it works the ff package is fast, the convienice of being able to directly index the ffdf object using R’s bracket notation is great. However we are not sure how useful a package for working with very large data sets can be if the data can not be read back from file reliably.

rhdf5 package

The rhdf5 package allows a variety of data structures to be written to file including data frames however we found that factor/character columns could not be recovered - we got numbers back instead. The write speed was very slow in comparison to the other method but the read back speed was comparable.

Julia’s HDF5 & DataFrame package

Julia’s HDF5 package allows users to store data frame (table) objects as jld files - which are HDF5 files with a specific structure and a different file extension. In this benchmark, we used the HDF5 and DataFrame packages. The read/write times of this package was the slowest and the file size was the largest about twice the size of the activeH5 package at 2.9 GB (273.7 MB compressed). We would comment that this is one of the least mature packages, and we expect or hope that future updates are accompanied with better performance.

Python’s h5py & pandas package

The performance of Python’s h5py for HDF5 files was the best in terms of read speed, however its write time is slower than activeH5. This package was used in combination with the pandas package which contains the DataFrame object and allows users to write table objects to file. The h5py python package has been around around for some time now, we have blogged about this package before and we like it. It is easy to use and reliable with very good documentation, however, if your data has missing values in categorical variables, Python may not be the best solution until numpy develops support for NA integers.

The performance of the read time of the h5py package is something that we would like to aim for, so that our users can have the advantage of working in an R environment with performance comensurate with this package.