Some time ago, we created a blog about differential expression in drosophilla. Part of the analysis requires reading text files containing count data. This can be done using R’s read.delim
function. In this blog entry we present a function readFileOfCounts
that is an alternative method for reading files in this format. It is written in C++ using the Rcpp package and there is an R wrapper function that gives access to the functionality. It reads files in about a third of the time that the native R function takes.
Note that the function is designed to read the count file data where the first column represents the symbol column and the other columns contain count data. The link to the code in this article is given in our GitHub page.
The code presented in this section is the C++ code that actually reads the contents of the file. The output is a list containing a matrix of the numerical data and the column names and rownames of the table.
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
#include <RcppArmadillo.h>
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <sstream>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export(".readFileOfCounts")]]
SEXP readFileOfCounts(std::string filePath, std::string sep) {
ifstream input(filePath);
string lineStr; // string for the line
string colStr; // string for the columns
stringstream strStream {};
const char * colChar;
int iCol, iRow = 0; // to capture the current column
// The separator
char _sep;
const char * tsep;
tsep = sep.c_str();
if(tsep[0] == '\\'){
_sep = '\t';
}else{
_sep = tsep[0];
}
vector<string> colNames; // column names
// Read the first line into a string vector
getline(input, lineStr);
strStream.str(lineStr);
while(getline(strStream, colStr, _sep))
{
colNames.push_back(colStr);
}
strStream.clear();
vector<string> rowNames; // row names
vector<vector<double>> countData;
vector<double> rowData;
iCol = colNames.size() - 1;
double countValue;
// Now we read the rest of the file
while(getline(input, lineStr))
{
strStream.str(lineStr);
getline(strStream, colStr, _sep); // read the rowname name
rowNames.push_back(colStr);
while(getline(strStream, colStr, _sep))
{
colChar = colStr.c_str();
countValue = atof(colChar);
rowData.push_back(countValue);
}
countData.push_back(rowData);
strStream.clear();
rowData.clear();
++iRow;
}
NumericMatrix rMat(iRow, iCol);
for(int i = 0; i < iRow; ++i){
rowData = countData[i];
for(int j = 0; j < iCol; ++j){
rMat(i, j) = rowData[j];
}
rowData.clear();
}
CharacterVector _rowNames = wrap(rowNames);
CharacterVector _colNames = wrap(colNames);
List output; output["mat"] = rMat; output["rowNames"] = _rowNames;
output["colNames"] = _colNames;
return wrap(output);
}
The R wrapper function contains takes the output list from the C++ function and constructs a data frame from the matrix data and row and column names.
readFileOfCounts <- function(filePath, sep){
cout <- .readFileOfCounts(filePath, sep)
output <- cout[["mat"]]
dimnames(output) <- list(cout[["rowNames"]], cout[["colNames"]][-1])
return(as.data.frame(output))
}
The usage of these functions is given below:
require(Rcpp)
sourceCpp("readFileOfCounts.cpp")
source("readFileOfCounts.R")
mb <- microbenchmark::microbenchmark
benchmarks <- mb(readFileOfCounts("count_table.txt", sep = "\t"), read.delim("count_table.txt", row.names="gene"))
Unit: milliseconds
expr min lq mean median uq max neval
readFileOfCounts("count_table.txt", sep = "\\t") 254.419 284.8199 292.3839 288.3123 293.8638 408.4934 100
read.delim("count_table.txt", row.names = "gene") 798.816 817.1757 838.4695 831.1102 862.1574 895.1991 100