Speed read counts from RNA-Seq

Active Analytics Ltd: posted 7 December 2014 14:21 PST 2014 by Chibisi Chima-Okereke

Introduction

Some time ago, we created a blog about differential expression in drosophilla. Part of the analysis requires reading text files containing count data. This can be done using R's read.delim function. In this blog entry we present a function readFileOfCounts that is an alternative method for reading files in this format. It is written in C++ using the Rcpp package and there is an R wrapper function that gives access to the functionality. It reads files in about a third of the time that the native R function takes.

Note that the function is designed to read the count file data where the first column represents the symbol column and the other columns contain count data. The link to the code in this article is given in our GitHub page.

The C++ Code

The code presented in this section is the C++ code that actually reads the contents of the file. The output is a list containing a matrix of the numerical data and the column names and rownames of the table.

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
#include <RcppArmadillo.h>
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <sstream>

using namespace Rcpp;
using namespace std;

// [[Rcpp::export(".readFileOfCounts")]]
SEXP readFileOfCounts(std::string filePath, std::string sep) {
    
    ifstream input(filePath);
    string lineStr; // string for the line
    string colStr; // string for the columns
    stringstream strStream {};
    const char * colChar;
    
    int iCol, iRow = 0; // to capture the current column
    // The separator
    char _sep;
    const char * tsep;
    tsep = sep.c_str();
    if(tsep[0] == '\\'){
      _sep = '\t';
      }else{
      _sep = tsep[0];
    }
    
    vector colNames; // column names
    
    // Read the first line into a string vector
    getline(input, lineStr);
    strStream.str(lineStr);
    while(getline(strStream, colStr, _sep))
    {
      colNames.push_back(colStr);
    }
    strStream.clear();
    vector rowNames; // row names
    vector> countData;
    vector rowData;
    iCol = colNames.size() - 1;
    double countValue;
    // Now we read the rest of the file
    while(getline(input, lineStr))
    {
      strStream.str(lineStr);
      getline(strStream, colStr, _sep); // read the rowname name
      rowNames.push_back(colStr);
      while(getline(strStream, colStr, _sep))
      {
        colChar = colStr.c_str();
        countValue = atof(colChar);
  	    rowData.push_back(countValue);
      }
      countData.push_back(rowData);
      strStream.clear();
      rowData.clear();
      ++iRow;
    }
    
    NumericMatrix rMat(iRow, iCol);
    for(int i = 0; i < iRow; ++i){
      rowData = countData[i];
      for(int j = 0; j < iCol; ++j){
        rMat(i, j) = rowData[j];
      }
      rowData.clear();
    }
    
    CharacterVector _rowNames = wrap(rowNames);
    CharacterVector _colNames = wrap(colNames);
    
    List output; output["mat"] = rMat; output["rowNames"] = _rowNames;
    output["colNames"] = _colNames;
    return wrap(output);
}

The R wrapper function

The R wrapper function contains takes the output list from the C++ function and constructs a data frame from the matrix data and row and column names.

readFileOfCounts <- function(filePath, sep){
  cout <- .readFileOfCounts(filePath, sep)
  output <- cout[["mat"]]
  dimnames(output) <- list(cout[["rowNames"]], cout[["colNames"]][-1])
  return(as.data.frame(output))
}

Usage

The usage of these functions is given below:

require(Rcpp)
sourceCpp("readFileOfCounts.cpp")
source("readFileOfCounts.R")

mb <- microbenchmark::microbenchmark
benchmarks <- mb(readFileOfCounts("count_table.txt", sep = "\t"), read.delim("count_table.txt", row.names="gene"))
Unit: milliseconds
                                              expr     min       lq     mean   median       uq      max neval
  readFileOfCounts("count_table.txt", sep = "\\t") 254.419 284.8199 292.3839 288.3123 293.8638 408.4934   100
 read.delim("count_table.txt", row.names = "gene") 798.816 817.1757 838.4695 831.1102 862.1574 895.1991   100

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.