[Rcpp-devel] Read csv and export object in R

ogami musashi uragami at hotmail.com
Tue Apr 21 11:01:17 CEST 2015


Hello Dirk,

Got it sorted, the basic problem was that the output matrix's dimensions 
has to be defined precisely.

I had some problems with first line (col names) and first columns (row 
names).
But it works now.

Benchmarks against fread shows the code i use returns a lighter object 
(a simple matrix) and thus processes faster.

reading 400 16,2Mb files with a 6 cores took 177,949 seconds with the 
cpp function and 228.231 seconds with fread.

Neadless to say both are considerably faster than read.table (took 21669 
seconds!) and read_csv from readr package (took about the same).


I know it would be better to contribute an rcpp gallery but for now i 
just have time to post the code here:

#include <Rcpp.h>
#include <fstream>
#include <sstream>
#include <string>
using namespace Rcpp;


//Function is taking a path to a numeric file and return the same data 
in a NumericMatrix object

// [[Rcpp::export]]
NumericMatrix readfilecpp(std::string path)
{

NumericMatrix output(20,46749);// output matrix (specifying the size is 
critical otherwise R crashes)

std::ifstream myfile(path.c_str()); //Opens the file. c_str is mandatory 
here so that ifstream accepts the string path

std::string line;
std::getline(myfile,line,'\n'); //skip the first line (col names in our 
case). Remove those lines if note necessary


for (int row=0; row<20; ++row) // basic idea: getline() will read lines 
row=0:19 and for each line will put the value separated by ',' into 
46749 columns
{
     std::string line;
     std::getline(myfile,line,'\n'); //Starts at the second line because 
the first one was ditched previously

     if(!myfile.good() ) //If end of rows then break
         break;

     std::stringstream iss(line); // take the line into a stringstream
     std::string val;
     std::getline(iss,val,','); ///skips the first column (row names)

     for (int col=0; col<46749; ++col )
         {
     std::string val;
     std::getline(iss,val,','); //reads the stringstream line and 
separate it into 49749 values (that were delimited by a ',' in the 
stringstream)


     std::stringstream convertor(val); //get the results into another 
stringstream 'convertor'
     convertor >> output(row,col); //put the result into our output 
matrix at for the actual row and col
         }
     }
return(output);
}



On 20/04/15 13:16, Dirk Eddelbuettel wrote:
> On 20 April 2015 at 12:01, ogami musashi wrote:
> | Problem is..i have 400 object of 16,5 Mb each. and it take about 6 hours
> | to reimport in R! I use the readr package as this is the fastest base
> | function in R.
>
> a) readr != base R
>
> b) fread in package data.table is considered the fastest reader function
>
> | I adapted a C++ code to use Rcpp, it compiles but when using it it
> | crashes R:
>
> I fear you may have to debug that yourself.  As for speed, you won't be able
> to beat fread which has been optimised for this for years and uses mmap and
> other tricks.
>
> Dirk
>



More information about the Rcpp-devel mailing list