<div dir="ltr">As promised, I did some testing.  The results (described in detail below) are mixed, but suggest that compression is useful for some large data sets, and that if this is a serious issue for someone, they need to do some careful testing before committing to anything (I know, that should be obvious, but...).  Also, my results pretty clearly show that fread() crushes read.csv, regardless of whether the csv file is compressed.  Nice job Matthew!<div>

<br></div><div style>I start with Current Population Survey data from the Bureau of Labor Statistics.</div><div style>The file I used get be accessed here: <a href="ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData">ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData</a></div>

<div style><br></div><div style>I converted it to a csv file using StatTransfer 8 (I'm lazy), with no quoting of strings.  I then compressed the csv file using 7-Zip (gzip, Normal).  The resulting</div><div style>files, both with 4937221 obs, 5 variables are:</div>

<div style><div>ln_data_1.csv :    133625 KB</div><div>ln_data_1.csv.gz : 17528 KB</div><div><br></div><div style>Given the file size disparity, this should demonstrate any improvements via compression.  Also, for comparison, I show fread below.  I've made some</div>

<div style>formatting changes, but changed nothing else.</div><div style><br></div><div style><div>for(i in 1:5) {</div><div>  t1 <- system.time(cps1 <- read.csv("ln_data_1.csv"))</div><div>  print(t1)</div>

<div>}</div><div>   user  system elapsed </div><div>  12.32    0.53   12.90  </div><div>  12.51    0.44   13.00 </div><div>  12.39    0.47   12.89 </div><div>  12.36    0.55   12.96 </div><div>  12.43    0.36   12.94 </div>

<div>  </div><div>for(i in 1:5) {</div><div>  t2 <- system.time(cps1 <- read.csv("ln_data_1.csv.gz"))</div><div>  print(t2)</div><div>}</div><div>   user  system elapsed </div><div>  14.04    0.26   14.43 </div>

<div>  14.00    0.27   14.34 </div><div>  14.07    0.31   14.44  </div><div>  13.93    0.28   14.23 </div><div>  14.02    0.32   14.35 </div><div>  </div><div>for(i in 1:5) {</div><div>  t3 <- system.time(cps1 <- fread("ln_data_1.csv"))</div>

<div>  print(t3)</div><div>}</div><div>   user  system elapsed </div><div>   2.89    0.04    2.94  </div><div>   2.92    0.07    2.98 </div><div>   2.88    0.03    2.95  </div><div>   2.87    0.06    2.95 </div><div>   2.91    0.03    2.95 </div>

<div><br></div><div style>While the gzipped version uses less system time, total & user time has increased somewhat.  The fread function from data.table is dramatically</div><div style>faster.  While this isn't strictly a fair comparison because fread produces</div>

<div style>a data.table while read.csv produces a data.frame, the bias is against fread,</div><div style>not for it.</div><div style><br></div><div style>Next, I produce a random 2,000,000x10 matrix, write it to csv, and then read it back into memory as a data.frame (or data.table, for fread).  I again use 7-Zip for compression.The resulting files are:</div>

<div style>test2.csv :      375086 KB</div><div style>test2.csv.gz : 165477 KB</div><div style><br></div><div style><div>> matr <- replicate(10,rnorm(2000000))</div><div>> write.csv(matr,"test2.csv")</div>

<div>> t1 <- system.time(df <- read.csv("test2.csv"))</div><div>> t2 <- system.time(df <- read.csv("test2.csv.gz"))</div><div>> t3 <- system.time(df <- fread("test2.csv"))</div>

<div>      </div><div>> t1</div><div>   user  system elapsed </div><div> 165.32    0.36  166.25 </div><div>> t2</div><div>   user  system elapsed </div><div> 116.24    0.16  117.08 </div><div>> t3</div><div>   user  system elapsed </div>

<div>  17.64    0.06   17.83 </div><div><br></div><div style>The switch to strictly floating point numbers is significant.  Compression is significant improvement--about 49 seconds or about 30%--although nowhere near enough for read.csv to be comparable to fread.</div>

</div><div style><br></div><div style>Finally, I produce a 20000x1000 matrix.  The resulting files are:</div><div style>test1.csv :      354854 KB</div><div style>test1.csv.gz : 157975 KB</div><div style><br></div><div style>

<div>matr <- replicate(1000,rnorm(20000))</div><div>> write.csv(matr,"test1.csv")</div><div>> t1 <- system.time(df <- read.csv("test1.csv"))</div><div>> t2 <- system.time(df <- read.csv("test1.csv.gz"))</div>

<div>> t3 <- system.time(df <- fread("test1.csv"))</div><div>> t1</div><div>   user  system elapsed </div><div> 206.80    1.14  208.60 </div><div>> t2</div><div>   user  system elapsed </div><div>

 123.42    0.27  123.99   </div><div>> t3</div><div>   user  system elapsed </div><div>  17.24    0.09   17.37 </div></div><div style><br></div><div style>Here, compression is an even larger win, improving by about 83 seconds or roughly 40%.  The fread function is again dramatically faster, and unlike read.csv, fread's performance is similar regardless of the shape of the matrix.</div>

<div style><br></div><div style>We could create more detailed tests, varying the number of columns vs rows</div><div style>and their type (strings vs integers vs floats, etc) to get better details, but the</div><div style>

basic result is that compression can be a noticeable improvement in performance, but a superior read algorithm trumps that.  If it's feasible to</div><div style>combine fread's behavior with gzip, bzip2, or xz compression, it could be a</div>

<div style>big win for some files, but not for all of them.  The advice from</div><div style><a href="http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html">http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html</a> to compress csv files appears to hold, although<br>

</div><div style>it may not save much time if you have a lot of non-float values or few columns.</div><div style><br></div></div></div></div><div class="gmail_extra"><br clear="all"><div>-------<br>Nathaniel Graham<br><a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a><br>

<a href="mailto:npgraham1@uky.edu" target="_blank">npgraham1@uky.edu</a></div>

<br><br><div class="gmail_quote">On Wed, Apr 3, 2013 at 4:20 PM, Nathaniel Graham <span dir="ltr"><<a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir="ltr">Subjectively, the difference seems substantial, with large loads taking<div>half or a third as long.  Whether I use gzip or not, CPU usage isn't</div><div>especially high, suggesting that I'm either waiting on the hard drive</div>


<div>or that the whole process is memory bound.  I was all set to produce</div><div>some timings for comparison, but I'm working from home today and</div><div>my home machine struggles to accommodate large files---any difference</div>


<div>in load times gets swamped by swapping and general flailing on the</div><div>part of the OS (I've only got 4GB of RAM at home).  Hopefully I'll get</div><div>around to doing some timings on my work machine sometime this</div>


<div>week, since I've got no issues with memory there.</div></div><div class="gmail_extra"><div class="im"><br clear="all"><div>-------<br>Nathaniel Graham<br><a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a><br>


<a href="mailto:npgraham1@uky.edu" target="_blank">npgraham1@uky.edu</a></div>

<br><br></div><div><div class="h5"><div class="gmail_quote">On Wed, Apr 3, 2013 at 4:58 AM, Matthew Dowle <span dir="ltr"><<a href="mailto:mdowle@mdowle.plus.com" target="_blank">mdowle@mdowle.plus.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<u></u>

<div>

<p> </p>

<p>Interesting.  How much do you find read.csv is sped up by reading gzip'd files?</p><div><div>

<p> </p>

<p>On 02.04.2013 20:36, Nathaniel Graham wrote:</p>

<blockquote type="cite" style="padding-left:5px;border-left:#1010ff 2px solid;margin-left:5px;width:100%">

<div dir="ltr">Thanks, but I suspect that it would take longer to setup and then remove

<div>a ramdisk than it would to use read.csv and data.table.  My files are</div>

<div>moderately large (between 200 MB and 3 GB when compressed), but not </div>

<div>enormous; I gzip not so much to save space on disk but to speed up reads.</div>

</div>

<div class="gmail_extra"><br clear="all">

<div>-------<br>Nathaniel Graham<br><a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a><br><a href="mailto:npgraham1@uky.edu" target="_blank">npgraham1@uky.edu</a></div>

<br><br>

<div class="gmail_quote">On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle <span><<a href="mailto:mdowle@mdowle.plus.com" target="_blank">mdowle@mdowle.plus.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span style="text-decoration:underline"></span>

<div>

<p> </p>

<p>Hi,</p>

<p>fread memory maps the entire uncompressed file and this is baked into the way it works (e.g. skipping to the beginning, middle and last 5 rows to detect column types before starting to read the rows in) and where the convenience and speed comes from.</p>


<p>You could uncompress the .gz to a ramdisk first, and then fread the uncompressed file from that ramdisk, is probably the fastest way.  Which should still be pretty quick and I guess unlikely much slower than anything we could build into fread (provided you use a ramdisk).</p>


<p>Matthew</p>

<div>

<div>

<p> </p>

<p>On 02.04.2013 19:30, Nathaniel Graham wrote:</p>

<blockquote style="padding-left:5px;border-left:#1010ff 2px solid;margin-left:5px;width:100%">

<div dir="ltr">I have a moderately large csv file that's gzipped, but not in a tar

<div>archive, so it's "filename.csv.gz" that I want to read into a data.table.</div>

<div>I'd like to use fread(), but I can't seem to make it work.  I'm currently</div>

<div>using the following:</div>

<div>data.table(read.csv(gzfile("filename.csv.gz","r")))</div>

<div>Various combinations of gzfile, gzcon, file, readLines, and</div>

<div>textConnection all produce an error (invalid input).  Is there a better</div>

<div>way to read in large, compressed files?</div>

<div>

<div>

<div>-------<br>Nathaniel Graham<br><a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a><br><a href="mailto:npgraham1@uky.edu" target="_blank">npgraham1@uky.edu</a></div>

</div>

</div>

</div>

</blockquote>

<p> </p>

<div> </div>

</div>

</div>

</div>

</blockquote>

</div>

</div>

</blockquote>

<p> </p>

<div> </div>

</div></div></div>

</blockquote></div><br></div></div></div>

</blockquote></div><br></div>