[datatable-help] fread on gzipped files

Nathaniel Graham npgraham1 at gmail.com
Fri Apr 5 20:59:47 CEST 2013


As promised, I did some testing.  The results (described in detail below)
are mixed, but suggest that compression is useful for some large data sets,
and that if this is a serious issue for someone, they need to do some
careful testing before committing to anything (I know, that should be
obvious, but...).  Also, my results pretty clearly show that fread()
crushes read.csv, regardless of whether the csv file is compressed.  Nice
job Matthew!

I start with Current Population Survey data from the Bureau of Labor
Statistics.
The file I used get be accessed here:
ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData

I converted it to a csv file using StatTransfer 8 (I'm lazy), with no
quoting of strings.  I then compressed the csv file using 7-Zip (gzip,
Normal).  The resulting
files, both with 4937221 obs, 5 variables are:
ln_data_1.csv :    133625 KB
ln_data_1.csv.gz : 17528 KB

Given the file size disparity, this should demonstrate any improvements via
compression.  Also, for comparison, I show fread below.  I've made some
formatting changes, but changed nothing else.

for(i in 1:5) {
  t1 <- system.time(cps1 <- read.csv("ln_data_1.csv"))
  print(t1)
}
   user  system elapsed
  12.32    0.53   12.90
  12.51    0.44   13.00
  12.39    0.47   12.89
  12.36    0.55   12.96
  12.43    0.36   12.94

for(i in 1:5) {
  t2 <- system.time(cps1 <- read.csv("ln_data_1.csv.gz"))
  print(t2)
}
   user  system elapsed
  14.04    0.26   14.43
  14.00    0.27   14.34
  14.07    0.31   14.44
  13.93    0.28   14.23
  14.02    0.32   14.35

for(i in 1:5) {
  t3 <- system.time(cps1 <- fread("ln_data_1.csv"))
  print(t3)
}
   user  system elapsed
   2.89    0.04    2.94
   2.92    0.07    2.98
   2.88    0.03    2.95
   2.87    0.06    2.95
   2.91    0.03    2.95

While the gzipped version uses less system time, total & user time has
increased somewhat.  The fread function from data.table is dramatically
faster.  While this isn't strictly a fair comparison because fread produces
a data.table while read.csv produces a data.frame, the bias is against
fread,
not for it.

Next, I produce a random 2,000,000x10 matrix, write it to csv, and then
read it back into memory as a data.frame (or data.table, for fread).  I
again use 7-Zip for compression.The resulting files are:
test2.csv :      375086 KB
test2.csv.gz : 165477 KB

> matr <- replicate(10,rnorm(2000000))
> write.csv(matr,"test2.csv")
> t1 <- system.time(df <- read.csv("test2.csv"))
> t2 <- system.time(df <- read.csv("test2.csv.gz"))
> t3 <- system.time(df <- fread("test2.csv"))

> t1
   user  system elapsed
 165.32    0.36  166.25
> t2
   user  system elapsed
 116.24    0.16  117.08
> t3
   user  system elapsed
  17.64    0.06   17.83

The switch to strictly floating point numbers is significant.  Compression
is significant improvement--about 49 seconds or about 30%--although nowhere
near enough for read.csv to be comparable to fread.

Finally, I produce a 20000x1000 matrix.  The resulting files are:
test1.csv :      354854 KB
test1.csv.gz : 157975 KB

matr <- replicate(1000,rnorm(20000))
> write.csv(matr,"test1.csv")
> t1 <- system.time(df <- read.csv("test1.csv"))
> t2 <- system.time(df <- read.csv("test1.csv.gz"))
> t3 <- system.time(df <- fread("test1.csv"))
> t1
   user  system elapsed
 206.80    1.14  208.60
> t2
   user  system elapsed
 123.42    0.27  123.99
> t3
   user  system elapsed
  17.24    0.09   17.37

Here, compression is an even larger win, improving by about 83 seconds or
roughly 40%.  The fread function is again dramatically faster, and unlike
read.csv, fread's performance is similar regardless of the shape of the
matrix.

We could create more detailed tests, varying the number of columns vs rows
and their type (strings vs integers vs floats, etc) to get better details,
but the
basic result is that compression can be a noticeable improvement in
performance, but a superior read algorithm trumps that.  If it's feasible to
combine fread's behavior with gzip, bzip2, or xz compression, it could be a
big win for some files, but not for all of them.  The advice from
http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html
to
compress csv files appears to hold, although
it may not save much time if you have a lot of non-float values or few
columns.


-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu


On Wed, Apr 3, 2013 at 4:20 PM, Nathaniel Graham <npgraham1 at gmail.com>wrote:

> Subjectively, the difference seems substantial, with large loads taking
> half or a third as long.  Whether I use gzip or not, CPU usage isn't
> especially high, suggesting that I'm either waiting on the hard drive
> or that the whole process is memory bound.  I was all set to produce
> some timings for comparison, but I'm working from home today and
> my home machine struggles to accommodate large files---any difference
> in load times gets swamped by swapping and general flailing on the
> part of the OS (I've only got 4GB of RAM at home).  Hopefully I'll get
> around to doing some timings on my work machine sometime this
> week, since I've got no issues with memory there.
>
> -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
>
>
> On Wed, Apr 3, 2013 at 4:58 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>> **
>>
>>
>>
>> Interesting.  How much do you find read.csv is sped up by reading gzip'd
>> files?
>>
>>
>>
>> On 02.04.2013 20:36, Nathaniel Graham wrote:
>>
>> Thanks, but I suspect that it would take longer to setup and then remove
>> a ramdisk than it would to use read.csv and data.table.  My files are
>> moderately large (between 200 MB and 3 GB when compressed), but not
>> enormous; I gzip not so much to save space on disk but to speed up reads.
>>
>> -------
>> Nathaniel Graham
>> npgraham1 at gmail.com
>> npgraham1 at uky.edu
>>
>>
>> On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>>
>>>
>>> Hi,
>>>
>>> fread memory maps the entire uncompressed file and this is baked into
>>> the way it works (e.g. skipping to the beginning, middle and last 5 rows to
>>> detect column types before starting to read the rows in) and where the
>>> convenience and speed comes from.
>>>
>>> You could uncompress the .gz to a ramdisk first, and then fread the
>>> uncompressed file from that ramdisk, is probably the fastest way.  Which
>>> should still be pretty quick and I guess unlikely much slower than anything
>>> we could build into fread (provided you use a ramdisk).
>>>
>>> Matthew
>>>
>>>
>>>
>>> On 02.04.2013 19:30, Nathaniel Graham wrote:
>>>
>>> I have a moderately large csv file that's gzipped, but not in a tar
>>> archive, so it's "filename.csv.gz" that I want to read into a data.table.
>>> I'd like to use fread(), but I can't seem to make it work.  I'm currently
>>> using the following:
>>> data.table(read.csv(gzfile("filename.csv.gz","r")))
>>> Various combinations of gzfile, gzcon, file, readLines, and
>>> textConnection all produce an error (invalid input).  Is there a better
>>> way to read in large, compressed files?
>>>  -------
>>> Nathaniel Graham
>>> npgraham1 at gmail.com
>>> npgraham1 at uky.edu
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130405/ee719e10/attachment.html>


More information about the datatable-help mailing list