[datatable-help] fread on gzipped files

Fri Apr 5 21:38:40 CEST 2013

Fantastic, great job here, thanks! 

One thing to note is that
read.csv is much faster when using the standard tricks (colClasses,
nrows etc). That's why the speed comparisons in ?fread are careful to
link to online resources that list what the tricks are, and then compare
read.csv both with and without them to fread. Of course the "friendly"
part of fread is that you don't need to learn or know any tricks, so
from that point of view it may well be fair to compare no-frills
read.csv to fread as you've done. Good to state that so that nobody
accuses of unfair comparisons. But even with the tricks applied, fread
is still much faster. With-tricks on a compressed file would be
interesting for completeness. 

Thinking about it I suppose fread could
read .gz directly. Difficult, but possible. For convenience if nothing
else. I'll add it to the list to investigate ... 

Matthew 

On
05.04.2013 19:59, Nathaniel Graham wrote: 

> As promised, I did some
testing. The results (described in detail below) are mixed, but suggest
that compression is useful for some large data sets, and that if this is
a serious issue for someone, they need to do some careful testing before
committing to anything (I know, that should be obvious, but...). Also,
my results pretty clearly show that fread() crushes read.csv, regardless
of whether the csv file is compressed. Nice job Matthew! 
> I start with
Current Population Survey data from the Bureau of Labor Statistics. 
>
The file I used get be accessed here:
ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData [9] 
> I
converted it to a csv file using StatTransfer 8 (I'm lazy), with no
quoting of strings. I then compressed the csv file using 7-Zip (gzip,
Normal). The resulting 
> files, both with 4937221 obs, 5 variables are:

> 
> ln_data_1.csv : 133625 KB 
> ln_data_1.csv.gz : 17528 KB 
> Given
the file size disparity, this should demonstrate any improvements via
compression. Also, for comparison, I show fread below. I've made some 
>
formatting changes, but changed nothing else. 
> 
> for(i in 1:5) { 
>
t1 
> print(t1) 
> } 
> user system elapsed 
> 12.32 0.53 12.90 
> 12.51
0.44 13.00 
> 12.39 0.47 12.89 
> 12.36 0.55 12.96 
> 12.43 0.36 12.94

> 
> for(i in 1:5) { 
> t2 
> print(t2) 
> } 
> user system elapsed 
>
14.04 0.26 14.43 
> 14.00 0.27 14.34 
> 14.07 0.31 14.44 
> 13.93 0.28
14.23 
> 14.02 0.32 14.35 
> 
> for(i in 1:5) { 
> t3 
> print(t3) 
> }

> user system elapsed 
> 2.89 0.04 2.94 
> 2.92 0.07 2.98 
> 2.88 0.03
2.95 
> 2.87 0.06 2.95 
> 2.91 0.03 2.95 
> While the gzipped version
uses less system time, total & user time has increased somewhat. The
fread function from data.table is dramatically 
> faster. While this
isn't strictly a fair comparison because fread produces 
> a data.table
while read.csv produces a data.frame, the bias is against fread, 
> not
for it. 
> Next, I produce a random 2,000,000x10 matrix, write it to
csv, and then read it back into memory as a data.frame (or data.table,
for fread). I again use 7-Zip for compression.The resulting files are:

> test2.csv : 375086 KB 
> test2.csv.gz : 165477 KB 
> 
>> matr 
>>
write.csv(matr,"test2.csv") 
>> t1 
>> t2 
>> t3 
> 
>> t1 
> user
system elapsed 
> 165.32 0.36 166.25 
>> t2 
> user system elapsed 
>
116.24 0.16 117.08 
>> t3 
> user system elapsed 
> 17.64 0.06 17.83 
>
The switch to strictly floating point numbers is significant.
Compression is significant improvement--about 49 seconds or about
30%--although nowhere near enough for read.csv to be comparable to
fread. 
> Finally, I produce a 20000x1000 matrix. The resulting files
are: 
> test1.csv : 354854 KB 
> test1.csv.gz : 157975 KB 
> 
> matr 
>>
write.csv(matr,"test1.csv") 
>> t1 
>> t2 
>> t3 
>> t1 
> user system
elapsed 
> 206.80 1.14 208.60 
>> t2 
> user system elapsed 
> 123.42
0.27 123.99 
>> t3 
> user system elapsed 
> 17.24 0.09 17.37 
> Here,
compression is an even larger win, improving by about 83 seconds or
roughly 40%. The fread function is again dramatically faster, and unlike
read.csv, fread's performance is similar regardless of the shape of the
matrix. 
> We could create more detailed tests, varying the number of
columns vs rows 
> and their type (strings vs integers vs floats, etc)
to get better details, but the 
> basic result is that compression can
be a noticeable improvement in performance, but a superior read
algorithm trumps that. If it's feasible to 
> combine fread's behavior
with gzip, bzip2, or xz compression, it could be a 
> big win for some
files, but not for all of them. The advice from 
>
http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html
[10] to compress csv files appears to hold, although 
> it may not save
much time if you have a lot of non-float values or few columns. 
> 
>
-------
> Nathaniel Graham
> npgraham1 at gmail.com [11]
>
npgraham1 at uky.edu [12] 
> 
> On Wed, Apr 3, 2013 at 4:20 PM, Nathaniel
Graham <npgraham1 at gmail.com [13]> wrote:
> 
>> Subjectively, the
difference seems substantial, with large loads taking 
>> half or a
third as long. Whether I use gzip or not, CPU usage isn't 
>> especially
high, suggesting that I'm either waiting on the hard drive 
>> or that
the whole process is memory bound. I was all set to produce 
>> some
timings for comparison, but I'm working from home today and 
>> my home
machine struggles to accommodate large files---any difference 
>> in
load times gets swamped by swapping and general flailing on the 
>> part
of the OS (I've only got 4GB of RAM at home). Hopefully I'll get 
>>
around to doing some timings on my work machine sometime this 
>> week,
since I've got no issues with memory there. 
>> 
>> -------
>> Nathaniel
Graham
>> npgraham1 at gmail.com [6]
>> npgraham1 at uky.edu [7] 
>> 
>> On
Wed, Apr 3, 2013 at 4:58 AM, Matthew Dowle <mdowle at mdowle.plus.com [8]>
wrote:
>> 
>>> Interesting. How much do you find read.csv is sped up by
reading gzip'd files? 
>>> 
>>> On 02.04.2013 20:36, Nathaniel Graham
wrote: 
>>> 
>>>> Thanks, but I suspect that it would take longer to
setup and then remove 
>>>> a ramdisk than it would to use read.csv and
data.table. My files are 
>>>> moderately large (between 200 MB and 3 GB
when compressed), but not 
>>>> enormous; I gzip not so much to save
space on disk but to speed up reads. 
>>>> 
>>>> -------
>>>> Nathaniel
Graham
>>>> npgraham1 at gmail.com [3]
>>>> npgraham1 at uky.edu [4] 
>>>>

>>>> On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle
<mdowle at mdowle.plus.com [5]> wrote:
>>>> 
>>>>> Hi, 
>>>>> 
>>>>> fread
memory maps the entire uncompressed file and this is baked into the way
it works (e.g. skipping to the beginning, middle and last 5 rows to
detect column types before starting to read the rows in) and where the
convenience and speed comes from. 
>>>>> 
>>>>> You could uncompress the
.gz to a ramdisk first, and then fread the uncompressed file from that
ramdisk, is probably the fastest way. Which should still be pretty quick
and I guess unlikely much slower than anything we could build into fread
(provided you use a ramdisk). 
>>>>> 
>>>>> Matthew 
>>>>> 
>>>>> On
02.04.2013 19:30, Nathaniel Graham wrote: 
>>>>> 
>>>>>> I have a
moderately large csv file that's gzipped, but not in a tar 
>>>>>>
archive, so it's "filename.csv.gz" that I want to read into a
data.table. 
>>>>>> I'd like to use fread(), but I can't seem to make it
work. I'm currently 
>>>>>> using the following: 
>>>>>>
data.table(read.csv(gzfile("filename.csv.gz","r"))) 
>>>>>> Various
combinations of gzfile, gzcon, file, readLines, and 
>>>>>>
textConnection all produce an error (invalid input). Is there a better

>>>>>> way to read in large, compressed files? 
>>>>>> 
>>>>>>
-------
>>>>>> Nathaniel Graham
>>>>>> npgraham1 at gmail.com [1]
>>>>>>
npgraham1 at uky.edu [2]

Links:
------
[1]
mailto:npgraham1 at gmail.com
[2] mailto:npgraham1 at uky.edu
[3]
mailto:npgraham1 at gmail.com
[4] mailto:npgraham1 at uky.edu
[5]
mailto:mdowle at mdowle.plus.com
[6] mailto:npgraham1 at gmail.com
[7]
mailto:npgraham1 at uky.edu
[8] mailto:mdowle at mdowle.plus.com
[9]
http://webmail.plus.net/ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData
[10]
http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html
[11]
mailto:npgraham1 at gmail.com
[12] mailto:npgraham1 at uky.edu
[13]
mailto:npgraham1 at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130405/f11fef80/attachment-0001.html>