[datatable-help] fread on gzipped files

Wed Apr 3 22:20:55 CEST 2013

Subjectively, the difference seems substantial, with large loads taking
half or a third as long.  Whether I use gzip or not, CPU usage isn't
especially high, suggesting that I'm either waiting on the hard drive
or that the whole process is memory bound.  I was all set to produce
some timings for comparison, but I'm working from home today and
my home machine struggles to accommodate large files---any difference
in load times gets swamped by swapping and general flailing on the
part of the OS (I've only got 4GB of RAM at home).  Hopefully I'll get
around to doing some timings on my work machine sometime this
week, since I've got no issues with memory there.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu

On Wed, Apr 3, 2013 at 4:58 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> Interesting.  How much do you find read.csv is sped up by reading gzip'd
> files?
>
>
>
> On 02.04.2013 20:36, Nathaniel Graham wrote:
>
> Thanks, but I suspect that it would take longer to setup and then remove
> a ramdisk than it would to use read.csv and data.table.  My files are
> moderately large (between 200 MB and 3 GB when compressed), but not
> enormous; I gzip not so much to save space on disk but to speed up reads.
>
> -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
>
>
> On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>>
>> Hi,
>>
>> fread memory maps the entire uncompressed file and this is baked into the
>> way it works (e.g. skipping to the beginning, middle and last 5 rows to
>> detect column types before starting to read the rows in) and where the
>> convenience and speed comes from.
>>
>> You could uncompress the .gz to a ramdisk first, and then fread the
>> uncompressed file from that ramdisk, is probably the fastest way.  Which
>> should still be pretty quick and I guess unlikely much slower than anything
>> we could build into fread (provided you use a ramdisk).
>>
>> Matthew
>>
>>
>>
>> On 02.04.2013 19:30, Nathaniel Graham wrote:
>>
>> I have a moderately large csv file that's gzipped, but not in a tar
>> archive, so it's "filename.csv.gz" that I want to read into a data.table.
>> I'd like to use fread(), but I can't seem to make it work.  I'm currently
>> using the following:
>> data.table(read.csv(gzfile("filename.csv.gz","r")))
>> Various combinations of gzfile, gzcon, file, readLines, and
>> textConnection all produce an error (invalid input).  Is there a better
>> way to read in large, compressed files?
>>  -------
>> Nathaniel Graham
>> npgraham1 at gmail.com
>> npgraham1 at uky.edu
>>
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130403/381bf1cd/attachment.html>