[datatable-help] fread on gzipped files

Matthew Dowle mdowle at mdowle.plus.com
Wed Apr 3 10:58:24 CEST 2013


 

Interesting. How much do you find read.csv is sped up by reading
gzip'd files? 

On 02.04.2013 20:36, Nathaniel Graham wrote: 

> Thanks,
but I suspect that it would take longer to setup and then remove 
> a
ramdisk than it would to use read.csv and data.table. My files are 
>
moderately large (between 200 MB and 3 GB when compressed), but not 
>
enormous; I gzip not so much to save space on disk but to speed up
reads. 
> 
> -------
> Nathaniel Graham
> npgraham1 at gmail.com [3]
>
npgraham1 at uky.edu [4] 
> 
> On Tue, Apr 2, 2013 at 3:12 PM, Matthew
Dowle <mdowle at mdowle.plus.com [5]> wrote:
> 
>> Hi, 
>> 
>> fread memory
maps the entire uncompressed file and this is baked into the way it
works (e.g. skipping to the beginning, middle and last 5 rows to detect
column types before starting to read the rows in) and where the
convenience and speed comes from. 
>> 
>> You could uncompress the .gz
to a ramdisk first, and then fread the uncompressed file from that
ramdisk, is probably the fastest way. Which should still be pretty quick
and I guess unlikely much slower than anything we could build into fread
(provided you use a ramdisk). 
>> 
>> Matthew 
>> 
>> On 02.04.2013
19:30, Nathaniel Graham wrote: 
>> 
>>> I have a moderately large csv
file that's gzipped, but not in a tar 
>>> archive, so it's
"filename.csv.gz" that I want to read into a data.table. 
>>> I'd like
to use fread(), but I can't seem to make it work. I'm currently 
>>>
using the following: 
>>>
data.table(read.csv(gzfile("filename.csv.gz","r"))) 
>>> Various
combinations of gzfile, gzcon, file, readLines, and 
>>> textConnection
all produce an error (invalid input). Is there a better 
>>> way to read
in large, compressed files? 
>>> 
>>> -------
>>> Nathaniel Graham
>>>
npgraham1 at gmail.com [1]
>>> npgraham1 at uky.edu [2]

 

Links:
------
[1]
mailto:npgraham1 at gmail.com
[2] mailto:npgraham1 at uky.edu
[3]
mailto:npgraham1 at gmail.com
[4] mailto:npgraham1 at uky.edu
[5]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130403/9ce46d20/attachment.html>


More information about the datatable-help mailing list