[datatable-help] fread on gzipped files

Nathaniel Graham npgraham1 at gmail.com
Tue Apr 2 21:36:07 CEST 2013


Thanks, but I suspect that it would take longer to setup and then remove
a ramdisk than it would to use read.csv and data.table.  My files are
moderately large (between 200 MB and 3 GB when compressed), but not
enormous; I gzip not so much to save space on disk but to speed up reads.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu


On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> Hi,
>
> fread memory maps the entire uncompressed file and this is baked into the
> way it works (e.g. skipping to the beginning, middle and last 5 rows to
> detect column types before starting to read the rows in) and where the
> convenience and speed comes from.
>
> You could uncompress the .gz to a ramdisk first, and then fread the
> uncompressed file from that ramdisk, is probably the fastest way.  Which
> should still be pretty quick and I guess unlikely much slower than anything
> we could build into fread (provided you use a ramdisk).
>
> Matthew
>
>
>
> On 02.04.2013 19:30, Nathaniel Graham wrote:
>
> I have a moderately large csv file that's gzipped, but not in a tar
> archive, so it's "filename.csv.gz" that I want to read into a data.table.
> I'd like to use fread(), but I can't seem to make it work.  I'm currently
> using the following:
> data.table(read.csv(gzfile("filename.csv.gz","r")))
> Various combinations of gzfile, gzcon, file, readLines, and
> textConnection all produce an error (invalid input).  Is there a better
> way to read in large, compressed files?
>  -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130402/20e189d6/attachment.html>


More information about the datatable-help mailing list