[datatable-help] fread suggestion

stat quant statquant at outlook.com
Mon Mar 11 15:12:23 CET 2013


Filled as #2605
About your ultimate goal... why would you want on-disk tables rather than
RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be
quicker ?
I think data.table::fread is priceless because it is way faster than any
other read function.
I just benchmarked fread reading a csv file against R loading its own
.RData binary format, and shockingly fread is much faster!
I think it is too bad R doesn't provide a very fast way of loading objects
saved from a previous R session (well why don't I do it if it is so easy...)




2013/3/11 stat quant <mail.statquant at gmail.com>

> On my way to fill it in.
>
> About your ultimate goal... why would you want on-disk tables rather than
> RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be
> quicker ?
>
> I think data.table::fread is priceless because it is way faster than any
> other read function.
> I just benchmarked fread reading a csv file against R loading its own
> .RData binary format, and shockingly fread is much faster!
> I think it is too bad R doesn't provide a very fast way of loading objects
> saved from a previous R session (well why don't I do it if it is so easy...)
>
>
>
> 2013/3/11 Matthew Dowle <mdowle at mdowle.plus.com>
>
>> **
>>
>>
>>
>> Good idea statquant, please file it then.  How about something more
>> general e.g.
>>
>>     fread(input,  chunk.nrows=10000, chunk.filter = <anything acceptable
>> to i of DT[i]>)
>>
>> That <anything> could be grep() or any expression of column names. It
>> wouldn't be efficient to call that for every row one by one and similarly
>> couldn't be called for the whole DT, since the point is that DT is greater
>> than RAM.  So some batch size need be defined hence chunk.nrows=10000.
>> That filter would then be called for each chunk and any rows passing would
>> make it into the final table.
>>
>> read.ffdf has something like this I believe,  and Jens already suggested
>> that when I ran the timings in example(fread) past him.  We should probably
>> follow his lead on that in terms of argument names etc.
>>
>> Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB.   Since
>> that is how it needs to be internally,  in terms of number of pages to map.
>>  Or maybe both as nrows or MB would be acceptable.
>>
>> Ultimately (maybe in 5 years!) we're heading towards fread reading into
>> on-disk tables rather than RAM.   Filtering in chunks will always be a good
>> option to have though, even then, as you might want to filter what makes it
>> to the on-disk table.
>>
>> Matthew
>>
>>
>>
>> On 11.03.2013 12:53, MICHELE DE MEO wrote:
>>
>> Very interesting request. I also would be interested in this possibility.
>> Cheers
>>
>>
>> 2013/3/11 stat quant <statquant at outlook.com>
>>
>>> Hello list,
>>> We like FREAD because it is very fast, yet sometimes files are huge and
>>> R cannot handle that much data, some packages handle this limitation but
>>> they do not provide a similar to fread function.
>>> Yet sometimes only subsets of a file is really needed, subsets that
>>> could fit into RAM.
>>>
>>> So what about adding a grep option to fread that would allow to load
>>> only lines that matches a regular expression?
>>>
>>> I'll add a request if you think the idea is worth implementing.
>>>
>>> Cheers
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>> --
>> ***************************************************************
>> *Michele De Meo, Ph.D*
>> *Statistical and data mining solutions
>> http://micheledemeo.blogspot.com/
>> skype: demeo.michele*
>> *
>> *
>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130311/ec203166/attachment-0001.html>


More information about the datatable-help mailing list