[datatable-help] fread suggestion

Matthew Dowle mdowle at mdowle.plus.com
Mon Mar 11 15:51:01 CET 2013


 

Exactly RAM would always be quicker. But maybe you want to read data
from on-disk data.table using data.table syntax, rather than some other
database or flat text file. i.e. on-disk data.table would not need to
fit in RAM. 

Benchmark sounds intriguing. Please share if you can.
compress=TRUE by default so maybe the decompress takes time, though.


On 11.03.2013 14:12, stat quant wrote: 

> Filled as #2605 
> About
your ultimate goal... why would you want on-disk tables rather than RAM
(apart from being able to read >RAM limit file) ? Wouldnt RAM always be
quicker ? 
> I think data.table::fread is priceless because it is way
faster than any other read function. 
> I just benchmarked fread reading
a csv file against R loading its own .RData binary format, and
shockingly fread is much faster! 
> I think it is too bad R doesn't
provide a very fast way of loading objects saved from a previous R
session (well why don't I do it if it is so easy...) 
> 
> 2013/3/11
stat quant <mail.statquant at gmail.com [6]>
> 
>> On my way to fill it in.

>> 
>> About your ultimate goal... why would you want on-disk tables
rather than RAM (apart from being able to read >RAM limit file) ?
Wouldnt RAM always be quicker ? 
>> 
>> I think data.table::fread is
priceless because it is way faster than any other read function. 
>> I
just benchmarked fread reading a csv file against R loading its own
.RData binary format, and shockingly fread is much faster! 
>> I think
it is too bad R doesn't provide a very fast way of loading objects saved
from a previous R session (well why don't I do it if it is so easy...)

>> 
>> 2013/3/11 Matthew Dowle <mdowle at mdowle.plus.com [5]>
>> 
>>>
Good idea statquant, please file it then. How about something more
general e.g. 
>>> 
>>> fread(input, chunk.nrows=10000, chunk.filter =)

>>> 
>>> Thatcould be grep() or any expression of column names. It
wouldn't be efficient to call that for every row one by one and
similarly couldn't be called for the whole DT, since the point is that
DT is greater than RAM. So some batch size need be defined hence
chunk.nrows=10000. That filter would then be called for each chunk and
any rows passing would make it into the final table. 
>>> 
>>> read.ffdf
has something like this I believe, and Jens already suggested that when
I ran the timings in example(fread) past him. We should probably follow
his lead on that in terms of argument names etc. 
>>> 
>>> Perhaps chunk
should be defined in terms of RAM e.g. chunk=100MB. Since that is how it
needs to be internally, in terms of number of pages to map. Or maybe
both as nrows or MB would be acceptable. 
>>> 
>>> Ultimately (maybe in
5 years!) we're heading towards fread reading into on-disk tables rather
than RAM. Filtering in chunks will always be a good option to have
though, even then, as you might want to filter what makes it to the
on-disk table. 
>>> 
>>> Matthew 
>>> 
>>> On 11.03.2013 12:53, MICHELE
DE MEO wrote: 
>>> 
>>>> Very interesting request. I also would be
interested in this possibility. 
>>>> Cheers 
>>>> 
>>>> 2013/3/11 stat
quant <statquant at outlook.com [3]>
>>>> 
>>>>> Hello list, 
>>>>> We like
FREAD because it is very fast, yet sometimes files are huge and R cannot
handle that much data, some packages handle this limitation but they do
not provide a similar to fread function. 
>>>>> Yet sometimes only
subsets of a file is really needed, subsets that could fit into RAM.

>>>>> 
>>>>> So what about adding a grep option to fread that would
allow to load only lines that matches a regular expression? 
>>>>>

>>>>> I'll add a request if you think the idea is worth implementing.

>>>>> 
>>>>> Cheers 
>>>>> 
>>>>>
_______________________________________________
>>>>> datatable-help
mailing list
>>>>> datatable-help at lists.r-forge.r-project.org [1]
>>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
>>>> 
>>>> -- 
>>>> 
>>>>
_*************************************************************_ 
>>>>
_MICHELE DE MEO, PH.D_ 
>>>> Statistical and data mining solutions
>>>>
http://micheledemeo.blogspot.com/ [4]
>>>> skype: demeo.michele




Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:statquant at outlook.com
[4] http://micheledemeo.blogspot.com/
[5]
mailto:mdowle at mdowle.plus.com
[6] mailto:mail.statquant at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130311/e3f48476/attachment.html>


More information about the datatable-help mailing list