[datatable-help] fread suggestion

Matthew Dowle mdowle at mdowle.plus.com
Mon Mar 11 14:09:29 CET 2013


 

Good idea statquant, please file it then. How about something more
general e.g. 

 fread(input, chunk.nrows=10000, chunk.filter = <anything
acceptable to i of DT[i]>) 

That <anything> could be grep() or any
expression of column names. It wouldn't be efficient to call that for
every row one by one and similarly couldn't be called for the whole DT,
since the point is that DT is greater than RAM. So some batch size need
be defined hence chunk.nrows=10000. That filter would then be called for
each chunk and any rows passing would make it into the final table.


read.ffdf has something like this I believe, and Jens already
suggested that when I ran the timings in example(fread) past him. We
should probably follow his lead on that in terms of argument names etc.


Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB.
Since that is how it needs to be internally, in terms of number of pages
to map. Or maybe both as nrows or MB would be acceptable. 

Ultimately
(maybe in 5 years!) we're heading towards fread reading into on-disk
tables rather than RAM. Filtering in chunks will always be a good option
to have though, even then, as you might want to filter what makes it to
the on-disk table. 

Matthew 

On 11.03.2013 12:53, MICHELE DE MEO
wrote: 

> Very interesting request. I also would be interested in this
possibility. 
> Cheers 
> 
> 2013/3/11 stat quant <statquant at outlook.com
[3]>
> 
>> Hello list, 
>> We like FREAD because it is very fast, yet
sometimes files are huge and R cannot handle that much data, some
packages handle this limitation but they do not provide a similar to
fread function. 
>> Yet sometimes only subsets of a file is really
needed, subsets that could fit into RAM. 
>> 
>> So what about adding a
grep option to fread that would allow to load only lines that matches a
regular expression? 
>> 
>> I'll add a request if you think the idea is
worth implementing. 
>> 
>> Cheers 
>> 
>>
_______________________________________________
>> datatable-help
mailing list
>> datatable-help at lists.r-forge.r-project.org [1]
>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
> 
> -- 
> 
>
_*************************************************************_ 
>
_MICHELE DE MEO, PH.D_ 
> Statistical and data mining solutions
>
http://micheledemeo.blogspot.com/ [4]
> skype: demeo.michele




Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:statquant at outlook.com
[4] http://micheledemeo.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130311/543401a4/attachment.html>


More information about the datatable-help mailing list