[datatable-help] fread suggestion

Matthew Dowle mdowle at mdowle.plus.com
Mon Mar 11 16:10:32 CET 2013


 

Also, fread works by first memory mapping the file. The first time
it does this for a particular file is therefore slower (you may have
noticed the longer pause the first time before the percentage counter
starts). The time to memory map is reported when verbose=TRUE (but you
need the formatting fix in v1.8.9 to see that time as the formatted
number is messed up in v1.8.8). If you repeat the same fread call again
it won't spend as long memory mapping since it's already mapped,
depending on if you did anything else memory intensive on this
computer/server in the meantime. 

I don't know if base R's load()
memory maps, but if it doesn't it'll need to read from disk each time.
So be strictly fair, the time to compare is a "cold" read after a reboot
and the first run only of fread. But in practice we often do tend to
read the same file several times, so fread benefits from this. The OS
caches the file in RAM for you, basically. It might do this anyway. It's
all very OS and usage dependent! It may also depend on how your
particular R environment has been compiled. 

I don't think a fresh R
session is enough to reproduce this effect. You need a reboot as it's
the OS that caches/maps the file, not R/data.table. 

So in short - to
report the very fast time along with the time to memory map file from
cold, would be the fairest and most complete way to compare. 

Matthew


On 11.03.2013 14:51, Matthew Dowle wrote: 

> Exactly RAM would always
be quicker. But maybe you want to read data from on-disk data.table
using data.table syntax, rather than some other database or flat text
file. i.e. on-disk data.table would not need to fit in RAM. 
> 
>
Benchmark sounds intriguing. Please share if you can. compress=TRUE by
default so maybe the decompress takes time, though. 
> 
> On 11.03.2013
14:12, stat quant wrote: 
> 
>> Filled as #2605 
>> About your ultimate
goal... why would you want on-disk tables rather than RAM (apart from
being able to read >RAM limit file) ? Wouldnt RAM always be quicker ?

>> I think data.table::fread is priceless because it is way faster than
any other read function. 
>> I just benchmarked fread reading a csv file
against R loading its own .RData binary format, and shockingly fread is
much faster! 
>> I think it is too bad R doesn't provide a very fast way
of loading objects saved from a previous R session (well why don't I do
it if it is so easy...) 
>> 
>> 2013/3/11 stat quant
<mail.statquant at gmail.com [6]>
>> 
>>> On my way to fill it in. 
>>>

>>> About your ultimate goal... why would you want on-disk tables
rather than RAM (apart from being able to read >RAM limit file) ?
Wouldnt RAM always be quicker ? 
>>> 
>>> I think data.table::fread is
priceless because it is way faster than any other read function. 
>>> I
just benchmarked fread reading a csv file against R loading its own
.RData binary format, and shockingly fread is much faster! 
>>> I think
it is too bad R doesn't provide a very fast way of loading objects saved
from a previous R session (well why don't I do it if it is so easy...)

>>> 
>>> 2013/3/11 Matthew Dowle <mdowle at mdowle.plus.com [5]>
>>> 
>>>>
Good idea statquant, please file it then. How about something more
general e.g. 
>>>> 
>>>> fread(input, chunk.nrows=10000, chunk.filter =)

>>>> 
>>>> Thatcould be grep() or any expression of column names. It
wouldn't be efficient to call that for every row one by one and
similarly couldn't be called for the whole DT, since the point is that
DT is greater than RAM. So some batch size need be defined hence
chunk.nrows=10000. That filter would then be called for each chunk and
any rows passing would make it into the final table. 
>>>> 
>>>>
read.ffdf has something like this I believe, and Jens already suggested
that when I ran the timings in example(fread) past him. We should
probably follow his lead on that in terms of argument names etc. 
>>>>

>>>> Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB.
Since that is how it needs to be internally, in terms of number of pages
to map. Or maybe both as nrows or MB would be acceptable. 
>>>> 
>>>>
Ultimately (maybe in 5 years!) we're heading towards fread reading into
on-disk tables rather than RAM. Filtering in chunks will always be a
good option to have though, even then, as you might want to filter what
makes it to the on-disk table. 
>>>> 
>>>> Matthew 
>>>> 
>>>> On
11.03.2013 12:53, MICHELE DE MEO wrote: 
>>>> 
>>>>> Very interesting
request. I also would be interested in this possibility. 
>>>>> Cheers

>>>>> 
>>>>> 2013/3/11 stat quant <statquant at outlook.com [3]>
>>>>>

>>>>>> Hello list, 
>>>>>> We like FREAD because it is very fast, yet
sometimes files are huge and R cannot handle that much data, some
packages handle this limitation but they do not provide a similar to
fread function. 
>>>>>> Yet sometimes only subsets of a file is really
needed, subsets that could fit into RAM. 
>>>>>> 
>>>>>> So what about
adding a grep option to fread that would allow to load only lines that
matches a regular expression? 
>>>>>> 
>>>>>> I'll add a request if you
think the idea is worth implementing. 
>>>>>> 
>>>>>> Cheers 
>>>>>>

>>>>>> _______________________________________________
>>>>>>
datatable-help mailing list
>>>>>>
datatable-help at lists.r-forge.r-project.org [1]
>>>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
>>>>> 
>>>>> -- 
>>>>> 
>>>>>
_*************************************************************_ 
>>>>>
_MICHELE DE MEO, PH.D_ 
>>>>> Statistical and data mining
solutions
>>>>> http://micheledemeo.blogspot.com/ [4]
>>>>> skype:
demeo.michele

 

Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:statquant at outlook.com
[4] http://micheledemeo.blogspot.com/
[5]
mailto:mdowle at mdowle.plus.com
[6] mailto:mail.statquant at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130311/4d5fa5d0/attachment-0001.html>


More information about the datatable-help mailing list