[datatable-help] A new package multitable (data.list) remind me of a long existing feature request #202 and discussion thread

Chris Neff caneff at gmail.com
Fri Oct 7 20:10:40 CEST 2011


The biggest speed tweak is to pass in the colClasses argument in
read.csv.  I have a little function that reads the first N lines,
guesses the column types based on that, and passes that into read.csv
to read the full file.  This is much faster than the defaults.

Also specifying a guess at number of rows (but I don't think that can
be made generic), and specifying comment.char="". See:
http://www.biostat.jhsph.edu/~rpeng/docs/R-large-tables.html

I can't time the differences because I'm not on my normal machine at
the moment.  But looking forward to this.

On 7 October 2011 10:58, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Yes, single delimiter files too. Yes it should be faster than normal speed
> tweaks
> on read.table.
>
> One (very very basic) test so far has shown 4 times faster for a 7.5MB file
> on disk (5.5s
> down to 1.3s).   The code and test is already in the package (so you can run
> that test now),
> see data.table:::read  (3 colons), and the 2 source files :
> https://r-forge.r-project.org/scm/viewvc.php/pkg/R/read.R?view=markup&root=datatable
> https://r-forge.r-project.org/scm/viewvc.php/pkg/src/readfile.c?view=markup&root=datatable
>
> But, it doesn't look like I did the speed tweaks for read.csv in that
> comparison.  What are
> they again?    Any help with this feature would be great.
>
> Matthew
>
> "Chris Neff" <caneff at gmail.com> wrote in message
> news:CAAuY0RVoWwFGbcPrCOd6gT4GoamozGuexqSLioS3QHuFhF0c4g at mail.gmail.com...
> On 6 October 2011 00:15, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>> Indeed. Or columns 11 and 12 of BED files (genomics). Near on the agenda
>> is a fast file loader straight into data.table and list columns
>> (dual-delimited files such as BED).
>>
>
> Is this a fast file loader for any files that could be read using
> read.table, or just dual delimited files?  If you can make a way to
> load things that is faster than read.table with the normal speed
> tweaks that get mentioned for it, I'd be ecstatic.
>
>
>> I don't believe SQL has an analogous concept to list columns? To achieve
>> that people may be using comma delimited strings in varchar columns, I
>> guess.
>>
>> On Wed, 2011-10-05 at 16:19 -0500, Branson Owen wrote:
>>> Thank you very, very much Matthew. I think this is a very valuable (at
>>> least to me), and unique feature for more powerful calculation. A very
>>> useful application I can immediately think of is for options chains
>>> and order book modeling. It's much easier to track and model the whole
>>> option chains or order book for each time stamp or symbol, and also
>>> save a lot of replicating time stamps and symbols.
>>>
>>> 2011/10/4 Matthew Dowle <mdowle at mdowle.plus.com>
>>> On Sun, 2011-10-02 at 15:14 +0800, Branson Owen wrote:
>>>
>>> > Oh, sorry, I was testing the syntax like:
>>> >
>>> > DT = data.table(A = 1:2, B = list('a', 2i))
>>> >
>>> > It didn't work, and I though this feature has not been
>>> implemented.
>>> > Thank you for pointing it out with a good example.
>>>
>>>
>>> Natural to assume that should work. Now in 1.6.7 :
>>>
>>> o data.table() now accepts list columns directly rather than
>>> needing to add list columns to an existing data.table;
>>> e.g.,
>>>
>>> DT = data.table(x=1:3,y=list(4:6,3.14,matrix(1:12,3)))
>>>
>>> Thanks to Branson Owen for reminding.
>>>
>>> Accordingly, one item has been added to FAQ 2.17
>>> (differences
>>> between data.frame and data.table) :
>>> "data.frame(list(1:2,"k",1:4))
>>> creates 3 columns, data.table creates one list column"
>>>
>>> As before, list columns can be created via grouping; e.g.,
>>>
>>> DT = data.table(x=c(1,1,2,2,2,3,3),y=1:7)
>>> DT2 = DT[,list(list(unique(y))),by=x]
>>> DT2
>>> x V1
>>> [1,] 1 1, 2
>>> [2,] 2 3, 4, 5
>>> [3,] 3 6, 7
>>>
>>> and list columns can be grouped; e.g.,
>>>
>>> DT2[,sum(unlist(V1)),by=list(x%%2)]
>>> x V1
>>> [1,] 1 16
>>> [2,] 0 12
>>>
>>>
>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


More information about the datatable-help mailing list