[datatable-help] A new package multitable (data.list) remind me of a long existing feature request #202 and discussion thread

Fri Oct 7 23:17:39 CEST 2011

Thanks for the reminders; that's great. Your function that guesses
column types on the first N rows would be very useful as I was going to
have to write that. Any chance you could contribute it?

> Also specifying a guess at number of rows (but I don't think that can
> be made generic)

Yes that's important. I'm planning to use the file size in bytes and
divide by the size in bytes of the first 10 rows to make a good guess
and add a 5% margin. dogroups already works in a similar way and grows
vectors efficiently if the guess turns out to be insufficient. Even if
the guess is insufficient (e.g. a lot of varying length character
strings), by the time it knows that it will have read most of the file
and be able to make just one single (much more accurate) guess to finish
off the load.

That should be faster than sweeping through the whole file counting \n
(what wc does, albeit very well on unix or cygwin only).

On Fri, 2011-10-07 at 11:10 -0700, Chris Neff wrote:
> The biggest speed tweak is to pass in the colClasses argument in
> read.csv.  I have a little function that reads the first N lines,
> guesses the column types based on that, and passes that into read.csv
> to read the full file.  This is much faster than the defaults.
> 
> Also specifying a guess at number of rows (but I don't think that can
> be made generic), and specifying comment.char="". See:
> http://www.biostat.jhsph.edu/~rpeng/docs/R-large-tables.html
> 
> I can't time the differences because I'm not on my normal machine at
> the moment.  But looking forward to this.
> 
> On 7 October 2011 10:58, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> > Yes, single delimiter files too. Yes it should be faster than normal speed
> > tweaks
> > on read.table.
> >
> > One (very very basic) test so far has shown 4 times faster for a 7.5MB file
> > on disk (5.5s
> > down to 1.3s).   The code and test is already in the package (so you can run
> > that test now),
> > see data.table:::read  (3 colons), and the 2 source files :
> > https://r-forge.r-project.org/scm/viewvc.php/pkg/R/read.R?view=markup&root=datatable
> > https://r-forge.r-project.org/scm/viewvc.php/pkg/src/readfile.c?view=markup&root=datatable
> >
> > But, it doesn't look like I did the speed tweaks for read.csv in that
> > comparison.  What are
> > they again?    Any help with this feature would be great.
> >
> > Matthew
> >
> > "Chris Neff" <caneff at gmail.com> wrote in message
> > news:CAAuY0RVoWwFGbcPrCOd6gT4GoamozGuexqSLioS3QHuFhF0c4g at mail.gmail.com...
> > On 6 October 2011 00:15, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> >> Indeed. Or columns 11 and 12 of BED files (genomics). Near on the agenda
> >> is a fast file loader straight into data.table and list columns
> >> (dual-delimited files such as BED).
> >>
> >
> > Is this a fast file loader for any files that could be read using
> > read.table, or just dual delimited files?  If you can make a way to
> > load things that is faster than read.table with the normal speed
> > tweaks that get mentioned for it, I'd be ecstatic.
> >
> >
> >> I don't believe SQL has an analogous concept to list columns? To achieve
> >> that people may be using comma delimited strings in varchar columns, I
> >> guess.
> >>
> >> On Wed, 2011-10-05 at 16:19 -0500, Branson Owen wrote:
> >>> Thank you very, very much Matthew. I think this is a very valuable (at
> >>> least to me), and unique feature for more powerful calculation. A very
> >>> useful application I can immediately think of is for options chains
> >>> and order book modeling. It's much easier to track and model the whole
> >>> option chains or order book for each time stamp or symbol, and also
> >>> save a lot of replicating time stamps and symbols.
> >>>
> >>> 2011/10/4 Matthew Dowle <mdowle at mdowle.plus.com>
> >>> On Sun, 2011-10-02 at 15:14 +0800, Branson Owen wrote:
> >>>
> >>> > Oh, sorry, I was testing the syntax like:
> >>> >
> >>> > DT = data.table(A = 1:2, B = list('a', 2i))
> >>> >
> >>> > It didn't work, and I though this feature has not been
> >>> implemented.
> >>> > Thank you for pointing it out with a good example.
> >>>
> >>>
> >>> Natural to assume that should work. Now in 1.6.7 :
> >>>
> >>> o data.table() now accepts list columns directly rather than
> >>> needing to add list columns to an existing data.table;
> >>> e.g.,
> >>>
> >>> DT = data.table(x=1:3,y=list(4:6,3.14,matrix(1:12,3)))
> >>>
> >>> Thanks to Branson Owen for reminding.
> >>>
> >>> Accordingly, one item has been added to FAQ 2.17
> >>> (differences
> >>> between data.frame and data.table) :
> >>> "data.frame(list(1:2,"k",1:4))
> >>> creates 3 columns, data.table creates one list column"
> >>>
> >>> As before, list columns can be created via grouping; e.g.,
> >>>
> >>> DT = data.table(x=c(1,1,2,2,2,3,3),y=1:7)
> >>> DT2 = DT[,list(list(unique(y))),by=x]
> >>> DT2
> >>> x V1
> >>> [1,] 1 1, 2
> >>> [2,] 2 3, 4, 5
> >>> [3,] 3 6, 7
> >>>
> >>> and list columns can be grouped; e.g.,
> >>>
> >>> DT2[,sum(unlist(V1)),by=list(x%%2)]
> >>> x V1
> >>> [1,] 1 16
> >>> [2,] 0 12
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>
> >
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >