[datatable-help] List-valued column

Matthew Dowle mdowle at mdowle.plus.com
Thu Oct 27 09:55:15 CEST 2011


I think I know what you mean but just to first check a few things. This
nestled deep in ?data.table :

   "If i also has a key, it is i's key columns that are used to match to
x's key columns and a binary merge of the two tables is carried out."

By 'binary merge', what ?data.table means is that each row of i isn't
looked up in x using binary search from scratch. When i has a key (too)
the lower bound of binary search is not reset to 0 after each row in i
(as it is when i isn't keyed), it's left at the row in x matched by the
previous row of i. The upper bound is reset to nrow(x) for each row of
i, though. It's kind of a half-way approach. Yes, the assumption so far
has been that nrow(i)<<nrow(x).

By "preexisting keys" you do mean both tables have a key, right? Some
(incorrectly) refer to the key columns as keys (plural). The terminology
is that a data.table has at most one key (singular). A key has many
columns. Just checking no confusion before proceeding ...

I'd be very very interested to see a script and timing results. It would
make sense to do a linear search where it makes more sense, no problem.
That could be an internal switch (if we know what the rule should be)
that could be overridden by user. The worst case is X[X], isn't it.

Also, if you know x's key is unique then setting mult="first" may be
faster. I'd be very interested in the difference that makes, perhaps on
the worst case X[X] vs X[X,mult="first"].

Matthew


On Wed, 2011-10-26 at 17:57 -0400, Stavros Macrakis wrote:
> Some quick testing seems to show that the cost of joining table A to
> table B (on preexisting keys) is proportional to
> size(A)*log2(size(B)), that is, each element of table A is looked up
> (using binary search) in table B.  This is very efficient when size(A)
> << size(B), but when they're of similar size, a linear merge can be
> done in time size(A)+size(B), which can be several times faster.
>  Binary merge has a smooth transition from one regime to the other,
> but a simple cross-over is pretty good already.
> 
> 
> I would put together a nice script to show this explicitly, but that
> will take some time, so I wanted to check whether this is a known
> issue or if I need to demonstrate this issue in more detail.  (Or if
> I've made some stupid blunder.)
> 
> 
> Thanks!
> 
> 
>              -s
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list