[datatable-help] List-valued column
Stavros Macrakis
macrakis at alum.mit.edu
Thu Oct 27 17:55:05 CEST 2011
Sorry, I was abusing the term "binary merge". What I was actually thinking
of was algorithms like the ones discussed in
"Adaptive Set Intersections, Unions, and
Differences<http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.9963>"
(2000) and
"Faster Adaptive Set Intersections for Text
Searching<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.87.9365>"
(2006).The algorithm you describe is good for nrow(i) << nrow(x) (though
not as good as the algorithms described above) , but as you say, not very
good when nrow(i) ~ nrow(x), in which case it will be roughly
log2(nrow(x))/2 times slower than simple linear merge. Of course, the
detailed speed comparisons will depend on the distribution of values and the
exact code in the inner loops.> By "preexisting keys" you do mean both
tables have a key, right?Yes.> I'd be very very interested to see a script
and timing results. I could take that on as a project this weekend. But is
it worth it? The papers cited above have lots of performance tests showing
that those algorithms are better than binary search merge or linear merge
and everything in between. -s
On Thu, Oct 27, 2011 at 03:55, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> I think I know what you mean but just to first check a few things. This
> nestled deep in ?data.table :
>
> "If i also has a key, it is i's key columns that are used to match to
> x's key columns and a binary merge of the two tables is carried out."
>
> By 'binary merge', what ?data.table means is that each row of i isn't
> looked up in x using binary search from scratch. When i has a key (too)
> the lower bound of binary search is not reset to 0 after each row in i
> (as it is when i isn't keyed), it's left at the row in x matched by the
> previous row of i. The upper bound is reset to nrow(x) for each row of
> i, though. It's kind of a half-way approach. Yes, the assumption so far
> has been that nrow(i)<<nrow(x).
>
> By "preexisting keys" you do mean both tables have a key, right? Some
> (incorrectly) refer to the key columns as keys (plural). The terminology
> is that a data.table has at most one key (singular). A key has many
> columns. Just checking no confusion before proceeding ...
>
> I'd be very very interested to see a script and timing results. It would
> make sense to do a linear search where it makes more sense, no problem.
> That could be an internal switch (if we know what the rule should be)
> that could be overridden by user. The worst case is X[X], isn't it.
>
> Also, if you know x's key is unique then setting mult="first" may be
> faster. I'd be very interested in the difference that makes, perhaps on
> the worst case X[X] vs X[X,mult="first"].
>
> Matthew
>
>
> On Wed, 2011-10-26 at 17:57 -0400, Stavros Macrakis wrote:
> > Some quick testing seems to show that the cost of joining table A to
> > table B (on preexisting keys) is proportional to
> > size(A)*log2(size(B)), that is, each element of table A is looked up
> > (using binary search) in table B. This is very efficient when size(A)
> > << size(B), but when they're of similar size, a linear merge can be
> > done in time size(A)+size(B), which can be several times faster.
> > Binary merge has a smooth transition from one regime to the other,
> > but a simple cross-over is pretty good already.
> >
> >
> > I would put together a nice script to show this explicitly, but that
> > will take some time, so I wanted to check whether this is a known
> > issue or if I need to demonstrate this issue in more detail. (Or if
> > I've made some stupid blunder.)
> >
> >
> > Thanks!
> >
> >
> > -s
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20111027/82da150c/attachment.htm>
More information about the datatable-help
mailing list