[datatable-help] List-valued column

Matthew Dowle mdowle at mdowle.plus.com
Thu Oct 27 18:51:59 CEST 2011


Really interesting, thanks. I skimmed those papers. Initial thoughts are :

  i) their data is natural language text (website engines), assumptions of
uniform distribution. Typical identifiers and time series used in
data.table might not be uniform, but yes we could detect (with cost) and
adapt.

  ii) they seem to consider a single column sort, data.table is tuned for
multi-columns, see below

  iii) neither paper contains the word "page" other than in the website
context

data.table's binary search goes column by column. So when I said the upper
bound is set to n for each row of i, what I meant was n *for this column
of the key*, not n of the whole table. See 'prevupp' in the C code. That
upper bound may well sit on the same page in L2 a lot of the time.

I didn't know the word 'galloping' search or 'one sided' binary search.
Perhaps I did that anyway without knowing it's name. Only skimmed so not
sure.

I think the multi-column aspect, and localised upper bound within column,
are significant differences that mean, yes, timing tests would be worth
it. My guess would be that X[X] is significantly better when X's key has
two columns instead of one, because of the localised upper bound.

Finally the advantages of linear search may not be so clear in the case of
multiple columns. It has to look at each column for each row, which skips
across RAM and isn't L2 efficient. Those papers don't tackle that
dimension as far as I could see in a quick skim.

Definitely worth some timing tests, I think!


> Sorry, I was abusing the term "binary merge".  What I was actually
> thinking
> of was algorithms like the ones discussed in
> "Adaptive Set Intersections, Unions, and
> Differences<http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.9963>"
> (2000) and
> "Faster Adaptive Set Intersections for Text
> Searching<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.87.9365>"
> (2006).The algorithm you describe is good for nrow(i) << nrow(x)  (though
> not as good as the algorithms described above) , but as you say, not very
> good when nrow(i) ~ nrow(x), in which case it will be roughly
> log2(nrow(x))/2 times slower than simple linear merge.  Of course, the
> detailed speed comparisons will depend on the distribution of values and
> the
> exact code in the inner loops.> By "preexisting keys" you do mean both
> tables have a key, right?Yes.> I'd be very very interested to see a script
> and timing results. I could take that on as a project this weekend.  But
> is
> it worth it?  The papers cited above have lots of performance tests
> showing
> that those algorithms are better than binary search merge or linear merge
> and everything in between.           -s
>
> On Thu, Oct 27, 2011 at 03:55, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
>
>> I think I know what you mean but just to first check a few things. This
>> nestled deep in ?data.table :
>>
>>   "If i also has a key, it is i's key columns that are used to match to
>> x's key columns and a binary merge of the two tables is carried out."
>>
>> By 'binary merge', what ?data.table means is that each row of i isn't
>> looked up in x using binary search from scratch. When i has a key (too)
>> the lower bound of binary search is not reset to 0 after each row in i
>> (as it is when i isn't keyed), it's left at the row in x matched by the
>> previous row of i. The upper bound is reset to nrow(x) for each row of
>> i, though. It's kind of a half-way approach. Yes, the assumption so far
>> has been that nrow(i)<<nrow(x).
>>
>> By "preexisting keys" you do mean both tables have a key, right? Some
>> (incorrectly) refer to the key columns as keys (plural). The terminology
>> is that a data.table has at most one key (singular). A key has many
>> columns. Just checking no confusion before proceeding ...
>>
>> I'd be very very interested to see a script and timing results. It would
>> make sense to do a linear search where it makes more sense, no problem.
>> That could be an internal switch (if we know what the rule should be)
>> that could be overridden by user. The worst case is X[X], isn't it.
>>
>> Also, if you know x's key is unique then setting mult="first" may be
>> faster. I'd be very interested in the difference that makes, perhaps on
>> the worst case X[X] vs X[X,mult="first"].
>>
>> Matthew
>>
>>
>> On Wed, 2011-10-26 at 17:57 -0400, Stavros Macrakis wrote:
>> > Some quick testing seems to show that the cost of joining table A to
>> > table B (on preexisting keys) is proportional to
>> > size(A)*log2(size(B)), that is, each element of table A is looked up
>> > (using binary search) in table B.  This is very efficient when size(A)
>> > << size(B), but when they're of similar size, a linear merge can be
>> > done in time size(A)+size(B), which can be several times faster.
>> >  Binary merge has a smooth transition from one regime to the other,
>> > but a simple cross-over is pretty good already.
>> >
>> >
>> > I would put together a nice script to show this explicitly, but that
>> > will take some time, so I wanted to check whether this is a known
>> > issue or if I need to demonstrate this issue in more detail.  (Or if
>> > I've made some stupid blunder.)
>> >
>> >
>> > Thanks!
>> >
>> >
>> >              -s
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>




More information about the datatable-help mailing list