[datatable-help] datatable-help Digest, Vol 30, Issue 1

Tue Aug 7 13:43:57 CEST 2012

David,

Another point on this ...

>> From a design philosophy perspective, I see data.table as designed around
>> specific ways of doing things (e.g. assuming that keys are set for
>> joins,
>> assuming that the first columns of Y match the key of X or that Y has a
>> key, etc.).  If your needs don't match these design assumptions, then
>> you
>> have to make modifications.  An alternative approach is to write the
>> syntax for the general case, but implement important optimizations.  For
>> example, always perform a natural join (matching corresponding names in
>> X
>> and Y) for any X[Y], add something like X[Y, by=(Xcol1=Ycol1, ...)] and
>> if
>> the matching uses the key for X, then all the better.
>>
>
> True and agreed. But merge() may be that general case. This is why merge's
> performance has been improved in recent versions, so its speed is
> comparable to X[Y] but with more flexible capabilities.
>
> Also see FAQ 1.12 "What is the difference between X[Y] and merge(X,Y)?"
>

and this ...

>>> If joins matched by name, then the implementation could check if the
>>> key
>>> was sufficiently satisfied to be used and otherwise it would just
>>> perform a more conventional non-key'd join.
>
> Just to check here you know that i doesn't have to be keyed. Just x. It's
> not a match by column name, but by position, though (which I prefer since
> I find it onerous to make sure column names match). Note this from
> ?data.table :
>
>    " When i is a data.table, x must have a key. i is joined to x using x's
> key and the rows in x that match are returned. An equi-join is
> performed between each column in i to each column in x's key; i.e.,
> column 1 of i is matched to the 1st column of x's key, column 2 to the
> second, etc. "
>
> Note that "If i also has a key" comes later in that paragraph; i.e., i
> doesn't have to be keyed.
>
> However, #2175 now added ("Add natural joins i.e. X[Y] when X has no key
> could use common column names"), thanks. I think it might be covered by
> merge() though as that's what that does. Maybe not in combination with j
> (that's the efficiency limitiation of merge) :
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2175&group_id=240&atid=978
>

is that the following idom :

    X[Y[,list(c,d,e)]]

matches Y$c to key(X)[1], Y$d to key(X)[2] and Y$e to key(X)[3], and

    X[Y[,list(d,c,e)]]

matches Y$d to key(X)[1], Y$c to key(X)[2] and Y$e to key(X)[3].

So it isn't necessary to change the key of Y just to get X[Y] to join on
different columns from Y.

It's only necessary to change the key of X to join to different columns of
X, and that's where Yike's idiom using setkey() comes in,
merge.data.table, or manual secondary keys. When secondary keys are built
into the syntax, you'd be able to set2key(X,...) and then have some way to
join to that key of X rather than the primary key of X.

True, Y[,list(c,d,e)] currently does copy each column, but i) that copy is
faster than base R's copy because memcpy hasn't been implemented in
duplicate.c yet (see FAQ 1.8), ii) it's intended to change that to a
shallow copy in future which will not even take a memcpy (data.table would
then keep track and copy the columns on change, if needed), and iii) Y
often has fewer rows than X so copying Y's columns isn't so much of an
issue in many but not all cases.

Finally, an aside. Note that in the example above if key(X) was length 2,
then just c and d of Y are used in the join. Then e is available to X's j
via 'join inherited scope'. Join inherited scope saves allocating and
recycling each item in each row of Y to match the length of items in each
group of X matched to.

Matthew