[datatable-help] merge/join/match

Eduard Antonyan eduard.antonyan at gmail.com
Fri May 3 16:57:19 CEST 2013


A correction - the param is called "nomatch", not "match".

This use case seems like smth a user shouldn't really do - in an ideal
world you should have them both keyed by the same-name column.

As is, my view on it is that data.table is correcting the user mistake of
naming the column in Y - y, instead of x, and so the output makes sense and
I don't see the need of complicating the behavior by adding more cases one
has to go through to figure out what the output columns would be. Similar
to asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous
column there, would you?



On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
<ggrothendieck at gmail.com>wrote:

> I am moving this discussion which started with mdowle to the list.
>
> Consider this example slightly modified from the data.table FAQ:
>
> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
> > out <- X[Y]; out
>        x foo bar
> 1:     b   3   4
> 2:     b   4   4
> 3:     b   5   4
> 4:     c   6   2
> 5:     c   7   2
> 6:     d  NA   3
>
> Note that the first column of the output is labelled x even though the
> data to produce it comes from y, e.g. "d" in out$x is not in X$x but
> does appear in Y$y so clearly the data is coming from y as opposed to
> x .  In terms of SQL the above would be written:
>
>     select Y.y as x, ...
>
> and the need to renamne the first column of out suggests that there
> may be a deeper problem here.
>
> Here are some ideas to address this (they would require changes to
> data.table):
>
> - the default of X[Y,, match=NA] would be changed to a default of
> X[Y,,match=0] so that it corresponds to the defaults in R's merge and
> in SQL joins.
>
> - the column name of the first column in the example above would be
> changed to y if match=0 but be left at x if match=NA.  In the case
> that match=0 (the proposed new default) x and y are equal so the first
> column can be validly labelled as x but in the case that match=NA they
> are not so y would be used as the column name.
>
> - the name match= does seem a bit misleading since R's match only
> matches one item in the target whereas in data.table match matches
> many if mult="all" and that is the default.  Perhaps some thought
> should be given to a name change here?
>
> The above would seem to correspond more closely to R's merge and SQL
> join defaults.  Any use cases or other comments?
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130503/a9b451a0/attachment.html>


More information about the datatable-help mailing list