[datatable-help] merge/join/match
Gabor Grothendieck
ggrothendieck at gmail.com
Fri May 3 13:18:59 CEST 2013
I am moving this discussion which started with mdowle to the list.
Consider this example slightly modified from the data.table FAQ:
> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
> Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
> out <- X[Y]; out
x foo bar
1: b 3 4
2: b 4 4
3: b 5 4
4: c 6 2
5: c 7 2
6: d NA 3
Note that the first column of the output is labelled x even though the
data to produce it comes from y, e.g. "d" in out$x is not in X$x but
does appear in Y$y so clearly the data is coming from y as opposed to
x . In terms of SQL the above would be written:
select Y.y as x, ...
and the need to renamne the first column of out suggests that there
may be a deeper problem here.
Here are some ideas to address this (they would require changes to data.table):
- the default of X[Y,, match=NA] would be changed to a default of
X[Y,,match=0] so that it corresponds to the defaults in R's merge and
in SQL joins.
- the column name of the first column in the example above would be
changed to y if match=0 but be left at x if match=NA. In the case
that match=0 (the proposed new default) x and y are equal so the first
column can be validly labelled as x but in the case that match=NA they
are not so y would be used as the column name.
- the name match= does seem a bit misleading since R's match only
matches one item in the target whereas in data.table match matches
many if mult="all" and that is the default. Perhaps some thought
should be given to a name change here?
The above would seem to correspond more closely to R's merge and SQL
join defaults. Any use cases or other comments?
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the datatable-help
mailing list