[datatable-help] merge/join/match

Gabor Grothendieck ggrothendieck at gmail.com
Fri May 3 17:57:45 CEST 2013


In my last post it should have read:

   That X[Y] is not the same as Y[X] is analogous to the fact that
   merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE)

On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Assuming same-named keys, then these are all the same except possibly
> for row and column order:
>
> X[Y,,nomatch=0]
> Y[X,,nomatch=0]
> merge(X, Y)
> merge(Y, X)
>
> That X[Y] is not the same as Y[X] is analogous to the fact that
> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE)
>
> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>> Gabor,
>>
>> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't
>> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have*
>> to provide the same output (except for the column order and names). In that
>> sense, a join is a bit different from a merge, no?
>>
>> Arun
>>
>> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote:
>>
>> Yes, except that is not really what happens since match() only matches
>> one row whereas with mult="all", the default, all rows are matched
>> which is not really matching in the sense of match(). The current
>> naming confuses matching with joining and its really the latter that
>> is being done.
>>
>> Regarding the existence of merge the advantage of [ is that it will
>> automatically only take the columns needed so merge is not really
>> equivalent to [ in all respects. Furthermore having to use different
>> constructs for different types of merge seems awkward.
>>
>>
>> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan
>> <eduard.antonyan at gmail.com> wrote:
>>
>> Btw the way I think about the "nomatch" name is as follows - normally X[Y]
>> tries to match rows of Y with rows of X, and then "nomatch" tells it what to
>> do when there is *no match*.
>>
>>
>> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan <eduard.antonyan at gmail.com>
>> wrote:
>>
>>
>> To clarify - that behavior is already implemented in merge (more
>> specifically merge.data.table). I don't really have a view on having it in
>> X[Y] as well - I don't like all.x and all.y as the names, since there are no
>> params named 'x' and 'y' in [.data.table (as opposed to merge), but some
>> param that would do a full outer join could certainly be added.
>>
>>
>> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>>
>>
>> Yes, sorry. Its nomatch= which presumably derives from the parameter
>> of the same name in the match() function. If the idea of the nomatch=
>> name was to leverage off existing argument names in R then I would
>> prefer all.y= to be consistent with merge() in place of nomatch= since
>> we are really merging/joining rather than just matching. That would
>> also allow extension to all types of join by adding all.an x= argument
>> too.
>>
>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan
>> <eduard.antonyan at gmail.com> wrote:
>>
>> I would prefer nomatch=0 as a default though, simply because that's
>> what I
>> do most of the time :)
>>
>>
>> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan
>> <eduard.antonyan at gmail.com>
>> wrote:
>>
>>
>> A correction - the param is called "nomatch", not "match".
>>
>> This use case seems like smth a user shouldn't really do - in an ideal
>> world you should have them both keyed by the same-name column.
>>
>> As is, my view on it is that data.table is correcting the user mistake
>> of
>> naming the column in Y - y, instead of x, and so the output makes
>> sense and
>> I don't see the need of complicating the behavior by adding more cases
>> one
>> has to go through to figure out what the output columns would be.
>> Similar to
>> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous
>> column
>> there, would you?
>>
>>
>>
>> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>>
>>
>> I am moving this discussion which started with mdowle to the list.
>>
>> Consider this example slightly modified from the data.table FAQ:
>>
>> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
>> Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
>> out <- X[Y]; out
>>
>> x foo bar
>> 1: b 3 4
>> 2: b 4 4
>> 3: b 5 4
>> 4: c 6 2
>> 5: c 7 2
>> 6: d NA 3
>>
>> Note that the first column of the output is labelled x even though
>> the
>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but
>> does appear in Y$y so clearly the data is coming from y as opposed to
>> x . In terms of SQL the above would be written:
>>
>> select Y.y as x, ...
>>
>> and the need to renamne the first column of out suggests that there
>> may be a deeper problem here.
>>
>> Here are some ideas to address this (they would require changes to
>> data.table):
>>
>> - the default of X[Y,, match=NA] would be changed to a default of
>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and
>> in SQL joins.
>>
>> - the column name of the first column in the example above would be
>> changed to y if match=0 but be left at x if match=NA. In the case
>> that match=0 (the proposed new default) x and y are equal so the
>> first
>> column can be validly labelled as x but in the case that match=NA
>> they
>> are not so y would be used as the column name.
>>
>> - the name match= does seem a bit misleading since R's match only
>> matches one item in the target whereas in data.table match matches
>> many if mult="all" and that is the default. Perhaps some thought
>> should be given to a name change here?
>>
>> The above would seem to correspond more closely to R's merge and SQL
>> join defaults. Any use cases or other comments?
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>>
>>
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com


More information about the datatable-help mailing list