[datatable-help] merge/join/match

Gabor Grothendieck ggrothendieck at gmail.com
Fri May 3 18:50:23 CEST 2013


I was thinking of all.y= would be implemented first to replace
nomatch= and then all.x= later (as the latter involves more work
whereas the former is trivial).  It would be possible to implement
by.x, by.y and by too such that they default to the current
functionality and that might be nice too.

On Fri, May 3, 2013 at 12:45 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Gabor,
>
> Very true. I suppose your request is that the x[i] where `i` is a data.table
> should have the same set of options like R's base `merge` function, like,
> by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. However, I am
> not able to think of a way to do this. I mean, I find the syntax X[Y,
> by.x=TRUE] weird / not making sense. That is, to me even though
>
> X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the
> reordered columns) the latter 2 don't seem to make sense/is redundant (maybe
> it's because I am used to this syntax).
>
> Arun
>
> On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote:
>
> In my last post it should have read:
>
> That X[Y] is not the same as Y[X] is analogous to the fact that
> merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE)
>
> On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>
> Assuming same-named keys, then these are all the same except possibly
> for row and column order:
>
> X[Y,,nomatch=0]
> Y[X,,nomatch=0]
> merge(X, Y)
> merge(Y, X)
>
> That X[Y] is not the same as Y[X] is analogous to the fact that
> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE)
>
> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>
> Gabor,
>
> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't
> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have*
> to provide the same output (except for the column order and names). In that
> sense, a join is a bit different from a merge, no?
>
> Arun
>
> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote:
>
> Yes, except that is not really what happens since match() only matches
> one row whereas with mult="all", the default, all rows are matched
> which is not really matching in the sense of match(). The current
> naming confuses matching with joining and its really the latter that
> is being done.
>
> Regarding the existence of merge the advantage of [ is that it will
> automatically only take the columns needed so merge is not really
> equivalent to [ in all respects. Furthermore having to use different
> constructs for different types of merge seems awkward.
>
>
> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
>
> Btw the way I think about the "nomatch" name is as follows - normally X[Y]
> tries to match rows of Y with rows of X, and then "nomatch" tells it what to
> do when there is *no match*.
>
>
> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan <eduard.antonyan at gmail.com>
> wrote:
>
>
> To clarify - that behavior is already implemented in merge (more
> specifically merge.data.table). I don't really have a view on having it in
> X[Y] as well - I don't like all.x and all.y as the names, since there are no
> params named 'x' and 'y' in [.data.table (as opposed to merge), but some
> param that would do a full outer join could certainly be added.
>
>
> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>
>
> Yes, sorry. Its nomatch= which presumably derives from the parameter
> of the same name in the match() function. If the idea of the nomatch=
> name was to leverage off existing argument names in R then I would
> prefer all.y= to be consistent with merge() in place of nomatch= since
> we are really merging/joining rather than just matching. That would
> also allow extension to all types of join by adding all.an x= argument
> too.
>
> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
>
> I would prefer nomatch=0 as a default though, simply because that's
> what I
> do most of the time :)
>
>
> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan
> <eduard.antonyan at gmail.com>
> wrote:
>
>
> A correction - the param is called "nomatch", not "match".
>
> This use case seems like smth a user shouldn't really do - in an ideal
> world you should have them both keyed by the same-name column.
>
> As is, my view on it is that data.table is correcting the user mistake
> of
> naming the column in Y - y, instead of x, and so the output makes
> sense and
> I don't see the need of complicating the behavior by adding more cases
> one
> has to go through to figure out what the output columns would be.
> Similar to
> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous
> column
> there, would you?
>
>
>
> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>
>
> I am moving this discussion which started with mdowle to the list.
>
> Consider this example slightly modified from the data.table FAQ:
>
> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
> Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
> out <- X[Y]; out
>
> x foo bar
> 1: b 3 4
> 2: b 4 4
> 3: b 5 4
> 4: c 6 2
> 5: c 7 2
> 6: d NA 3
>
> Note that the first column of the output is labelled x even though
> the
> data to produce it comes from y, e.g. "d" in out$x is not in X$x but
> does appear in Y$y so clearly the data is coming from y as opposed to
> x . In terms of SQL the above would be written:
>
> select Y.y as x, ...
>
> and the need to renamne the first column of out suggests that there
> may be a deeper problem here.
>
> Here are some ideas to address this (they would require changes to
> data.table):
>
> - the default of X[Y,, match=NA] would be changed to a default of
> X[Y,,match=0] so that it corresponds to the defaults in R's merge and
> in SQL joins.
>
> - the column name of the first column in the example above would be
> changed to y if match=0 but be left at x if match=NA. In the case
> that match=0 (the proposed new default) x and y are equal so the
> first
> column can be validly labelled as x but in the case that match=NA
> they
> are not so y would be used as the column name.
>
> - the name match= does seem a bit misleading since R's match only
> matches one item in the target whereas in data.table match matches
> many if mult="all" and that is the default. Perhaps some thought
> should be given to a name change here?
>
> The above would seem to correspond more closely to R's merge and SQL
> join defaults. Any use cases or other comments?
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
>
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
>



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com


More information about the datatable-help mailing list