[datatable-help] merge/join/match

Eduard Antonyan eduard.antonyan at gmail.com
Fri May 3 18:56:07 CEST 2013


yeah, I disagree with this view. I don't think [] should pursue
compatibility with merge.


On Fri, May 3, 2013 at 11:54 AM, Gabor Grothendieck <ggrothendieck at gmail.com
> wrote:

> I think that from the viewpoint of compatibility and convenience it
> would be best to implement all.x and all.y and not rely on swapping X
> and Y.  SQLite did something like this (they implemented left join but
> not right join based on the idea that all you have to do is swap join
> arguments) but the problem with it is that it adds a layer of mental
> specification effort if the actual problem is better stated in the
> unsupported orientation.
>
> On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
> > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all
> of
> > the other merge options already exist as either X[Y] or Y[X] with or
> without
> > nomatch = 0/NA.
> >
> >
> > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan
> > <aragorn168b at gmail.com> wrote:
> >>
> >> Gabor,
> >>
> >> Very true. I suppose your request is that the x[i] where `i` is a
> >> data.table should have the same set of options like R's base `merge`
> >> function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by
> itself.
> >> However, I am not able to think of a way to do this. I mean, I find the
> >> syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even
> though
> >>
> >> X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the
> >> reordered columns) the latter 2 don't seem to make sense/is redundant
> (maybe
> >> it's because I am used to this syntax).
> >>
> >> Arun
> >>
> >> On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote:
> >>
> >> In my last post it should have read:
> >>
> >> That X[Y] is not the same as Y[X] is analogous to the fact that
> >> merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE)
> >>
> >> On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck
> >> <ggrothendieck at gmail.com> wrote:
> >>
> >> Assuming same-named keys, then these are all the same except possibly
> >> for row and column order:
> >>
> >> X[Y,,nomatch=0]
> >> Y[X,,nomatch=0]
> >> merge(X, Y)
> >> merge(Y, X)
> >>
> >> That X[Y] is not the same as Y[X] is analogous to the fact that
> >> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE)
> >>
> >> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan
> >> <aragorn168b at gmail.com> wrote:
> >>
> >> Gabor,
> >>
> >> X[Y] and Y[X] are not necessarily the same operations (meaning, they
> don't
> >> *have* to give the same output). However, merge(X,Y) and merge(Y,X)
> *have*
> >> to provide the same output (except for the column order and names). In
> >> that
> >> sense, a join is a bit different from a merge, no?
> >>
> >> Arun
> >>
> >> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote:
> >>
> >> Yes, except that is not really what happens since match() only matches
> >> one row whereas with mult="all", the default, all rows are matched
> >> which is not really matching in the sense of match(). The current
> >> naming confuses matching with joining and its really the latter that
> >> is being done.
> >>
> >> Regarding the existence of merge the advantage of [ is that it will
> >> automatically only take the columns needed so merge is not really
> >> equivalent to [ in all respects. Furthermore having to use different
> >> constructs for different types of merge seems awkward.
> >>
> >>
> >> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan
> >> <eduard.antonyan at gmail.com> wrote:
> >>
> >> Btw the way I think about the "nomatch" name is as follows - normally
> X[Y]
> >> tries to match rows of Y with rows of X, and then "nomatch" tells it
> what
> >> to
> >> do when there is *no match*.
> >>
> >>
> >> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan
> >> <eduard.antonyan at gmail.com>
> >> wrote:
> >>
> >>
> >> To clarify - that behavior is already implemented in merge (more
> >> specifically merge.data.table). I don't really have a view on having it
> in
> >> X[Y] as well - I don't like all.x and all.y as the names, since there
> are
> >> no
> >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some
> >> param that would do a full outer join could certainly be added.
> >>
> >>
> >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck
> >> <ggrothendieck at gmail.com> wrote:
> >>
> >>
> >> Yes, sorry. Its nomatch= which presumably derives from the parameter
> >> of the same name in the match() function. If the idea of the nomatch=
> >> name was to leverage off existing argument names in R then I would
> >> prefer all.y= to be consistent with merge() in place of nomatch= since
> >> we are really merging/joining rather than just matching. That would
> >> also allow extension to all types of join by adding all.an x= argument
> >> too.
> >>
> >> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan
> >> <eduard.antonyan at gmail.com> wrote:
> >>
> >> I would prefer nomatch=0 as a default though, simply because that's
> >> what I
> >> do most of the time :)
> >>
> >>
> >> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan
> >> <eduard.antonyan at gmail.com>
> >> wrote:
> >>
> >>
> >> A correction - the param is called "nomatch", not "match".
> >>
> >> This use case seems like smth a user shouldn't really do - in an ideal
> >> world you should have them both keyed by the same-name column.
> >>
> >> As is, my view on it is that data.table is correcting the user mistake
> >> of
> >> naming the column in Y - y, instead of x, and so the output makes
> >> sense and
> >> I don't see the need of complicating the behavior by adding more cases
> >> one
> >> has to go through to figure out what the output columns would be.
> >> Similar to
> >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous
> >> column
> >> there, would you?
> >>
> >>
> >>
> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
> >> <ggrothendieck at gmail.com> wrote:
> >>
> >>
> >> I am moving this discussion which started with mdowle to the list.
> >>
> >> Consider this example slightly modified from the data.table FAQ:
> >>
> >> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
> >> Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
> >> out <- X[Y]; out
> >>
> >> x foo bar
> >> 1: b 3 4
> >> 2: b 4 4
> >> 3: b 5 4
> >> 4: c 6 2
> >> 5: c 7 2
> >> 6: d NA 3
> >>
> >> Note that the first column of the output is labelled x even though
> >> the
> >> data to produce it comes from y, e.g. "d" in out$x is not in X$x but
> >> does appear in Y$y so clearly the data is coming from y as opposed to
> >> x . In terms of SQL the above would be written:
> >>
> >> select Y.y as x, ...
> >>
> >> and the need to renamne the first column of out suggests that there
> >> may be a deeper problem here.
> >>
> >> Here are some ideas to address this (they would require changes to
> >> data.table):
> >>
> >> - the default of X[Y,, match=NA] would be changed to a default of
> >> X[Y,,match=0] so that it corresponds to the defaults in R's merge and
> >> in SQL joins.
> >>
> >> - the column name of the first column in the example above would be
> >> changed to y if match=0 but be left at x if match=NA. In the case
> >> that match=0 (the proposed new default) x and y are equal so the
> >> first
> >> column can be validly labelled as x but in the case that match=NA
> >> they
> >> are not so y would be used as the column name.
> >>
> >> - the name match= does seem a bit misleading since R's match only
> >> matches one item in the target whereas in data.table match matches
> >> many if mult="all" and that is the default. Perhaps some thought
> >> should be given to a name change here?
> >>
> >> The above would seem to correspond more closely to R's merge and SQL
> >> join defaults. Any use cases or other comments?
> >>
> >> --
> >> Statistics & Software Consulting
> >> GKX Group, GKX Associates Inc.
> >> tel: 1-877-GKX-GROUP
> >> email: ggrothendieck at gmail.com
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >>
> >>
> >>
> >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>
> >>
> >>
> >>
> >> --
> >> Statistics & Software Consulting
> >> GKX Group, GKX Associates Inc.
> >> tel: 1-877-GKX-GROUP
> >> email: ggrothendieck at gmail.com
> >>
> >>
> >>
> >>
> >> --
> >> Statistics & Software Consulting
> >> GKX Group, GKX Associates Inc.
> >> tel: 1-877-GKX-GROUP
> >> email: ggrothendieck at gmail.com
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >>
> >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>
> >>
> >>
> >>
> >> --
> >> Statistics & Software Consulting
> >> GKX Group, GKX Associates Inc.
> >> tel: 1-877-GKX-GROUP
> >> email: ggrothendieck at gmail.com
> >>
> >>
> >>
> >>
> >> --
> >> Statistics & Software Consulting
> >> GKX Group, GKX Associates Inc.
> >> tel: 1-877-GKX-GROUP
> >> email: ggrothendieck at gmail.com
> >>
> >>
> >
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130503/7a3179ec/attachment-0001.html>


More information about the datatable-help mailing list