[datatable-help] merge/join/match

Eduard Antonyan eduard.antonyan at gmail.com
Fri May 3 17:41:10 CEST 2013


Yeah, leveraging not using all columns in a full merge is a good idea - I
agree that feature would be nice to have.

I'm not using the match() function tbh, so current naming doesn't bug me,
but maybe others who do use match() can weigh in if that bothers them as
well.


On Fri, May 3, 2013 at 10:36 AM, Gabor Grothendieck <ggrothendieck at gmail.com
> wrote:

> Yes, except that is not really what happens since match() only matches
> one row whereas with mult="all", the default, all rows are matched
> which is not really matching in the sense of match().  The current
> naming confuses matching with joining and its really the latter that
> is being done.
>
> Regarding the existence of merge the advantage of [ is that it will
> automatically only take the columns needed so merge is not really
> equivalent to [ in all respects.  Furthermore having to use different
> constructs for different types of merge seems awkward.
>
>
> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
> > Btw the way I think about the "nomatch" name is as follows - normally
> X[Y]
> > tries to match rows of Y with rows of X, and then "nomatch" tells it
> what to
> > do when there is *no match*.
> >
> >
> > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan <
> eduard.antonyan at gmail.com>
> > wrote:
> >>
> >> To clarify - that behavior is already implemented in merge (more
> >> specifically merge.data.table). I don't really have a view on having it
> in
> >> X[Y] as well - I don't like all.x and all.y as the names, since there
> are no
> >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some
> >> param that would do a full outer join could certainly be added.
> >>
> >>
> >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck
> >> <ggrothendieck at gmail.com> wrote:
> >>>
> >>> Yes, sorry.  Its nomatch= which presumably derives from the parameter
> >>> of the same name in the match() function.  If the idea of the nomatch=
> >>> name was to leverage off existing argument names in R then I would
> >>> prefer all.y= to be consistent with merge() in place of nomatch= since
> >>> we are really merging/joining rather than just matching. That would
> >>> also allow extension to all types of join by adding all.an x= argument
> >>> too.
> >>>
> >>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan
> >>> <eduard.antonyan at gmail.com> wrote:
> >>> > I would prefer nomatch=0 as a default though, simply because that's
> >>> > what I
> >>> > do most of the time :)
> >>> >
> >>> >
> >>> > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan
> >>> > <eduard.antonyan at gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> A correction - the param is called "nomatch", not "match".
> >>> >>
> >>> >> This use case seems like smth a user shouldn't really do - in an
> ideal
> >>> >> world you should have them both keyed by the same-name column.
> >>> >>
> >>> >> As is, my view on it is that data.table is correcting the user
> mistake
> >>> >> of
> >>> >> naming the column in Y - y, instead of x, and so the output makes
> >>> >> sense and
> >>> >> I don't see the need of complicating the behavior by adding more
> cases
> >>> >> one
> >>> >> has to go through to figure out what the output columns would be.
> >>> >> Similar to
> >>> >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous
> >>> >> column
> >>> >> there, would you?
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
> >>> >> <ggrothendieck at gmail.com> wrote:
> >>> >>>
> >>> >>> I am moving this discussion which started with mdowle to the list.
> >>> >>>
> >>> >>> Consider this example slightly modified from the data.table FAQ:
> >>> >>>
> >>> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7,
> key="x")
> >>> >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
> >>> >>> > out <- X[Y]; out
> >>> >>>        x foo bar
> >>> >>> 1:     b   3   4
> >>> >>> 2:     b   4   4
> >>> >>> 3:     b   5   4
> >>> >>> 4:     c   6   2
> >>> >>> 5:     c   7   2
> >>> >>> 6:     d  NA   3
> >>> >>>
> >>> >>> Note that the first column of the output is labelled x even though
> >>> >>> the
> >>> >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x
> but
> >>> >>> does appear in Y$y so clearly the data is coming from y as opposed
> to
> >>> >>> x .  In terms of SQL the above would be written:
> >>> >>>
> >>> >>>     select Y.y as x, ...
> >>> >>>
> >>> >>> and the need to renamne the first column of out suggests that there
> >>> >>> may be a deeper problem here.
> >>> >>>
> >>> >>> Here are some ideas to address this (they would require changes to
> >>> >>> data.table):
> >>> >>>
> >>> >>> - the default of X[Y,, match=NA] would be changed to a default of
> >>> >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge
> and
> >>> >>> in SQL joins.
> >>> >>>
> >>> >>> - the column name of the first column in the example above would be
> >>> >>> changed to y if match=0 but be left at x if match=NA.  In the case
> >>> >>> that match=0 (the proposed new default) x and y are equal so the
> >>> >>> first
> >>> >>> column can be validly labelled as x but in the case that match=NA
> >>> >>> they
> >>> >>> are not so y would be used as the column name.
> >>> >>>
> >>> >>> - the name match= does seem a bit misleading since R's match only
> >>> >>> matches one item in the target whereas in data.table match matches
> >>> >>> many if mult="all" and that is the default.  Perhaps some thought
> >>> >>> should be given to a name change here?
> >>> >>>
> >>> >>> The above would seem to correspond more closely to R's merge and
> SQL
> >>> >>> join defaults.  Any use cases or other comments?
> >>> >>>
> >>> >>> --
> >>> >>> Statistics & Software Consulting
> >>> >>> GKX Group, GKX Associates Inc.
> >>> >>> tel: 1-877-GKX-GROUP
> >>> >>> email: ggrothendieck at gmail.com
> >>> >>> _______________________________________________
> >>> >>> datatable-help mailing list
> >>> >>> datatable-help at lists.r-forge.r-project.org
> >>> >>>
> >>> >>>
> >>> >>>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>> >>
> >>> >>
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Statistics & Software Consulting
> >>> GKX Group, GKX Associates Inc.
> >>> tel: 1-877-GKX-GROUP
> >>> email: ggrothendieck at gmail.com
> >>
> >>
> >
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130503/2716c6ba/attachment-0001.html>


More information about the datatable-help mailing list