[datatable-help] merge/join/match
Eduard Antonyan
eduard.antonyan at gmail.com
Fri May 3 17:23:02 CEST 2013
To clarify - that behavior is already implemented in merge (more
specifically merge.data.table). I don't really have a view on having it in
X[Y] as well - I don't like all.x and all.y as the names, since there are
no params named 'x' and 'y' in [.data.table (as opposed to merge), but some
param that would do a full outer join could certainly be added.
On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck <ggrothendieck at gmail.com
> wrote:
> Yes, sorry. Its nomatch= which presumably derives from the parameter
> of the same name in the match() function. If the idea of the nomatch=
> name was to leverage off existing argument names in R then I would
> prefer all.y= to be consistent with merge() in place of nomatch= since
> we are really merging/joining rather than just matching. That would
> also allow extension to all types of join by adding all.an x= argument
> too.
>
> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
> > I would prefer nomatch=0 as a default though, simply because that's what
> I
> > do most of the time :)
> >
> >
> > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan <
> eduard.antonyan at gmail.com>
> > wrote:
> >>
> >> A correction - the param is called "nomatch", not "match".
> >>
> >> This use case seems like smth a user shouldn't really do - in an ideal
> >> world you should have them both keyed by the same-name column.
> >>
> >> As is, my view on it is that data.table is correcting the user mistake
> of
> >> naming the column in Y - y, instead of x, and so the output makes sense
> and
> >> I don't see the need of complicating the behavior by adding more cases
> one
> >> has to go through to figure out what the output columns would be.
> Similar to
> >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous
> column
> >> there, would you?
> >>
> >>
> >>
> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
> >> <ggrothendieck at gmail.com> wrote:
> >>>
> >>> I am moving this discussion which started with mdowle to the list.
> >>>
> >>> Consider this example slightly modified from the data.table FAQ:
> >>>
> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
> >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
> >>> > out <- X[Y]; out
> >>> x foo bar
> >>> 1: b 3 4
> >>> 2: b 4 4
> >>> 3: b 5 4
> >>> 4: c 6 2
> >>> 5: c 7 2
> >>> 6: d NA 3
> >>>
> >>> Note that the first column of the output is labelled x even though the
> >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but
> >>> does appear in Y$y so clearly the data is coming from y as opposed to
> >>> x . In terms of SQL the above would be written:
> >>>
> >>> select Y.y as x, ...
> >>>
> >>> and the need to renamne the first column of out suggests that there
> >>> may be a deeper problem here.
> >>>
> >>> Here are some ideas to address this (they would require changes to
> >>> data.table):
> >>>
> >>> - the default of X[Y,, match=NA] would be changed to a default of
> >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and
> >>> in SQL joins.
> >>>
> >>> - the column name of the first column in the example above would be
> >>> changed to y if match=0 but be left at x if match=NA. In the case
> >>> that match=0 (the proposed new default) x and y are equal so the first
> >>> column can be validly labelled as x but in the case that match=NA they
> >>> are not so y would be used as the column name.
> >>>
> >>> - the name match= does seem a bit misleading since R's match only
> >>> matches one item in the target whereas in data.table match matches
> >>> many if mult="all" and that is the default. Perhaps some thought
> >>> should be given to a name change here?
> >>>
> >>> The above would seem to correspond more closely to R's merge and SQL
> >>> join defaults. Any use cases or other comments?
> >>>
> >>> --
> >>> Statistics & Software Consulting
> >>> GKX Group, GKX Associates Inc.
> >>> tel: 1-877-GKX-GROUP
> >>> email: ggrothendieck at gmail.com
> >>> _______________________________________________
> >>> datatable-help mailing list
> >>> datatable-help at lists.r-forge.r-project.org
> >>>
> >>>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>
> >>
> >
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130503/a77f7ad9/attachment-0001.html>
More information about the datatable-help
mailing list