[datatable-help] merge/join/match

Eduard Antonyan eduard.antonyan at gmail.com
Fri May 3 17:27:44 CEST 2013


Btw the way I think about the "nomatch" name is as follows - normally X[Y]
tries to match rows of Y with rows of X, and then "nomatch" tells it what
to do when there is *no match*.


On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan
<eduard.antonyan at gmail.com>wrote:

> To clarify - that behavior is already implemented in merge (more
> specifically merge.data.table). I don't really have a view on having it in
> X[Y] as well - I don't like all.x and all.y as the names, since there are
> no params named 'x' and 'y' in [.data.table (as opposed to merge), but some
> param that would do a full outer join could certainly be added.
>
>
> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck <
> ggrothendieck at gmail.com> wrote:
>
>> Yes, sorry.  Its nomatch= which presumably derives from the parameter
>> of the same name in the match() function.  If the idea of the nomatch=
>> name was to leverage off existing argument names in R then I would
>> prefer all.y= to be consistent with merge() in place of nomatch= since
>> we are really merging/joining rather than just matching. That would
>> also allow extension to all types of join by adding all.an x= argument
>> too.
>>
>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan
>> <eduard.antonyan at gmail.com> wrote:
>> > I would prefer nomatch=0 as a default though, simply because that's
>> what I
>> > do most of the time :)
>> >
>> >
>> > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan <
>> eduard.antonyan at gmail.com>
>> > wrote:
>> >>
>> >> A correction - the param is called "nomatch", not "match".
>> >>
>> >> This use case seems like smth a user shouldn't really do - in an ideal
>> >> world you should have them both keyed by the same-name column.
>> >>
>> >> As is, my view on it is that data.table is correcting the user mistake
>> of
>> >> naming the column in Y - y, instead of x, and so the output makes
>> sense and
>> >> I don't see the need of complicating the behavior by adding more cases
>> one
>> >> has to go through to figure out what the output columns would be.
>> Similar to
>> >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous
>> column
>> >> there, would you?
>> >>
>> >>
>> >>
>> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
>> >> <ggrothendieck at gmail.com> wrote:
>> >>>
>> >>> I am moving this discussion which started with mdowle to the list.
>> >>>
>> >>> Consider this example slightly modified from the data.table FAQ:
>> >>>
>> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
>> >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
>> >>> > out <- X[Y]; out
>> >>>        x foo bar
>> >>> 1:     b   3   4
>> >>> 2:     b   4   4
>> >>> 3:     b   5   4
>> >>> 4:     c   6   2
>> >>> 5:     c   7   2
>> >>> 6:     d  NA   3
>> >>>
>> >>> Note that the first column of the output is labelled x even though the
>> >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but
>> >>> does appear in Y$y so clearly the data is coming from y as opposed to
>> >>> x .  In terms of SQL the above would be written:
>> >>>
>> >>>     select Y.y as x, ...
>> >>>
>> >>> and the need to renamne the first column of out suggests that there
>> >>> may be a deeper problem here.
>> >>>
>> >>> Here are some ideas to address this (they would require changes to
>> >>> data.table):
>> >>>
>> >>> - the default of X[Y,, match=NA] would be changed to a default of
>> >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and
>> >>> in SQL joins.
>> >>>
>> >>> - the column name of the first column in the example above would be
>> >>> changed to y if match=0 but be left at x if match=NA.  In the case
>> >>> that match=0 (the proposed new default) x and y are equal so the first
>> >>> column can be validly labelled as x but in the case that match=NA they
>> >>> are not so y would be used as the column name.
>> >>>
>> >>> - the name match= does seem a bit misleading since R's match only
>> >>> matches one item in the target whereas in data.table match matches
>> >>> many if mult="all" and that is the default.  Perhaps some thought
>> >>> should be given to a name change here?
>> >>>
>> >>> The above would seem to correspond more closely to R's merge and SQL
>> >>> join defaults.  Any use cases or other comments?
>> >>>
>> >>> --
>> >>> Statistics & Software Consulting
>> >>> GKX Group, GKX Associates Inc.
>> >>> tel: 1-877-GKX-GROUP
>> >>> email: ggrothendieck at gmail.com
>> >>> _______________________________________________
>> >>> datatable-help mailing list
>> >>> datatable-help at lists.r-forge.r-project.org
>> >>>
>> >>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130503/faeb9e2d/attachment.html>


More information about the datatable-help mailing list