[datatable-help] merge/join/match

Gabor Grothendieck ggrothendieck at gmail.com
Fri May 3 17:09:19 CEST 2013


Yes, sorry.  Its nomatch= which presumably derives from the parameter
of the same name in the match() function.  If the idea of the nomatch=
name was to leverage off existing argument names in R then I would
prefer all.y= to be consistent with merge() in place of nomatch= since
we are really merging/joining rather than just matching. That would
also allow extension to all types of join by adding all.an x= argument
too.

On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan
<eduard.antonyan at gmail.com> wrote:
> I would prefer nomatch=0 as a default though, simply because that's what I
> do most of the time :)
>
>
> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan <eduard.antonyan at gmail.com>
> wrote:
>>
>> A correction - the param is called "nomatch", not "match".
>>
>> This use case seems like smth a user shouldn't really do - in an ideal
>> world you should have them both keyed by the same-name column.
>>
>> As is, my view on it is that data.table is correcting the user mistake of
>> naming the column in Y - y, instead of x, and so the output makes sense and
>> I don't see the need of complicating the behavior by adding more cases one
>> has to go through to figure out what the output columns would be. Similar to
>> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous column
>> there, would you?
>>
>>
>>
>> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>>>
>>> I am moving this discussion which started with mdowle to the list.
>>>
>>> Consider this example slightly modified from the data.table FAQ:
>>>
>>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
>>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
>>> > out <- X[Y]; out
>>>        x foo bar
>>> 1:     b   3   4
>>> 2:     b   4   4
>>> 3:     b   5   4
>>> 4:     c   6   2
>>> 5:     c   7   2
>>> 6:     d  NA   3
>>>
>>> Note that the first column of the output is labelled x even though the
>>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but
>>> does appear in Y$y so clearly the data is coming from y as opposed to
>>> x .  In terms of SQL the above would be written:
>>>
>>>     select Y.y as x, ...
>>>
>>> and the need to renamne the first column of out suggests that there
>>> may be a deeper problem here.
>>>
>>> Here are some ideas to address this (they would require changes to
>>> data.table):
>>>
>>> - the default of X[Y,, match=NA] would be changed to a default of
>>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and
>>> in SQL joins.
>>>
>>> - the column name of the first column in the example above would be
>>> changed to y if match=0 but be left at x if match=NA.  In the case
>>> that match=0 (the proposed new default) x and y are equal so the first
>>> column can be validly labelled as x but in the case that match=NA they
>>> are not so y would be used as the column name.
>>>
>>> - the name match= does seem a bit misleading since R's match only
>>> matches one item in the target whereas in data.table match matches
>>> many if mult="all" and that is the default.  Perhaps some thought
>>> should be given to a name change here?
>>>
>>> The above would seem to correspond more closely to R's merge and SQL
>>> join defaults.  Any use cases or other comments?
>>>
>>> --
>>> Statistics & Software Consulting
>>> GKX Group, GKX Associates Inc.
>>> tel: 1-877-GKX-GROUP
>>> email: ggrothendieck at gmail.com
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com


More information about the datatable-help mailing list