[datatable-help] merge/join/match

Gabor Grothendieck ggrothendieck at gmail.com
Sat May 4 11:47:40 CEST 2013


Where it says any.y= it should have read all.y=.

On Sat, May 4, 2013 at 5:46 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> One further comment on nomatch=0 weirdness.  It seems that the value
> of nomatch= is the row index of the row of X to return if a row in Y
> matches no row in X here: X[Y,,nomatch=?]   In ordinary R indexing
> using an index value of 0 means drop the corresponding component and
> NA means return an NA.  nomatch=1 would presumably return the first
> row of X for non-matching rows of Y but, in fact, nomatch= seems to be
> restricted to 0 and NA as any other value generates an error message
> to this effect. Likely it was decided that values other than 0 and NA
> would be too bizarre and most likely represent user error.   If any.y=
> were used then it would naturally be logical and this artificial
> distinction (i.e .between 0/NA on one hand and everything else on the
> other hand) would not have to be made.
>
> On Fri, May 3, 2013 at 6:41 PM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>> In thinking about this a bit more I can see the argument for leaving
>> the default at nomatch=NA. Consider these examples of indexing:
>>
>>> letters[27]
>> [1] NA
>>> BOD[7,]
>>    Time demand
>> NA   NA     NA
>>
>> nomatch=NA seems more compatible with these examples than nomatch=0.
>>
>> (At the same time this does not mean we could not also change the
>> argument name from nomatch= to all.y= and add the other merge
>> arguments (all.x=, by.x=, by.y=, by=) as well since it remains the
>> case that R's merge() seems closer than R's match() to this
>> functionality regardless of the default.)
>>
>>
>> On Fri, May 3, 2013 at 4:42 PM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>>> One can view data.table's generalization of indexing as the
>>> realization that all indexing can conceptually be viewed as merging
>>> where indexing with numeric values corresponds to merging with the
>>> data.table's row numbers and indexing with logical values, L, is
>>> equivalent to merging with which(L) so there are really not two types:
>>> indexing and merging but just one type: merging that covers them all.
>>>
>>>
>>> On Fri, May 3, 2013 at 1:01 PM, Arunkumar Srinivasan
>>> <aragorn168b at gmail.com> wrote:
>>>> I am wondering if performing X[Y] as a "merge" in correspondence with R's
>>>> base "merge", if the basic idea of "i" becomes confusing. That is, when "i"
>>>> is not a data.table in X[i] it indexes by rows. When `i` is a data.table,
>>>> instead of the current definition which is in par with the subletting
>>>> operation that use `i` (here data.table) as an index to subset X and then
>>>> JOIN both X and Y, we say, here X and Y are data.tables and we perform a
>>>> merge. I think this becomes confusing regarding the purpose of `i`.
>>>>
>>>> Remember that the main purpose of having the X[Y] is to have the flexibility
>>>> of using `j` to to filter/subset only the desired columns. So, for example
>>>> if you want to get 1 column of Y out of 100 columns when joining, you do:
>>>> X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the
>>>> traditional definition of merge.
>>>>
>>>> As much as I like the idea of having consistent syntax, I also love the
>>>> feature of X[Y, j]. So I'm confused as to how to deal with this.
>>>>
>>>> Arun
>>>>
>>>> On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote:
>>>>
>>>> I think that from the viewpoint of compatibility and convenience it
>>>> would be best to implement all.x and all.y and not rely on swapping X
>>>> and Y. SQLite did something like this (they implemented left join but
>>>> not right join based on the idea that all you have to do is swap join
>>>> arguments) but the problem with it is that it adds a layer of mental
>>>> specification effort if the actual problem is better stated in the
>>>> unsupported orientation.
>>>>
>>>> On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan
>>>> <eduard.antonyan at gmail.com> wrote:
>>>>
>>>> Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of
>>>> the other merge options already exist as either X[Y] or Y[X] with or without
>>>> nomatch = 0/NA.
>>>>
>>>>
>>>> On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan
>>>> <aragorn168b at gmail.com> wrote:
>>>>
>>>>
>>>> Gabor,
>>>>
>>>> Very true. I suppose your request is that the x[i] where `i` is a
>>>> data.table should have the same set of options like R's base `merge`
>>>> function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself.
>>>> However, I am not able to think of a way to do this. I mean, I find the
>>>> syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though
>>>>
>>>> X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the
>>>> reordered columns) the latter 2 don't seem to make sense/is redundant (maybe
>>>> it's because I am used to this syntax).
>>>>
>>>> Arun
>>>>
>>>> On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote:
>>>>
>>>> In my last post it should have read:
>>>>
>>>> That X[Y] is not the same as Y[X] is analogous to the fact that
>>>> merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE)
>>>>
>>>> On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck
>>>> <ggrothendieck at gmail.com> wrote:
>>>>
>>>> Assuming same-named keys, then these are all the same except possibly
>>>> for row and column order:
>>>>
>>>> X[Y,,nomatch=0]
>>>> Y[X,,nomatch=0]
>>>> merge(X, Y)
>>>> merge(Y, X)
>>>>
>>>> That X[Y] is not the same as Y[X] is analogous to the fact that
>>>> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE)
>>>>
>>>> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan
>>>> <aragorn168b at gmail.com> wrote:
>>>>
>>>> Gabor,
>>>>
>>>> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't
>>>> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have*
>>>> to provide the same output (except for the column order and names). In
>>>> that
>>>> sense, a join is a bit different from a merge, no?
>>>>
>>>> Arun
>>>>
>>>> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote:
>>>>
>>>> Yes, except that is not really what happens since match() only matches
>>>> one row whereas with mult="all", the default, all rows are matched
>>>> which is not really matching in the sense of match(). The current
>>>> naming confuses matching with joining and its really the latter that
>>>> is being done.
>>>>
>>>> Regarding the existence of merge the advantage of [ is that it will
>>>> automatically only take the columns needed so merge is not really
>>>> equivalent to [ in all respects. Furthermore having to use different
>>>> constructs for different types of merge seems awkward.
>>>>
>>>>
>>>> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan
>>>> <eduard.antonyan at gmail.com> wrote:
>>>>
>>>> Btw the way I think about the "nomatch" name is as follows - normally X[Y]
>>>> tries to match rows of Y with rows of X, and then "nomatch" tells it what
>>>> to
>>>> do when there is *no match*.
>>>>
>>>>
>>>> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan
>>>> <eduard.antonyan at gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> To clarify - that behavior is already implemented in merge (more
>>>> specifically merge.data.table). I don't really have a view on having it in
>>>> X[Y] as well - I don't like all.x and all.y as the names, since there are
>>>> no
>>>> params named 'x' and 'y' in [.data.table (as opposed to merge), but some
>>>> param that would do a full outer join could certainly be added.
>>>>
>>>>
>>>> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck
>>>> <ggrothendieck at gmail.com> wrote:
>>>>
>>>>
>>>> Yes, sorry. Its nomatch= which presumably derives from the parameter
>>>> of the same name in the match() function. If the idea of the nomatch=
>>>> name was to leverage off existing argument names in R then I would
>>>> prefer all.y= to be consistent with merge() in place of nomatch= since
>>>> we are really merging/joining rather than just matching. That would
>>>> also allow extension to all types of join by adding all.an x= argument
>>>> too.
>>>>
>>>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan
>>>> <eduard.antonyan at gmail.com> wrote:
>>>>
>>>> I would prefer nomatch=0 as a default though, simply because that's
>>>> what I
>>>> do most of the time :)
>>>>
>>>>
>>>> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan
>>>> <eduard.antonyan at gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> A correction - the param is called "nomatch", not "match".
>>>>
>>>> This use case seems like smth a user shouldn't really do - in an ideal
>>>> world you should have them both keyed by the same-name column.
>>>>
>>>> As is, my view on it is that data.table is correcting the user mistake
>>>> of
>>>> naming the column in Y - y, instead of x, and so the output makes
>>>> sense and
>>>> I don't see the need of complicating the behavior by adding more cases
>>>> one
>>>> has to go through to figure out what the output columns would be.
>>>> Similar to
>>>> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous
>>>> column
>>>> there, would you?
>>>>
>>>>
>>>>
>>>> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
>>>> <ggrothendieck at gmail.com> wrote:
>>>>
>>>>
>>>> I am moving this discussion which started with mdowle to the list.
>>>>
>>>> Consider this example slightly modified from the data.table FAQ:
>>>>
>>>> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
>>>> Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
>>>> out <- X[Y]; out
>>>>
>>>> x foo bar
>>>> 1: b 3 4
>>>> 2: b 4 4
>>>> 3: b 5 4
>>>> 4: c 6 2
>>>> 5: c 7 2
>>>> 6: d NA 3
>>>>
>>>> Note that the first column of the output is labelled x even though
>>>> the
>>>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but
>>>> does appear in Y$y so clearly the data is coming from y as opposed to
>>>> x . In terms of SQL the above would be written:
>>>>
>>>> select Y.y as x, ...
>>>>
>>>> and the need to renamne the first column of out suggests that there
>>>> may be a deeper problem here.
>>>>
>>>> Here are some ideas to address this (they would require changes to
>>>> data.table):
>>>>
>>>> - the default of X[Y,, match=NA] would be changed to a default of
>>>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and
>>>> in SQL joins.
>>>>
>>>> - the column name of the first column in the example above would be
>>>> changed to y if match=0 but be left at x if match=NA. In the case
>>>> that match=0 (the proposed new default) x and y are equal so the
>>>> first
>>>> column can be validly labelled as x but in the case that match=NA
>>>> they
>>>> are not so y would be used as the column name.
>>>>
>>>> - the name match= does seem a bit misleading since R's match only
>>>> matches one item in the target whereas in data.table match matches
>>>> many if mult="all" and that is the default. Perhaps some thought
>>>> should be given to a name change here?
>>>>
>>>> The above would seem to correspond more closely to R's merge and SQL
>>>> join defaults. Any use cases or other comments?
>>>>
>>>> --
>>>> Statistics & Software Consulting
>>>> GKX Group, GKX Associates Inc.
>>>> tel: 1-877-GKX-GROUP
>>>> email: ggrothendieck at gmail.com
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>>>
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Statistics & Software Consulting
>>>> GKX Group, GKX Associates Inc.
>>>> tel: 1-877-GKX-GROUP
>>>> email: ggrothendieck at gmail.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Statistics & Software Consulting
>>>> GKX Group, GKX Associates Inc.
>>>> tel: 1-877-GKX-GROUP
>>>> email: ggrothendieck at gmail.com
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Statistics & Software Consulting
>>>> GKX Group, GKX Associates Inc.
>>>> tel: 1-877-GKX-GROUP
>>>> email: ggrothendieck at gmail.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Statistics & Software Consulting
>>>> GKX Group, GKX Associates Inc.
>>>> tel: 1-877-GKX-GROUP
>>>> email: ggrothendieck at gmail.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Statistics & Software Consulting
>>>> GKX Group, GKX Associates Inc.
>>>> tel: 1-877-GKX-GROUP
>>>> email: ggrothendieck at gmail.com
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Statistics & Software Consulting
>>> GKX Group, GKX Associates Inc.
>>> tel: 1-877-GKX-GROUP
>>> email: ggrothendieck at gmail.com
>>
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com


More information about the datatable-help mailing list