[datatable-help] merge/join/match

Gabor Grothendieck ggrothendieck at gmail.com
Fri May 3 22:42:06 CEST 2013


One can view data.table's generalization of indexing as the
realization that all indexing can conceptually be viewed as merging
where indexing with numeric values corresponds to merging with the
data.table's row numbers and indexing with logical values, L, is
equivalent to merging with which(L) so there are really not two types:
indexing and merging but just one type: merging that covers them all.


On Fri, May 3, 2013 at 1:01 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> I am wondering if performing X[Y] as a "merge" in correspondence with R's
> base "merge", if the basic idea of "i" becomes confusing. That is, when "i"
> is not a data.table in X[i] it indexes by rows. When `i` is a data.table,
> instead of the current definition which is in par with the subletting
> operation that use `i` (here data.table) as an index to subset X and then
> JOIN both X and Y, we say, here X and Y are data.tables and we perform a
> merge. I think this becomes confusing regarding the purpose of `i`.
>
> Remember that the main purpose of having the X[Y] is to have the flexibility
> of using `j` to to filter/subset only the desired columns. So, for example
> if you want to get 1 column of Y out of 100 columns when joining, you do:
> X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the
> traditional definition of merge.
>
> As much as I like the idea of having consistent syntax, I also love the
> feature of X[Y, j]. So I'm confused as to how to deal with this.
>
> Arun
>
> On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote:
>
> I think that from the viewpoint of compatibility and convenience it
> would be best to implement all.x and all.y and not rely on swapping X
> and Y. SQLite did something like this (they implemented left join but
> not right join based on the idea that all you have to do is swap join
> arguments) but the problem with it is that it adds a layer of mental
> specification effort if the actual problem is better stated in the
> unsupported orientation.
>
> On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
>
> Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of
> the other merge options already exist as either X[Y] or Y[X] with or without
> nomatch = 0/NA.
>
>
> On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>
>
> Gabor,
>
> Very true. I suppose your request is that the x[i] where `i` is a
> data.table should have the same set of options like R's base `merge`
> function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself.
> However, I am not able to think of a way to do this. I mean, I find the
> syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though
>
> X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the
> reordered columns) the latter 2 don't seem to make sense/is redundant (maybe
> it's because I am used to this syntax).
>
> Arun
>
> On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote:
>
> In my last post it should have read:
>
> That X[Y] is not the same as Y[X] is analogous to the fact that
> merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE)
>
> On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>
> Assuming same-named keys, then these are all the same except possibly
> for row and column order:
>
> X[Y,,nomatch=0]
> Y[X,,nomatch=0]
> merge(X, Y)
> merge(Y, X)
>
> That X[Y] is not the same as Y[X] is analogous to the fact that
> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE)
>
> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>
> Gabor,
>
> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't
> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have*
> to provide the same output (except for the column order and names). In
> that
> sense, a join is a bit different from a merge, no?
>
> Arun
>
> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote:
>
> Yes, except that is not really what happens since match() only matches
> one row whereas with mult="all", the default, all rows are matched
> which is not really matching in the sense of match(). The current
> naming confuses matching with joining and its really the latter that
> is being done.
>
> Regarding the existence of merge the advantage of [ is that it will
> automatically only take the columns needed so merge is not really
> equivalent to [ in all respects. Furthermore having to use different
> constructs for different types of merge seems awkward.
>
>
> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
>
> Btw the way I think about the "nomatch" name is as follows - normally X[Y]
> tries to match rows of Y with rows of X, and then "nomatch" tells it what
> to
> do when there is *no match*.
>
>
> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan
> <eduard.antonyan at gmail.com>
> wrote:
>
>
> To clarify - that behavior is already implemented in merge (more
> specifically merge.data.table). I don't really have a view on having it in
> X[Y] as well - I don't like all.x and all.y as the names, since there are
> no
> params named 'x' and 'y' in [.data.table (as opposed to merge), but some
> param that would do a full outer join could certainly be added.
>
>
> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>
>
> Yes, sorry. Its nomatch= which presumably derives from the parameter
> of the same name in the match() function. If the idea of the nomatch=
> name was to leverage off existing argument names in R then I would
> prefer all.y= to be consistent with merge() in place of nomatch= since
> we are really merging/joining rather than just matching. That would
> also allow extension to all types of join by adding all.an x= argument
> too.
>
> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
>
> I would prefer nomatch=0 as a default though, simply because that's
> what I
> do most of the time :)
>
>
> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan
> <eduard.antonyan at gmail.com>
> wrote:
>
>
> A correction - the param is called "nomatch", not "match".
>
> This use case seems like smth a user shouldn't really do - in an ideal
> world you should have them both keyed by the same-name column.
>
> As is, my view on it is that data.table is correcting the user mistake
> of
> naming the column in Y - y, instead of x, and so the output makes
> sense and
> I don't see the need of complicating the behavior by adding more cases
> one
> has to go through to figure out what the output columns would be.
> Similar to
> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous
> column
> there, would you?
>
>
>
> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>
>
> I am moving this discussion which started with mdowle to the list.
>
> Consider this example slightly modified from the data.table FAQ:
>
> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
> Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
> out <- X[Y]; out
>
> x foo bar
> 1: b 3 4
> 2: b 4 4
> 3: b 5 4
> 4: c 6 2
> 5: c 7 2
> 6: d NA 3
>
> Note that the first column of the output is labelled x even though
> the
> data to produce it comes from y, e.g. "d" in out$x is not in X$x but
> does appear in Y$y so clearly the data is coming from y as opposed to
> x . In terms of SQL the above would be written:
>
> select Y.y as x, ...
>
> and the need to renamne the first column of out suggests that there
> may be a deeper problem here.
>
> Here are some ideas to address this (they would require changes to
> data.table):
>
> - the default of X[Y,, match=NA] would be changed to a default of
> X[Y,,match=0] so that it corresponds to the defaults in R's merge and
> in SQL joins.
>
> - the column name of the first column in the example above would be
> changed to y if match=0 but be left at x if match=NA. In the case
> that match=0 (the proposed new default) x and y are equal so the
> first
> column can be validly labelled as x but in the case that match=NA
> they
> are not so y would be used as the column name.
>
> - the name match= does seem a bit misleading since R's match only
> matches one item in the target whereas in data.table match matches
> many if mult="all" and that is the default. Perhaps some thought
> should be given to a name change here?
>
> The above would seem to correspond more closely to R's merge and SQL
> join defaults. Any use cases or other comments?
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
>
>
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
>



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com


More information about the datatable-help mailing list