[datatable-help] merge/join/match

Fri May 3 19:14:03 CEST 2013

Gabor,  

I agree partially with your post in that, since X[Y] *is* a join (/merge), it could also give a "full" join. So,

X[Y] <~~~ current usage. equivalent to merge(X, Y, by.y=TRUE)
X[Y, all=TRUE] <~~~~ equivalent to merge(X, Y, all=TRUE)

Similarly,

Y[X] <~~~ current usage. equivalent to merge(Y, X, by.x=TRUE)
Y[X, all=TRUE] <~~~ equivalent to merge(Y, X, all=TRUE)

But, X[Y, all.x=TRUE] and Y[X, all.y=TRUE] doesn't make sense to me as the operation is clear that you use Y as an index. What do you think?  

Arun

On Friday, May 3, 2013 at 7:09 PM, Arunkumar Srinivasan wrote:

> The confusion, maybe as well be very well due to the fact that X[Y] is not just a subset of X based on X and Y's key columns, rather a `join` (both X and Y's columns are "visible" and joined). But then that was by itself due to a feature request FR #746.  
> 
> Arun
> 
> 
> On Friday, May 3, 2013 at 7:03 PM, Arunkumar Srinivasan wrote:
> 
> > Where I say "main purpose", it should be "one of the main advantages of having" 
> > 
> > Arun
> > 
> > 
> > On Friday, May 3, 2013 at 7:01 PM, Arunkumar Srinivasan wrote:
> > 
> > > I am wondering if performing X[Y] as a "merge" in correspondence with R's base "merge", if the basic idea of "i" becomes confusing. That is, when "i" is not a data.table in X[i] it indexes by rows. When `i` is a data.table, instead of the current definition which is in par with the subletting operation that use `i` (here data.table) as an index to subset X and then JOIN both X and Y, we say, here X and Y are data.tables and we perform a merge. I think this becomes confusing regarding the purpose of `i`.
> > > 
> > > Remember that the main purpose of having the X[Y] is to have the flexibility of using `j` to to filter/subset only the desired columns. So, for example if you want to get 1 column of Y out of 100 columns when joining, you do: X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the traditional definition of merge.
> > > 
> > > As much as I like the idea of having consistent syntax, I also love the feature of X[Y, j]. So I'm confused as to how to deal with this.
> > > 
> > > Arun
> > > 
> > > 
> > > On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote:
> > > 
> > > > I think that from the viewpoint of compatibility and convenience it
> > > > would be best to implement all.x and all.y and not rely on swapping X
> > > > and Y. SQLite did something like this (they implemented left join but
> > > > not right join based on the idea that all you have to do is swap join
> > > > arguments) but the problem with it is that it adds a layer of mental
> > > > specification effort if the actual problem is better stated in the
> > > > unsupported orientation.
> > > > 
> > > > On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan
> > > > <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)> wrote:
> > > > > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of
> > > > > the other merge options already exist as either X[Y] or Y[X] with or without
> > > > > nomatch = 0/NA.
> > > > > 
> > > > > 
> > > > > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan
> > > > > <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > > > 
> > > > > > Gabor,
> > > > > > 
> > > > > > Very true. I suppose your request is that the x[i] where `i` is a
> > > > > > data.table should have the same set of options like R's base `merge`
> > > > > > function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself.
> > > > > > However, I am not able to think of a way to do this. I mean, I find the
> > > > > > syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though
> > > > > > 
> > > > > > X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the
> > > > > > reordered columns) the latter 2 don't seem to make sense/is redundant (maybe
> > > > > > it's because I am used to this syntax).
> > > > > > 
> > > > > > Arun
> > > > > > 
> > > > > > On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote:
> > > > > > 
> > > > > > In my last post it should have read:
> > > > > > 
> > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that
> > > > > > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE)
> > > > > > 
> > > > > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck
> > > > > > <ggrothendieck at gmail.com (mailto:ggrothendieck at gmail.com)> wrote:
> > > > > > 
> > > > > > Assuming same-named keys, then these are all the same except possibly
> > > > > > for row and column order:
> > > > > > 
> > > > > > X[Y,,nomatch=0]
> > > > > > Y[X,,nomatch=0]
> > > > > > merge(X, Y)
> > > > > > merge(Y, X)
> > > > > > 
> > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that
> > > > > > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE)
> > > > > > 
> > > > > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan
> > > > > > <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > > > 
> > > > > > Gabor,
> > > > > > 
> > > > > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't
> > > > > > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have*
> > > > > > to provide the same output (except for the column order and names). In
> > > > > > that
> > > > > > sense, a join is a bit different from a merge, no?
> > > > > > 
> > > > > > Arun
> > > > > > 
> > > > > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote:
> > > > > > 
> > > > > > Yes, except that is not really what happens since match() only matches
> > > > > > one row whereas with mult="all", the default, all rows are matched
> > > > > > which is not really matching in the sense of match(). The current
> > > > > > naming confuses matching with joining and its really the latter that
> > > > > > is being done.
> > > > > > 
> > > > > > Regarding the existence of merge the advantage of [ is that it will
> > > > > > automatically only take the columns needed so merge is not really
> > > > > > equivalent to [ in all respects. Furthermore having to use different
> > > > > > constructs for different types of merge seems awkward.
> > > > > > 
> > > > > > 
> > > > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan
> > > > > > <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)> wrote:
> > > > > > 
> > > > > > Btw the way I think about the "nomatch" name is as follows - normally X[Y]
> > > > > > tries to match rows of Y with rows of X, and then "nomatch" tells it what
> > > > > > to
> > > > > > do when there is *no match*.
> > > > > > 
> > > > > > 
> > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan
> > > > > > <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)>
> > > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > To clarify - that behavior is already implemented in merge (more
> > > > > > specifically merge.data.table). I don't really have a view on having it in
> > > > > > X[Y] as well - I don't like all.x and all.y as the names, since there are
> > > > > > no
> > > > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some
> > > > > > param that would do a full outer join could certainly be added.
> > > > > > 
> > > > > > 
> > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck
> > > > > > <ggrothendieck at gmail.com (mailto:ggrothendieck at gmail.com)> wrote:
> > > > > > 
> > > > > > 
> > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter
> > > > > > of the same name in the match() function. If the idea of the nomatch=
> > > > > > name was to leverage off existing argument names in R then I would
> > > > > > prefer all.y= to be consistent with merge() in place of nomatch= since
> > > > > > we are really merging/joining rather than just matching. That would
> > > > > > also allow extension to all types of join by adding all.an x= argument
> > > > > > too.
> > > > > > 
> > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan
> > > > > > <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)> wrote:
> > > > > > 
> > > > > > I would prefer nomatch=0 as a default though, simply because that's
> > > > > > what I
> > > > > > do most of the time :)
> > > > > > 
> > > > > > 
> > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan
> > > > > > <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)>
> > > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > A correction - the param is called "nomatch", not "match".
> > > > > > 
> > > > > > This use case seems like smth a user shouldn't really do - in an ideal
> > > > > > world you should have them both keyed by the same-name column.
> > > > > > 
> > > > > > As is, my view on it is that data.table is correcting the user mistake
> > > > > > of
> > > > > > naming the column in Y - y, instead of x, and so the output makes
> > > > > > sense and
> > > > > > I don't see the need of complicating the behavior by adding more cases
> > > > > > one
> > > > > > has to go through to figure out what the output columns would be.
> > > > > > Similar to
> > > > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous
> > > > > > column
> > > > > > there, would you?
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck
> > > > > > <ggrothendieck at gmail.com (mailto:ggrothendieck at gmail.com)> wrote:
> > > > > > 
> > > > > > 
> > > > > > I am moving this discussion which started with mdowle to the list.
> > > > > > 
> > > > > > Consider this example slightly modified from the data.table FAQ:
> > > > > > 
> > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x")
> > > > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3))
> > > > > > out <- X[Y]; out
> > > > > > 
> > > > > > x foo bar
> > > > > > 1: b 3 4
> > > > > > 2: b 4 4
> > > > > > 3: b 5 4
> > > > > > 4: c 6 2
> > > > > > 5: c 7 2
> > > > > > 6: d NA 3
> > > > > > 
> > > > > > Note that the first column of the output is labelled x even though
> > > > > > the
> > > > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x but
> > > > > > does appear in Y$y so clearly the data is coming from y as opposed to
> > > > > > x . In terms of SQL the above would be written:
> > > > > > 
> > > > > > select Y.y as x, ...
> > > > > > 
> > > > > > and the need to renamne the first column of out suggests that there
> > > > > > may be a deeper problem here.
> > > > > > 
> > > > > > Here are some ideas to address this (they would require changes to
> > > > > > data.table):
> > > > > > 
> > > > > > - the default of X[Y,, match=NA] would be changed to a default of
> > > > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge and
> > > > > > in SQL joins.
> > > > > > 
> > > > > > - the column name of the first column in the example above would be
> > > > > > changed to y if match=0 but be left at x if match=NA. In the case
> > > > > > that match=0 (the proposed new default) x and y are equal so the
> > > > > > first
> > > > > > column can be validly labelled as x but in the case that match=NA
> > > > > > they
> > > > > > are not so y would be used as the column name.
> > > > > > 
> > > > > > - the name match= does seem a bit misleading since R's match only
> > > > > > matches one item in the target whereas in data.table match matches
> > > > > > many if mult="all" and that is the default. Perhaps some thought
> > > > > > should be given to a name change here?
> > > > > > 
> > > > > > The above would seem to correspond more closely to R's merge and SQL
> > > > > > join defaults. Any use cases or other comments?
> > > > > > 
> > > > > > --
> > > > > > Statistics & Software Consulting
> > > > > > GKX Group, GKX Associates Inc.
> > > > > > tel: 1-877-GKX-GROUP
> > > > > > email: ggrothendieck at gmail.com (http://gmail.com)
> > > > > > _______________________________________________
> > > > > > datatable-help mailing list
> > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Statistics & Software Consulting
> > > > > > GKX Group, GKX Associates Inc.
> > > > > > tel: 1-877-GKX-GROUP
> > > > > > email: ggrothendieck at gmail.com (http://gmail.com)
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Statistics & Software Consulting
> > > > > > GKX Group, GKX Associates Inc.
> > > > > > tel: 1-877-GKX-GROUP
> > > > > > email: ggrothendieck at gmail.com (http://gmail.com)
> > > > > > _______________________________________________
> > > > > > datatable-help mailing list
> > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > > 
> > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Statistics & Software Consulting
> > > > > > GKX Group, GKX Associates Inc.
> > > > > > tel: 1-877-GKX-GROUP
> > > > > > email: ggrothendieck at gmail.com (http://gmail.com)
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Statistics & Software Consulting
> > > > > > GKX Group, GKX Associates Inc.
> > > > > > tel: 1-877-GKX-GROUP
> > > > > > email: ggrothendieck at gmail.com (http://gmail.com)
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > -- 
> > > > Statistics & Software Consulting
> > > > GKX Group, GKX Associates Inc.
> > > > tel: 1-877-GKX-GROUP
> > > > email: ggrothendieck at gmail.com (http://gmail.com)
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130503/7f87b842/attachment-0001.html>