[datatable-help] keys that dont match

Sat May 7 22:18:21 CEST 2011

Read again and realised the 'not join' (although correct) solves a
different problem; i.e., finding the rows in dt2 that no row in dt1
matches to, not the rows in dt1 that don't match to any row in dt2. So,
second attempt ...

Create 2 tables Steve L created :

> dt1 = data.table(a=1:10,b=letters[1:10],key="a,b")
> dt2 = data.table(a=c(1,3,5,10),b=letters[c(1,3,5,10)],key="a,b")

then (without dummy columns needed, so faster and more convenient) :

> which(is.na(dt2[dt1,which=TRUE]))
[1] 2 4 6 7 8 9

Which are the rows in dt1 that aren't in dt2.

This works in this case because when a row in dt1 matches to a row in
(this) dt2, it only matches to one row in dt2 since this dt2's key is
unique, and, this join is using all the columns in dt2's key.

To robustly cope with multiple matches :

> which(is.na(dt2[dt1,which=TRUE,mult="first"]))
[1] 2 4 6 7 8 9

Testing that by creating a dup in dt2 :

> dt2 = data.table(a=c(1,3,5,5,10),b=letters[c(1,3,5,5,10)],key="a,b")
> dt2
      a b
[1,]  1 a
[2,]  3 c
[3,]  5 e
[4,]  5 e
[5,] 10 j
> which(is.na(dt2[dt1,which=TRUE]))  
[1]  2  4  7  8  9 10    # unintended result
> which(is.na(dt2[dt1,which=TRUE,mult="first"]))
[1] 2 4 6 7 8 9          # intended result (same as before)
> 

Admittedly, that's ugly, but it should be fast and scale. Perhaps there
is scope for a new 'whichna' argument then that does that directly :

> dt2[dt1,whichna=TRUE]
[1] 2 4 6 7 8 9
>

and maybe a new 'not' argument :

> dt2 =
data.table(a=c(1,3,5,5,10,11),b=letters[c(1,3,5,5,10,11)],key="a,b")
> dt2
      a b
[1,]  1 a
[2,]  3 c
[3,]  5 e
[4,]  5 e
[5,] 10 j
[6,] 11 k   # extra row not in dt1
> dt2[-dt2[dt1,which=TRUE,nomatch=0]]    # not join
      a b
[1,] 11 k
> dt2[dt1,not=TRUE]   # more user friendly
      a b
[1,] 11 k

> seq(1,nrow(dt2))[-dt2[dt1,which=TRUE,nomatch=0]]  # which not
[1] 6
> dt2[dt1,not=TRUE,which=TRUE]   # more user friendly
[1] 6

Setting both not=TRUE and whichna=TRUE wouldn't make sense and would be
prevented.

Thoughts anyone? 

Matthew

On Sat, 2011-05-07 at 08:03 +0100, Matthew Dowle wrote:
> The original post from Santosh came through as a BCC. I guess
> GoogleGroups did the BCC. Will need to do more investigation.
> 
> > Which are the rows in dt1 that aren't in dt2
> Another option may be a 'not join'; e.g.,
>   X[-X[Y,which=TRUE]]
> or
>   seq(1,nrow(X))[-X[Y,which=TRUE]]
> 
> Will add something to docs/wiki re 'not joins'.
> 
> Matthew
> 
> 
> On Wed, 2011-05-04 at 13:00 -0400, Steve Lianoglou wrote:
> > Hi,
> > 
> > On Wed, May 4, 2011 at 12:23 PM, Santosh Srinivas
> > <santosh.srinivas at gmail.com> wrote:
> > > Hi Steve,
> > >
> > > Sorry ... strange problem .. Dont know why that happened.
> > >
> > > http://groups.google.com/group/datatable/browse_thread/thread/51a0387e95d37feb
> > 
> > It looks like your first email was sent to the @googlegroups.com
> > address (I didn't even know we had that setup), and the second one
> > came through the @lists.r-forge.r-project.
> > 
> > So (I guess) the first didn't come through because it was sent to the
> > wrong(?) list -- anyway, in the future you should send to the
> > @lists.r-forge... one.
> > 
> > > I had the question and my attempt to answer before someone says go read the
> > > manual :)
> > 
> > It looks like the answer you offered is reasonable, though.
> > 
> > In short -- the question was "How can I quickly tell which (keyed)
> > rows are in one data.table vs. another)".
> > 
> > As you mentioned, you can do this by joining using `[` -- in order to
> > do this easily, you could ensure that each data.table has a column
> > that isn't in the other.
> > 
> > For example, if you have data like so:
> > 
> > 
> > R> dt1 <- data.table(a=1:10, b=letters[1:10], key="a,b")
> > R> dt2 <- data.table(a=c(1, 3, 5, 10), b=letters[c(1, 3, 5, 10)], key="a,b")
> > 
> > Doing either `dt1[dt2]` or `dt2[dt1]` doesn't get you anywhere too
> > fast (especially if one is just a subset of the other (like dt2 is to
> > dt1):
> > 
> > R> dt1[dt2]
> >       a b
> > [1,]  1 a
> > [2,]  3 c
> > [3,]  5 e
> > [4,] 10 j
> > 
> > R> dt2[dt1]
> >        a b
> >  [1,]  1 a
> >  [2,]  2 b
> >  [3,]  3 c
> >  [4,]  4 d
> >  [5,]  5 e
> >  [6,]  6 f
> >  [7,]  7 g
> >  [8,]  8 h
> >  [9,]  9 i
> > [10,] 10 j
> > 
> > Adding some 'dummy' columns may help:
> > 
> > R> dt1$in.1 <- TRUE
> > R> dt2$in.2 <- TRUE
> > 
> > Then you can (easily) ask which rows are in dt1 that aren't in dt2:
> > 
> > R> dt2[dt1] ## nomatch=NA is the default
> >        a b in.2 in.1
> >  [1,]  1 a TRUE TRUE
> >  [2,]  2 b   NA TRUE
> >  [3,]  3 c TRUE TRUE
> >  [4,]  4 d   NA TRUE
> >  [5,]  5 e TRUE TRUE
> >  [6,]  6 f   NA TRUE
> >  [7,]  7 g   NA TRUE
> >  [8,]  8 h   NA TRUE
> >  [9,]  9 i   NA TRUE
> > [10,] 10 j TRUE TRUE
> > 
> > ## or more email friendly format:
> > R> which(is.na(dt2[dt1]$in.2))
> > [1] 2 4 6 7 8 9
> > 
> > Which are the rows in dt1 that aren't in dt2
> > 
> > HTH,
> > -steve
> > 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help