[datatable-help] keys that dont match

Sat May 7 09:03:45 CEST 2011

The original post from Santosh came through as a BCC. I guess
GoogleGroups did the BCC. Will need to do more investigation.

> Which are the rows in dt1 that aren't in dt2
Another option may be a 'not join'; e.g.,
  X[-X[Y,which=TRUE]]
or
  seq(1,nrow(X))[-X[Y,which=TRUE]]

Will add something to docs/wiki re 'not joins'.

Matthew

On Wed, 2011-05-04 at 13:00 -0400, Steve Lianoglou wrote:
> Hi,
> 
> On Wed, May 4, 2011 at 12:23 PM, Santosh Srinivas
> <santosh.srinivas at gmail.com> wrote:
> > Hi Steve,
> >
> > Sorry ... strange problem .. Dont know why that happened.
> >
> > http://groups.google.com/group/datatable/browse_thread/thread/51a0387e95d37feb
> 
> It looks like your first email was sent to the @googlegroups.com
> address (I didn't even know we had that setup), and the second one
> came through the @lists.r-forge.r-project.
> 
> So (I guess) the first didn't come through because it was sent to the
> wrong(?) list -- anyway, in the future you should send to the
> @lists.r-forge... one.
> 
> > I had the question and my attempt to answer before someone says go read the
> > manual :)
> 
> It looks like the answer you offered is reasonable, though.
> 
> In short -- the question was "How can I quickly tell which (keyed)
> rows are in one data.table vs. another)".
> 
> As you mentioned, you can do this by joining using `[` -- in order to
> do this easily, you could ensure that each data.table has a column
> that isn't in the other.
> 
> For example, if you have data like so:
> 
> 
> R> dt1 <- data.table(a=1:10, b=letters[1:10], key="a,b")
> R> dt2 <- data.table(a=c(1, 3, 5, 10), b=letters[c(1, 3, 5, 10)], key="a,b")
> 
> Doing either `dt1[dt2]` or `dt2[dt1]` doesn't get you anywhere too
> fast (especially if one is just a subset of the other (like dt2 is to
> dt1):
> 
> R> dt1[dt2]
>       a b
> [1,]  1 a
> [2,]  3 c
> [3,]  5 e
> [4,] 10 j
> 
> R> dt2[dt1]
>        a b
>  [1,]  1 a
>  [2,]  2 b
>  [3,]  3 c
>  [4,]  4 d
>  [5,]  5 e
>  [6,]  6 f
>  [7,]  7 g
>  [8,]  8 h
>  [9,]  9 i
> [10,] 10 j
> 
> Adding some 'dummy' columns may help:
> 
> R> dt1$in.1 <- TRUE
> R> dt2$in.2 <- TRUE
> 
> Then you can (easily) ask which rows are in dt1 that aren't in dt2:
> 
> R> dt2[dt1] ## nomatch=NA is the default
>        a b in.2 in.1
>  [1,]  1 a TRUE TRUE
>  [2,]  2 b   NA TRUE
>  [3,]  3 c TRUE TRUE
>  [4,]  4 d   NA TRUE
>  [5,]  5 e TRUE TRUE
>  [6,]  6 f   NA TRUE
>  [7,]  7 g   NA TRUE
>  [8,]  8 h   NA TRUE
>  [9,]  9 i   NA TRUE
> [10,] 10 j TRUE TRUE
> 
> ## or more email friendly format:
> R> which(is.na(dt2[dt1]$in.2))
> [1] 2 4 6 7 8 9
> 
> Which are the rows in dt1 that aren't in dt2
> 
> HTH,
> -steve
>