[datatable-help] Joins with or-condition

Thu Jun 6 15:58:31 CEST 2013

On Thu, Jun 6, 2013 at 8:30 AM, Wolfgang Steitz <wsteitz at gmail.com> wrote:
> Sorry for not sending code. I guess I was expecting a simple answer
> how this sql query is translated. Like for an "and" in the "on", using
> 'setkey' and then doing a join works. So here is reproducible code:
>
> require(sqldf)
>
> table_a <- data.table(y=c("l","l","s","a","a","b","c","d","d"),
>                       z=c("u","s","u","b","a","b","c","d","a"),
>
> abc=c(123.2,13.5,228.4,427.2,123.1,180.2,153.8,113.2,13.2),
>                       xyz=c("a","b","f","b","a","b","f","a","f"))
>
> table_b <-  data.table(y=c("l","l","s","a","a","b","c","d","d","k","k","d","c","u"),
>
> z=c("u","s","u","b","a","b","c","d","a","a","b","d","d","a"),
>
> abc=c(123.2,13.5,228.4,427.2,123.1,180.2,153.8,113.2,13.2,123.1,180.2,153.8,113.2,13.2))
>
> sqldf("SELECT a.xyz, sum(a.abc * b.abc)
> FROM table_a as a
> LEFT JOIN table_b as b
> ON (a.y=b.y or a.y=b.z or a.z=b.y or a.z=b.z)
> Group by a.xyz", drv="SQLite")

This will cut down the cartesian product between `a` and `b` decomposing it
into cartesian products between a subset of `a` (namely those rows with a
particular `xyz` value) and all of `b` for each unique value of `xyz`
(which is basically part of what the query optimizer in SQL would do for
you automatically):

a <- data.table(one = 1, table_a)
b <- data.table(one = 1, table_b)
setkey(a, one, xyz)
setkey(b, one)
f <- function(lev) {
   a. <- a[J(1, lev)]
   b[a., allow.cartesian = TRUE][y == y.1 | y == z.1 | z == y.1 | z ==
z.1, list(sum(abc * abc.1))]
}
rbindlist(lapply(unique(a$xyz), f))

(Although its not a data.table solution you could alternately try
adding indexes
to your SQL query.)