[datatable-help] Stackoverflow thread comparing merge times

Tue Dec 7 20:43:50 CET 2010

On Tue, Dec 7, 2010 at 2:36 PM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Hi,
>
> On Tue, Dec 7, 2010 at 2:07 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>
>> Does anyone have time to see if this post uses data.table correctly :
>>
>> http://stackoverflow.com/questions/4322219/whats-the-fastest-way-to-merge-join-data-frames-in-r
>>
>> The  dt[, colMeans(cbind(x, y)), by="g1,g2"] bit looks wrong to me. Is
>> that why it takes 131 seconds vs 2.73 for sqldf ?  Shouldn't it be
>> dt[,list(mean(x),mean(y)),by="g1,g2"] ?
>>
>> And also the y2= bit of dt1[dt2,list(x,y1,y2=dt2$y2)] looks odd.
>
> Don't know what's wrong with me today, but running this part of the
> given example in "the obvious way" is causing data.table to error and
> I'm not sure what I'm (obviously(?)) doing wrong:
>
> set.seed(123)
> N <- 1e5
> d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
> d2 <- data.frame(x=sample(N,N), y2=rnorm(N))
>
> d1 <- data.table(d1, key="x")
> d2 <- data.table(d2, key="x")
> merge(d1, d2, by="x")
>
> Error in x[, key, with = FALSE] : incorrect number of dimensions
>
> What am I missing?

It's a problem with the column name "x". I thought we got rid of the
naming issues a while ago. The following seems to work:

set.seed(123)
N <- 1e5
d1 <- data.frame(xx=sample(N,N), y1=rnorm(N))
d2 <- data.frame(xx=sample(N,N), y2=rnorm(N))

d1 <- data.table(d1, key="xx")
d2 <- data.table(d2, key="xx")
merge(d1, d2)

Right now, I don't have time to dig further.

- Tom