[datatable-help] Stackoverflow thread comparing merge times

Matthew Dowle mdowle at mdowle.plus.com
Wed Dec 8 10:07:43 CET 2010


Bug raised for this one :
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1229&group_id=240&atid=975

The "x" issue was fixed in [.data.table. It wasn't a specific fix as far as 
I remember but when the internal scoping was tidied up and made more robust. 
Maybe this is a new one in merge.data.table when it calls [.data.table.

I don't use merge() btw, preferring d1[d2] syntax instead which may explain 
why this got missed.

Matthew

"Tom Short" <tshort.rlists at gmail.com> wrote in message 
news:AANLkTint_CMfpQC_jxDP51zNN8zBvdKqTs=wcrK1G6Xk at mail.gmail.com...
On Tue, Dec 7, 2010 at 2:36 PM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Hi,
>
> On Tue, Dec 7, 2010 at 2:07 PM, Matthew Dowle <mdowle at mdowle.plus.com> 
> wrote:
>>
>> Does anyone have time to see if this post uses data.table correctly :
>>
>> http://stackoverflow.com/questions/4322219/whats-the-fastest-way-to-merge-join-data-frames-in-r
>>
>> The dt[, colMeans(cbind(x, y)), by="g1,g2"] bit looks wrong to me. Is
>> that why it takes 131 seconds vs 2.73 for sqldf ? Shouldn't it be
>> dt[,list(mean(x),mean(y)),by="g1,g2"] ?
>>
>> And also the y2= bit of dt1[dt2,list(x,y1,y2=dt2$y2)] looks odd.
>
> Don't know what's wrong with me today, but running this part of the
> given example in "the obvious way" is causing data.table to error and
> I'm not sure what I'm (obviously(?)) doing wrong:
>
> set.seed(123)
> N <- 1e5
> d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
> d2 <- data.frame(x=sample(N,N), y2=rnorm(N))
>
> d1 <- data.table(d1, key="x")
> d2 <- data.table(d2, key="x")
> merge(d1, d2, by="x")
>
> Error in x[, key, with = FALSE] : incorrect number of dimensions
>
> What am I missing?

It's a problem with the column name "x". I thought we got rid of the
naming issues a while ago. The following seems to work:

set.seed(123)
N <- 1e5
d1 <- data.frame(xx=sample(N,N), y1=rnorm(N))
d2 <- data.frame(xx=sample(N,N), y2=rnorm(N))

d1 <- data.table(d1, key="xx")
d2 <- data.table(d2, key="xx")
merge(d1, d2)

Right now, I don't have time to dig further.

- Tom 





More information about the datatable-help mailing list