[datatable-help] Slow execution: Extracting last value in each group

Fri Aug 16 06:52:48 CEST 2013

HI,
This is a follow up from a post in R-help mailing list. (http://r.789695.n4.nabble.com/How-to-extract-last-value-in-each-group-td4673787.html).  

In short, I tried the below using data.table(), but found to be slower than some of the other methods.  Steve Lianoglou also tried the same and got it much faster (system.time()~ 0.070  vs. ~40 ).

###data

dat1<- structure(list(Date = c("06/01/2010", "06/01/2010", "06/01/2010", 
"06/01/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010", 
"06/02/2010", "06/02/2010", "06/02/2010"), Time = c(1358L, 1359L, 
1400L, 1700L, 331L, 332L, 334L, 335L, 336L, 337L, 338L), O = c(136.4, 
136.4, 136.45, 136.55, 136.55, 136.7, 136.75, 136.8, 136.8, 136.75, 
136.8), H = c(136.4, 136.5, 136.55, 136.55, 136.7, 136.7, 136.75, 
136.8, 136.8, 136.8, 136.8), L = c(136.35, 136.35, 136.35, 136.55, 
136.5, 136.65, 136.75, 136.8, 136.8, 136.75, 136.8), C = c(136.35, 
136.5, 136.4, 136.55, 136.7, 136.65, 136.75, 136.8, 136.8, 136.8, 
136.8), U = c(2L, 9L, 8L, 1L, 36L, 3L, 1L, 4L, 8L, 1L, 3L), D = c(12L, 
6L, 7L, 0L, 6L, 1L, 0L, 0L, 0L, 2L, 0L)), .Names = c("Date", 
"Time", "O", "H", "L", "C", "U", "D"), class = "data.frame", row.names = c(NA, 
-11L))

indx<- rep(1:nrow(dat1),1e5)
dat2<- dat1[indx,]
dat2[-c(1:11),1]<-format(rep(seq(as.Date("1080-01-01"),by=1,length.out=99999),each=11),"%m/%d/%Y")
 dat2<- dat2[order(dat2[,1],dat2[,2]),]
row.names(dat2)<-1:nrow(dat2)

#Some speed comparisons (more in the link):
system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,])
#   user  system elapsed 
 # 0.528   0.012   0.540 
 system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
#   user  system elapsed 
 # 0.156   0.000   0.155 

library(data.table)
system.time({
dt1 <- data.table(dat2, key=c('Date', 'Time'))
 ans <- dt1[, .SD[.N], by='Date']})

 # user  system elapsed 
 #39.860   0.020  39.952   #############slower than many other methods
ans1<- as.data.frame(ans)
 row.names(ans1)<- row.names(res7)
 attr(ans1,"row.names")<- attr(res7,"row.names")
 identical(ans1,res7)
#[1] TRUE

Steve Lianoglou reply is below:
############################

Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close
specs to your machine):

R> dt1 <- data.table(dat2, key=c('Date', 'Time'))
R> system.time(ans <- dt1[, .SD[.N], by='Date'])
   user  system elapsed
  0.064   0.009   0.073  ###########################

R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
   user  system elapsed
  0.148   0.016   0.165

On one of our compute server running who knows what processor on some
version of linux, but shouldn't really matter as we're talking
relative time to each other here:

R> system.time(ans <- dt1[, .SD[.N], by='Date'])
   user  system elapsed
  0.160   0.012   0.170

R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
   user  system elapsed
  0.292   0.004   0.294
##############################################

My sessionInfo#######
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)  (linux mint 15)

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.8 stringr_0.6.2    reshape2_1.2.2  

loaded via a namespace (and not attached):
[1] plyr_1.8    tools_3.0.1

CPU ####################
I use Dell XPS L502X
 * Processor 2nd Gen Core i7 Intel i7-2630QM / 2 GHz ( 2.9 GHz ) ( Quad-Core ) 
 * Memory 6 GB / 8 GB (max) 
 * Hard Drive 640 GB - Serial ATA-300 - 7200 rpm  

Any help will be appreciated.
Thanks.
A.K.