[datatable-help] Slow execution: Extracting last value in each group

Fri Aug 16 17:02:16 CEST 2013

Hi Arun, Frank & Steve,
Thanks for responding to my post.

I did the 'microbenchmark' and 'debugonce(....)' using the same dataset 'dat2'.
f1 <- function (dataFrame) {
      dataFrame[unlist(with(dataFrame, tapply(Time, list(Date), FUN = function(x) x == max(x)))), ]
  }
  f2 <- function (dataFrame) {
      dataFrame[cumsum(with(dataFrame, tapply(Time, list(Date), FUN = which.max))), ]
  }

isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
  f3 <- function(dataFrame) {
      dataFrame[ isLastInRun(dataFrame$Date), ]
  }

f4<- function(dataFrame){
    dataFrame[as.logical(with(dataFrame,ave(Time,Date,FUN=function(x) x==max(x)))),]
}

f5<- function(dataFrame){
   dataFrame[cumsum(rle(dataFrame[,1])$lengths),]
    }

library(data.table)
dt1 <- data.table(dat2, key=c('Date', 'Time'))
f6<- function(dataTable){
  dataTable[, .SD[.N], by='Date']}

f7<- function(dataTable){
  dataTable[dataTable[, .I[.N], by='Date']$V1]
   }
f8<- function(dataTable){
dataTable[J(unique(Date)),,mult='last']
  }     

f9<- function(dataTable){
 dataTable[dataTable[, .I[.N], by='Date']$V1] 
}

library(microbenchmark)
microbenchmark(f1(dat2),
           f2(dat2),
           f3(dat2),
           f4(dat2),
           f5(dat2),
           f6(dt1),
           f7(dt1),
           f8(dt1),
           f9(dt1),
           times=100)                   
#Unit: milliseconds
#     expr         min          lq      median          uq        max neval
# f1(dat2)  2046.59313  2318.57397  2414.21020  2533.28214  2842.9609   100
# f2(dat2)   940.97742  1000.56395  1027.53096  1100.67961  1705.4570   100
# f3(dat2)   315.06253   325.02696   341.21953   364.85656   533.9347   100
# f4(dat2)   804.89703   858.14888   899.55182   964.39989  1129.9311   100
# f5(dat2)   149.55682   153.67846   167.23934   176.56643   292.3134   100
#  f6(dt1) 46665.61046 48234.78637 48818.88141 49366.46810 51112.7930   100  ###############################slowest
#  f7(dt1)    71.02789    76.97008    85.09989    97.82982   387.3801   100
#  f8(dt1)    77.74961    78.94773    80.00620    89.00892   205.2492   100
#  f9(dt1)    71.76817    76.40184    79.89194   100.57348   282.8359   100

#Comparing the fastest among data.table with f5()
 system.time(res8<- f8(dt1))
#   user  system elapsed 
 #  0.08    0.00    0.08 
system.time(res5<- f5(dat2))
#   user  system elapsed 
 # 0.156   0.000   0.153 
res8New<- as.data.frame(res8)
 row.names(res8New)<- row.names(res5)
attr(res8New,"row.names")<- attr(res5,"row.names")
 identical(res8New,res5)
#[1] TRUE

#During debugging: the step that took long time to execute is: (Same as Frank reported)

debugonce(data.table:::`[.data.table`)
 dt1[, .SD[.N], by='Date']

debug: ans = .Call(Cdogroups, x, xcols, groups, grpcols, jiscols, grporder, 
    o__, f__, len__, jsub, SDenv, cols, newnames, verbose)

#I use Linux mint 15.

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.8     microbenchmark_1.3-0 stringr_0.6.2       
[4] reshape2_1.2.2      

loaded via a namespace (and not attached):
[1] plyr_1.8    tcltk_3.0.1 tools_3.0.1

A.K.

Steve,
Thank you.

arun, 
Could you run it with `microbenchmark` instead of system.time (with times = 100 or so) and paste the results here?

Also, maybe you could use debugonce(data.table:::`[.data.table`) and then run 

    x[, .SD[.N], by='Date']

to go step by step to find out the line that causes the lag, perhaps? 

Arun

________________________________
From: Arunkumar Srinivasan <aragorn168b at gmail.com>
To: arun <smartpink111 at yahoo.com> 
Sent: Friday, August 16, 2013 2:27 AM
Subject: Re: [datatable-help] Slow execution: Extracting last value in each group

Sorry, but I'm not sure what your question is here. There seems to be different timings between you and Steve. You want to get it verified as to which one is true? On my system, Steve's takes 0.003 seconds. 

However, a *faster* version than Steve's solution (on bigger data) would be: 

    x[x[, .I[.N], by='Date']$V1]

Arun

On Friday, August 16, 2013 at 6:52 AM, arun wrote:
HI,
>This is a follow up from a post in R-help mailing list. (http://r.789695.n4.nabble.com/How-to-extract-last-value-in-each-group-td4673787.html).  
>
>
>
>
>In short, I tried the below using data.table(), but found to be slower than some of the other methods.  Steve Lianoglou also tried the same and got it much faster (system.time()~ 0.070  vs. ~40 ).
>
>
>###data
>
>
>dat1<- structure(list(Date = c("06/01/2010", "06/01/2010", "06/01/2010", 
>"06/01/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010", 
>"06/02/2010", "06/02/2010", "06/02/2010"), Time = c(1358L, 1359L, 
>1400L, 1700L, 331L, 332L, 334L, 335L, 336L, 337L, 338L), O = c(136.4, 
>136.4, 136.45, 136.55, 136.55, 136.7, 136.75, 136.8, 136.8, 136.75, 
>136.8), H = c(136.4, 136.5, 136.55, 136.55, 136.7, 136.7, 136.75, 
>136.8, 136.8, 136.8, 136.8), L = c(136.35, 136.35, 136.35, 136.55, 
>136.5, 136.65, 136.75, 136.8, 136.8, 136.75, 136.8), C = c(136.35, 
>136.5, 136.4, 136.55, 136.7, 136.65, 136.75, 136.8, 136.8, 136.8, 
>136.8), U = c(2L, 9L, 8L, 1L, 36L, 3L, 1L, 4L, 8L, 1L, 3L), D = c(12L, 
>6L, 7L, 0L, 6L, 1L, 0L, 0L, 0L, 2L, 0L)), .Names = c("Date", 
>"Time", "O", "H", "L", "C", "U", "D"), class = "data.frame", row.names = c(NA, 
>-11L))
>
>
>
>
>indx<- rep(1:nrow(dat1),1e5)
>dat2<- dat1[indx,]
>dat2[-c(1:11),1]<-format(rep(seq(as.Date("1080-01-01"),by=1,length.out=99999),each=11),"%m/%d/%Y")
> dat2<- dat2[order(dat2[,1],dat2[,2]),]
>row.names(dat2)<-1:nrow(dat2)
>
>
>
>
>
>
>#Some speed comparisons (more in the link):
>system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,])
>#   user  system elapsed 
> # 0.528   0.012   0.540 
> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
>#   user  system elapsed 
> # 0.156   0.000   0.155 
>
>
>
>
>library(data.table)
>system.time({
>dt1 <- data.table(dat2, key=c('Date', 'Time'))
> ans <- dt1[, .SD[.N], by='Date']})
>
>
> # user  system elapsed 
> #39.860   0.020  39.952   #############slower than many other methods
>ans1<- as.data.frame(ans)
> row.names(ans1)<- row.names(res7)
> attr(ans1,"row.names")<- attr(res7,"row.names")
> identical(ans1,res7)
>#[1] TRUE
>
>
>
>
>
>
>
>
>Steve Lianoglou reply is below:
>############################
>
>
>
>
>Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close
>specs to your machine):
>
>
>R> dt1 <- data.table(dat2, key=c('Date', 'Time'))
>R> system.time(ans <- dt1[, .SD[.N], by='Date'])
>   user  system elapsed
>  0.064   0.009   0.073  ###########################
>
>
>R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
>   user  system elapsed
>  0.148   0.016   0.165
>
>
>On one of our compute server running who knows what processor on some
>version of linux, but shouldn't really matter as we're talking
>relative time to each other here:
>
>
>R> system.time(ans <- dt1[, .SD[.N], by='Date'])
>   user  system elapsed
>  0.160   0.012   0.170
>
>
>R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
>   user  system elapsed
>  0.292   0.004   0.294
>##############################################
>
>
>My sessionInfo#######
>sessionInfo()
>R version 3.0.1 (2013-05-16)
>Platform: x86_64-unknown-linux-gnu (64-bit)  (linux mint 15)
>
>
>locale:
> [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
> [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
> [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
> [7] LC_PAPER=C                 LC_NAME=C                 
> [9] LC_ADDRESS=C               LC_TELEPHONE=C            
>[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
>
>
>attached base packages:
>[1] stats     graphics  grDevices utils     datasets  methods   base     
>
>
>other attached packages:
>[1] data.table_1.8.8 stringr_0.6.2    reshape2_1.2.2  
>
>
>loaded via a namespace (and not attached):
>[1] plyr_1.8    tools_3.0.1
>
>
>CPU ####################
>I use Dell XPS L502X
> * Processor 2nd Gen Core i7 Intel i7-2630QM / 2 GHz ( 2.9 GHz ) ( Quad-Core ) 
> * Memory 6 GB / 8 GB (max) 
> * Hard Drive 640 GB - Serial ATA-300 - 7200 rpm  
>
>
>Any help will be appreciated.
>Thanks.
>A.K.
>_______________________________________________
>datatable-help mailing list
>datatable-help at lists.r-forge.r-project.org
>https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help