[datatable-help] Slow execution: Extracting last value in each group

Arunkumar Srinivasan aragorn168b at gmail.com
Fri Aug 16 08:27:34 CEST 2013


Sorry, but I'm not sure what your question is here. There seems to be different timings between you and Steve. You want to get it verified as to which one is true? On my system, Steve's takes 0.003 seconds. 

However, a *faster* version than Steve's solution (on bigger data) would be:

    x[x[, .I[.N], by='Date']$V1] 

Arun


On Friday, August 16, 2013 at 6:52 AM, arun wrote:

> HI,
> This is a follow up from a post in R-help mailing list. (http://r.789695.n4.nabble.com/How-to-extract-last-value-in-each-group-td4673787.html).  
> 
> 
> In short, I tried the below using data.table(), but found to be slower than some of the other methods.  Steve Lianoglou also tried the same and got it much faster (system.time()~ 0.070  vs. ~40 ).
> 
> ###data
> 
> dat1<- structure(list(Date = c("06/01/2010", "06/01/2010", "06/01/2010", 
> "06/01/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010", 
> "06/02/2010", "06/02/2010", "06/02/2010"), Time = c(1358L, 1359L, 
> 1400L, 1700L, 331L, 332L, 334L, 335L, 336L, 337L, 338L), O = c(136.4, 
> 136.4, 136.45, 136.55, 136.55, 136.7, 136.75, 136.8, 136.8, 136.75, 
> 136.8), H = c(136.4, 136.5, 136.55, 136.55, 136.7, 136.7, 136.75, 
> 136.8, 136.8, 136.8, 136.8), L = c(136.35, 136.35, 136.35, 136.55, 
> 136.5, 136.65, 136.75, 136.8, 136.8, 136.75, 136.8), C = c(136.35, 
> 136.5, 136.4, 136.55, 136.7, 136.65, 136.75, 136.8, 136.8, 136.8, 
> 136.8), U = c(2L, 9L, 8L, 1L, 36L, 3L, 1L, 4L, 8L, 1L, 3L), D = c(12L, 
> 6L, 7L, 0L, 6L, 1L, 0L, 0L, 0L, 2L, 0L)), .Names = c("Date", 
> "Time", "O", "H", "L", "C", "U", "D"), class = "data.frame", row.names = c(NA, 
> -11L))
> 
> 
> indx<- rep(1:nrow(dat1),1e5)
> dat2<- dat1[indx,]
> dat2[-c(1:11),1]<-format(rep(seq(as.Date("1080-01-01"),by=1,length.out=99999),each=11),"%m/%d/%Y")
>  dat2<- dat2[order(dat2[,1],dat2[,2]),]
> row.names(dat2)<-1:nrow(dat2)
> 
> 
> 
> #Some speed comparisons (more in the link):
> system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,])
> #   user  system elapsed 
>  # 0.528   0.012   0.540 
>  system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
> #   user  system elapsed 
>  # 0.156   0.000   0.155 
> 
> 
> library(data.table)
> system.time({
> dt1 <- data.table(dat2, key=c('Date', 'Time'))
>  ans <- dt1[, .SD[.N], by='Date']})
> 
>  # user  system elapsed 
>  #39.860   0.020  39.952   #############slower than many other methods
> ans1<- as.data.frame(ans)
>  row.names(ans1)<- row.names(res7)
>  attr(ans1,"row.names")<- attr(res7,"row.names")
>  identical(ans1,res7)
> #[1] TRUE
> 
> 
> 
> 
> Steve Lianoglou reply is below:
> ############################
> 
> 
> Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close
> specs to your machine):
> 
> R> dt1 <- data.table(dat2, key=c('Date', 'Time'))
> R> system.time(ans <- dt1[, .SD[.N], by='Date'])
>   user  system elapsed
>   0.064  0.009  0.073  ###########################
> 
> R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
>   user  system elapsed
>   0.148  0.016  0.165
> 
> On one of our compute server running who knows what processor on some
> version of linux, but shouldn't really matter as we're talking
> relative time to each other here:
> 
> R> system.time(ans <- dt1[, .SD[.N], by='Date'])
>   user  system elapsed
>   0.160  0.012  0.170
> 
> R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
>   user  system elapsed
>   0.292  0.004  0.294
> ##############################################
> 
> My sessionInfo#######
> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)  (linux mint 15)
> 
> locale:
>  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
>  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
>  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
>  [7] LC_PAPER=C                 LC_NAME=C                 
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] data.table_1.8.8 stringr_0.6.2    reshape2_1.2.2  
> 
> loaded via a namespace (and not attached):
> [1] plyr_1.8    tools_3.0.1
> 
> CPU ####################
> I use Dell XPS L502X
>  * Processor 2nd Gen Core i7 Intel i7-2630QM / 2 GHz ( 2.9 GHz ) ( Quad-Core ) 
>  * Memory 6 GB / 8 GB (max) 
>  * Hard Drive 640 GB - Serial ATA-300 - 7200 rpm  
> 
> Any help will be appreciated.
> Thanks.
> A.K.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130816/dfed31b3/attachment.html>


More information about the datatable-help mailing list