[datatable-help] Slow execution: Extracting last value in each group
arun
smartpink111 at yahoo.com
Fri Aug 16 17:02:16 CEST 2013
Hi Arun, Frank & Steve,
Thanks for responding to my post.
I did the 'microbenchmark' and 'debugonce(....)' using the same dataset 'dat2'.
f1 <- function (dataFrame) {
dataFrame[unlist(with(dataFrame, tapply(Time, list(Date), FUN = function(x) x == max(x)))), ]
}
f2 <- function (dataFrame) {
dataFrame[cumsum(with(dataFrame, tapply(Time, list(Date), FUN = which.max))), ]
}
isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
f3 <- function(dataFrame) {
dataFrame[ isLastInRun(dataFrame$Date), ]
}
f4<- function(dataFrame){
dataFrame[as.logical(with(dataFrame,ave(Time,Date,FUN=function(x) x==max(x)))),]
}
f5<- function(dataFrame){
dataFrame[cumsum(rle(dataFrame[,1])$lengths),]
}
library(data.table)
dt1 <- data.table(dat2, key=c('Date', 'Time'))
f6<- function(dataTable){
dataTable[, .SD[.N], by='Date']}
f7<- function(dataTable){
dataTable[dataTable[, .I[.N], by='Date']$V1]
}
f8<- function(dataTable){
dataTable[J(unique(Date)),,mult='last']
}
f9<- function(dataTable){
dataTable[dataTable[, .I[.N], by='Date']$V1]
}
library(microbenchmark)
microbenchmark(f1(dat2),
f2(dat2),
f3(dat2),
f4(dat2),
f5(dat2),
f6(dt1),
f7(dt1),
f8(dt1),
f9(dt1),
times=100)
#Unit: milliseconds
# expr min lq median uq max neval
# f1(dat2) 2046.59313 2318.57397 2414.21020 2533.28214 2842.9609 100
# f2(dat2) 940.97742 1000.56395 1027.53096 1100.67961 1705.4570 100
# f3(dat2) 315.06253 325.02696 341.21953 364.85656 533.9347 100
# f4(dat2) 804.89703 858.14888 899.55182 964.39989 1129.9311 100
# f5(dat2) 149.55682 153.67846 167.23934 176.56643 292.3134 100
# f6(dt1) 46665.61046 48234.78637 48818.88141 49366.46810 51112.7930 100 ###############################slowest
# f7(dt1) 71.02789 76.97008 85.09989 97.82982 387.3801 100
# f8(dt1) 77.74961 78.94773 80.00620 89.00892 205.2492 100
# f9(dt1) 71.76817 76.40184 79.89194 100.57348 282.8359 100
#Comparing the fastest among data.table with f5()
system.time(res8<- f8(dt1))
# user system elapsed
# 0.08 0.00 0.08
system.time(res5<- f5(dat2))
# user system elapsed
# 0.156 0.000 0.153
res8New<- as.data.frame(res8)
row.names(res8New)<- row.names(res5)
attr(res8New,"row.names")<- attr(res5,"row.names")
identical(res8New,res5)
#[1] TRUE
#During debugging: the step that took long time to execute is: (Same as Frank reported)
debugonce(data.table:::`[.data.table`)
dt1[, .SD[.N], by='Date']
debug: ans = .Call(Cdogroups, x, xcols, groups, grpcols, jiscols, grporder,
o__, f__, len__, jsub, SDenv, cols, newnames, verbose)
#I use Linux mint 15.
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.8 microbenchmark_1.3-0 stringr_0.6.2
[4] reshape2_1.2.2
loaded via a namespace (and not attached):
[1] plyr_1.8 tcltk_3.0.1 tools_3.0.1
A.K.
Steve,
Thank you.
arun,
Could you run it with `microbenchmark` instead of system.time (with times = 100 or so) and paste the results here?
Also, maybe you could use debugonce(data.table:::`[.data.table`) and then run
x[, .SD[.N], by='Date']
to go step by step to find out the line that causes the lag, perhaps?
Arun
________________________________
From: Arunkumar Srinivasan <aragorn168b at gmail.com>
To: arun <smartpink111 at yahoo.com>
Sent: Friday, August 16, 2013 2:27 AM
Subject: Re: [datatable-help] Slow execution: Extracting last value in each group
Sorry, but I'm not sure what your question is here. There seems to be different timings between you and Steve. You want to get it verified as to which one is true? On my system, Steve's takes 0.003 seconds.
However, a *faster* version than Steve's solution (on bigger data) would be:
x[x[, .I[.N], by='Date']$V1]
Arun
On Friday, August 16, 2013 at 6:52 AM, arun wrote:
HI,
>This is a follow up from a post in R-help mailing list. (http://r.789695.n4.nabble.com/How-to-extract-last-value-in-each-group-td4673787.html).
>
>
>
>
>In short, I tried the below using data.table(), but found to be slower than some of the other methods. Steve Lianoglou also tried the same and got it much faster (system.time()~ 0.070 vs. ~40 ).
>
>
>###data
>
>
>dat1<- structure(list(Date = c("06/01/2010", "06/01/2010", "06/01/2010",
>"06/01/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010",
>"06/02/2010", "06/02/2010", "06/02/2010"), Time = c(1358L, 1359L,
>1400L, 1700L, 331L, 332L, 334L, 335L, 336L, 337L, 338L), O = c(136.4,
>136.4, 136.45, 136.55, 136.55, 136.7, 136.75, 136.8, 136.8, 136.75,
>136.8), H = c(136.4, 136.5, 136.55, 136.55, 136.7, 136.7, 136.75,
>136.8, 136.8, 136.8, 136.8), L = c(136.35, 136.35, 136.35, 136.55,
>136.5, 136.65, 136.75, 136.8, 136.8, 136.75, 136.8), C = c(136.35,
>136.5, 136.4, 136.55, 136.7, 136.65, 136.75, 136.8, 136.8, 136.8,
>136.8), U = c(2L, 9L, 8L, 1L, 36L, 3L, 1L, 4L, 8L, 1L, 3L), D = c(12L,
>6L, 7L, 0L, 6L, 1L, 0L, 0L, 0L, 2L, 0L)), .Names = c("Date",
>"Time", "O", "H", "L", "C", "U", "D"), class = "data.frame", row.names = c(NA,
>-11L))
>
>
>
>
>indx<- rep(1:nrow(dat1),1e5)
>dat2<- dat1[indx,]
>dat2[-c(1:11),1]<-format(rep(seq(as.Date("1080-01-01"),by=1,length.out=99999),each=11),"%m/%d/%Y")
> dat2<- dat2[order(dat2[,1],dat2[,2]),]
>row.names(dat2)<-1:nrow(dat2)
>
>
>
>
>
>
>#Some speed comparisons (more in the link):
>system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,])
># user system elapsed
> # 0.528 0.012 0.540
> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
># user system elapsed
> # 0.156 0.000 0.155
>
>
>
>
>library(data.table)
>system.time({
>dt1 <- data.table(dat2, key=c('Date', 'Time'))
> ans <- dt1[, .SD[.N], by='Date']})
>
>
> # user system elapsed
> #39.860 0.020 39.952 #############slower than many other methods
>ans1<- as.data.frame(ans)
> row.names(ans1)<- row.names(res7)
> attr(ans1,"row.names")<- attr(res7,"row.names")
> identical(ans1,res7)
>#[1] TRUE
>
>
>
>
>
>
>
>
>Steve Lianoglou reply is below:
>############################
>
>
>
>
>Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close
>specs to your machine):
>
>
>R> dt1 <- data.table(dat2, key=c('Date', 'Time'))
>R> system.time(ans <- dt1[, .SD[.N], by='Date'])
> user system elapsed
> 0.064 0.009 0.073 ###########################
>
>
>R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
> user system elapsed
> 0.148 0.016 0.165
>
>
>On one of our compute server running who knows what processor on some
>version of linux, but shouldn't really matter as we're talking
>relative time to each other here:
>
>
>R> system.time(ans <- dt1[, .SD[.N], by='Date'])
> user system elapsed
> 0.160 0.012 0.170
>
>
>R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
> user system elapsed
> 0.292 0.004 0.294
>##############################################
>
>
>My sessionInfo#######
>sessionInfo()
>R version 3.0.1 (2013-05-16)
>Platform: x86_64-unknown-linux-gnu (64-bit) (linux mint 15)
>
>
>locale:
> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
>[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>
>
>attached base packages:
>[1] stats graphics grDevices utils datasets methods base
>
>
>other attached packages:
>[1] data.table_1.8.8 stringr_0.6.2 reshape2_1.2.2
>
>
>loaded via a namespace (and not attached):
>[1] plyr_1.8 tools_3.0.1
>
>
>CPU ####################
>I use Dell XPS L502X
> * Processor 2nd Gen Core i7 Intel i7-2630QM / 2 GHz ( 2.9 GHz ) ( Quad-Core )
> * Memory 6 GB / 8 GB (max)
> * Hard Drive 640 GB - Serial ATA-300 - 7200 rpm
>
>
>Any help will be appreciated.
>Thanks.
>A.K.
>_______________________________________________
>datatable-help mailing list
>datatable-help at lists.r-forge.r-project.org
>https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list