[datatable-help] Slow execution: Extracting last value in each group
arun
smartpink111 at yahoo.com
Fri Aug 16 06:52:48 CEST 2013
HI,
This is a follow up from a post in R-help mailing list. (http://r.789695.n4.nabble.com/How-to-extract-last-value-in-each-group-td4673787.html).
In short, I tried the below using data.table(), but found to be slower than some of the other methods. Steve Lianoglou also tried the same and got it much faster (system.time()~ 0.070 vs. ~40 ).
###data
dat1<- structure(list(Date = c("06/01/2010", "06/01/2010", "06/01/2010",
"06/01/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010",
"06/02/2010", "06/02/2010", "06/02/2010"), Time = c(1358L, 1359L,
1400L, 1700L, 331L, 332L, 334L, 335L, 336L, 337L, 338L), O = c(136.4,
136.4, 136.45, 136.55, 136.55, 136.7, 136.75, 136.8, 136.8, 136.75,
136.8), H = c(136.4, 136.5, 136.55, 136.55, 136.7, 136.7, 136.75,
136.8, 136.8, 136.8, 136.8), L = c(136.35, 136.35, 136.35, 136.55,
136.5, 136.65, 136.75, 136.8, 136.8, 136.75, 136.8), C = c(136.35,
136.5, 136.4, 136.55, 136.7, 136.65, 136.75, 136.8, 136.8, 136.8,
136.8), U = c(2L, 9L, 8L, 1L, 36L, 3L, 1L, 4L, 8L, 1L, 3L), D = c(12L,
6L, 7L, 0L, 6L, 1L, 0L, 0L, 0L, 2L, 0L)), .Names = c("Date",
"Time", "O", "H", "L", "C", "U", "D"), class = "data.frame", row.names = c(NA,
-11L))
indx<- rep(1:nrow(dat1),1e5)
dat2<- dat1[indx,]
dat2[-c(1:11),1]<-format(rep(seq(as.Date("1080-01-01"),by=1,length.out=99999),each=11),"%m/%d/%Y")
dat2<- dat2[order(dat2[,1],dat2[,2]),]
row.names(dat2)<-1:nrow(dat2)
#Some speed comparisons (more in the link):
system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,])
# user system elapsed
# 0.528 0.012 0.540
system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
# user system elapsed
# 0.156 0.000 0.155
library(data.table)
system.time({
dt1 <- data.table(dat2, key=c('Date', 'Time'))
ans <- dt1[, .SD[.N], by='Date']})
# user system elapsed
#39.860 0.020 39.952 #############slower than many other methods
ans1<- as.data.frame(ans)
row.names(ans1)<- row.names(res7)
attr(ans1,"row.names")<- attr(res7,"row.names")
identical(ans1,res7)
#[1] TRUE
Steve Lianoglou reply is below:
############################
Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close
specs to your machine):
R> dt1 <- data.table(dat2, key=c('Date', 'Time'))
R> system.time(ans <- dt1[, .SD[.N], by='Date'])
user system elapsed
0.064 0.009 0.073 ###########################
R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
user system elapsed
0.148 0.016 0.165
On one of our compute server running who knows what processor on some
version of linux, but shouldn't really matter as we're talking
relative time to each other here:
R> system.time(ans <- dt1[, .SD[.N], by='Date'])
user system elapsed
0.160 0.012 0.170
R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
user system elapsed
0.292 0.004 0.294
##############################################
My sessionInfo#######
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit) (linux mint 15)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.8 stringr_0.6.2 reshape2_1.2.2
loaded via a namespace (and not attached):
[1] plyr_1.8 tools_3.0.1
CPU ####################
I use Dell XPS L502X
* Processor 2nd Gen Core i7 Intel i7-2630QM / 2 GHz ( 2.9 GHz ) ( Quad-Core )
* Memory 6 GB / 8 GB (max)
* Hard Drive 640 GB - Serial ATA-300 - 7200 rpm
Any help will be appreciated.
Thanks.
A.K.
More information about the datatable-help
mailing list