[datatable-help] Slow execution: Extracting last value in each group
Arunkumar Srinivasan
aragorn168b at gmail.com
Fri Aug 16 08:27:34 CEST 2013
Sorry, but I'm not sure what your question is here. There seems to be different timings between you and Steve. You want to get it verified as to which one is true? On my system, Steve's takes 0.003 seconds.
However, a *faster* version than Steve's solution (on bigger data) would be:
x[x[, .I[.N], by='Date']$V1]
Arun
On Friday, August 16, 2013 at 6:52 AM, arun wrote:
> HI,
> This is a follow up from a post in R-help mailing list. (http://r.789695.n4.nabble.com/How-to-extract-last-value-in-each-group-td4673787.html).
>
>
> In short, I tried the below using data.table(), but found to be slower than some of the other methods. Steve Lianoglou also tried the same and got it much faster (system.time()~ 0.070 vs. ~40 ).
>
> ###data
>
> dat1<- structure(list(Date = c("06/01/2010", "06/01/2010", "06/01/2010",
> "06/01/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010",
> "06/02/2010", "06/02/2010", "06/02/2010"), Time = c(1358L, 1359L,
> 1400L, 1700L, 331L, 332L, 334L, 335L, 336L, 337L, 338L), O = c(136.4,
> 136.4, 136.45, 136.55, 136.55, 136.7, 136.75, 136.8, 136.8, 136.75,
> 136.8), H = c(136.4, 136.5, 136.55, 136.55, 136.7, 136.7, 136.75,
> 136.8, 136.8, 136.8, 136.8), L = c(136.35, 136.35, 136.35, 136.55,
> 136.5, 136.65, 136.75, 136.8, 136.8, 136.75, 136.8), C = c(136.35,
> 136.5, 136.4, 136.55, 136.7, 136.65, 136.75, 136.8, 136.8, 136.8,
> 136.8), U = c(2L, 9L, 8L, 1L, 36L, 3L, 1L, 4L, 8L, 1L, 3L), D = c(12L,
> 6L, 7L, 0L, 6L, 1L, 0L, 0L, 0L, 2L, 0L)), .Names = c("Date",
> "Time", "O", "H", "L", "C", "U", "D"), class = "data.frame", row.names = c(NA,
> -11L))
>
>
> indx<- rep(1:nrow(dat1),1e5)
> dat2<- dat1[indx,]
> dat2[-c(1:11),1]<-format(rep(seq(as.Date("1080-01-01"),by=1,length.out=99999),each=11),"%m/%d/%Y")
> dat2<- dat2[order(dat2[,1],dat2[,2]),]
> row.names(dat2)<-1:nrow(dat2)
>
>
>
> #Some speed comparisons (more in the link):
> system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,])
> # user system elapsed
> # 0.528 0.012 0.540
> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
> # user system elapsed
> # 0.156 0.000 0.155
>
>
> library(data.table)
> system.time({
> dt1 <- data.table(dat2, key=c('Date', 'Time'))
> ans <- dt1[, .SD[.N], by='Date']})
>
> # user system elapsed
> #39.860 0.020 39.952 #############slower than many other methods
> ans1<- as.data.frame(ans)
> row.names(ans1)<- row.names(res7)
> attr(ans1,"row.names")<- attr(res7,"row.names")
> identical(ans1,res7)
> #[1] TRUE
>
>
>
>
> Steve Lianoglou reply is below:
> ############################
>
>
> Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close
> specs to your machine):
>
> R> dt1 <- data.table(dat2, key=c('Date', 'Time'))
> R> system.time(ans <- dt1[, .SD[.N], by='Date'])
> user system elapsed
> 0.064 0.009 0.073 ###########################
>
> R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
> user system elapsed
> 0.148 0.016 0.165
>
> On one of our compute server running who knows what processor on some
> version of linux, but shouldn't really matter as we're talking
> relative time to each other here:
>
> R> system.time(ans <- dt1[, .SD[.N], by='Date'])
> user system elapsed
> 0.160 0.012 0.170
>
> R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
> user system elapsed
> 0.292 0.004 0.294
> ##############################################
>
> My sessionInfo#######
> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit) (linux mint 15)
>
> locale:
> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] data.table_1.8.8 stringr_0.6.2 reshape2_1.2.2
>
> loaded via a namespace (and not attached):
> [1] plyr_1.8 tools_3.0.1
>
> CPU ####################
> I use Dell XPS L502X
> * Processor 2nd Gen Core i7 Intel i7-2630QM / 2 GHz ( 2.9 GHz ) ( Quad-Core )
> * Memory 6 GB / 8 GB (max)
> * Hard Drive 640 GB - Serial ATA-300 - 7200 rpm
>
> Any help will be appreciated.
> Thanks.
> A.K.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130816/dfed31b3/attachment.html>
More information about the datatable-help
mailing list