[datatable-help] Bigger table , smaller access time-how is this possible?

Matthew Dowle mdowle at mdowle.plus.com
Tue Nov 29 09:12:45 CET 2011


I don't follow. The elapsed time is 0.005 seconds in all cases. The
times are extremely small anyway (5ms), it seems to be just noise.

We're used to seeing examples like the one in the examples section of
help(":=") where 591s is reduced to 1.1s. A 500 times speedup. But, more
importantly, where the wall clock time (10 minutes) is meaningful, worth
saving, and (hopefully) the readers understand the saving scales; i.e.,
10 minutes saving can easily be hours with larger data.

We can talk on the 5ms scale, too, but you'll need to be much more
precise and read up on the subject first, please.


On Tue, 2011-11-29 at 10:56 +0530, Ashim Kapoor wrote:
> Dear Matthew,
> 
> Many thanks for your email.
> 
> Following your advice I split out the as.character(as.hexmode( )) and
> ran it many times. The results swing both ways.
> 
> 
> 
> > library(xtable)
> > library(data.table)
> > start.size<-6e+5
> > 
> > time.data.table<-list()
> > 
> > for (i in 0:1){
> + n<-start.size*10^i
> + n1<-n/5000
> +
> my.data.table<-data.table(index=1:n,seriesname=rep(as.character(as.hexmode(1:n1)),each=5000),value=rnorm(n))
> + setkey(my.data.table,"seriesname")
> + searchitem<-as.character(as.hexmode(n1))
> + time.data.table[[i+1]]<-system.time(my.data.table[J(searchitem)])
> + }
> > 
> > rbind(time.data.table[[1]],time.data.table[[2]])
>      user.self sys.self elapsed user.child sys.child
> [1,]     0.008        0   0.005          0         0
> [2,]     0.008        0   0.005          0         0
> 
> > rbind(time.data.table[[1]],time.data.table[[2]])
>      user.self sys.self elapsed user.child sys.child
> [1,]     0.008        0   0.005          0         0
> [2,]     0.004        0   0.005          0         0
> 
> > rbind(time.data.table[[1]],time.data.table[[2]])
>      user.self sys.self elapsed user.child sys.child
> [1,]     0.004        0   0.005          0         0
> [2,]     0.004        0   0.005          0         0
> 
> > rbind(time.data.table[[1]],time.data.table[[2]])
>      user.self sys.self elapsed user.child sys.child
> [1,]     0.008        0   0.005          0         0
> [2,]     0.008        0   0.005          0         0
> 
> > rbind(time.data.table[[1]],time.data.table[[2]])
>      user.self sys.self elapsed user.child sys.child
> [1,]     0.004    0.004   0.005          0         0
> [2,]     0.009    0.000   0.005          0         0
> 
> Thank you,
> Ashim
> 
> 
> On Mon, Nov 28, 2011 at 4:53 PM, Matthew Dowle
> <mdowle at mdowle.plus.com> wrote:
>         
>         Hi,
>         Welcome to the list. Quick first response..
>         
>         Comparing differences of 4ms of single runs is not usually
>         very robust due
>         to overhead and cache effects. We usually prefer differences
>         of many
>         seconds or minutes and even then take the minimum of 3
>         repeated runs,
>         using something like packages rbenchmark or microbenchmark.
>         
>         as.character(as.hexmode()) will install those strings in R's
>         global string
>         cache. The 2nd time will be faster as all those strings are
>         already
>         cached.  Whether that explains this case I don't know, seems
>         plausible as
>         it's only 4ms. That part could be split out, repeated and
>         timed
>         separately.
>         
>         Think a simpler example would be possible, too. I missed the
>         reason why
>         it's in a loop through 0:1 and for 4ms something like that
>         might be making
>         a tiny difference.
>         
>         HTH, Matthew
>         
>         > Dear all,
>         >
>         > Please see my reproducible example below. My question is why
>         does the 2nd
>         > table,which is bigger have a smaller access time ?
>         >
>         >> library(xtable)
>         >> library(data.table)
>         > data.table 1.7.2  For help type: help("data.table")
>         >> start.size<-6e+5
>         >>
>         >> time.data.table<-list()
>         >>
>         >> for (i in 0:1){
>         > + n<-start.size*10^i
>         > + n1<-n/5000
>         > +
>         >
>         my.data.table<-data.table(index=1:n,seriesname=rep(as.character(as.hexmode(1:n1)),each=5000),value=rnorm(n))
>         > + setkey(my.data.table,"seriesname")
>         > +
>         > time.data.table[[i
>         +1]]<-system.time(my.data.table[J(as.character(as.hexmode(n1/4))),])
>         > + }
>         >
>         >>
>         >> rbind(time.data.table[[1]],time.data.table[[2]])
>         >      user.self sys.self elapsed user.child sys.child
>         > [1,]     0.008        0   0.008          0         0
>         > [2,]     0.004        0   0.004          0         0
>         >> time.data.table[[1]]
>         >    user  system elapsed
>         >   0.008   0.000   0.008
>         >> time.data.table[[2]]
>         >    user  system elapsed
>         >   0.004   0.000   0.004
>         >>
>         >
>         > Many thanks,
>         > Ashim
>         
>         > _______________________________________________
>         > datatable-help mailing list
>         > datatable-help at lists.r-forge.r-project.org
>         >
>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>         
>         
> 




More information about the datatable-help mailing list