[datatable-help] data.table vs matrix speed

Matt Dowle mdowle at mdowle.plus.com
Wed Jul 9 18:52:36 CEST 2014


Nice example.  Yes this is the way to use it and I agree more 
readable.   But I fear it isn't actually working as you expected. Each 
component of `:=` doesn't see previous results, yet (not yet 
implemented).  Easier to see that in a simple example :

 > DT = data.table(a=1:3,b=1:6)
 > DT
    a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
 > DT[,`:=`(b=1L, d=sum(b)), by=a]
 > DT
    a b d
1: 1 1 5   # all the RHS got evaluated first, before starting to assign 
the results.
2: 2 1 7
3: 3 1 9
4: 1 1 5
5: 2 1 7
6: 3 1 9
 >

To get the result you want, you currently have to add an extra `<-`.  
Like this :

 > DT = data.table(a=1:3,b=1:6)   # start fresh
 > DT
    a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
 > DT[,`:=`(b=b<-1L, d=sum(b)), by=a]   # extra b<-
 > DT
    a b d
1: 1 1 1
2: 2 1 1
3: 3 1 1
4: 1 1 1
5: 2 1 1
6: 3 1 1
 >

Clearly in your example, since you're using earlier columns in later 
ones, that becomes onerous and bug prone due to typos, but shouldn't 
slow it down :

pre.coupleDT <- function(serostates, sexually.active) {
     serostates[sexually.active , `:=`(
         s..   = s..   <- s.. * (1-p.m.bef) * (1-p.f.bef),
         mb.a1 = mb.a1 <- s.. * p.m.bef * (1-p.f.bef),
         mb.a2 = mb.a2 <- mb.a1 * (1 - p.f.bef),
         mb.   = mb.   <- mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
         f.ba1 = f.bal <- s.. * p.f.bef * (1-p.m.bef),
         f.ba2 = f.ba2 <- f.ba1 * (1 - p.m.bef),
         f.b   = f.b   <- f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
         hb1b2 = hb1b2 <- hb1b2 + .5  *  s.. * p.m.bef * p.f.bef + (mb.a1 + mb.a2 + mb.)  *  p.f.bef,
         hb2b1 =          hb2b1 + .5  *  s.. * p.m.bef * p.f.bef + (f.ba1 + f.ba2 + f.b)  *  p.m.bef)
            ]
     return(serostates)
}


It's on the list to change it to the way you expected,  and we all want 
that.  It involves a change quite deep down in the C code so isn't done 
yet,  although there's nothing particularly hard about it.

In terms of why data.table is faster here, consider the repeated :

     temp[,'s..']

The `[` there is a function call; is.function(`[`)==TRUE. And each time 
the 's..' string appears, it looks up which column number corresponds to 
that name. There are 28 calls in your matrix version. It isn't so much 
matrix vs data.table, more the access method. In the data.table version, 
once you're inside scope, it's just symbol lookup (the 28 calls to `[` 
are gone, as are the 28 lookups of 'colname').

There may be some copies going on as well; e.g. 
serostates[sexually.active,] <- temp.   Run both through Rprof() and it 
might reveal more.

I can't think of a better way to use data.table. But note that the 
benchmark is pretty meaningless. It's being looped 100 times presumably 
because one run is so quick. This is quite a bug bear when we see this 
done online. The only way to scale up, is to increase the data size, 
perhaps by 100 times in this example. Then a single run takes a 
measurable amount of time (say 10 seconds or more) and the industry rule 
of thumb is to report the minimum of three consecutive runs. The 
inferences are usually very different than when you repeat a tiny test 
many times. The data has to be much much bigger than L2/L3 cache 
(typically 8MB but varies widely), e.g. 1GB or more.  This matrix is 
just 6MB and likely fits entirely in cache, depending on how big your 
cache is (see output of lscpu on unix/mac, or system info on Windows).  
Unless of course the nature of the task is to iterate,  in which case 
the overhead of the `[` call can become significant, and is why we added 
set() as a loopable `:=`.

HTH
Matt


On 09/07/14 16:30, Steve Bellan wrote:
> I'm trying to optimize the speed of a script that iteratively updates state variables for several thousands of individuals through time though only some individuals are active at each point in time. I had been doing this with matrices but was wondering how it compared with data.table since the latter seems to be more readable. I'm finding that my data.table implementation is about 2-3 times faster, which seems surprising since I thought matrices should be faster. It makes me wonder if there are ways to speed up either implementation. Any help is much appreciated! Here's an example of the code:
>
>
> n <- 10^5
> k <- 9
> serostates <- matrix(0,n,k)
> serostates <- as.data.table(serostates)
> setnames(serostates, 1:k, c('s..', 'mb.a1', 'mb.a2', 'mb.', 'f.ba1', 'f.ba2', 'f.b', 'hb1b2', 'hb2b1'))
> serostates[, `:=`(s.. = 1)]
> serostates
> serostatesMat <- as.matrix(serostates)
>
> pre.coupleDT <- function(serostates, sexually.active) {
>      serostates[sexually.active , `:=`(
>          s..   = s.. * (1-p.m.bef) * (1-p.f.bef),
>          mb.a1 = s.. * p.m.bef * (1-p.f.bef),
>          mb.a2 = mb.a1 * (1 - p.f.bef),
>          mb.   = mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
>          f.ba1 = s.. * p.f.bef * (1-p.m.bef),
>          f.ba2 = f.ba1 * (1 - p.m.bef),
>          f.b   = f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
>          hb1b2 = hb1b2 + .5  *  s.. * p.m.bef * p.f.bef + (mb.a1 + mb.a2 + mb.)  *  p.f.bef,
>          hb2b1 = hb2b1 + .5  *  s.. * p.m.bef * p.f.bef + (f.ba1 + f.ba2 + f.b)  *  p.m.bef)
>             ]
>      return(serostates)
> }
>
>
> pre.coupleMat <- function(serostates, sexually.active) {
>      temp <- serostates[sexually.active,]
>      temp[,'s..']   = temp[,'s..'] * (1-p.m.bef) * (1-p.f.bef)
>      temp[,'mb.a1'] = temp[,'s..'] * p.m.bef * (1-p.f.bef)
>      temp[,'mb.a2'] = temp[,'mb.a1'] * (1 - p.f.bef)
>      temp[,'mb.'] = temp[,'mb.a2'] * (1 - p.f.bef) + temp[,'mb.'] * (1 - p.f.bef)
>      temp[,'f.ba1'] = temp[,'s..'] * p.f.bef * (1-p.m.bef)
>      temp[,'f.ba2'] = temp[,'f.ba1'] * (1 - p.m.bef)
>      temp[,'f.b'] = temp[,'f.ba2'] * (1 - p.m.bef) + temp[,'f.b'] * (1 - p.m.bef)
>      temp[,'hb1b2'] = temp[,'hb1b2'] + .5  *  temp[,'s..'] * p.m.bef * p.f.bef + (temp[,'mb.a1'] + temp[,'mb.a2'] + temp[,'mb.'])  *  p.f.bef
>      temp[,'hb2b1'] = temp[,'hb2b1'] + .5  *  temp[,'s..'] * p.m.bef * p.f.bef + (temp[,'f.ba1'] + temp[,'f.ba2'] + temp[,'f.b'])  *  p.m.bef
> serostates[sexually.active,] <- temp
> return(serostates)
> }
>
> sexually.active <- rbinom(n, 1,.5)==1
> p.m.bef <- .5
> p.f.bef <- .8
>
> system.time(
>      for(ii in 1:100) {
>          serostates <- pre.couple(serostates, sexually.active)
>      }
>      ) ## about 2.25 seconds
>
>
> system.time(
>      for(ii in 1:100) {
>          serostatesMat <- pre.coupleMat(serostatesMat, sexually.active)
>      }
>      ) ## about 6 seconds
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>



More information about the datatable-help mailing list