[datatable-help] data.table vs matrix speed

Matt Dowle mdowle at mdowle.plus.com
Wed Jul 9 18:59:59 CEST 2014


Oops, that highlighted that adding <- isn't quite the same when 
recycling comes into it.  In your case, each RHS returns a vector as 
long as the input, so adding <- should be ok.  But in my example, the 
first RHS was a single 1L which was assigned to the symbol b (before 
recycling) that sum(b) then saw and returned 1 not 2.

Ok, iterative RHS more pressing that I thought then.  Thanks for 
highlighting.

Matt

On 09/07/14 17:52, Matt Dowle wrote:
>
> Nice example.  Yes this is the way to use it and I agree more 
> readable.   But I fear it isn't actually working as you expected. Each 
> component of `:=` doesn't see previous results, yet (not yet 
> implemented).  Easier to see that in a simple example :
>
> > DT = data.table(a=1:3,b=1:6)
> > DT
>    a b
> 1: 1 1
> 2: 2 2
> 3: 3 3
> 4: 1 4
> 5: 2 5
> 6: 3 6
> > DT[,`:=`(b=1L, d=sum(b)), by=a]
> > DT
>    a b d
> 1: 1 1 5   # all the RHS got evaluated first, before starting to 
> assign the results.
> 2: 2 1 7
> 3: 3 1 9
> 4: 1 1 5
> 5: 2 1 7
> 6: 3 1 9
> >
>
> To get the result you want, you currently have to add an extra `<-`.  
> Like this :
>
> > DT = data.table(a=1:3,b=1:6)   # start fresh
> > DT
>    a b
> 1: 1 1
> 2: 2 2
> 3: 3 3
> 4: 1 4
> 5: 2 5
> 6: 3 6
> > DT[,`:=`(b=b<-1L, d=sum(b)), by=a]   # extra b<-
> > DT
>    a b d
> 1: 1 1 1
> 2: 2 1 1
> 3: 3 1 1
> 4: 1 1 1
> 5: 2 1 1
> 6: 3 1 1
> >
>
> Clearly in your example, since you're using earlier columns in later 
> ones, that becomes onerous and bug prone due to typos, but shouldn't 
> slow it down :
>
> pre.coupleDT <- function(serostates, sexually.active) {
>     serostates[sexually.active , `:=`(
>         s..   = s..   <- s.. * (1-p.m.bef) * (1-p.f.bef),
>         mb.a1 = mb.a1 <- s.. * p.m.bef * (1-p.f.bef),
>         mb.a2 = mb.a2 <- mb.a1 * (1 - p.f.bef),
>         mb.   = mb.   <- mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
>         f.ba1 = f.bal <- s.. * p.f.bef * (1-p.m.bef),
>         f.ba2 = f.ba2 <- f.ba1 * (1 - p.m.bef),
>         f.b   = f.b   <- f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
>         hb1b2 = hb1b2 <- hb1b2 + .5  *  s.. * p.m.bef * p.f.bef + 
> (mb.a1 + mb.a2 + mb.)  *  p.f.bef,
>         hb2b1 =          hb2b1 + .5  *  s.. * p.m.bef * p.f.bef + 
> (f.ba1 + f.ba2 + f.b)  *  p.m.bef)
>            ]
>     return(serostates)
> }
>
>
> It's on the list to change it to the way you expected,  and we all 
> want that.  It involves a change quite deep down in the C code so 
> isn't done yet,  although there's nothing particularly hard about it.
>
> In terms of why data.table is faster here, consider the repeated :
>
>     temp[,'s..']
>
> The `[` there is a function call; is.function(`[`)==TRUE. And each 
> time the 's..' string appears, it looks up which column number 
> corresponds to that name. There are 28 calls in your matrix version. 
> It isn't so much matrix vs data.table, more the access method. In the 
> data.table version, once you're inside scope, it's just symbol lookup 
> (the 28 calls to `[` are gone, as are the 28 lookups of 'colname').
>
> There may be some copies going on as well; e.g. 
> serostates[sexually.active,] <- temp.   Run both through Rprof() and 
> it might reveal more.
>
> I can't think of a better way to use data.table. But note that the 
> benchmark is pretty meaningless. It's being looped 100 times 
> presumably because one run is so quick. This is quite a bug bear when 
> we see this done online. The only way to scale up, is to increase the 
> data size, perhaps by 100 times in this example. Then a single run 
> takes a measurable amount of time (say 10 seconds or more) and the 
> industry rule of thumb is to report the minimum of three consecutive 
> runs. The inferences are usually very different than when you repeat a 
> tiny test many times. The data has to be much much bigger than L2/L3 
> cache (typically 8MB but varies widely), e.g. 1GB or more.  This 
> matrix is just 6MB and likely fits entirely in cache, depending on how 
> big your cache is (see output of lscpu on unix/mac, or system info on 
> Windows).  Unless of course the nature of the task is to iterate,  in 
> which case the overhead of the `[` call can become significant, and is 
> why we added set() as a loopable `:=`.
>
> HTH
> Matt
>
>
> On 09/07/14 16:30, Steve Bellan wrote:
>> I'm trying to optimize the speed of a script that iteratively updates 
>> state variables for several thousands of individuals through time 
>> though only some individuals are active at each point in time. I had 
>> been doing this with matrices but was wondering how it compared with 
>> data.table since the latter seems to be more readable. I'm finding 
>> that my data.table implementation is about 2-3 times faster, which 
>> seems surprising since I thought matrices should be faster. It makes 
>> me wonder if there are ways to speed up either implementation. Any 
>> help is much appreciated! Here's an example of the code:
>>
>>
>> n <- 10^5
>> k <- 9
>> serostates <- matrix(0,n,k)
>> serostates <- as.data.table(serostates)
>> setnames(serostates, 1:k, c('s..', 'mb.a1', 'mb.a2', 'mb.', 'f.ba1', 
>> 'f.ba2', 'f.b', 'hb1b2', 'hb2b1'))
>> serostates[, `:=`(s.. = 1)]
>> serostates
>> serostatesMat <- as.matrix(serostates)
>>
>> pre.coupleDT <- function(serostates, sexually.active) {
>>      serostates[sexually.active , `:=`(
>>          s..   = s.. * (1-p.m.bef) * (1-p.f.bef),
>>          mb.a1 = s.. * p.m.bef * (1-p.f.bef),
>>          mb.a2 = mb.a1 * (1 - p.f.bef),
>>          mb.   = mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
>>          f.ba1 = s.. * p.f.bef * (1-p.m.bef),
>>          f.ba2 = f.ba1 * (1 - p.m.bef),
>>          f.b   = f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
>>          hb1b2 = hb1b2 + .5  *  s.. * p.m.bef * p.f.bef + (mb.a1 + 
>> mb.a2 + mb.)  *  p.f.bef,
>>          hb2b1 = hb2b1 + .5  *  s.. * p.m.bef * p.f.bef + (f.ba1 + 
>> f.ba2 + f.b)  *  p.m.bef)
>>             ]
>>      return(serostates)
>> }
>>
>>
>> pre.coupleMat <- function(serostates, sexually.active) {
>>      temp <- serostates[sexually.active,]
>>      temp[,'s..']   = temp[,'s..'] * (1-p.m.bef) * (1-p.f.bef)
>>      temp[,'mb.a1'] = temp[,'s..'] * p.m.bef * (1-p.f.bef)
>>      temp[,'mb.a2'] = temp[,'mb.a1'] * (1 - p.f.bef)
>>      temp[,'mb.'] = temp[,'mb.a2'] * (1 - p.f.bef) + temp[,'mb.'] * 
>> (1 - p.f.bef)
>>      temp[,'f.ba1'] = temp[,'s..'] * p.f.bef * (1-p.m.bef)
>>      temp[,'f.ba2'] = temp[,'f.ba1'] * (1 - p.m.bef)
>>      temp[,'f.b'] = temp[,'f.ba2'] * (1 - p.m.bef) + temp[,'f.b'] * 
>> (1 - p.m.bef)
>>      temp[,'hb1b2'] = temp[,'hb1b2'] + .5  *  temp[,'s..'] * p.m.bef 
>> * p.f.bef + (temp[,'mb.a1'] + temp[,'mb.a2'] + temp[,'mb.'])  *  p.f.bef
>>      temp[,'hb2b1'] = temp[,'hb2b1'] + .5  *  temp[,'s..'] * p.m.bef 
>> * p.f.bef + (temp[,'f.ba1'] + temp[,'f.ba2'] + temp[,'f.b'])  *  p.m.bef
>> serostates[sexually.active,] <- temp
>> return(serostates)
>> }
>>
>> sexually.active <- rbinom(n, 1,.5)==1
>> p.m.bef <- .5
>> p.f.bef <- .8
>>
>> system.time(
>>      for(ii in 1:100) {
>>          serostates <- pre.couple(serostates, sexually.active)
>>      }
>>      ) ## about 2.25 seconds
>>
>>
>> system.time(
>>      for(ii in 1:100) {
>>          serostatesMat <- pre.coupleMat(serostatesMat, sexually.active)
>>      }
>>      ) ## about 6 seconds
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help 
>>
>>
>



More information about the datatable-help mailing list