[datatable-help] data.table vs matrix speed
Matt Dowle
mdowle at mdowle.plus.com
Wed Jul 9 18:59:59 CEST 2014
Oops, that highlighted that adding <- isn't quite the same when
recycling comes into it. In your case, each RHS returns a vector as
long as the input, so adding <- should be ok. But in my example, the
first RHS was a single 1L which was assigned to the symbol b (before
recycling) that sum(b) then saw and returned 1 not 2.
Ok, iterative RHS more pressing that I thought then. Thanks for
highlighting.
Matt
On 09/07/14 17:52, Matt Dowle wrote:
>
> Nice example. Yes this is the way to use it and I agree more
> readable. But I fear it isn't actually working as you expected. Each
> component of `:=` doesn't see previous results, yet (not yet
> implemented). Easier to see that in a simple example :
>
> > DT = data.table(a=1:3,b=1:6)
> > DT
> a b
> 1: 1 1
> 2: 2 2
> 3: 3 3
> 4: 1 4
> 5: 2 5
> 6: 3 6
> > DT[,`:=`(b=1L, d=sum(b)), by=a]
> > DT
> a b d
> 1: 1 1 5 # all the RHS got evaluated first, before starting to
> assign the results.
> 2: 2 1 7
> 3: 3 1 9
> 4: 1 1 5
> 5: 2 1 7
> 6: 3 1 9
> >
>
> To get the result you want, you currently have to add an extra `<-`.
> Like this :
>
> > DT = data.table(a=1:3,b=1:6) # start fresh
> > DT
> a b
> 1: 1 1
> 2: 2 2
> 3: 3 3
> 4: 1 4
> 5: 2 5
> 6: 3 6
> > DT[,`:=`(b=b<-1L, d=sum(b)), by=a] # extra b<-
> > DT
> a b d
> 1: 1 1 1
> 2: 2 1 1
> 3: 3 1 1
> 4: 1 1 1
> 5: 2 1 1
> 6: 3 1 1
> >
>
> Clearly in your example, since you're using earlier columns in later
> ones, that becomes onerous and bug prone due to typos, but shouldn't
> slow it down :
>
> pre.coupleDT <- function(serostates, sexually.active) {
> serostates[sexually.active , `:=`(
> s.. = s.. <- s.. * (1-p.m.bef) * (1-p.f.bef),
> mb.a1 = mb.a1 <- s.. * p.m.bef * (1-p.f.bef),
> mb.a2 = mb.a2 <- mb.a1 * (1 - p.f.bef),
> mb. = mb. <- mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
> f.ba1 = f.bal <- s.. * p.f.bef * (1-p.m.bef),
> f.ba2 = f.ba2 <- f.ba1 * (1 - p.m.bef),
> f.b = f.b <- f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
> hb1b2 = hb1b2 <- hb1b2 + .5 * s.. * p.m.bef * p.f.bef +
> (mb.a1 + mb.a2 + mb.) * p.f.bef,
> hb2b1 = hb2b1 + .5 * s.. * p.m.bef * p.f.bef +
> (f.ba1 + f.ba2 + f.b) * p.m.bef)
> ]
> return(serostates)
> }
>
>
> It's on the list to change it to the way you expected, and we all
> want that. It involves a change quite deep down in the C code so
> isn't done yet, although there's nothing particularly hard about it.
>
> In terms of why data.table is faster here, consider the repeated :
>
> temp[,'s..']
>
> The `[` there is a function call; is.function(`[`)==TRUE. And each
> time the 's..' string appears, it looks up which column number
> corresponds to that name. There are 28 calls in your matrix version.
> It isn't so much matrix vs data.table, more the access method. In the
> data.table version, once you're inside scope, it's just symbol lookup
> (the 28 calls to `[` are gone, as are the 28 lookups of 'colname').
>
> There may be some copies going on as well; e.g.
> serostates[sexually.active,] <- temp. Run both through Rprof() and
> it might reveal more.
>
> I can't think of a better way to use data.table. But note that the
> benchmark is pretty meaningless. It's being looped 100 times
> presumably because one run is so quick. This is quite a bug bear when
> we see this done online. The only way to scale up, is to increase the
> data size, perhaps by 100 times in this example. Then a single run
> takes a measurable amount of time (say 10 seconds or more) and the
> industry rule of thumb is to report the minimum of three consecutive
> runs. The inferences are usually very different than when you repeat a
> tiny test many times. The data has to be much much bigger than L2/L3
> cache (typically 8MB but varies widely), e.g. 1GB or more. This
> matrix is just 6MB and likely fits entirely in cache, depending on how
> big your cache is (see output of lscpu on unix/mac, or system info on
> Windows). Unless of course the nature of the task is to iterate, in
> which case the overhead of the `[` call can become significant, and is
> why we added set() as a loopable `:=`.
>
> HTH
> Matt
>
>
> On 09/07/14 16:30, Steve Bellan wrote:
>> I'm trying to optimize the speed of a script that iteratively updates
>> state variables for several thousands of individuals through time
>> though only some individuals are active at each point in time. I had
>> been doing this with matrices but was wondering how it compared with
>> data.table since the latter seems to be more readable. I'm finding
>> that my data.table implementation is about 2-3 times faster, which
>> seems surprising since I thought matrices should be faster. It makes
>> me wonder if there are ways to speed up either implementation. Any
>> help is much appreciated! Here's an example of the code:
>>
>>
>> n <- 10^5
>> k <- 9
>> serostates <- matrix(0,n,k)
>> serostates <- as.data.table(serostates)
>> setnames(serostates, 1:k, c('s..', 'mb.a1', 'mb.a2', 'mb.', 'f.ba1',
>> 'f.ba2', 'f.b', 'hb1b2', 'hb2b1'))
>> serostates[, `:=`(s.. = 1)]
>> serostates
>> serostatesMat <- as.matrix(serostates)
>>
>> pre.coupleDT <- function(serostates, sexually.active) {
>> serostates[sexually.active , `:=`(
>> s.. = s.. * (1-p.m.bef) * (1-p.f.bef),
>> mb.a1 = s.. * p.m.bef * (1-p.f.bef),
>> mb.a2 = mb.a1 * (1 - p.f.bef),
>> mb. = mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
>> f.ba1 = s.. * p.f.bef * (1-p.m.bef),
>> f.ba2 = f.ba1 * (1 - p.m.bef),
>> f.b = f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
>> hb1b2 = hb1b2 + .5 * s.. * p.m.bef * p.f.bef + (mb.a1 +
>> mb.a2 + mb.) * p.f.bef,
>> hb2b1 = hb2b1 + .5 * s.. * p.m.bef * p.f.bef + (f.ba1 +
>> f.ba2 + f.b) * p.m.bef)
>> ]
>> return(serostates)
>> }
>>
>>
>> pre.coupleMat <- function(serostates, sexually.active) {
>> temp <- serostates[sexually.active,]
>> temp[,'s..'] = temp[,'s..'] * (1-p.m.bef) * (1-p.f.bef)
>> temp[,'mb.a1'] = temp[,'s..'] * p.m.bef * (1-p.f.bef)
>> temp[,'mb.a2'] = temp[,'mb.a1'] * (1 - p.f.bef)
>> temp[,'mb.'] = temp[,'mb.a2'] * (1 - p.f.bef) + temp[,'mb.'] *
>> (1 - p.f.bef)
>> temp[,'f.ba1'] = temp[,'s..'] * p.f.bef * (1-p.m.bef)
>> temp[,'f.ba2'] = temp[,'f.ba1'] * (1 - p.m.bef)
>> temp[,'f.b'] = temp[,'f.ba2'] * (1 - p.m.bef) + temp[,'f.b'] *
>> (1 - p.m.bef)
>> temp[,'hb1b2'] = temp[,'hb1b2'] + .5 * temp[,'s..'] * p.m.bef
>> * p.f.bef + (temp[,'mb.a1'] + temp[,'mb.a2'] + temp[,'mb.']) * p.f.bef
>> temp[,'hb2b1'] = temp[,'hb2b1'] + .5 * temp[,'s..'] * p.m.bef
>> * p.f.bef + (temp[,'f.ba1'] + temp[,'f.ba2'] + temp[,'f.b']) * p.m.bef
>> serostates[sexually.active,] <- temp
>> return(serostates)
>> }
>>
>> sexually.active <- rbinom(n, 1,.5)==1
>> p.m.bef <- .5
>> p.f.bef <- .8
>>
>> system.time(
>> for(ii in 1:100) {
>> serostates <- pre.couple(serostates, sexually.active)
>> }
>> ) ## about 2.25 seconds
>>
>>
>> system.time(
>> for(ii in 1:100) {
>> serostatesMat <- pre.coupleMat(serostatesMat, sexually.active)
>> }
>> ) ## about 6 seconds
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>
More information about the datatable-help
mailing list