[datatable-help] data.table vs matrix speed
Matt Dowle
mdowle at mdowle.plus.com
Wed Jul 9 18:52:36 CEST 2014
Nice example. Yes this is the way to use it and I agree more
readable. But I fear it isn't actually working as you expected. Each
component of `:=` doesn't see previous results, yet (not yet
implemented). Easier to see that in a simple example :
> DT = data.table(a=1:3,b=1:6)
> DT
a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
> DT[,`:=`(b=1L, d=sum(b)), by=a]
> DT
a b d
1: 1 1 5 # all the RHS got evaluated first, before starting to assign
the results.
2: 2 1 7
3: 3 1 9
4: 1 1 5
5: 2 1 7
6: 3 1 9
>
To get the result you want, you currently have to add an extra `<-`.
Like this :
> DT = data.table(a=1:3,b=1:6) # start fresh
> DT
a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
> DT[,`:=`(b=b<-1L, d=sum(b)), by=a] # extra b<-
> DT
a b d
1: 1 1 1
2: 2 1 1
3: 3 1 1
4: 1 1 1
5: 2 1 1
6: 3 1 1
>
Clearly in your example, since you're using earlier columns in later
ones, that becomes onerous and bug prone due to typos, but shouldn't
slow it down :
pre.coupleDT <- function(serostates, sexually.active) {
serostates[sexually.active , `:=`(
s.. = s.. <- s.. * (1-p.m.bef) * (1-p.f.bef),
mb.a1 = mb.a1 <- s.. * p.m.bef * (1-p.f.bef),
mb.a2 = mb.a2 <- mb.a1 * (1 - p.f.bef),
mb. = mb. <- mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
f.ba1 = f.bal <- s.. * p.f.bef * (1-p.m.bef),
f.ba2 = f.ba2 <- f.ba1 * (1 - p.m.bef),
f.b = f.b <- f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
hb1b2 = hb1b2 <- hb1b2 + .5 * s.. * p.m.bef * p.f.bef + (mb.a1 + mb.a2 + mb.) * p.f.bef,
hb2b1 = hb2b1 + .5 * s.. * p.m.bef * p.f.bef + (f.ba1 + f.ba2 + f.b) * p.m.bef)
]
return(serostates)
}
It's on the list to change it to the way you expected, and we all want
that. It involves a change quite deep down in the C code so isn't done
yet, although there's nothing particularly hard about it.
In terms of why data.table is faster here, consider the repeated :
temp[,'s..']
The `[` there is a function call; is.function(`[`)==TRUE. And each time
the 's..' string appears, it looks up which column number corresponds to
that name. There are 28 calls in your matrix version. It isn't so much
matrix vs data.table, more the access method. In the data.table version,
once you're inside scope, it's just symbol lookup (the 28 calls to `[`
are gone, as are the 28 lookups of 'colname').
There may be some copies going on as well; e.g.
serostates[sexually.active,] <- temp. Run both through Rprof() and it
might reveal more.
I can't think of a better way to use data.table. But note that the
benchmark is pretty meaningless. It's being looped 100 times presumably
because one run is so quick. This is quite a bug bear when we see this
done online. The only way to scale up, is to increase the data size,
perhaps by 100 times in this example. Then a single run takes a
measurable amount of time (say 10 seconds or more) and the industry rule
of thumb is to report the minimum of three consecutive runs. The
inferences are usually very different than when you repeat a tiny test
many times. The data has to be much much bigger than L2/L3 cache
(typically 8MB but varies widely), e.g. 1GB or more. This matrix is
just 6MB and likely fits entirely in cache, depending on how big your
cache is (see output of lscpu on unix/mac, or system info on Windows).
Unless of course the nature of the task is to iterate, in which case
the overhead of the `[` call can become significant, and is why we added
set() as a loopable `:=`.
HTH
Matt
On 09/07/14 16:30, Steve Bellan wrote:
> I'm trying to optimize the speed of a script that iteratively updates state variables for several thousands of individuals through time though only some individuals are active at each point in time. I had been doing this with matrices but was wondering how it compared with data.table since the latter seems to be more readable. I'm finding that my data.table implementation is about 2-3 times faster, which seems surprising since I thought matrices should be faster. It makes me wonder if there are ways to speed up either implementation. Any help is much appreciated! Here's an example of the code:
>
>
> n <- 10^5
> k <- 9
> serostates <- matrix(0,n,k)
> serostates <- as.data.table(serostates)
> setnames(serostates, 1:k, c('s..', 'mb.a1', 'mb.a2', 'mb.', 'f.ba1', 'f.ba2', 'f.b', 'hb1b2', 'hb2b1'))
> serostates[, `:=`(s.. = 1)]
> serostates
> serostatesMat <- as.matrix(serostates)
>
> pre.coupleDT <- function(serostates, sexually.active) {
> serostates[sexually.active , `:=`(
> s.. = s.. * (1-p.m.bef) * (1-p.f.bef),
> mb.a1 = s.. * p.m.bef * (1-p.f.bef),
> mb.a2 = mb.a1 * (1 - p.f.bef),
> mb. = mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
> f.ba1 = s.. * p.f.bef * (1-p.m.bef),
> f.ba2 = f.ba1 * (1 - p.m.bef),
> f.b = f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
> hb1b2 = hb1b2 + .5 * s.. * p.m.bef * p.f.bef + (mb.a1 + mb.a2 + mb.) * p.f.bef,
> hb2b1 = hb2b1 + .5 * s.. * p.m.bef * p.f.bef + (f.ba1 + f.ba2 + f.b) * p.m.bef)
> ]
> return(serostates)
> }
>
>
> pre.coupleMat <- function(serostates, sexually.active) {
> temp <- serostates[sexually.active,]
> temp[,'s..'] = temp[,'s..'] * (1-p.m.bef) * (1-p.f.bef)
> temp[,'mb.a1'] = temp[,'s..'] * p.m.bef * (1-p.f.bef)
> temp[,'mb.a2'] = temp[,'mb.a1'] * (1 - p.f.bef)
> temp[,'mb.'] = temp[,'mb.a2'] * (1 - p.f.bef) + temp[,'mb.'] * (1 - p.f.bef)
> temp[,'f.ba1'] = temp[,'s..'] * p.f.bef * (1-p.m.bef)
> temp[,'f.ba2'] = temp[,'f.ba1'] * (1 - p.m.bef)
> temp[,'f.b'] = temp[,'f.ba2'] * (1 - p.m.bef) + temp[,'f.b'] * (1 - p.m.bef)
> temp[,'hb1b2'] = temp[,'hb1b2'] + .5 * temp[,'s..'] * p.m.bef * p.f.bef + (temp[,'mb.a1'] + temp[,'mb.a2'] + temp[,'mb.']) * p.f.bef
> temp[,'hb2b1'] = temp[,'hb2b1'] + .5 * temp[,'s..'] * p.m.bef * p.f.bef + (temp[,'f.ba1'] + temp[,'f.ba2'] + temp[,'f.b']) * p.m.bef
> serostates[sexually.active,] <- temp
> return(serostates)
> }
>
> sexually.active <- rbinom(n, 1,.5)==1
> p.m.bef <- .5
> p.f.bef <- .8
>
> system.time(
> for(ii in 1:100) {
> serostates <- pre.couple(serostates, sexually.active)
> }
> ) ## about 2.25 seconds
>
>
> system.time(
> for(ii in 1:100) {
> serostatesMat <- pre.coupleMat(serostatesMat, sexually.active)
> }
> ) ## about 6 seconds
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
More information about the datatable-help
mailing list