[Rcpp-devel] performance profile from Rcpp or RcppArmadillo code? Matrix subviews, etc

Mon Jan 7 00:09:22 CET 2013

On Jan 6, 2013, at 2:50 AM, Paul Johnson <pauljohn32 at gmail.com> wrote:

> I received a big pile of Rcpp code from some folks and am using it as
> a way to learn about Armadillo and Rcpp.
> 
> From gazing at the code, and guessing about style and usage, I some
> ways to make a program run faster.  But it is all by guessing and
> testing.
> 
> I wish I could get a profile (via gcc) to find out where the Rcpp part
> is spending its time.
> 
> Has anybody made a list of "dos and don'ts" for making faster
> RcppArmadillo code?
> 

Mostly the usual C++'s dos and dont's

> So far, I'm absolutely sure that RcppArmadillo follows many of the
> usual rules about fast code. Don't allocate huge matrices over and
> over for no good reason, for example. Armadillo has the delayed
> processing feature that automatically accelerates some calculations.
> That's quite amazing to me.

I think you are talking about the expression template feature of Armadillo. It does not accelerate things, just avoid non-necessary duplication of calculations.
> 
> But some of the practical details of writing code are uncertain. In
> particular, I wonder about the slowdown in making calculations on
> submatrices and the relative penalty you pay for re-calculating things
> versus allocating space to store them. (I once saw a demonstration
> that it is faster to recalculate multiplication problems than it is to
> store the results in some cases....)
It obviously depends on how often you need to access the results an how long it take to calculate it. Modern CPUs are capable of compute one multiplication and one addition in one cycle. In the case of linear algebra, four floats or two doubles can be done in with one SIMD instruction. In contrast, accessing the results can take up to hundreds of cycles if a cache miss happens. So if you only access the calculated results a few times, it is not allocated on stack, and the calculation is relatively fast, then recalculation can be faster than store and access later.

In contrast, say your calculation results in a scalar or small fixed size matrix, then storing it in automatic memory (on stack) and accessing it  later is almost surely faster than calculating it again if the later accessing happens close to the site of storage.

For more advices, I would suggest Agner's optimization manual (http://www.agner.org/optimize/), just the first section is relevant to most practitioners. 
> 
> Here are some of the things I would like to read about. I have noticed
> in R that accessing subsections of a matrix is slow, compared to
> analyzing the whole thing. For example. It is SLOWER to repeatedly
> work on a subsection of a matrix x than it is to extract that piece
> one time and work on the copy. I'm pasting in the R code to
> demonstrate this at the end of this message.
> 
> Does Armadillo follow that same logic, so that repeatedly accessing a
> subview of a matrix is slower than creating a new copy of the
> submatrix and then working on it?
It really depends on how the subview is accessed. Two main factors that affect performance is if the matrix is accessed linearly and if the accessing to subview results in a temporary copy of it. The later is even worse than the former as it leads to free store. Matrix is really only a one dimensional array in disguise. And dynamic array is really pointers. Armadillo is mostly smart in avoiding temporary copies, thanks to expression template. But when it cannot determine that it is safe to avoid it, due to possible pointer aliasing, it does make copies. Non-contiguous access is slow because of inefficient use of cache and shall be avoided. There should be no performance difference when access a contiguous block (whole columns) of a matrix when compared to accessing a whole matrix. They are exactly same, accessed through a pointer linearly. When accessed non-linearly, the worst case is cache contention, which will be very slow.

In your example below, what you wanted to access are two rows, g(k1, k2), etc. Things cannot be much worse than this if they need to be accessed repeated even a temporary copy is not created. If you need to access a block row by row, then consider using row major storage.

Yan
> 
> Along those lines, has the performance of the Armadillo non-contiguous
> matrix access been assessed? Are we well advised to just allocate new
> matrices and access the non-contiguous subviews one time, than to
> access them over and over?
> 
> Given a matrix g and vectors k1 and k2 that contain row numbers, we
> want to do math on these sub matrices.
> 
> g(k1, k1)
> g(k1, k2)
> g(k2, k2)
> 
> I am *guessing* that repeated access to a non-contiguous matrix
> subview is slower than accessing a contiguous block out of the matrix.
> And accessing a whole matrix would be faster still. So maybe a person
> ought to allocate.
> 
> arma::mat g11 = g(k1, k1);
> 
> rather than grabbing g(k1, k1) over and over. Know what I mean?
> 
> Oh, well. here's the example about R submatrix access
> 
> ## fasterR-matrix-1.R
> ## Paul Johnson
> ## 2012-12-23
> 
> nReps <- 5000
> nVars <- 100
> nRows <- 200
> 
> 
> ## fn1 with elements is the fastest case scenario
> ##
> fn1 <- function(x){
>    solve(x)
> }
> 
> 
> ## extract 2:(nVars+1) rows and columns, returns "anonymously"
> fn2 <- function(x){
>    y <- x[2:(nVars+1), 2:(nVars+1)]
>    solve(y)
> }
> 
> 
> ## names new matrix to return
> fn3 <- function(x){
>    y <- x[2:(nVars+1), 2:(nVars+1)]
>    yinv <- solve(y)
> }
> 
> ## manufacture new matrix, copy results to it
> fn4 <- function(x){
>    y <- x[2:(nVars+1), 2:(nVars+1)]
>    yinv <- solve(y)
>    z <- matrix(0, nVars, nVars)
>    z <- yinv
> }
> 
> ## manufacuture bigger matrix, copy results into part of it
> fn5 <- function(x){
>    y <- x[2:(nVars+1), 2:(nVars+1)]
>    yinv <- solve(y)
>    z <- matrix(0, (nVars+1), (nVars+1))
>    z[2:(nVars+1), 2:(nVars+1)] <- yinv
> }
> 
> 
> ## Now make up test data.
> ## Manufacture symmetric matrices
> Xlist1 <- vector("list", nReps)
> for(i in 1:nReps){
>    x <- matrix(rnorm(nRows * nVars), nRows, nVars)
>    Xlist1[[i]] <- crossprod(x)
> }
> 
> ## create padded theta-like matrix
> Xlist2 <- vector("list", nReps)
> for(i in 1:nReps){
>    y <- Xlist1[[i]]
>    y <- rbind(c(rep(1, nVars)), y)
>    y <- cbind(c(-1, rep(1,nVars)), y)
>    Xlist2[[i]] <- y
> }
> 
> 
> system.time(
>    fn1l <- lapply(Xlist1, fn1)
>    )
> 
> system.time(
>    fn2l <- lapply(Xlist2, fn2)
>    )
> 
> system.time(
>    fn3l <- lapply(Xlist2, fn3)
>    )
> 
> 
> system.time(
>    fn4l <- lapply(Xlist2, fn4)
>    )
> 
> 
> system.time(
>    fn5l <- lapply(Xlist2, fn5)
>    )
> 
> ## Conclusion. It IS slower to repeatedly extract rows & columns 2:(nVars+1)
> ## than to just work with a whole matrix of nVars x nVars values.
> 
> 
> -- 
> Paul E. Johnson
> Professor, Political Science      Assoc. Director
> 1541 Lilac Lane, Room 504      Center for Research Methods
> University of Kansas                 University of Kansas
> http://pj.freefaculty.org               http://quant.ku.edu
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel