[Rcpp-devel] [Rd] Performance of .C and .Call functions vs. native R code

Tue Jul 19 17:09:11 CEST 2011

I just saw that I left a syntax error in the .R and the first
_Rout.txt files.  Notice that in the second _Rout.txt file the order
of the arguments in the constructors for the MMatrixXd and the
MVectorXd are in a different order than in the .R and the first
_Rout.txt files.  The correct order has the pointer first, then the
dimensions.  For the first _Rout.txt file this part of the code is not
used.

On Tue, Jul 19, 2011 at 10:00 AM, Douglas Bates <bates at stat.wisc.edu> wrote:
> On Thu, Jul 14, 2011 at 10:21 AM, Alireza Mahani
> <alireza.s.mahani at gmail.com> wrote:
>> (I am using a LINUX machine)
>>
>> Jeff,
>>
>> In creating reproducible results, I 'partially' answered my question. I have
>> attached two scripts, 'mvMultiply.r' and 'mvMultiply.cc'. Please copy both
>> files into your chosen directory, then run 'Rscript mvMultiply.r' in that
>> directory while changing the two boolean parameters 'INCLUDE_DATAPREP' and
>> 'ROWMAJOR' to all four permutations. (The variable 'diffVec' is there to
>> verify that the two methods produce the same output vector.)
>>
>> Below are the results that I get, along with discussion (tR and tCall are in
>> sec):
>>
>> INCLUDE_DATAPREP,ROWMAJOR,tR,tCall
>> F,F,13.536,13.875
>> F,T,13.824,14.299
>> T,F,13.688,18.167
>> T,T,13.982,30.730
>>
>> Interpretation: The execution time for the .Call line is nearly identical to
>> the call to R operator '%*%'. Two data preparation lines for the .Call
>> method create the overhead:
>>
>> A <- t(A) (~12sec, or 12usec per call)
>> dim(A) <- dim(A)[1] * dim(A)[2] (~4sec, or 4usec per call)
>>
>> While the first line can be avoided by providing options in c++ function (as
>> is done in the BLAS API), I wonder if the second line can be avoided, aside
>> from the obvious option of rewriting the R scripts to use vectors instead of
>> matrices. But this defies one of the primary advantages of using R, which is
>> succinct, high-level coding. In particular, if one has several matrices as
>> input into a .Call function, then the overhead from matrix-to-vector
>> transformations can add up. To summarize, my questions are:
>>
>> 1- Do the above results seem reasonable to you? Is there a similar penalty
>> in R's '%*%' operator for transforming matrices to vectors before calling
>> BLAS functions?
>> 2- Are there techniques for reducing the above overhead for developers
>> looking to augment their R code with compiled code?
>>
>> Regards,
>> Alireza
>>
>> ---------------------------------------
>> # mvMultiply.r
>> # comparing performance of matrix multiplication in R (using '%*%' operator)
>> vs. calling compiled code (using .Call function)
>> # y [m x 1] = A [m x n] %*% x [n x 1]
>>
>> rm(list = ls())
>> system("R CMD SHLIB mvMultiply.cc")
>> dyn.load("mvMultiply.so")
>>
>> INCLUDE_DATAPREP <- F
>> ROWMAJOR <- F #indicates whether the c++ function treats A in a row-major or
>> column-major fashion
>>
>> m <- 100
>> n <- 10
>> N <- 1000000
>>
>> diffVec <- array(0, dim = N)
>> tR <- 0.0
>> tCall <- 0.0
>> for (i in 1:N) {
>>
>>        A <- runif(m*n); dim(A) <- c(m,n)
>>        x <- runif(n)
>>
>>        t1 <- proc.time()[3]
>>        y1 <- A %*% x
>>        tR <- tR + proc.time()[3] - t1
>>
>>        if (INCLUDE_DATAPREP) {
>>                t1 <- proc.time()[3]
>>        }
>>        if (ROWMAJOR) { #since R will convert matrix to vector in a column-major
>> fashion, if the c++ function expects a row-major format, we need to
>> transpose A before converting it to a vector
>>                A <- t(A)
>>        }
>>        dim(A) <- dim(A)[1] * dim(A)[2]
>>        if (!INCLUDE_DATAPREP) {
>>                t1 <- proc.time()[3]
>>        }
>>        y2 <- .Call("matvecMultiply", as.double(A), as.double(x),
>> as.logical(c(ROWMAJOR)))
>>        tCall <- tCall + proc.time()[3] - t1
>>
>>        diffVec[i] <- max(abs(y2 - y1))
>> }
>> cat("Data prep time for '.Call' included: ", INCLUDE_DATAPREP, "\n")
>> cat("C++ function expects row-major matrix: ", ROWMAJOR, "\n")
>> cat("Time - Using '%*%' operator in R: ", tR, "sec\n")
>> cat("Time - Using '.Call' function: ", tCall, "sec\n")
>> cat("Maximum difference between methods: ", max(diffVec), "\n")
>>
>> dyn.unload("mvMultiply.so")
>> ---------------------------------------
>> # mvMultiply.cc
>> #include <Rinternals.h>
>> #include <R.h>
>>
>> extern "C" {
>>
>> SEXP matvecMultiply(SEXP A, SEXP x, SEXP rowmajor) {
>>        double *rA = REAL(A), *rx = REAL(x), *ry;
>>        int *rrm = LOGICAL(rowmajor);
>>        int n = length(x);
>>        int m = length(A) / n;
>>        SEXP y;
>>        PROTECT(y = allocVector(REALSXP, m));
>>        ry = REAL(y);
>>        for (int i = 0; i < m; i++) {
>>                ry[i] = 0.0;
>>                for (int j = 0; j < n; j++) {
>>                        if (rrm[0] == 1) {
>>                                ry[i] += rA[i * n + j] * rx[j];
>>                        } else {
>>                                ry[i] += rA[j * m + i] * rx[j];
>>                        }
>>                }
>>        }
>>        UNPROTECT(1);
>>        return(y);
>> }
>>
>> }
>>
>
> I realize that you are just beginning to use the .C and .Call
> interfaces and your example is therefore a simple one.  However, if
> you plan to continue with such development it is worthwhile learning
> of some of the tools available.  I think one of the most important is
> the "inline" package that can take a C or C++ code segment as a text
> string and go through all the steps of creating and loading a
> .Call'able compiled function.
>
> Second, if you are going to use C++ (the code you show could be C code
> as it doesn't use any C++ extensions) then you should look at the Rcpp
> package written by Dirk Eddelbuettel and Romain Francois which allows
> for comparatively painless interfacing of R objects and C++ objects.
> The Rcpp-devel list, which I have copied on this reply, is for
> questions related to that system.  The inline package allows for
> various "plugin" constructions to wrap your code in the appropriate
> headers and point the compiler to the locations of header files and
> libraries.  There are two extensions to Rcpp for numerical linear
> algebra in C++, RcppArmadillo and RcppEigen.  I show the use of
> RcppEigen here.
>
> Third there are several packages in R that do the busy work of
> benchmarking expressions and neatly formulating the results.  I use
> the rbenchmark package.
>
> Putting all these together yields the enclosed script and results.
>
> In Eigen, a MatrixXd object is the equivalent of R's numeric matrix
> (similarly MatrixXi for integer and MatrixXcd for complex) and a
> VectorXd object is the equivalent of a numeric vector.  A "mapped"
> matrix or vector is one that uses the storage allocated by R, thereby
> avoiding a copy operation (similar to your accessing elements of the
> arrays through the pointer returned by REAL()).  To adhere to R's
> functional programming semantics it is a good idea to declare such
> objects as const.  The 'as' and 'wrap' functions are provided by Rcpp
> with extensions in RcppEigen to the Eigen classes in the development
> version.  In the released versions of Rcpp and RcppEigen we use
> intermediate Rcpp objects. These functions have the advantage of
> checking the types of R objects being passed.  The Eigen code for
> matrix multiplication will check the consistency of dimensions in the
> operation.
>
> Given the size of the matrix you are working with it is not surprising
> that interpretation overhead and checking will be a large part of the
> elapsed time, hence the relative differences between different methods
> of doing the numerical calculation will be small.  The operation of
> multiplying a 100 x 10 matrix by a 10-vector involves "only" 1000
> floating point operations.  Furthermore, each element of the matrix is
> used only once so sophisticated methods of manipulating cache contents
> won't buy you much.  These benchmark results are from a system that
> uses Atlas BLAS (basic linear algebra subroutines); other systems will
> provide different results.  Interestingly, I found on some systems
> using R's BLAS, which are not accelerated, the R code is closer in
> speed to the code using Eigen.  An example is given in the second
> version of the output.
>