[Rcpp-devel] Differences between RcppEigen and RcppArmadillo

Sat Jun 16 20:21:21 CEST 2012

On Thu, Jun 14, 2012 at 2:36 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
>
> On 15 June 2012 at 02:56, c s wrote:
> | Simply installing ATLAS (which provides speed-ups for several Lapack
> | functions) on Debian/Ubuntu systems can already make a big difference.
> |  (Debian & Ubuntu use a trick to redirect Lapack and Blas calls to
> | ATLAS).  Under Mac OS X, the Accelerate framework provides fast
> | implementations of Lapack and Blas functions (eg. using
> | multi-threading).
>
> I found OpenBLAS to be faster than Atlas (with both coming from the
> distribution), and tend to install that now. It uses a multithreaded approach
> by default, but I have to double check one thing about cpu affinity which,
> when used from R, has been seen to interfere.  That is possible reason for
> the single-threaded performance here. I'll report back.
>
> | I've taken the modular approach to Armadillo (ie. using Lapack rather
> | than reimplementing decompositions), as it specifically allows other
> | specialist parties (such as Intel) to provide Lapack that is highly
> | optimised for particular architectures.  I myself would not be able to
> | keep up with the specific optimisations required for each CPU.  This
> | also "future-proofs" Armadillo for each new CPU generation.
> |
> | More importantly, numerically stable implementation of computational
> | decompositions/factorisations is notoriously difficult to get right.
> | The core algorithms in Lapack have been evolving for the past 20+
> | years, being exposed to a bazillion corner-cases.  Lapack itself is
> | related to Linpack and Eispack, which are even older.  I've been
> | exposed to software development long enough to know that in the end
> | only time can shake out all the bugs.  As such, using Lapack is far
> | less risky than reimplementing decompositions from scratch.  A
> | "home-made" matrix decomposition might be a bit faster on a particular
> | CPU, but you have far less knowledge as to when it's going to blow up
> | in your face.
> |
> | High-performance variants of Lapack, such as MKL, take an existing
> | proven implementation of a decomposition algorithm and recode parts of
> | it in assembly, and/or parallelise other parts.
>
> All excellent points which nobody disputes.  It just so happens that Eigen
> still looks better in some benchmarks but as you say we have to ensure we
> really compare apples to apples.

While I realize that Conrad has strong opinions on these issues and is
unlikely to be moved from those opinions, I would make a few points in
favor of Eigen.

- using BLAS/Lapack is effective because that is mature, well-tested
code.  It is ineffective because the reference implementations of
BLAS/Lapack are in Fortran 77 and suffer from all the idiosyncrasies
of that ancient language.  These include call-by-reference semantics,
even for scalars, lack of structures beyond pointers to arrays and the
inability to allocate storage.  Hence, the caller frequently must pass
an array 'work' and it length 'lwork' and, occasionally, 'rwork' and
'iwork'. There is a helpful convention for getting the length of the
work array correct which involves a preliminary call with lwork = -1
that does nothing but write the desired work array length into the
first element of the work array.  The caller must then allocate the
work array, change the value of lwork and call the routine a second
time.  Because the only structure available is a pointer to a
one-dimensional array, you must pass a general matrix as four
arguments M, N, A, LDA.  There are no machine-readable header files.
Jeff Bezanson recently joked about writing a parser for the upper-case
Fortran comments to create headers.  Does any of this sound like 21st
century software design?

- MKL blas is very good on Intel hardware.  Naturally Intel doesn't go
to great lengths to support AMD processors to the same extent.  MKL is
proprietary, closed-source software and plugging it into an
open-source software systems like R and Aramadillo feels like a
compromise.  There are open alternatives like Atlas and OpenBlas.
However, these all need to conform to the Fortran calling sequence
specifications (Lapacke and CBLAS do provide C calling sequences but
it is not a good idea to count on their availability at present).  And
notice that this approach contradicts the "Lapack code is complex but
widely tested so it should be left untouched" argument.  The way to
get good performance in OpenBlas and MKL is to rewrite the BLAS and
heavily-used parts of Lapack, like dgetrf, in a combination of C and
Assembler.  I would also point out that one of the reasons that Lapack
code is complex and difficult to understand is because it is written
in the archaic language Fortran 77.

- the approach in Eigen was to use templated C++ classes for dense and
sparse matrices and for the decompositions and solvers.  I don't think
readers of this group need to be reminded of the advantages of
templated object-oriented C++ code.  Eigen has a bit of a learning
curve but it also has a great range of capabilities.

These comments may provoke a heated response from Conrad but, if so, I
don't plan to respond further.  Eigen and Armadillo are different
approaches, each with their own advantages and disadvantages.