[Rcpp-devel] Differences between RcppEigen and RcppArmadillo

Thu Jun 14 18:56:15 CEST 2012

On Thu, Jun 14, 2012 at 4:43 AM, Dirk Eddelbuettel <edd at debian.org> wrote:
> And you should find Eigen to be a little faster. Andreas Alfons went as far
> as building 'robustHD' using RcppArmadillo with a drop-in for RcppEigen
> (in package 'sparseLTSEigen'; both package names from memmory and
> I may have mistyped).  He reported a performance gain of around 25% for
> his problem sets.  On the 'fastLm' benchmark, we find the fast Eigen-based
> decompositions to be much faster than Armadillo.

This is a mis-conception that needs to be addressed.  For equivalent
functionality, Armadillo is not necessarily any slower than Eigen,
given suitable Lapack and/or Blas libraries are used (such as Intel's
MKL or AMD's ACML, or even the open-source Atlas or OpenBlas in many
cases).  Standard Lapack and Blas are just that: a "better than
nothing" baseline implementation in terms of performance.

Armadillo doesn't reimplement Lapack and it doesn't reimplement any
decompositions -- it uses Lapack.  (**This is a very important point,
which I elaborate on below**).  As such, the speed of Armadillo for
matrix decompositions is directly dependant on the particular
implementation of Lapack that's installed on the user's machine.

I've seen some ridiculous speed differences between standard Lapack
and MKL.  The latter not only has CPU-specific optimisations (eg.
using the latest AVX extensions), but can also do multi-threading.

Simply installing ATLAS (which provides speed-ups for several Lapack
functions) on Debian/Ubuntu systems can already make a big difference.
 (Debian & Ubuntu use a trick to redirect Lapack and Blas calls to
ATLAS).  Under Mac OS X, the Accelerate framework provides fast
implementations of Lapack and Blas functions (eg. using
multi-threading).

I've taken the modular approach to Armadillo (ie. using Lapack rather
than reimplementing decompositions), as it specifically allows other
specialist parties (such as Intel) to provide Lapack that is highly
optimised for particular architectures.  I myself would not be able to
keep up with the specific optimisations required for each CPU.  This
also "future-proofs" Armadillo for each new CPU generation.

More importantly, numerically stable implementation of computational
decompositions/factorisations is notoriously difficult to get right.
The core algorithms in Lapack have been evolving for the past 20+
years, being exposed to a bazillion corner-cases.  Lapack itself is
related to Linpack and Eispack, which are even older.  I've been
exposed to software development long enough to know that in the end
only time can shake out all the bugs.  As such, using Lapack is far
less risky than reimplementing decompositions from scratch.  A
"home-made" matrix decomposition might be a bit faster on a particular
CPU, but you have far less knowledge as to when it's going to blow up
in your face.

High-performance variants of Lapack, such as MKL, take an existing
proven implementation of a decomposition algorithm and recode parts of
it in assembly, and/or parallelise other parts.