[Rcpp-devel] Results on a linear models benchmark

Thu Nov 10 21:53:50 CET 2011

I have added another method to the fastLm function in the RcppEigen
package and Dirk and I have written an introductory vignette.  We have
not yet released a new version so these are currently only available
through the SVN archive at R-forge.

If you install the version on R-forge you can run the benchmark of
linear model fitting methods (calculating both the coefficient
estimates and their standard errors)  by starting R and entering

source(system.file(package="RcppEigen", "examples", "lmBenchmark.R"))

I have tried this with both a single-threaded BLAS (atlas) and a
multi-threaded BLAS (a locally compiled version of the Ubuntu
libopenblas-base package, which is based on Goto's BLAS).  The
processor is an AMD Athlon || X4 635 with 4 cores.

The results with multi-threaded BLAS are

lm benchmark for n = 100000 and p = 40: nrep = 20
      test   relative elapsed user.self sys.self
3     LDLt   1.000000   1.176     1.172    0.000
8      LLt   1.010204   1.188     1.172    0.000
6  SymmEig   2.762755   3.249     2.704    0.516
7       QR   6.350340   7.468     6.932    0.528
9     arma   6.601190   7.763    25.686    4.473
2    PivQR   7.154762   8.414     7.777    0.608
1   lm.fit  11.683673  13.740    21.561   16.769
4    GESDD  12.576531  14.790    44.011   10.960
5      SVD  44.475340  52.303    51.379    0.804
10     GSL 150.456633 176.937   210.517  149.857

Only the arma, lm.fit, GESDD and GSL methods use the multithreading.
You can see that in each of these methods the user time exceeds the
elapsed time.  Notice, though, how large the system time is.  arma
achieves a reasonable trade-off between system time and computation
time, GESDD is sort-of okay but lm.fit and GSL are not - probably
because they use only level-1 BLAS.

However, in terms of elapsed time, which is what really counts, the
single-threaded methods LDLt, LLt, and SymmEig all using Eigen methods
based on the crossproduct matrix (i.e. X'X) are still the fastest.
The QR, arma and PivQR methods are comparable.  Of these only the
PivQR handles rank-deficient cases properly.  lm.fit and GESDD are
comparable.  GESDD, SVD and GSL are all based on a singular value
decomposition.  GSL from the GNU Scientific Library uses an older
algorithm and is out of contention.  The SVD method uses the JacobiSVD
class from Eigen and the GESDD method uses Eigen classes but calls the
Lapack subroutine dgesdd to calculate the SVD itself.

 Timings with a single-threaded BLAS on the same machine are

lm benchmark for n = 100000 and p = 40: nrep = 20
     test   relative elapsed user.self sys.self
8      LLt   1.000000   1.199     1.196    0.000
3     LDLt   1.053378   1.263     1.261    0.000
6  SymmEig   2.795663   3.352     2.760    0.584
7       QR   6.708924   8.044     7.393    0.640
9     arma   7.525438   9.023     8.992    0.004
2    PivQR   8.038365   9.638     9.004    0.624
1   lm.fit  14.638866  17.552    16.701    0.828
4    GESDD  15.276063  18.316    17.681    0.620
5      SVD  45.872394  55.001    54.091    0.876
10     GSL 149.428691 179.165   178.391    0.668

Given the variation in the timings of the methods that don't use BLAS
(LDLt, LLt, SymmEig, QR, PivQR and SVD), the gains from the
multi-threaded BLAS are not impressive.  I was allowing 4 threads on a
machine with 4 cores so perhaps it would be better to use fewer
threads.