[Rcpp-devel] Avoiding memory allocations when using Armadillo, BLAS, and LAPACK

Wed Feb 18 22:31:11 CET 2015

On Wed, Feb 18, 2015 at 8:00 AM, Dirk Eddelbuettel <edd at debian.org> wrote:
> It is a little challenging to keep up with your ability to ask this question
> here, on StackOverflow and again on r-devel. As I first saw it here, I'll
> answer here.

Sorry about that.  I was planning to do it first only on StackOverflow
so the code samples would be better formatted, and then saw your
frequent exhortations to use Rcpp-devel instead.   The r-devel post is
actually a separate issue that happens to share the same example.  In
that one, I'm exploring a patch along the lines of Radford's pqr work
that would require changing the way R-core handles all of it's
allocations.

> R is a dynamically-typed interpreted language with many bells and whistles,
> but also opaque memory management.

>From the Rcpp point of view, it probably should be considered opaque.
>From the R-core side, I think it would be useful if there were more
people exploring ways to improve it.  From the measurements I've made,
I think improving R's memory management might be the lowest hanging
fruit for improving R's overall performance.

> My recommendation always is to "if in
> doubt and when working with large objects" to maybe just step aside and do
> something different outside of R.

Sidestepping R is definitely a clear path to high performance, but for
this particular project I'm trying to figure out ways to write things
that interoperate with existing R code and can be modified by R
programmers unfamiliar with C++.   I'm hoping that there is a subset
of R "design patterns" that produces acceptably high performance.   In
order of preference, I'm hoping to:

1) Find a way to write high performance code in straight R.
2) Failing that, write an Rcpp extension for high performance BLAS.
3) Failing that, write a C/C++ library and an Rcpp interface to it.

The r-devel thread is concentrating on the first, how to improve the
performance of core R code by reducing memory churn.  The thread here
is concentrating on the second, how to write Rcpp extensions that use
BLAS/LAPACK functions more efficiently.  You'll be happy to know that
I've not yet started a thread on the third approach!

> You mentioned logistic regression.

I added a more complete code sample to the StackOverflow question.
For the actual work, I'm thinking to implement Komareks LR-TRIRLS,
since the algorithmic advantage is probably going to be greater than
the implementation difference.

> And we did something similar at work: use
> Rcpp as well as bigmemory via external pointers (ie Rcpp::XPtr) so that *you*
> can allocate one chunk of memory *you* control, and keep at arm's length of
> R.

The bigmemory/bigalgebra combination comes quite close to what I'm
trying to do, but I'm scared to rely on it.  The namespace is a
hodgepodge, I don't need the large than memory aspects, it's not
actively maintained, and some parts seem broken or missing.

XPtr is interesting, but conceptually it seems like there should be a
way to work with R's memory management rather than trying to sidestep
it.  Perhaps I'm being naive.

> Implementing a simple glm-alike wrapper over that to fit a logistic
> regression in then not so hard.

Mostly yes, with the possible exception of the question I'm focussing on :)

How do I enable the R syntax "wQ = w * Q" to reuse the preallocated
space for wQ  rather than allocating a new temporary?   While doing a
system level malloc() is cheaper than letting R handle it, it's still
much less efficient than reuse.

> We even did it in a multicore context to get
> extra parallelism still using only that one (large!!) chunk of memory.

Yes, fork() plus copy-on-write seems like a great performance
combination.   But as long as the memory isn't modified, I think this
should work just as well for native R variables (or Rcpp::Numeric) as
for external pointers.

--nate