[Rcpp-devel] Very Large Matrices in RcppArmillao

Tue Jul 17 16:56:41 CEST 2012

On Tue, Jul 17, 2012 at 8:14 AM, Dirk Eddelbuettel <edd at debian.org> wrote:
>
> On 16 July 2012 at 23:30, French, Joshua wrote:
> | I am doing some linear algebra on large matrices in R and receiving the
> | following error:  "allocMatrix: too many elements specified".  From what I
> | understand, the error is caused by the fact that R uses 32-bit ints and not
> | 64-bit ints for matrix indices, so R doesn't have a way to represent all the
> | elements in the very large matrix.
> |
> | My two questions:
> |
> | 1.  Armadillo (and presumably RcppArmadillo) will not have this issue since
> | Armadillo provided support for 64-bit indices as of version 2.4.0.  Is there a
> | way to easily utilize this functionality from within RcppArmadillo?
>
> I need to double check but this may have been a compile-time option you need
> to enable. In any event ... R indices are still limited so you may not be
> able to pass these back and forth.
>
> | 2.  I have found in the past that some of the speeds gains from RcppArmadillo
> | in comparison to pure R are lost when passing large matrices as arguments.
> |  There will always be overhead when passing arguments (especially large matrix
> | arguments) to pretty much any function.  Are there any tricks to minimize the
> | overhead when passing a non-sparse matrix argument of say 1,000,000 by 500 from
> | R to Armadillo?
>
> I defer all question concerning sparse matrices to Doug and other users of
> sparse matrix code.  I live mostly in a small-to-medium size dense matrix world.

Actually the question was about non-sparse matrices.  It looks as if
it is the number of rows in the matrix that will be problematic.  An
upper bound on the number of rows is the maximum integer value divided
by the number of columns.

> .Machine$integer.max / 500
[1] 4294967

I would try not to exceed about 1/2 to 1/4 of that bound.

A simple way of handling data sets that are much larger than that is
to work with a sample of the rows.  If that is not feasible then I
would create a list of matrices each of size 1,000,000 by 500 or so
and vertically concatenate them in the  C++ code.  Of course, this
means a copying operation.  Also, when you are finished if you need to
pass results back to R then you face a similar problem getting a large
matrix in C++ back into R storage.

You can create a read-only matrix in C++ using the storage from R as
described by Christian for RcppArmadillo or using the Eigen::Map class
in RcppEigen.

What are you (Joshua) doing with these large matrices?  If the main
calculations involve X'X-type calculations you can carry out the
calculations on horizontal chunks and then assemble the results.