[Rcpp-devel] Very Large Matrices in RcppArmadillo
Douglas Bates
bates at stat.wisc.edu
Tue Jul 17 18:08:01 CEST 2012
On Tue, Jul 17, 2012 at 10:44 AM, French, Joshua
<JOSHUA.FRENCH at ucdenver.edu> wrote:
> Thank you all for the responses.
>
> Christian, I didn't know about the copy_aux_mem option. I will have to
> take a look at that.
>
> Dirk, thanks for looking into the 64-bit matrix indices.
>
> Doug, the place in my code where I get the error is when I multiply
> matrices. I might have matrices X and Y, where X is 300000x500 and Y is
> 500x300000 and I want Z = X * Y.
You will need to come up with another algorithm. The size of Z is
3,000,000 by 3,000,000 and storing this as a dense matrix will require
about 67 terabytes of memory
> (3000000 * 3000000 * 8) / 2^30
[1] 67055.23
I don't think even Google has that much memory available. You are
going to have to work with a decomposition. For example a QR
decomposition can represent the matrix X as a (virtual) orthogonal
matrix Q of size 3,000,000 by 3,000,000 and R an upper triangular 500
by 500 matrix but stored in essentially the same amount of space as
the original matrix X.
> I could break up the matrices into
> smaller chunks and do the multiplication, but Z is later used in several
> other multi-step calculations (with addition and multiplication mostly) so
> I think that would be a last resort. If I can get the 64-bit matrix
> indices working in RcppArmadillo, I think that will solve much of the
> problem, because I will only need to return very long vectors and not big
> matrices.
>
> Joshua
> --
> Joshua French, Ph.D.
> Assistant Professor
> Department of Mathematical and Statistical Sciences
> University of Colorado Denver
> Joshua.French at ucdenver.edu
> http://math.ucdenver.edu/~jfrench/
> Ph: 303-556-6265 Fax: 303-556-8550
>
>
>
>
>
>
>
> On 7/17/12 8:56 AM, "Douglas Bates" <bates at stat.wisc.edu> wrote:
>
>>On Tue, Jul 17, 2012 at 8:14 AM, Dirk Eddelbuettel <edd at debian.org> wrote:
>>>
>>> On 16 July 2012 at 23:30, French, Joshua wrote:
>>> | I am doing some linear algebra on large matrices in R and receiving
>>>the
>>> | following error: "allocMatrix: too many elements specified". From
>>>what I
>>> | understand, the error is caused by the fact that R uses 32-bit ints
>>>and not
>>> | 64-bit ints for matrix indices, so R doesn't have a way to represent
>>>all the
>>> | elements in the very large matrix.
>>> |
>>> | My two questions:
>>> |
>>> | 1. Armadillo (and presumably RcppArmadillo) will not have this issue
>>>since
>>> | Armadillo provided support for 64-bit indices as of version 2.4.0.
>>>Is there a
>>> | way to easily utilize this functionality from within RcppArmadillo?
>>>
>>> I need to double check but this may have been a compile-time option you
>>>need
>>> to enable. In any event ... R indices are still limited so you may not
>>>be
>>> able to pass these back and forth.
>>>
>>> | 2. I have found in the past that some of the speeds gains from
>>>RcppArmadillo
>>> | in comparison to pure R are lost when passing large matrices as
>>>arguments.
>>> | There will always be overhead when passing arguments (especially
>>>large matrix
>>> | arguments) to pretty much any function. Are there any tricks to
>>>minimize the
>>> | overhead when passing a non-sparse matrix argument of say 1,000,000
>>>by 500 from
>>> | R to Armadillo?
>>>
>>> I defer all question concerning sparse matrices to Doug and other users
>>>of
>>> sparse matrix code. I live mostly in a small-to-medium size dense
>>>matrix world.
>>
>>Actually the question was about non-sparse matrices. It looks as if
>>it is the number of rows in the matrix that will be problematic. An
>>upper bound on the number of rows is the maximum integer value divided
>>by the number of columns.
>>
>>> .Machine$integer.max / 500
>>[1] 4294967
>>
>>I would try not to exceed about 1/2 to 1/4 of that bound.
>>
>>A simple way of handling data sets that are much larger than that is
>>to work with a sample of the rows. If that is not feasible then I
>>would create a list of matrices each of size 1,000,000 by 500 or so
>>and vertically concatenate them in the C++ code. Of course, this
>>means a copying operation. Also, when you are finished if you need to
>>pass results back to R then you face a similar problem getting a large
>>matrix in C++ back into R storage.
>>
>>You can create a read-only matrix in C++ using the storage from R as
>>described by Christian for RcppArmadillo or using the Eigen::Map class
>>in RcppEigen.
>>
>>What are you (Joshua) doing with these large matrices? If the main
>>calculations involve X'X-type calculations you can carry out the
>>calculations on horizontal chunks and then assemble the results.
>
More information about the Rcpp-devel
mailing list