[Rcpp-devel] examples of using cula matrix multiplication in Rcpp

Mon May 18 15:16:44 CEST 2015

I am actually working on a general purpose GPU library for R using Rcpp and
RcppArmadillo but it is still under heavy development.  During these very
early stages I have had an 'older' card (AMD Radeon HD 5700 Series) so I
have been working primarily with OpenCL and the clBLAS library (which must
be installed).  The idea has been to create the easiest possible (at least
from my perspective) interface to leverage GPU's.  I will be receiving a
newer NVIDIA card (GeForce GTX 970) so I will begin adding in CUDA as well
(I fully intended to have a hybrid system similar to the arrayfire
library).  You can view it on my github as well:
https://github.com/cdeterman/gpuR

As I said, keep in mind that it is still under heavy development and many
more functions need to be added and it is only available for Linux at the
moment.  It is designed to provide a new class structure similar to the
'Matrix' package.  You can see an example of some vector addition use on
the github page but an example of matrix multiplication would be:

# convert matrix to gpuMatrix object
gpuA <- gpuMatrix(A)
gpuB <- gpuMatrix(B)

# matrix mutliplication
gpuC <- gpuA %*% gpuB

Also, if a user is looking in to GPGPU, they are likely dealing with 'big
data' so this package is intended to be used in concert with the
'bigmemory' package as well with the 'gpuBigMatrix' function where the idea
is to provide a full interface when matrices exceed local memory size
(obviously would be slower but useful for those without access to expensive
hardware).  There is also support for 'integer', 'float', and 'double' data
types if the default R 'double' precision is not required (for an
additional speed-up).

With my older card, and using code that could likely be optimized further,
Dirk is correct that the data transfer time is very significant.  My
initial benchmarks can exceed R's native BLAS (not much to celebrate) but
is clearly bested by just using OpenBLAS.  Also, as Dirk mentions, as the
size of the matrix increases the performance distance also shrinks until
the GPU wins out.  Again, my initial benchmarks show that the gpuMatrix
multiplication does eventually beat out OpenBLAS consistently once matrices
begin to approaches sizes of 2048x2048.  I am optimistic however with the
use of a newer card and beginning to apply some CUDA.  Once I have more
functions and the CUDA interface set up I intend to have it submitted to
CRAN.  I am always open to comments, concerns, and/or contributions :)

Regards,
Charles

On Sat, May 16, 2015 at 2:58 PM, Colin Rundel <rundel at gmail.com> wrote:

> I’ve been playing around with Rcpp and CUDA (CUBLAS and Magma in
> particular) for quite a while now and definitely find it useful for
> improving performance. My interest is mostly in spatial models and gaussian
> processes where the rate limiting step is usually O(n^3) matrix
> decomposition where n is between 1000 to 5000.
>
> For these types of tasks I routinely see ~2x improvements over
> RcppArmadillo & OpenBLAS using a $100 consumer grade card, which isn’t huge
> but makes a big difference when the overall runtime is around 80 hours per
> model.
>
> If anyone is interested in looking at some code I have the early stages of
> a package up on github: https://github.com/rundel/RcppGP. In particular
> the gpu_mat class has a reasonably mature interface for moving data between
> armadillo and cuBLAS.
>
> -Colin
>
> -----
>
> Colin Rundel
> Assistant Professor of the Practice
> Duke University, Department of Statistical Science
> www.stat.duke.edu/~cr173/
>
> On May 16, 2015, at 12:24 PM, Yue Li <gorillayue at gmail.com> wrote:
>
> Thanks for the quick insightful replies! I will look into the solutions
> and keep the list posted on any progress on this end.
>
> Yue
>
> On May 16, 2015, at 12:10 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
>
>
> On 16 May 2015 at 17:05, Sean O'Riordain wrote:
> | Some students I have been working with managed to get Rcpp to work with
> Cuda
> | for a simple use case - calculating a big log-likelihood for MCMC - and
> they
> | got a bit of a speedup compared with Rcpp - but it needs more work.  They
> | promised they would write up a note for the gallery once their exams are
> over
> | in a couple of weeks.
>
> That is splendid news!
>
> I better make sure I can compile with CUDA then or else building the
> article
> may be tricky.
>
> Dirk
>
> --
> http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
>
>
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
>
>
>
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/179bc3b2/attachment.html>