The comment of Doug Bates on this thread motivated me to better examine opportunities for algorithm improvements in my R package c++ code written under Rcpp and RcppArmadillo. I performed a few changes: firstly, i removed all general matrix inverse computations, replacing many with triangular solving from cholesky decompositions (e.g. to conduct posterior sampling from a multivariate gaussian where the precision matrix is readily available). secondly, instead of repeatedly performing high-dimensional matrix computations composing sampled parameter and design objects during a sequential sampling scan, I made the computation once per iteration and then operated on the resulting large object by repeatedly extracting portions, pre-sampling, and inserting back the post-sampled result during a sequential scan. thirdly, while linear algebra compositions of parameters and data objects must be rendered repeatedly across posterior sampling iterations, compositions of design/data objects (with each other) may be performed only once and re-used. I had been sloppy and was repeatedly performing computations purely involving data objects inside the sampling loops. I replaced this approach by slicing and dicing the design objects and performing the matrix operations, once, and then repeatedly inserting the objects during sampling. While the first two steps - particularly the second - produced some runtime speed improvements, I was surprised that the third step produced only a very small improvement where I had expected a large reduction in runtime. I used RcppArmadillo and placed these sliced and diced matrix objects into fields from which I then extracted elements during sampling. So I replaced matrix computations with extractions from field containers. Is it possible that extraction speeds from fields and multiplying a chain of moderate-dimensioned matrices (e.g. 5 x 600) are of equivalent computational intensity? Was the compiler compensating for my hackery?<div class="gmail_extra">
<br><br><div class="gmail_quote">On Tue, Dec 11, 2012 at 7:07 AM, Honglang Wang <span dir="ltr"><<a href="mailto:wanghonglang2008@gmail.com" target="_blank">wanghonglang2008@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I tried out that if the dim of the matrix less than or equal to 4, inv() and solve() have the same speed.<div class="im HOEnZb"><br><br clear="all"><div><div>Best wishes!</div><div> </div><div>Honglang Wang</div><div> </div>
<div>Office C402 Wells Hall</div>
<div>Department of Statistics and Probability</div><div>Michigan State University</div><div>1579 I Spartan Village, East Lansing, MI 48823</div><div><a href="mailto:wangho16@msu.edu" target="_blank">wangho16@msu.edu</a></div>
</div><br>
<br><br></div><div class="HOEnZb"><div class="h5"><div class="gmail_quote">On Mon, Dec 10, 2012 at 8:05 AM, c s <span dir="ltr"><<a href="mailto:conradsand.arma@gmail.com" target="_blank">conradsand.arma@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi Honglang,<br><br>I recommend looking into using the solve() function in Armadillo, instead of inv(). Using solve() will be considerably faster. <br><br>More details here:<br><a href="http://arma.sourceforge.net/docs.html#solve" target="_blank">http://arma.sourceforge.net/docs.html#solve</a><div>
<div><br>
<br><br>On Thursday, December 6, 2012, Honglang Wang <<a href="mailto:wanghonglang2008@gmail.com" target="_blank">wanghonglang2008@gmail.com</a>> wrote:<br>> The following is a full example although I don't know whether it's minimal or not:<br>
><br>> library(Rcpp)<br>> library(RcppArmadillo)<br>> sourceCpp("betahat_mod.cpp")<br>><br>> #The following is data generation.<br>> n=200<br>> m=20<br>> p=2<br>> t=runif(m*n,min=0, max=1)<br>
> X1=rnorm(m*n,0,1)<br>> X1=as.matrix(1+2*exp(t)+X1)<br>> X2=rnorm(m*n,0,1)<br>> X2=as.matrix(3-4*t^2+X2)<br>> X=cbind(X1,X2)<br>> beta1=0.5*sin(t)<br>> beta2=beta1<br>> rho=0.2<br>
> sig2=1<br>> eps=unlist(lapply(1:n,function(x){as.matrix(arima.sim(list(ar=rho),m,sd=sqrt(sig2*(1-rho^2))))}))<br>> y=as.matrix(X1*beta1+X2*beta2+eps)<br>><br>> #A simple kernel function.<br>> ker=function(x,h)<br>
> {<br>> ans=x<br>> lo=(abs(x)<h)<br>> ans[lo]=(3/4)*(1-(x[lo]/h)^2)/h<br>> ans[!lo]=0<br>> return(ans)<br>> }<br>><br>><br>> h=0.3<br>><br>> #assess the time for evaluating bethat of 100 t's:<br>
> system.time((betahatt=t(apply(as.matrix(t[1:100]),1,function(x) betahat(ker,x,X,y,t,h,m)$betahat))))<br>><br>> And the .cpp file is the following:<br>><br>> // [[Rcpp::depends(RcppArmadillo)]]<br>><br>
> #include <RcppArmadillo.h><br>><br>> using namespace Rcpp;<br>><br>> // [[Rcpp::export]]<br>> List betahat(Function ker, double t0, NumericMatrix Xr, NumericMatrix yr, NumericVector tr, double h, int m) {<br>
> int n = Xr.nrow(), p = Xr.ncol();<br>> arma::mat X(Xr.begin(), n, p, false);<br>> arma::mat y(yr.begin(), n, 1, false);<br>> arma::colvec t(tr.begin(), tr.size(), false);<br>> arma::mat T = X;<br>
> T.each_col() %= (t-t0)/h;<br>> arma::mat D = arma::join_rows(X,T);<br>> arma::vec kk =as<arma::vec>(ker(tr-t0,h));<br>> arma::mat W = (arma::diagmat(kk))/m;<br>> arma::mat Inv = arma::trans(D)*W*D;<br>
> arma::vec betahat = arma::inv(Inv)*arma::trans(D)*W*y;<br>> arma::colvec betahat0(betahat.begin(),betahat.size()/2,false);<br>> return List::create(Named("betahat") = betahat0);<br>> }<br>><br>
> Best wishes!<br>> <br>> Honglang Wang<br>> <br>> Office C402 Wells Hall<br>> Department of Statistics and Probability<br>> Michigan State University<br>> 1579 I Spartan Village, East Lansing, MI 48823<br>
> <a href="mailto:wangho16@msu.edu" target="_blank">wangho16@msu.edu</a><br>><br>><br>> On Wed, Dec 5, 2012 at 4:07 AM, Christian Gunning <<a href="mailto:xian@unm.edu" target="_blank">xian@unm.edu</a>> wrote:<br>
>><br>>> Can you post a minimal full example?<br>
>> -Christian<br>>><br>>> On Tue, Dec 4, 2012 at 8:39 PM, Honglang Wang<br>>> <<a href="mailto:wanghonglang2008@gmail.com" target="_blank">wanghonglang2008@gmail.com</a>> wrote:<br>>> > Yes, the main issue for my coding is the allocation of memory. And I have<br>
>> > fixed one of the biggest memory allocation issue: 4000 by 4000 diagonal<br>>> > matrix. And since I am not familiar with Rcpp and RcppArmadillo, I have no<br>>> > idea how to reuse the memory. I hope I can have some materials to learn<br>
>> > this. Thanks.<br>>> ><br>>> ><br>>> >><br>>> >> What exactly do these timings show? A single call you your function?<br>>> >> How many calls?<br>>> >><br>
>> > Here I called my function for 100 times.<br>>> ><br>>> >><br>>> >> Building on Romain's point: -- a portion of your function's runtime is<br>>> >> in memory allocation<br>
>> >> (and you have a lot of allocations here).<br>>> >> If you're calling your function thousands or millions of times, then<br>>> >> it might pay to closely<br>>> >> examine your memory allocation strategies and figure out what's<br>
>> >> temporary, for example.<br>>> >> It looks like you're already using copy_aux_mem = false in a number<br>>> >> of places, but you're<br>>> >> allocating a lot of objects -- of approx what size?<br>
>> >><br>>> >> For example, wouldn't this work just as well with one less allocation?<br>>> >> arma::vec kk = t;<br>>> >> arma::uvec q1 = arma::find(arma::abs(tp)<h);<br>
>> >> kk.elem(q1) = ((1-arma::pow(tp.elem(q1)/h,2))/h)*0.75;<br>>> >> // done with q1. let's reuse it.<br>>> >> q1 = arma::find(arma::abs(tp)>=h);<br>>> >> // was q2<br>
>> >> kk.elem(q1).zeros();<br>>> >><br>>> >> You could potentially allocate memory for temporary working space in<br>>> >> R, grab it with copy_aux_mem = false, write your temp results there,<br>
>> >> and reuse these objects in subsequent function calls. It doesn't make<br>>> >> sense to go to this trouble, though, if your core algorithm consumes<br>>> >> the bulk of runtime.<br>
>> >><br>>> >> Have you looked on the armadillo notes r.e. inv? Matrix inversion has<br>>> >> O(>n^2). You may be aided by pencil-and-paper math here.<br>>> >> <a href="http://arma.sourceforge.net/docs.html#inv" target="_blank">http://arma.sourceforge.net/docs.html#inv</a><br>
>> >><br>>> > Here my matrix for inverse is only 4 by 4, so I think it's ok.<br>>> ><br>>> >><br>>> >> best,<br>>> >> Christian<br>>> >><br>
>> >> > Dear All,<br>>> >> > I have tried out the first example by using RcppArmadillo, but I am not<br>>> >> > sure whether the code is efficient or not. And I did the comparison of<br>
>> >> > the<br>>> >> > computation time.<br>>> >> ><br>>> >> > 1) R code using for loop in R: 87.22s<br>>> >> > 2) R code using apply: 77.86s<br>
>> >> > 3) RcppArmadillo by using for loop in C++: 53.102s<br>
>> >> > 4) RcppArmadillo together with apply in R: 47.310s<br>>> >> ><br>>> >> > It is kind of not so big increase. I am wondering whether I used an<br>>> >> > inefficient way for the C++ coding:<br>
>> >><br>>> >><br>>> >><br>>> >> --<br>>> >> A man, a plan, a cat, a ham, a yak, a yam, a hat, a canal – Panama!<br>>> ><br>>> ><br>>><br>
>><br>>><br>>> --<br>>> A man, a plan, a cat, a ham, a yak, a yam, a hat, a canal – Panama!<br>><br>>
</div></div></blockquote></div><br>
</div></div><br>_______________________________________________<br>
Rcpp-devel mailing list<br>
<a href="mailto:Rcpp-devel@lists.r-forge.r-project.org">Rcpp-devel@lists.r-forge.r-project.org</a><br>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel</a><br></blockquote></div><br><br clear="all"><div>
<br></div>-- <br>Thank you, Terrance Savitsky<br>
</div>