<div dir="ltr">Hi there,<div><br></div><div>I am using gradient descent to reduce a large matrix of users and items. For this I am trying to use all 40 available cores but unfortunately my performance is no better than when I was using just one. I am new to openMP and RcppArmadillo so pardon my ignorance.<br></div><div><br></div><div>The main loop is -<br></div><div><table class="" style="border-collapse:collapse;border-spacing:0px;color:rgb(51,51,51);font-family:Helvetica,arial,nimbussansl,liberationsans,freesans,clean,sans-serif,'Segoe UI Emoji','Segoe UI Symbol';font-size:13px;line-height:14.5600004196167px"><tbody><tr><td id="LC27" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">#<span class="" style="color:rgb(167,29,93)">pragma</span> omp parallel for</td></tr><tr><td id="L28" class="" style="padding:0px 10px;width:50.4000015258789px;min-width:50px;white-space:nowrap;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);vertical-align:top;text-align:right;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="LC28" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">  <span class="" style="color:rgb(167,29,93)">for</span> (<span class="" style="color:rgb(167,29,93)">int</span> u = <span class="" style="color:rgb(0,134,179)">0</span>; u < C.<span class="">n_rows</span>; u++) {</td></tr><tr><td id="L29" class="" style="padding:0px 10px;width:50.4000015258789px;min-width:50px;white-space:nowrap;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);vertical-align:top;text-align:right;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="LC29" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">    arma::mat Cu = <span class="" style="color:rgb(0,134,179)">diagmat</span>(C.<span class="" style="color:rgb(0,134,179)">row</span>(u));</td></tr><tr><td id="L30" class="" style="padding:0px 10px;width:50.4000015258789px;min-width:50px;white-space:nowrap;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);vertical-align:top;text-align:right;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="LC30" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">    arma::mat YTCuIY = Y.<span class="" style="color:rgb(0,134,179)">t</span>() * (Cu) * Y;</td></tr><tr><td id="L31" class="" style="padding:0px 10px;width:50.4000015258789px;min-width:50px;white-space:nowrap;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);vertical-align:top;text-align:right;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="LC31" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">    arma::mat YTCupu = Y.<span class="" style="color:rgb(0,134,179)">t</span>() * (Cu + fact_eye) * P.<span class="" style="color:rgb(0,134,179)">row</span>(u).<span class="" style="color:rgb(0,134,179)">t</span>();</td></tr><tr><td id="L32" class="" style="padding:0px 10px;width:50.4000015258789px;min-width:50px;white-space:nowrap;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);vertical-align:top;text-align:right;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="LC32" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">    arma::mat WuT = YTY + YTCuIY + lambda_eye;</td></tr><tr><td id="L33" class="" style="padding:0px 10px;width:50.4000015258789px;min-width:50px;white-space:nowrap;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);vertical-align:top;text-align:right;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="LC33" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">    arma::mat xu = <span class="" style="color:rgb(0,134,179)">solve</span>(WuT, YTCupu);</td></tr><tr><td id="L34" class="" style="padding:0px 10px;width:50.4000015258789px;min-width:50px;white-space:nowrap;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);vertical-align:top;text-align:right;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="LC34" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">
</td></tr><tr><td id="L35" class="" style="padding:0px 10px;width:50.4000015258789px;min-width:50px;white-space:nowrap;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);vertical-align:top;text-align:right;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="LC35" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">    <span class="" style="color:rgb(150,152,150)">// Update gradient -- maybe a slow operation in parallel?</span></td></tr><tr><td id="L36" class="" style="padding:0px 10px;width:50.4000015258789px;min-width:50px;white-space:nowrap;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);vertical-align:top;text-align:right;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="LC36" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">    X.<span class="" style="color:rgb(0,134,179)">row</span>(u) = xu.<span class="" style="color:rgb(0,134,179)">t</span>();</td></tr><tr><td id="L37" class="" style="padding:0px 10px;width:50.4000015258789px;min-width:50px;white-space:nowrap;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);vertical-align:top;text-align:right;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="LC37" class="" style="padding:0px 10px;vertical-align:top;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;white-space:pre;overflow:visible;word-wrap:normal">  }</td></tr></tbody></table></div><div><br></div><div><br></div><div>full code - <a href="https://github.com/sanealytics/recommenderlabrats/blob/master/src/implicit.cpp">https://github.com/sanealytics/recommenderlabrats/blob/master/src/implicit.cpp</a></div><div><br></div><div>(implementing this paper - <a href="http://www.researchgate.net/profile/Yifan_Hu/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets/links/0912f509c579ddd954000000.pdf">http://www.researchgate.net/profile/Yifan_Hu/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets/links/0912f509c579ddd954000000.pdf</a>)</div><div><br></div><div>Matrices C, Y and P are large. Matrix X can be assumed to be small.</div><div><br></div><div>I have the following questions -</div><div>1) I have replaced my BLAS with OpenMP BLAS and am also using the "<span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:13.4399995803833px;white-space:pre">#</span><span class="" style="color:rgb(167,29,93);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:13.4399995803833px;white-space:pre">pragma</span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:13.4399995803833px;white-space:pre"> omp parallel for</span>" clause. Will they step over each other or are they complimentary? I ask because my understanding is that the for loop will split each user across threads, then the BLAS will redistribute the matrices to multiply across all threads again. Is that right? And if so, is that what we want to do?</div><div><br></div><div>2) Since the threads are running in parallel and I just need the resulting value as output, I would ideally like a reduce() that gives each row in sequence and I can construct the new X from it. I am not sure how to go about doing that with Rcpp. I also want to avoid copying data as much as possible.<br></div><div><br></div><div>Looking forward to your input,</div><div>Saurabh</div><div><br></div><div><br></div></div>