[Rcpp-devel] OpenMP and Parallel BLAS
Saurabh B
saurabh.writes at gmail.com
Wed May 27 17:16:47 CEST 2015
Thank you both. I'm trying out these excellent suggestions and reading
through the material. It really helps my understanding of how these two
things work together.
I will update with my findings for everyone's benefit.
To clarify my second question James, can multiple processes concurrently
modify different segments of an arma::mat object? Does the entire matrix X
lock in order to modify a section?
If it is the latter, one way to get around that is to borrow this
functional paradigm -
users -> map -> myFunction() -> reduce()
where myFunction() outputs the new X(u) for the given user. reduce() then
combines all such values to form the new value of X (old is garbage
Since it emits a new matrix each time, the locking contention is not an
But if arma::mat can be modified concurrently, the above is not needed. I
think that's what you allude to above.
Thanks again,
On Tue, May 26, 2015 at 8:41 PM, Balamuta, James Joseph <
balamut2 at illinois.edu> wrote:
> Greetings and Salutations,
> I would suggest the following modifications:
> 1. Use the Rcpp omp plugin
> // [[Rcpp::plugins(openmp)]
> Instead of using set flags. (assuming you are on Rcpp >= 0.10.5 )
> 2. Modify the function parameters to include: int cores
> This allows you to specify cores during run time vs. compile time.
> 3. Specify pragma directive such that it is:
> #pragma omp parallel for num_threads(cores)
> Or use:
> omp_set_num_threads(cores);
> The first is a more graceful fail if the system does not support openmp
> and overrides all set core values.
> Regarding your questions:
> 1. OpenMP will open up the requested number of threads. If you have
> a Parallel BLAS it will open up more OpenMP threads. This is problematic.
> Consider:
> A machine with 8 cores.
> Default to using 4 cores to number of threads for the OpenMP problem.
> Assume that the Parallel BLAS is using 2 cores…
> Then, 4*2 = 8 cores are allocated for parallelization.
> So, depending on your allocation, you probably will have “step over.”
> 2. Reductions in OpenMP are generally only possible if you have:
> var = var op expr (e.g. sum += x(i); )
> var is a scalar (e.g. sum, the summed value)
> op is the operator to apply (e.g. +, plus)
> expr is a scalar that does not reference var (e.g. x(i), new value)
> I’m confused as to whether you are referring to your final output e.g.
> Y.row(i) = yu.t(); as the reduction.
> If this is the case, the object, Y, is being updated in shared memory.
> Since only one row is updated, this is fine.
> Everything else within the for loop is considered as private to the
> instance since it is declared within the pragma.
> With your journey into OpenMP, these might help:
> Slides regarding OpenMP and RcppArmadillo:
> http://www.thecoatlessprofessor.com/wp-content/uploads/2014/09/hpc_parallel.pdf
> Demo code for using OpenMP with Armadillo & Eigen using the tapering idea
> in spatial statistics:
> https://github.com/coatless/pims_bigdata
> Sincerely,
> *From:* rcpp-devel-bounces at lists.r-forge.r-project.org [mailto:
> rcpp-devel-bounces at lists.r-forge.r-project.org] *On Behalf Of *Saurabh B
> *Sent:* Tuesday, May 26, 2015 4:53 PM
> *To:* rcpp-devel at lists.r-forge.r-project.org
> *Subject:* [Rcpp-devel] OpenMP and Parallel BLAS
> Hi there,
> I am using gradient descent to reduce a large matrix of users and items.
> For this I am trying to use all 40 available cores but unfortunately my
> performance is no better than when I was using just one. I am new to openMP
> and RcppArmadillo so pardon my ignorance.
> The main loop is -
> #pragma omp parallel for
> for (int u = 0; u < C.n_rows; u++) {
> arma::mat Cu = diagmat(C.row(u));
> arma::mat YTCuIY = Y.t() * (Cu) * Y;
> arma::mat YTCupu = Y.t() * (Cu + fact_eye) * P.row(u).t();
> arma::mat WuT = YTY + YTCuIY + lambda_eye;
> arma::mat xu = solve(WuT, YTCupu);
> // Update gradient -- maybe a slow operation in parallel?
> X.row(u) = xu.t();
> }
> full code -
> https://github.com/sanealytics/recommenderlabrats/blob/master/src/implicit.cpp
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sanealytics_recommenderlabrats_blob_master_src_implicit.cpp&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Oj62bnDE1oueLU-seL9f0p1xxu4Hvw2JDuP8BUw91c8&m=VTzIWqHqUjsUEq0rJs9u6p5oJdEvwM5rSY7YlYmglGM&s=E57j1meIRKL8m500E49D3PRQ7bgpEv3BgvLJ2Qd6874&e=>
> (implementing this paper -
> http://www.researchgate.net/profile/Yifan_Hu/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets/links/0912f509c579ddd954000000.pdf
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.researchgate.net_profile_Yifan-5FHu_publication_220765111-5FCollaborative-5FFiltering-5Ffor-5FImplicit-5FFeedback-5FDatasets_links_0912f509c579ddd954000000.pdf&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Oj62bnDE1oueLU-seL9f0p1xxu4Hvw2JDuP8BUw91c8&m=VTzIWqHqUjsUEq0rJs9u6p5oJdEvwM5rSY7YlYmglGM&s=jPEJ-O62i5EG_3FH3F3Rdbdj2F_pY3wSDEpb81j3Li0&e=>
> )
> Matrices C, Y and P are large. Matrix X can be assumed to be small.
> I have the following questions -
> 1) I have replaced my BLAS with OpenMP BLAS and am also using the "#pragma
> omp parallel for" clause. Will they step over each other or are they
> complimentary? I ask because my understanding is that the for loop will
> split each user across threads, then the BLAS will redistribute the
> matrices to multiply across all threads again. Is that right? And if so, is
> that what we want to do?
> 2) Since the threads are running in parallel and I just need the resulting
> value as output, I would ideally like a reduce() that gives each row in
> sequence and I can construct the new X from it. I am not sure how to go
> about doing that with Rcpp. I also want to avoid copying data as much as
> possible.
> Looking forward to your input,
> Saurabh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150527/b580bef6/attachment.html>
More information about the Rcpp-devel
mailing list