[Rcpp-devel] Apply Defined Rcpp Functions In Parallel

Dillon Hammill Dillon.Hammill at anu.edu.au
Tue Aug 17 10:58:48 CEST 2021


Hi everyone,

Apologies if this is the wrong place to ask for help with Rcpp/RcppParallel.

I have recently began using Rcpp and RcppParallel to speed up an algorithm that I have written in R ( I still have a lot to learn!). I have had no issues converting my R code into Rcpp code, but I am struggling to get things working in parallel using RcppParallel. I was hoping that someone may be able help improve my understanding of how to translate Rcpp code into code that works with RcppParallel.

To make things easier to understand, I have created a simplified example that will hopefully illustrate what I hope to achieve.

Firstly, we compute a distance matrix based on xy co-ordinates using the dist() function in R:
x <- matrix(1:100 + 0.1, ncol = 2, dimnames = list(NULL, c("x","y")))
x_dist <- dist(x, method = "euclidean")

Secondly, we apply a custom algorithm that uses the raw data and distance matrix to compute labels for each point:
algo <- function(x,
                             dist = NULL) {

  # algorithm code to compute a label for each row in x
  # code is complex and requires both x and distance matrix
  # but simply assigns an integer to each row

  labels <- 1:nrow(x)
  return(labels)

}
x_labels <- algo(x, dist = x_dist)

Now let's say we wanted run the algorithm for all distance methods and store the labels (single thread R code) in a matrix. This is the code I would like to run in parallel as it can take some time to compute the labels using the alogrithm.
dist_methods <- c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski")
x_labels <- lapply(dist_methods, function(dist_method) {
  x_dist <- dist(x, method = dist_method)
  return(algo(x, x_dist))
})
x_labels <- do.call("cbind", x_labels)

To speed up the algorithm we can convert these R functions into Rcpp equivalents. I have removed code to make it clearer and simply returned an example matrix/vector. I have also converted the dist_methods to an integer for simplicity (i.e. 1 - euclidean, 2 - maximum etc.).
#include <Rcpp.h>
using namespace Rcpp;

//[[Rcpp::export]]
NumericMatrix rcpp_dist(NumericMatrix x, int dist_method = 1) {

  // code to compute distance matrix depends on dist_method
  NumericMatrix x_dist(x.nrow(), x.nrow());
  return x_dist;

}

//[[Rcpp::export]]
IntegerVector rcpp_algo(NumericMatrix x, NumericMatrix dist) {

   // code to compute a label for each row in x using distance matrix
   IntegerVector labels(x.nrow());
   return labels;

}

Finally, we attempt to run the code on multiple threads for each distance method.
#include <RcppParallel.h>
using namespace RcppParallel;

struct RcppParAlgo : public Worker {

  // source thread safe
  RMatrix<double> x;
  RVector<int> dist_method;

  // output thread safe
  RMatrix <int> labels;

  // initialise
  RcppParAlgo(NumericMatrix x, int dist_method)
    : x(x), dist_method(dist_method), labels(labels) {}

  // compute distance matrix based on method and run algo to get labels
  void operator() (std::size_t begin, std::size_t end) {

    // call rcpp_dist on x with dist_method
    // I know this code is wrong but hopefully shows what I am trying to do (this is where I need some help)
    for(std::size_t i = begin; i < end; i++) {
      RVector<int> dist_type = dist_method[i];
      RMatrix<double> x_dist = rcpp_dist(x, dist_type);
      RVector<int> x_labels = rcpp_algo(x, x_dist);
      // push lables into output matrix
      for(int j = 0; j <= x_labels.size(); j++) {
        labels(j, i) = x_labels[j];
      }
    }

  }

};

//[[Rcpp::export]]
IntegerMatrix rcppParAlgo(NumericMatrix x) {

  // distance methods
  IntegerVector dist_methods = {1,2,3,4,5};

  // allocate output matrix to store labels
  IntegerMatrix labels(x.nrow(), dist_methods.size());

  // functor
  RcppParAlgo RcppParAlgo(x, dist_methods);

  // run in parallel
  parallelFor(0, dist_methods.length(), RcppParAlgo);

  return labels;

}

The main questions I would like to address are:

  1.  As the code is currently written, rcpp_dist and rcpp_algo do not accept RMatrix/RVector classes which causes problems when I try call these functions in parallel. Is there a way to write generic functions that will work in both cases? I still want to export the definitions of rcpp_dist() and rcpp_algo() so they can be called from R.
  2.  Is it safe to create new objects on each thread? For example, in my case I need to create the temporary distance matrix which is passed to rcpp_algo() along with x to compute the labels. I haven't seen any examples of this, are there constructors for RMatrix/RVector similar to NumericMatrix/NumericVector?

Any help or suggestions would be greatly appreciated. I have placed the Rcpp/RcppParallel code in package hosted on GitHub if you would like to fiddle with the code:
https://github.com/DillonHammill/RcppHelp
[https://opengraph.githubassets.com/c783348bc98fe97d064ef1da1b860542a1422388bfb389ba366bc8827df916cf/DillonHammill/RcppHelp]<https://github.com/DillonHammill/RcppHelp>
GitHub - DillonHammill/RcppHelp: Example RcppParallel Code<https://github.com/DillonHammill/RcppHelp>
Example RcppParallel Code. Contribute to DillonHammill/RcppHelp development by creating an account on GitHub.
github.com

Cheers,
Dillon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20210817/918f893b/attachment-0001.html>


More information about the Rcpp-devel mailing list