[Rcpp-devel] Is Rcpp:: not a good choice for iterators?

FHRB Toledo fernandohtoledo at gmail.com
Tue Aug 11 17:13:07 CEST 2015


Dear All,

It is my first post here at Rcpp-devel List. I'm passionate with the
awesome work and with all perspectives that it can opens for speed up codes
and embedding of C++ in R package seamless. Thus, I wish to thank all
developers.

After, some reading over the Book, online vignettes and so on. I translated
one of my R scripts using two approaches: taking all advantages of Rcpp;
and using just the "seamless" integration, i.e., casting SEXP objects,
working only with std:: objects and wrapping the return.

So far so good, however, by my surprise I found huge difference regarding
the performance between both implementations. If it's not enough, even the
raw R code shown better performance compared to the "full Rcpp".

I made a short example to show you the point. First of all let me explain
the process for who do not have any idea about meiosis: We take a
hypothetical genome (that gives us the SNPs positions, vector Map); then
based on this length we draw a Poisson value (where length is lambda), with
this value (e.g., n) we draw n uniform within the genome length. With this,
after some manipulations (findInterval % 2) we know where Crossing Over
(recombination) happened. Then we can take this prototype and generate a
gamete of some entry genotype (DNA tapes A and B).

The codes are:

Raw R script:

## Meiosis process through Poisson/Uniform
MeiosisR <- function( Map, TapeA, TapeB )
    {
        ## Number of Crossing Overs **Poisson Draw**
        nCrossOver <- rpois(n = 1, lambda = ceiling(max(Map)))
        ## Assigning Crossing Overs on the Genome **Uniform Draw**
        posCrossOver <- sort(runif(n = nCrossOver, min = 0, max = max(Map)))
        ## Building the "Prototype" **mixture between the 2 haplotypes**
        Prototype <- findInterval(x = Map, vec = posCrossOver) %% 2
        ## Raffle for the reference haplotype
        if ( as.logical( rbinom(n = 1, size = 1, prob = .5)  ) )
            {
                ## Building the gamete
                Gamete <- ifelse(test = as.logical(Prototype),
                                 yes = TapeA,
                                 no = TapeB)
            } else {
                ## idem with other "guide"
                Gamete <- ifelse(test = as.logical(Prototype),
                                 yes = TapeB,
                                 no = TapeA)
            }
        return( Gamete )
    }

STL cpp version:

// [[Rcpp::plugins(cpp11)]]
#include <algorithm>
#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
std::vector<int> MeiosisSTL( std::vector<double> Map,
                             std::vector<int> TapeA,
                             std::vector<int> TapeB ) {

  // Obtaining Chromossome's parameter: length, number & positions of
SNPs
  double L = *std::max_element(Map.begin(), Map.end());
  int Size = Map.size();

  // RNG process: draw the number & positions of Crossing Overs
  double CrossingNumber = as<double>( rpois(1, L) );
  std::vector<double> CrossingPositions = as<std::vector<double> >( runif(
CrossingNumber, 0.0, L ) );
  std::sort(CrossingPositions.begin(), CrossingPositions.end());

  // Building a Gamete "Prototype"
  std::vector<bool> Prototype( Size );

  // "findInterval" adapted from Wickham's Rcpp Chapter (
http://adv-r.had.co.nz/Rcpp.html)
  std::vector<double>::iterator Iter, Positions;
  std::vector<bool>::iterator ProtoIter;
  std::vector<double>::iterator MapBegin = Map.begin();
  std::vector<double>::iterator MapEnd = Map.end();
  std::vector<bool>::iterator ProtoBegin = Prototype.begin();
  std::vector<double>::iterator CrossPosBegin = CrossingPositions.begin();
  std::vector<double>::iterator CrossPosEnd = CrossingPositions.end();

  for(Iter = MapBegin, ProtoIter = ProtoBegin; Iter != MapEnd; ++Iter,
++ProtoIter) {
    Positions = std::upper_bound(CrossPosBegin, CrossPosEnd, *Iter);
    *ProtoIter = std::distance(CrossPosBegin, Positions) % 2; // odd or even
  }

  std::vector<int> Gamete;
  Gamete.reserve( Size );
  // raffle: which tape shall be the "guide"?
  bool v01 = as<bool>( rbinom(1, 1, .5) );

  if ( v01 ) {
    for ( int i = 0; i < Size; i++ ) {
      // Build the Gamete from the Prototype
      Gamete.push_back( ((Prototype.at( i )) ? TapeA.at( i ) : TapeB.at( i
)) );
    }
  } else {
    for ( int i = 0; i < Size; i++ ) {
      Gamete.push_back( ((Prototype.at( i )) ? TapeB.at( i ) : TapeA.at( i
)) );
    }
  }

  return Gamete ;
}

Full Rcpp version:

// [[Rcpp::plugins(cpp11)]]
#include <algorithm>
#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector MeiosisRcpp( NumericVector Map,
                           IntegerVector TapeA,
                           IntegerVector TapeB ) {

  double L = max( Map ); // sugar :)
  int Size = Map.size();

  double CrossingNumber = as<double>( rpois(1, L) );
  NumericVector CrossingPositions = runif( CrossingNumber, 0.0, L );
  std::sort(CrossingPositions.begin(), CrossingPositions.end());

  LogicalVector Prototype( Size );

  NumericVector::iterator Iter, Positions;
  LogicalVector::iterator ProtoIter;
  NumericVector::iterator MapBegin = Map.begin();
  NumericVector::iterator MapEnd = Map.end();
  LogicalVector::iterator ProtoBegin = Prototype.begin();
  NumericVector::iterator CrossPosBegin = CrossingPositions.begin();
  NumericVector::iterator CrossPosEnd = CrossingPositions.end();

  for(Iter = MapBegin, ProtoIter = ProtoBegin; Iter != MapEnd; ++Iter,
++ProtoIter) {
    Positions = std::upper_bound(CrossPosBegin, CrossPosEnd, *Iter);
    *ProtoIter = std::distance(CrossPosBegin, Positions) % 2; // odd or even
  }

  IntegerVector Gamete;
  bool v01 = as<bool>( rbinom(1, 1, .5) );

  if ( v01 ) {
    for ( int i = 0; i < Size; i++ ) {
      Gamete.push_back( ((Prototype[ i ]) ? TapeA[ i ] : TapeB[ i ]) );

    }
  } else {
    for ( int i = 0; i < Size; i++ ) {
      Gamete.push_back( ((Prototype[ i ]) ? TapeB[ i ] : TapeA[ i ]) );
    }
  }

  return Gamete ;
}

If we take the entry as:

## Genome Setup
N <- 1E4 # number of gametes
nM <- 400 # number of Markers (SNPs)
L <-  4 # genome length (in Morgans)

## building the genome & full heterozigous genotype (diploid, 2 DNA tapes)
## map <- seq(from = .1, to = L, length.out = nM) # SNPs positions
**regular**
map <- sort(runif(nM, 0, L)) # idem **random**
tapeA <- rep(1, nM)
tapeB <- rep(2, nM)

You will found results like that:

evaluated expressions:

RawR = replicate(n = N, expr = MeiosisR(Map = map, TapeA = tapeA, TapeB =
tapeB))
cppSTL = replicate(n = N, expr = MeiosisSTL(Map = map, TapeA = tapeA, TapeB
= tapeB))
Rcpp = replicate(n = N, expr = MeiosisRcpp(Map = map, TapeA = tapeA, TapeB
= tapeB))

- microbenchmark:

Unit: seconds
   expr       min        lq      mean   median        uq       max neval cld
   RawR 1.2169414 1.2655129 1.2978896 1.286561 1.3055376 1.5020700   100  b
 cppSTL 0.1721323 0.1773904 0.1867463 0.180910 0.1863087 0.2835556   100 a
   Rcpp 1.9190921 2.0101288 2.0633424 2.046132 2.1081319 2.2832565   100   c

- rbenchmark:

    test replications elapsed user.self sys.self
2 cppSTL          100  18.048    18.020    0.020
1   RawR          100 126.264   124.888    1.324
3   Rcpp          100 208.647   208.548    0.012

Just to report, my sessionInfo() is as follow:

sessionInfo() # Useful Infos **Reproducible Research**
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
 [7] LC_PAPER=en_US.utf8       LC_NAME=C
 [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rbenchmark_1.0.0     microbenchmark_1.4-2 Rcpp_0.11.5

loaded via a namespace (and not attached):
 [1] MASS_7.3-40      colorspace_1.2-6 scales_0.2.4     compiler_3.2.1
 [5] plyr_1.8.2       tools_3.2.1      gtable_0.1.2     reshape2_1.4.1
 [9] ggplot2_1.0.1    grid_3.2.1       stringr_0.6.2    digest_0.6.8
[13] proto_0.3-10     munsell_0.4.2

After that, I wish to request help to evaluate those results. Is "full
Rcpp" not a good choice when using iterators? Let me know if I am not
taking care about something that may change the things. You may notice that
I left all parameters as fixed as possible, then, they should not influence
on that.

I am open to share the files or give further information.

Thank you very much indeed regarding any insight on that.

Cheers,
FH
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150811/d8a9f444/attachment-0001.html>


More information about the Rcpp-devel mailing list