[Rcpp-devel] Is Rcpp:: not a good choice for iterators?
FHRB Toledo
fernandohtoledo at gmail.com
Tue Aug 11 17:13:07 CEST 2015
Dear All,
It is my first post here at Rcpp-devel List. I'm passionate with the
awesome work and with all perspectives that it can opens for speed up codes
and embedding of C++ in R package seamless. Thus, I wish to thank all
developers.
After, some reading over the Book, online vignettes and so on. I translated
one of my R scripts using two approaches: taking all advantages of Rcpp;
and using just the "seamless" integration, i.e., casting SEXP objects,
working only with std:: objects and wrapping the return.
So far so good, however, by my surprise I found huge difference regarding
the performance between both implementations. If it's not enough, even the
raw R code shown better performance compared to the "full Rcpp".
I made a short example to show you the point. First of all let me explain
the process for who do not have any idea about meiosis: We take a
hypothetical genome (that gives us the SNPs positions, vector Map); then
based on this length we draw a Poisson value (where length is lambda), with
this value (e.g., n) we draw n uniform within the genome length. With this,
after some manipulations (findInterval % 2) we know where Crossing Over
(recombination) happened. Then we can take this prototype and generate a
gamete of some entry genotype (DNA tapes A and B).
The codes are:
Raw R script:
## Meiosis process through Poisson/Uniform
MeiosisR <- function( Map, TapeA, TapeB )
{
## Number of Crossing Overs **Poisson Draw**
nCrossOver <- rpois(n = 1, lambda = ceiling(max(Map)))
## Assigning Crossing Overs on the Genome **Uniform Draw**
posCrossOver <- sort(runif(n = nCrossOver, min = 0, max = max(Map)))
## Building the "Prototype" **mixture between the 2 haplotypes**
Prototype <- findInterval(x = Map, vec = posCrossOver) %% 2
## Raffle for the reference haplotype
if ( as.logical( rbinom(n = 1, size = 1, prob = .5) ) )
{
## Building the gamete
Gamete <- ifelse(test = as.logical(Prototype),
yes = TapeA,
no = TapeB)
} else {
## idem with other "guide"
Gamete <- ifelse(test = as.logical(Prototype),
yes = TapeB,
no = TapeA)
}
return( Gamete )
}
STL cpp version:
// [[Rcpp::plugins(cpp11)]]
#include <algorithm>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::vector<int> MeiosisSTL( std::vector<double> Map,
std::vector<int> TapeA,
std::vector<int> TapeB ) {
// Obtaining Chromossome's parameter: length, number & positions of
SNPs
double L = *std::max_element(Map.begin(), Map.end());
int Size = Map.size();
// RNG process: draw the number & positions of Crossing Overs
double CrossingNumber = as<double>( rpois(1, L) );
std::vector<double> CrossingPositions = as<std::vector<double> >( runif(
CrossingNumber, 0.0, L ) );
std::sort(CrossingPositions.begin(), CrossingPositions.end());
// Building a Gamete "Prototype"
std::vector<bool> Prototype( Size );
// "findInterval" adapted from Wickham's Rcpp Chapter (
http://adv-r.had.co.nz/Rcpp.html)
std::vector<double>::iterator Iter, Positions;
std::vector<bool>::iterator ProtoIter;
std::vector<double>::iterator MapBegin = Map.begin();
std::vector<double>::iterator MapEnd = Map.end();
std::vector<bool>::iterator ProtoBegin = Prototype.begin();
std::vector<double>::iterator CrossPosBegin = CrossingPositions.begin();
std::vector<double>::iterator CrossPosEnd = CrossingPositions.end();
for(Iter = MapBegin, ProtoIter = ProtoBegin; Iter != MapEnd; ++Iter,
++ProtoIter) {
Positions = std::upper_bound(CrossPosBegin, CrossPosEnd, *Iter);
*ProtoIter = std::distance(CrossPosBegin, Positions) % 2; // odd or even
}
std::vector<int> Gamete;
Gamete.reserve( Size );
// raffle: which tape shall be the "guide"?
bool v01 = as<bool>( rbinom(1, 1, .5) );
if ( v01 ) {
for ( int i = 0; i < Size; i++ ) {
// Build the Gamete from the Prototype
Gamete.push_back( ((Prototype.at( i )) ? TapeA.at( i ) : TapeB.at( i
)) );
}
} else {
for ( int i = 0; i < Size; i++ ) {
Gamete.push_back( ((Prototype.at( i )) ? TapeB.at( i ) : TapeA.at( i
)) );
}
}
return Gamete ;
}
Full Rcpp version:
// [[Rcpp::plugins(cpp11)]]
#include <algorithm>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector MeiosisRcpp( NumericVector Map,
IntegerVector TapeA,
IntegerVector TapeB ) {
double L = max( Map ); // sugar :)
int Size = Map.size();
double CrossingNumber = as<double>( rpois(1, L) );
NumericVector CrossingPositions = runif( CrossingNumber, 0.0, L );
std::sort(CrossingPositions.begin(), CrossingPositions.end());
LogicalVector Prototype( Size );
NumericVector::iterator Iter, Positions;
LogicalVector::iterator ProtoIter;
NumericVector::iterator MapBegin = Map.begin();
NumericVector::iterator MapEnd = Map.end();
LogicalVector::iterator ProtoBegin = Prototype.begin();
NumericVector::iterator CrossPosBegin = CrossingPositions.begin();
NumericVector::iterator CrossPosEnd = CrossingPositions.end();
for(Iter = MapBegin, ProtoIter = ProtoBegin; Iter != MapEnd; ++Iter,
++ProtoIter) {
Positions = std::upper_bound(CrossPosBegin, CrossPosEnd, *Iter);
*ProtoIter = std::distance(CrossPosBegin, Positions) % 2; // odd or even
}
IntegerVector Gamete;
bool v01 = as<bool>( rbinom(1, 1, .5) );
if ( v01 ) {
for ( int i = 0; i < Size; i++ ) {
Gamete.push_back( ((Prototype[ i ]) ? TapeA[ i ] : TapeB[ i ]) );
}
} else {
for ( int i = 0; i < Size; i++ ) {
Gamete.push_back( ((Prototype[ i ]) ? TapeB[ i ] : TapeA[ i ]) );
}
}
return Gamete ;
}
If we take the entry as:
## Genome Setup
N <- 1E4 # number of gametes
nM <- 400 # number of Markers (SNPs)
L <- 4 # genome length (in Morgans)
## building the genome & full heterozigous genotype (diploid, 2 DNA tapes)
## map <- seq(from = .1, to = L, length.out = nM) # SNPs positions
**regular**
map <- sort(runif(nM, 0, L)) # idem **random**
tapeA <- rep(1, nM)
tapeB <- rep(2, nM)
You will found results like that:
evaluated expressions:
RawR = replicate(n = N, expr = MeiosisR(Map = map, TapeA = tapeA, TapeB =
tapeB))
cppSTL = replicate(n = N, expr = MeiosisSTL(Map = map, TapeA = tapeA, TapeB
= tapeB))
Rcpp = replicate(n = N, expr = MeiosisRcpp(Map = map, TapeA = tapeA, TapeB
= tapeB))
- microbenchmark:
Unit: seconds
expr min lq mean median uq max neval cld
RawR 1.2169414 1.2655129 1.2978896 1.286561 1.3055376 1.5020700 100 b
cppSTL 0.1721323 0.1773904 0.1867463 0.180910 0.1863087 0.2835556 100 a
Rcpp 1.9190921 2.0101288 2.0633424 2.046132 2.1081319 2.2832565 100 c
- rbenchmark:
test replications elapsed user.self sys.self
2 cppSTL 100 18.048 18.020 0.020
1 RawR 100 126.264 124.888 1.324
3 Rcpp 100 208.647 208.548 0.012
Just to report, my sessionInfo() is as follow:
sessionInfo() # Useful Infos **Reproducible Research**
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rbenchmark_1.0.0 microbenchmark_1.4-2 Rcpp_0.11.5
loaded via a namespace (and not attached):
[1] MASS_7.3-40 colorspace_1.2-6 scales_0.2.4 compiler_3.2.1
[5] plyr_1.8.2 tools_3.2.1 gtable_0.1.2 reshape2_1.4.1
[9] ggplot2_1.0.1 grid_3.2.1 stringr_0.6.2 digest_0.6.8
[13] proto_0.3-10 munsell_0.4.2
After that, I wish to request help to evaluate those results. Is "full
Rcpp" not a good choice when using iterators? Let me know if I am not
taking care about something that may change the things. You may notice that
I left all parameters as fixed as possible, then, they should not influence
on that.
I am open to share the files or give further information.
Thank you very much indeed regarding any insight on that.
Cheers,
FH
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150811/d8a9f444/attachment-0001.html>
More information about the Rcpp-devel
mailing list