[Rcpp-devel] Fwd: Is Rcpp:: not a good choice for iterators?

FHRB Toledo fernandohtoledo at gmail.com
Tue Aug 11 20:47:09 CEST 2015


---------- Forwarded message ----------
From: FHRB Toledo <fernandohtoledo at gmail.com>
Date: 11 August 2015 at 11:42
Subject: Re: [Rcpp-devel] Is Rcpp:: not a good choice for iterators?
To: Hao Ye <hye at ucsd.edu>


Dear Hao Ye,

Thanks regarding your reply.

I have read something related to resize Rcpp::?Vectors on StackOverFLow (
http://stackoverflow.com/questions/13782943/how-to-resize-a-numericvector)!

Comparing the threads, your contribution seems a good point. Maybe I can
initialize the Gamete vector with no_int( Size ) and then perform the loop
over it filling with the mix between the tapes.

I will also try the Eddelbuette suggestion, making gamete as a std::vector
and then wrapping it on the return statement.

I will implement it  and shall report the updated benchmarks!

Cheers,
FH

On 11 August 2015 at 11:16, Hao Ye <hye at ucsd.edu> wrote:

> It also looks like you are not initializing the size of Gamete in the
> "full Rcpp" version, so it is resizing on every loop iteration...
>
> Best,
> --
> Hao Ye
> hye at ucsd.edu
>
> On Tue, Aug 11, 2015 at 12:06 PM, Tim Keitt <tkeitt at utexas.edu> wrote:
>
>> 1) Figure out where you are making full copies of your data structures
>> (casting NumericVector to std::vector<double>)
>> 2) Use eg R::rbinom instead of rbinom if you do not want the return as an
>> Rcpp Vector type.
>> 3) Move the outer loop (your replicate call) into C++
>>
>> The copying of data is what is slowing things down most likely.
>>
>> T.
>>
>> http://www.keittlab.org/
>>
>> On Tue, Aug 11, 2015 at 10:13 AM, FHRB Toledo <fernandohtoledo at gmail.com>
>> wrote:
>>
>>> Dear All,
>>>
>>> It is my first post here at Rcpp-devel List. I'm passionate with the
>>> awesome work and with all perspectives that it can opens for speed up codes
>>> and embedding of C++ in R package seamless. Thus, I wish to thank all
>>> developers.
>>>
>>> After, some reading over the Book, online vignettes and so on. I
>>> translated one of my R scripts using two approaches: taking all advantages
>>> of Rcpp; and using just the "seamless" integration, i.e., casting SEXP
>>> objects, working only with std:: objects and wrapping the return.
>>>
>>> So far so good, however, by my surprise I found huge difference
>>> regarding the performance between both implementations. If it's not enough,
>>> even the raw R code shown better performance compared to the "full Rcpp".
>>>
>>> I made a short example to show you the point. First of all let me
>>> explain the process for who do not have any idea about meiosis: We take a
>>> hypothetical genome (that gives us the SNPs positions, vector Map); then
>>> based on this length we draw a Poisson value (where length is lambda), with
>>> this value (e.g., n) we draw n uniform within the genome length. With this,
>>> after some manipulations (findInterval % 2) we know where Crossing Over
>>> (recombination) happened. Then we can take this prototype and generate a
>>> gamete of some entry genotype (DNA tapes A and B).
>>>
>>> The codes are:
>>>
>>> Raw R script:
>>>
>>> ## Meiosis process through Poisson/Uniform
>>> MeiosisR <- function( Map, TapeA, TapeB )
>>>     {
>>>         ## Number of Crossing Overs **Poisson Draw**
>>>         nCrossOver <- rpois(n = 1, lambda = ceiling(max(Map)))
>>>         ## Assigning Crossing Overs on the Genome **Uniform Draw**
>>>         posCrossOver <- sort(runif(n = nCrossOver, min = 0, max =
>>> max(Map)))
>>>         ## Building the "Prototype" **mixture between the 2 haplotypes**
>>>         Prototype <- findInterval(x = Map, vec = posCrossOver) %% 2
>>>         ## Raffle for the reference haplotype
>>>         if ( as.logical( rbinom(n = 1, size = 1, prob = .5)  ) )
>>>             {
>>>                 ## Building the gamete
>>>                 Gamete <- ifelse(test = as.logical(Prototype),
>>>                                  yes = TapeA,
>>>                                  no = TapeB)
>>>             } else {
>>>                 ## idem with other "guide"
>>>                 Gamete <- ifelse(test = as.logical(Prototype),
>>>                                  yes = TapeB,
>>>                                  no = TapeA)
>>>             }
>>>         return( Gamete )
>>>     }
>>>
>>> STL cpp version:
>>>
>>> // [[Rcpp::plugins(cpp11)]]
>>> #include <algorithm>
>>> #include <Rcpp.h>
>>>
>>> using namespace Rcpp;
>>>
>>> // [[Rcpp::export]]
>>> std::vector<int> MeiosisSTL( std::vector<double> Map,
>>>                              std::vector<int> TapeA,
>>>                              std::vector<int> TapeB ) {
>>>
>>>   // Obtaining Chromossome's parameter: length, number & positions of
>>> SNPs
>>>   double L = *std::max_element(Map.begin(), Map.end());
>>>   int Size = Map.size();
>>>
>>>   // RNG process: draw the number & positions of Crossing Overs
>>>   double CrossingNumber = as<double>( rpois(1, L) );
>>>   std::vector<double> CrossingPositions = as<std::vector<double> >(
>>> runif( CrossingNumber, 0.0, L ) );
>>>   std::sort(CrossingPositions.begin(), CrossingPositions.end());
>>>
>>>   // Building a Gamete "Prototype"
>>>   std::vector<bool> Prototype( Size );
>>>
>>>   // "findInterval" adapted from Wickham's Rcpp Chapter (
>>> http://adv-r.had.co.nz/Rcpp.html)
>>>   std::vector<double>::iterator Iter, Positions;
>>>   std::vector<bool>::iterator ProtoIter;
>>>   std::vector<double>::iterator MapBegin = Map.begin();
>>>   std::vector<double>::iterator MapEnd = Map.end();
>>>   std::vector<bool>::iterator ProtoBegin = Prototype.begin();
>>>   std::vector<double>::iterator CrossPosBegin =
>>> CrossingPositions.begin();
>>>   std::vector<double>::iterator CrossPosEnd = CrossingPositions.end();
>>>
>>>   for(Iter = MapBegin, ProtoIter = ProtoBegin; Iter != MapEnd; ++Iter,
>>> ++ProtoIter) {
>>>     Positions = std::upper_bound(CrossPosBegin, CrossPosEnd, *Iter);
>>>     *ProtoIter = std::distance(CrossPosBegin, Positions) % 2; // odd or
>>> even
>>>   }
>>>
>>>   std::vector<int> Gamete;
>>>   Gamete.reserve( Size );
>>>   // raffle: which tape shall be the "guide"?
>>>   bool v01 = as<bool>( rbinom(1, 1, .5) );
>>>
>>>   if ( v01 ) {
>>>     for ( int i = 0; i < Size; i++ ) {
>>>       // Build the Gamete from the Prototype
>>>       Gamete.push_back( ((Prototype.at( i )) ? TapeA.at( i ) : TapeB.at(
>>> i )) );
>>>     }
>>>   } else {
>>>     for ( int i = 0; i < Size; i++ ) {
>>>       Gamete.push_back( ((Prototype.at( i )) ? TapeB.at( i ) : TapeA.at(
>>> i )) );
>>>     }
>>>   }
>>>
>>>   return Gamete ;
>>> }
>>>
>>> Full Rcpp version:
>>>
>>> // [[Rcpp::plugins(cpp11)]]
>>> #include <algorithm>
>>> #include <Rcpp.h>
>>>
>>> using namespace Rcpp;
>>>
>>> // [[Rcpp::export]]
>>> IntegerVector MeiosisRcpp( NumericVector Map,
>>>                            IntegerVector TapeA,
>>>                            IntegerVector TapeB ) {
>>>
>>>   double L = max( Map ); // sugar :)
>>>   int Size = Map.size();
>>>
>>>   double CrossingNumber = as<double>( rpois(1, L) );
>>>   NumericVector CrossingPositions = runif( CrossingNumber, 0.0, L );
>>>   std::sort(CrossingPositions.begin(), CrossingPositions.end());
>>>
>>>   LogicalVector Prototype( Size );
>>>
>>>   NumericVector::iterator Iter, Positions;
>>>   LogicalVector::iterator ProtoIter;
>>>   NumericVector::iterator MapBegin = Map.begin();
>>>   NumericVector::iterator MapEnd = Map.end();
>>>   LogicalVector::iterator ProtoBegin = Prototype.begin();
>>>   NumericVector::iterator CrossPosBegin = CrossingPositions.begin();
>>>   NumericVector::iterator CrossPosEnd = CrossingPositions.end();
>>>
>>>   for(Iter = MapBegin, ProtoIter = ProtoBegin; Iter != MapEnd; ++Iter,
>>> ++ProtoIter) {
>>>     Positions = std::upper_bound(CrossPosBegin, CrossPosEnd, *Iter);
>>>     *ProtoIter = std::distance(CrossPosBegin, Positions) % 2; // odd or
>>> even
>>>   }
>>>
>>>   IntegerVector Gamete;
>>>   bool v01 = as<bool>( rbinom(1, 1, .5) );
>>>
>>>   if ( v01 ) {
>>>     for ( int i = 0; i < Size; i++ ) {
>>>       Gamete.push_back( ((Prototype[ i ]) ? TapeA[ i ] : TapeB[ i ]) );
>>>
>>>     }
>>>   } else {
>>>     for ( int i = 0; i < Size; i++ ) {
>>>       Gamete.push_back( ((Prototype[ i ]) ? TapeB[ i ] : TapeA[ i ]) );
>>>     }
>>>   }
>>>
>>>   return Gamete ;
>>> }
>>>
>>> If we take the entry as:
>>>
>>> ## Genome Setup
>>> N <- 1E4 # number of gametes
>>> nM <- 400 # number of Markers (SNPs)
>>> L <-  4 # genome length (in Morgans)
>>>
>>> ## building the genome & full heterozigous genotype (diploid, 2 DNA
>>> tapes)
>>> ## map <- seq(from = .1, to = L, length.out = nM) # SNPs positions
>>> **regular**
>>> map <- sort(runif(nM, 0, L)) # idem **random**
>>> tapeA <- rep(1, nM)
>>> tapeB <- rep(2, nM)
>>>
>>> You will found results like that:
>>>
>>> evaluated expressions:
>>>
>>> RawR = replicate(n = N, expr = MeiosisR(Map = map, TapeA = tapeA, TapeB
>>> = tapeB))
>>> cppSTL = replicate(n = N, expr = MeiosisSTL(Map = map, TapeA = tapeA,
>>> TapeB = tapeB))
>>> Rcpp = replicate(n = N, expr = MeiosisRcpp(Map = map, TapeA = tapeA,
>>> TapeB = tapeB))
>>>
>>> - microbenchmark:
>>>
>>> Unit: seconds
>>>    expr       min        lq      mean   median        uq       max neval
>>> cld
>>>    RawR 1.2169414 1.2655129 1.2978896 1.286561 1.3055376 1.5020700
>>> 100  b
>>>  cppSTL 0.1721323 0.1773904 0.1867463 0.180910 0.1863087 0.2835556   100
>>> a
>>>    Rcpp 1.9190921 2.0101288 2.0633424 2.046132 2.1081319 2.2832565
>>> 100   c
>>>
>>> - rbenchmark:
>>>
>>>     test replications elapsed user.self sys.self
>>> 2 cppSTL          100  18.048    18.020    0.020
>>> 1   RawR          100 126.264   124.888    1.324
>>> 3   Rcpp          100 208.647   208.548    0.012
>>>
>>> Just to report, my sessionInfo() is as follow:
>>>
>>> sessionInfo() # Useful Infos **Reproducible Research**
>>> R version 3.2.1 (2015-06-18)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>> Running under: Debian GNU/Linux 8 (jessie)
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
>>>  [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
>>>  [7] LC_PAPER=en_US.utf8       LC_NAME=C
>>>  [9] LC_ADDRESS=C              LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] rbenchmark_1.0.0     microbenchmark_1.4-2 Rcpp_0.11.5
>>>
>>> loaded via a namespace (and not attached):
>>>  [1] MASS_7.3-40      colorspace_1.2-6 scales_0.2.4     compiler_3.2.1
>>>  [5] plyr_1.8.2       tools_3.2.1      gtable_0.1.2     reshape2_1.4.1
>>>  [9] ggplot2_1.0.1    grid_3.2.1       stringr_0.6.2    digest_0.6.8
>>> [13] proto_0.3-10     munsell_0.4.2
>>>
>>> After that, I wish to request help to evaluate those results. Is "full
>>> Rcpp" not a good choice when using iterators? Let me know if I am not
>>> taking care about something that may change the things. You may notice that
>>> I left all parameters as fixed as possible, then, they should not influence
>>> on that.
>>>
>>> I am open to share the files or give further information.
>>>
>>> Thank you very much indeed regarding any insight on that.
>>>
>>> Cheers,
>>> FH
>>>
>>> _______________________________________________
>>> Rcpp-devel mailing list
>>> Rcpp-devel at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
>>>
>>
>>
>> _______________________________________________
>> Rcpp-devel mailing list
>> Rcpp-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150811/0980877e/attachment-0001.html>


More information about the Rcpp-devel mailing list