[Rcpp-devel] [R] Find number of elements less than some number: Elegant/fastsolution needed

Douglas Bates bates at stat.wisc.edu
Sun Apr 17 20:00:39 CEST 2011


I enclose some comparisons, admittedly on only one size of example (10
million in the reference distribution and 100,000 in the test sample).

The versions using Rcpp and the std algorithms came out about 3 times
as fast as the version using R's findInterval.  What surprises me is
that the sequential comparisons in C++ (version k) is slightly faster
than the binary search versions (h and j).  The unassisted binary
search requires 30 probes for each element of ans and the sequential
comparisons show take an average of 100 comparisons.  I suppose the
difference is sequential access versus random access to the elements
of y.

On Fri, Apr 15, 2011 at 7:34 AM, Douglas Bates <bates at stat.wisc.edu> wrote:
> On Thu, Apr 14, 2011 at 11:47 PM, Christian Gunning <xian at unm.edu> wrote:
>> On Thu, Apr 14, 2011 at 7:02 PM,
>> <rcpp-devel-request at r-forge.wu-wien.ac.at> wrote:
>>
>>> I was able to write a very short C++ function using the Rcpp package
>>> that provided about a 1000-fold increase in speed relative to the best
>>> I could do in R.  I don't have the script on this computer so I will
>>> post it tomorrow when I am back on the computer at the office.
>>>
>>> Apologies for cross-posting to the Rcpp-devel list but I am doing so
>>> because this might make a good example of the usefulness of Rcpp and
>>> inline.
>>
>> And RcppArmadillo, as the case may be.
>>
>> This is a cool little problem.  In the examples given, I'd caution
>> people against comparing apples and durian.  The sort(x) is a cost
>> that should be considered *within* each implementation. I used
>> Armadillo to sort (src, f4), and get another 100% worth of speedup
>> that I can't reproducing using R's sort (src1, f1-f3).  If i modify
>> SEXP in-place (and this always confuses me, so I tend to avoid it),
>> I'm seeing an additional ~5-10% speed gain (src2, f5) -- the advantage
>> of this last seems to be primarily in memory-constrained applications.
>>
>> On to the code!
>>
>> src = '
>> NumericVector xx_(clone(x)), yy_(clone(y));
>> int nxx = xx_.size();
>> int nyy = yy_.size();
>> arma::vec xx(xx_), yy(yy_);
>> yy = sort(yy);
>> xx = sort(xx);
>> //
>> //
>> int j = 0; //gt index for yy
>> for (int i=0; i < nxx; i++) {
>>    while ((j < nyy) && ( xx(i) > yy(j) ) ) {
>>        j++;
>>    }
>>    xx_(i) = j;
>>   }
>> return (xx_);
>> '
>>
>> src1 = '
>> NumericVector xx_(clone(x)), yy_(clone(y));
>> // assumes x & y are already sorted
>> arma::vec xx(xx_), yy(yy_);
>> int nxx = xx.n_elem;
>> int nyy = yy.n_elem;
>> int j = 0; //gt index for yy
>> for (int i=0; i < nxx; i++) {
>>    while ((j < nyy) && ( xx(i) > yy(j) ) ) {
>>        j++;
>>    }
>>    xx_(i) = j;
>>  }
>> return (xx_);
>> '
>>
>> src2 = '
>> NumericVector xx_(x), yy_(y);  //kinda scary
>> int nxx = xx_.size();
>> int nyy = yy_.size();
>> arma::vec xx(xx_.begin(), nxx, false), yy(yy_.begin(), nyy, false);
>> //really kinda scary
>> yy = sort(yy);
>> xx = sort(xx);
>> //
>> //
>> int j = 0; //gt index for yy
>> for (int i=0; i < nxx; i++) {
>>    while ((j < nyy) && ( xx(i) > yy(j) ) ) {
>>        j++;
>>    }
>>    xx_(i) = j;
>> }
>> return (xx_);
>> '
>>
>> require(inline)
>> require(RcppArmadillo)
>> f1 <- function(x, y) { sort(length(y) - findInterval(-x, rev(-sort(y))));}
>> f2 <- function(x, y) {x = sort(x); length(y) - findInterval(-x, rev(-sort(y)))}
>> f3.a <- cxxfunction(signature(x="numeric", y="numeric"), src1,
>> plugin='RcppArmadillo')
>> f3 <- function(x,y) {
>>        x <- sort(x)
>>        y <- sort(y)
>>        return(f3.a(x,y))
>> }
>> f4 <- cxxfunction(signature(x="numeric", y="numeric"), src,
>> plugin='RcppArmadillo')
>> ##  danger -- violates R semantics
>> f5 <- cxxfunction(signature(x="numeric", y="numeric"), src2,
>> plugin='RcppArmadillo')
>>
>>
>> ## this is a really ugly test. ygwypf, i suppose :)
>>
>> for (i in 1:5) {
>>  x1 <- x <- rnorm(5e6)
>>  y1 <- y <- rnorm(5e6)
>>  print( cbind(
>>    r1=system.time(r1 <- f1(x,y)),
>>    r2=system.time(r2 <- f2(x,y)), r3=system.time(r3 <- f3(x1,y1)),
>>    r4 = system.time(r4 <- f4(x,y)), r5 = system.time(r5 <- f5(x,y))
>>  ))
>> }
>> print(all.equal(r1, r2))
>> print(all.equal(r1, r3))
>> print(all.equal(r1, r4))
>> print(all.equal(r1, r5))
>
> I agree that the cost of sorting should be taken into account but I
> don't think you need to go to the RcppArmadillo package to get a sort
> function.  Why not use std::sort?
>
> Also, I did sequential comparisons as shown in your code but after
> reading Bill Dunlap's response and looking at the documentation for
> the findInterval  function in R I smacked myself on the forehead and
> thought "Duh - binary search, of course".
>
> I haven't looked at the C code underlying the findInterval function
> yet so I don't know if Martin has clever tricks for sorted x and y.
> However the documentation for the std::upper_bound template at
> cplusplus.com shows how to use that for the case here.  The best I can
> think of for sorted x and y is to pass the upper bound from x[i] as
> the first argument in the call to std::upper_bound for x[i+1].
>
> Unfortunately I am staring at a series of deadlines today so
> implementations and comparisons may need to wait until tomorrow.
>
> P.S. to Christian:  Check the archives for several of Dirk's posts to
> the rcpp-devel list where he has used the rbenchmark package to
> produce clean output from comparisons of  implementations of
> algorithms.
>
-------------- next part --------------

R version 2.13.0 (2011-04-13)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> require("inline")
Loading required package: inline
> require("Rcpp")
Loading required package: Rcpp
> require("rbenchmark")
Loading required package: rbenchmark
> 
> ## I'm changing the rules a bit here because Sunduz wanted the
> ## counts in the original order of x
> 
> f <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
> 
> ## findInterval can take advantage of x being ordered
> 
> g <- function(x, y) {
+     ord <- order(x)
+     ans <- integer(length(x))
+     ans[ord] <- length(y) - findInterval(-x[ord], rev(-sort(y)))
+     ans
+ }
> 
> ## version using std::upper_bound for unordered x
> h <- cxxfunction(signature(x_="numeric", y_="numeric"), '
+ {
+     Rcpp::NumericVector x(x_),
+         y = clone(Rcpp::NumericVector(y_));
+     int n = x.size();
+     Rcpp::IntegerVector ans(n);
+     const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+                // sort (a copy of) the y values for binary search
+     std::sort(bb, ee);
+     
+     for (int i = 0; i < n; i++) {
+ 	ans[i] = std::upper_bound(bb, ee, x[i]) - bb;
+     }
+     return ans;
+ }
+ ', plugin="Rcpp")
> 
> ## version using std::upper_bound when x is non-decreasing
> j1 <- cxxfunction(signature(x_="numeric", y_="numeric"), '
+ {
+     Rcpp::NumericVector x(x_),
+         y = clone(Rcpp::NumericVector(y_));
+     int n = x.size();
+     Rcpp::IntegerVector ans(n);
+     const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+     Rcpp::NumericVector::iterator mm = y.begin();
+                // sort (a copy of) the y values for binary search
+     std::sort(bb, ee);
+     
+     for (int i = 0; i < n; i++) {
+         mm = std::upper_bound(mm, ee, x[i]);
+         ans[i] = mm - bb;
+     }
+     return ans;
+ }
+ ', plugin="Rcpp")
> 
> j <- function(x, y) {
+     ord <- order(x)
+     ans <- integer(length(x))
+     ans[ord] <- j1(x[ord], y)
+     ans
+ }
> 
> ## version using sequential search
> k1 <- cxxfunction(signature(xs_="numeric", y_="numeric", ord_="integer"), '
+ {
+     Rcpp::NumericVector xs(xs_),
+         y = clone(Rcpp::NumericVector(y_));
+     int n = xs.size();
+     Rcpp::IntegerVector ord(ord_), ans(n);
+     const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+     Rcpp::NumericVector::iterator yy = y.begin();
+     double *xp = xs.begin();
+                // sort (a copy of) the y values for binary search
+     std::sort(bb, ee);
+     
+     for (int i = 0; i < n && yy < ee; i++, xp++) {
+ 	while (*xp >= *yy && yy < ee) yy++;
+ 	ans[ord[i] - 1] = yy - bb;
+     }
+     return ans;
+ }
+ ', plugin="Rcpp")
> 
> k <- function(x, y) {
+     ord <- order(x)
+     k1(x[ord], y, ord)
+ }
>     
> x <- rnorm(1e5)
> yy <- y <- rnorm(1e7)
> yy[1] <- y[1]                           # ensure y and yy are distinct
> 
> aa <- f(x, y)
> all.equal(aa, g(x, y))
[1] TRUE
> all.equal(aa, h(x, y))
[1] TRUE
> all.equal(aa, j(x, y))
[1] TRUE
> all.equal(aa, k(x, y))
[1] TRUE
> all.equal(y, yy)                        # check that nothing has changed in y
[1] TRUE
> 
> benchmark(f(x,y), g(x,y), h(x,y), j(x,y), k(x,y),
+           order="relative", replications=10,
+           columns=c("test", "replications", "elapsed", "relative"))
     test replications elapsed relative
5 k(x, y)           10  19.477 1.000000
4 j(x, y)           10  19.984 1.026031
3 h(x, y)           10  20.107 1.032346
2 g(x, y)           10  57.691 2.962006
1 f(x, y)           10  60.563 3.109462
> 
> 
> proc.time()
   user  system elapsed 
213.060  15.880 233.067 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bb.R
Type: application/octet-stream
Size: 2865 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20110417/cc479b0f/attachment.obj>


More information about the Rcpp-devel mailing list