[Rcpp-commits] r3008 - papers/BatesEddelbuettel

Sat Apr 23 00:32:11 CEST 2011

Author: edd
Date: 2011-04-23 00:32:11 +0200 (Sat, 23 Apr 2011)
New Revision: 3008

Added:
   papers/BatesEddelbuettel/bb.r
   papers/BatesEddelbuettel/christian_20110414.txt
   papers/BatesEddelbuettel/doug_20110414.txt
   papers/BatesEddelbuettel/kevin_20110414.txt
Modified:
   papers/BatesEddelbuettel/UmmelKeles.Rnw
Log:
more emails about the question


Modified: papers/BatesEddelbuettel/UmmelKeles.Rnw
===================================================================

--- papers/BatesEddelbuettel/UmmelKeles.Rnw	2011-04-22 21:39:36 UTC (rev 3007)
+++ papers/BatesEddelbuettel/UmmelKeles.Rnw	2011-04-22 22:32:11 UTC (rev 3008)
@@ -19,7 +19,7 @@
 library(inline)
 library(Rcpp)
 library(rbenchmark)
-@ 
+@
 \begin{document}
 
 Recently a query by Kevin Ummel on the R-help mailing list prompted a
@@ -30,31 +30,31 @@
 
 There are various ways of doing this.  The original poster used
 <<f1>>=
-f1 <- function(x, y) 
+f1 <- function(x, y)
     sapply(x, function(i) length(which(y < i)))
-@ 
+@
 
 Richard Heiberger and Marc Swartz both suggested
 <<f2>>=
 f2 <- function(x, y)
     colSums(outer(y, x, '<'))
-@ 
+@
 
 Gustavo Carvalho suggested the equivalent of
 <<f3>>=
 f3 <- Vectorize(function(x, y) sum(y < x), "x")
-@ 
+@
 and Bill Dunlap, drawing on his encyclopedic knowledge of S-PLUS and R
 functions, noted that this operation was essential what is done in R's
 findInterval function which uses compiled code implementing a binary search.
 <<f4>>=
 f4 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
-@ 
+@
 
 For large vectors \code{x} and \code{y}, Bill's version is much faster
 than any of the other suggestions which involve comparing each element
 of \code{x} to each elements of \code{y}. Interestingly, the second
-version (\code{f2}), which was suggested by two experience R users,
+version (\code{f2}), which was suggested by two experienced R users,
 can become deadly slow on moderate sized vectors, because of the way
 that the \code{outer} function is implemented.
 
@@ -74,14 +74,14 @@
           columns=c("test","elapsed","relative"),
           order="relative", replications=10L)
 
-@ 
+@
 
 We will eliminate all but Bill Dunlap's method on this evidence and
 change the rules a bit.  The question posed by Sunduz Keles regarded
 p-values from a test sample relative to a larger reference sample
 
-(From here you can continue with the description on the R-help and
-Rcpp-Devel postings in which I included benchmark timings of various
-versions.)
+%(From here you can continue with the description on the R-help and
+%Rcpp-Devel postings in which I included benchmark timings of various
+%versions.)
 
 \end{document}

Added: papers/BatesEddelbuettel/bb.r
===================================================================
--- papers/BatesEddelbuettel/bb.r	                        (rev 0)
+++ papers/BatesEddelbuettel/bb.r	2011-04-22 22:32:11 UTC (rev 3008)
@@ -0,0 +1,144 @@
+
+## From: Douglas Bates <bates at stat.wisc.edu>
+## To: xian at unm.edu
+## Cc: William Dunlap <wdunlap at tibco.com>, rcpp-devel at r-forge.wu-wien.ac.at,
+##         Kevin Ummel <kevinummel at gmail.com>
+## Subject: Re: [Rcpp-devel] [R] Find number of elements less than some number:
+##  Elegant/fastsolution needed
+## Date: Sun, 17 Apr 2011 13:00:39 -0500
+
+## I enclose some comparisons, admittedly on only one size of example (10
+## million in the reference distribution and 100,000 in the test sample).
+
+## The versions using Rcpp and the std algorithms came out about 3 times
+## as fast as the version using R's findInterval.  What surprises me is
+## that the sequential comparisons in C++ (version k) is slightly faster
+## than the binary search versions (h and j).  The unassisted binary
+## search requires 30 probes for each element of ans and the sequential
+## comparisons show take an average of 100 comparisons.  I suppose the
+## difference is sequential access versus random access to the elements
+## of y.
+
+## [ ... lots of lines from email deleted, final results below ... ]
+
+## > benchmark(f(x,y), g(x,y), h(x,y), j(x,y), k(x,y),
+## +           order="relative", replications=10,
+## +           columns=c("test", "replications", "elapsed", "relative"))
+##      test replications elapsed relative
+## 5 k(x, y)           10  19.477 1.000000
+## 4 j(x, y)           10  19.984 1.026031
+## 3 h(x, y)           10  20.107 1.032346
+## 2 g(x, y)           10  57.691 2.962006
+## 1 f(x, y)           10  60.563 3.109462
+## >
+## >
+## > proc.time()
+##    user  system elapsed
+## 213.060  15.880 233.067
+
+## Code slighly edited
+
+suppressMessage(require(inline))
+suppressMessage(require(Rcpp))
+suppressMessage(require(rbenchmark))
+
+## I'm changing the rules a bit here because Sunduz wanted the
+## counts in the original order of x
+
+f <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
+
+## findInterval can take advantage of x being ordered
+
+g <- function(x, y) {
+    ord <- order(x)
+    ans <- integer(length(x))
+    ans[ord] <- length(y) - findInterval(-x[ord], rev(-sort(y)))
+    ans
+}
+
+## version using std::upper_bound for unordered x
+h <- cxxfunction(signature(x_="numeric", y_="numeric"), '
+{
+    Rcpp::NumericVector x(x_),
+        y = clone(Rcpp::NumericVector(y_));
+    int n = x.size();
+    Rcpp::IntegerVector ans(n);
+    const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+               // sort (a copy of) the y values for binary search
+    std::sort(bb, ee);
+
+    for (int i = 0; i < n; i++) {
+	ans[i] = std::upper_bound(bb, ee, x[i]) - bb;
+    }
+    return ans;
+}
+', plugin="Rcpp")
+
+## version using std::upper_bound when x is non-decreasing
+j1 <- cxxfunction(signature(x_="numeric", y_="numeric"), '
+{
+    Rcpp::NumericVector x(x_),
+        y = clone(Rcpp::NumericVector(y_));
+    int n = x.size();
+    Rcpp::IntegerVector ans(n);
+    const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+    Rcpp::NumericVector::iterator mm = y.begin();
+               // sort (a copy of) the y values for binary search
+    std::sort(bb, ee);
+
+    for (int i = 0; i < n; i++) {
+        mm = std::upper_bound(mm, ee, x[i]);
+        ans[i] = mm - bb;
+    }
+    return ans;
+}
+', plugin="Rcpp")
+
+j <- function(x, y) {
+    ord <- order(x)
+    ans <- integer(length(x))
+    ans[ord] <- j1(x[ord], y)
+    ans
+}
+
+## version using sequential search
+k1 <- cxxfunction(signature(xs_="numeric", y_="numeric", ord_="integer"), '
+{
+    Rcpp::NumericVector xs(xs_),
+        y = clone(Rcpp::NumericVector(y_));
+    int n = xs.size();
+    Rcpp::IntegerVector ord(ord_), ans(n);
+    const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+    Rcpp::NumericVector::iterator yy = y.begin();
+    double *xp = xs.begin();
+               // sort (a copy of) the y values for binary search
+    std::sort(bb, ee);
+
+    for (int i = 0; i < n && yy < ee; i++, xp++) {
+	while (*xp >= *yy && yy < ee) yy++;
+	ans[ord[i] - 1] = yy - bb;
+    }
+    return ans;
+}
+', plugin="Rcpp")
+
+k <- function(x, y) {
+    ord <- order(x)
+    k1(x[ord], y, ord)
+}
+
+x <- rnorm(1e5)
+yy <- y <- rnorm(1e7)
+yy[1] <- y[1]                           # ensure y and yy are distinct
+
+aa <- f(x, y)
+all.equal(aa, g(x, y))
+all.equal(aa, h(x, y))
+all.equal(aa, j(x, y))
+all.equal(aa, k(x, y))
+all.equal(y, yy)                        # check that nothing has changed in y
+
+res <- benchmark(f(x,y), g(x,y), h(x,y), j(x,y), k(x,y),
+                 order="relative", replications=10,
+                 columns=c("test", "replications", "elapsed", "relative"))
+print(res)

Added: papers/BatesEddelbuettel/christian_20110414.txt
===================================================================
--- papers/BatesEddelbuettel/christian_20110414.txt	                        (rev 0)
+++ papers/BatesEddelbuettel/christian_20110414.txt	2011-04-22 22:32:11 UTC (rev 3008)
@@ -0,0 +1,123 @@
+## From: Christian Gunning <xian at unm.edu>
+## To: rcpp-devel at r-forge.wu-wien.ac.at
+## Cc: rvaradhan at jhmi.edu, Kevin Ummel <kevinummel at gmail.com>,
+##         r-help at r-project.org, Sunduz Keles <keles at stat.wisc.edu>,
+##         William Dunlap <wdunlap at tibco.com>,
+##         "W. Duncan Wadsworth" <w.d.wadsworth at gmail.com>
+## Subject: Re: [Rcpp-devel] [R] Find number of elements less than some number:
+##  Elegant/fastsolution needed
+## Date: Thu, 14 Apr 2011 21:47:22 -0700
+
+## On Thu, Apr 14, 2011 at 7:02 PM,
+## <rcpp-devel-request at r-forge.wu-wien.ac.at> wrote:
+
+## > I was able to write a very short C++ function using the Rcpp package
+## > that provided about a 1000-fold increase in speed relative to the best
+## > I could do in R.  I don't have the script on this computer so I will
+## > post it tomorrow when I am back on the computer at the office.
+## >
+## > Apologies for cross-posting to the Rcpp-devel list but I am doing so
+## > because this might make a good example of the usefulness of Rcpp and
+## > inline.
+
+## And RcppArmadillo, as the case may be.
+
+## This is a cool little problem.  In the examples given, I'd caution
+## people against comparing apples and durian.  The sort(x) is a cost
+## that should be considered *within* each implementation. I used
+## Armadillo to sort (src, f4), and get another 100% worth of speedup
+## that I can't reproducing using R's sort (src1, f1-f3).  If i modify
+## SEXP in-place (and this always confuses me, so I tend to avoid it),
+## I'm seeing an additional ~5-10% speed gain (src2, f5) -- the advantage
+## of this last seems to be primarily in memory-constrained applications.
+
+## On to the code!
+
+src = '
+NumericVector xx_(clone(x)), yy_(clone(y));
+int nxx = xx_.size();
+int nyy = yy_.size();
+arma::vec xx(xx_), yy(yy_);
+yy = sort(yy);
+xx = sort(xx);
+//
+//
+int j = 0; //gt index for yy
+for (int i=0; i < nxx; i++) {
+    while ((j < nyy) && ( xx(i) > yy(j) ) ) {
+        j++;
+    }
+    xx_(i) = j;
+   }
+return (xx_);
+'
+
+src1 = '
+NumericVector xx_(clone(x)), yy_(clone(y));
+// assumes x & y are already sorted
+arma::vec xx(xx_), yy(yy_);
+int nxx = xx.n_elem;
+int nyy = yy.n_elem;
+int j = 0; //gt index for yy
+for (int i=0; i < nxx; i++) {
+    while ((j < nyy) && ( xx(i) > yy(j) ) ) {
+        j++;
+    }
+    xx_(i) = j;
+  }
+return (xx_);
+'
+
+src2 = '
+NumericVector xx_(x), yy_(y);  //kinda scary
+int nxx = xx_.size();
+int nyy = yy_.size();
+arma::vec xx(xx_.begin(), nxx, false), yy(yy_.begin(), nyy, false);
+//really kinda scary
+yy = sort(yy);
+xx = sort(xx);
+//
+//
+int j = 0; //gt index for yy
+for (int i=0; i < nxx; i++) {
+    while ((j < nyy) && ( xx(i) > yy(j) ) ) {
+        j++;
+    }
+    xx_(i) = j;
+}
+return (xx_);
+'
+
+require(inline)
+require(RcppArmadillo)
+f1 <- function(x, y) { sort(length(y) - findInterval(-x, rev(-sort(y))));}
+f2 <- function(x, y) {x = sort(x); length(y) - findInterval(-x, rev(-sort(y)))}
+f3.a <- cxxfunction(signature(x="numeric", y="numeric"), src1,
+plugin='RcppArmadillo')
+f3 <- function(x,y) {
+        x <- sort(x)
+        y <- sort(y)
+        return(f3.a(x,y))
+}
+f4 <- cxxfunction(signature(x="numeric", y="numeric"), src,
+plugin='RcppArmadillo')
+##  danger -- violates R semantics
+f5 <- cxxfunction(signature(x="numeric", y="numeric"), src2,
+plugin='RcppArmadillo')
+
+
+## this is a really ugly test. ygwypf, i suppose :)
+
+for (i in 1:5) {
+  x1 <- x <- rnorm(5e6)
+  y1 <- y <- rnorm(5e6)
+  print( cbind(
+    r1=system.time(r1 <- f1(x,y)),
+    r2=system.time(r2 <- f2(x,y)), r3=system.time(r3 <- f3(x1,y1)),
+    r4 = system.time(r4 <- f4(x,y)), r5 = system.time(r5 <- f5(x,y))
+  ))
+}
+print(all.equal(r1, r2))
+print(all.equal(r1, r3))
+print(all.equal(r1, r4))
+print(all.equal(r1, r5))

Added: papers/BatesEddelbuettel/doug_20110414.txt
===================================================================
--- papers/BatesEddelbuettel/doug_20110414.txt	                        (rev 0)
+++ papers/BatesEddelbuettel/doug_20110414.txt	2011-04-22 22:32:11 UTC (rev 3008)
@@ -0,0 +1,31 @@
+From: Douglas Bates <bates at stat.wisc.edu>
+To: William Dunlap <wdunlap at tibco.com>
+Cc: r-help at r-project.org, Sunduz Keles <keles at stat.wisc.edu>,
+        rcpp-devel <Rcpp-devel at r-forge.wu-wien.ac.at>,
+        Kevin Ummel <kevinummel at gmail.com>
+Subject: Re: [Rcpp-devel] [R] Find number of elements less than some number:
+ Elegant/fastsolution needed
+Date: Thu, 14 Apr 2011 17:22:53 -0500
+
+My colleague Sunduz Keles once mentioned a similar problem to me.  She
+had a large sample from a reference distribution and a test sample
+(both real-valued in her case) and she wanted, for each element of the
+test sample, the proportion of the reference sample that was less than
+the element.  It's a type of empirical p-value calculation.
+
+I forget the exact sizes of the samples (do you remember, Sunduz?) but
+they were in the tens of thousands or larger.  Solutions in R tended
+to involve comparing every element in the test sample to every element
+in the reference sample but, of course, that is unnecessary.  If you
+sort both samples then you can start the comparisons for a particular
+element of the test sample at the element of the reference sample
+where the last comparison failed.
+
+I was able to write a very short C++ function using the Rcpp package
+that provided about a 1000-fold increase in speed relative to the best
+I could do in R.  I don't have the script on this computer so I will
+post it tomorrow when I am back on the computer at the office.
+
+Apologies for cross-posting to the Rcpp-devel list but I am doing so
+because this might make a good example of the usefulness of Rcpp and
+inline.

Added: papers/BatesEddelbuettel/kevin_20110414.txt
===================================================================
--- papers/BatesEddelbuettel/kevin_20110414.txt	                        (rev 0)
+++ papers/BatesEddelbuettel/kevin_20110414.txt	2011-04-22 22:32:11 UTC (rev 3008)
@@ -0,0 +1,62 @@
+From: Kevin Ummel <kevinummel at gmail.com>
+To: r-help at r-project.org
+Subject: [R] Find number of elements less than some number: Elegant/fast
+	solution needed
+Date: Thu, 14 Apr 2011 14:34:44 -0500
+
+Take vector x and a subset y:
+
+x=1:10
+
+y=c(4,5,7,9)
+
+For each value in 'x', I want to know how many elements in 'y' are less than 'x'.
+
+An example would be:
+
+sapply(x,FUN=function(i) {length(which(y<i))})
+ [1] 0 0 0 0 1 2 2 3 3 4
+
+But this solution is far too slow when x and y have lengths in the millions.
+
+I'm certain an elegant (and computationally efficient) solution exists, but I'm in the weeds at this point.
+
+Any help is much appreciated.
+
+Kevin
+
+University of Manchester
+
+
+
+
+
+
+
+
+Take two vectors x and y, where y is a subset of x:
+
+x=1:10
+
+y=c(2,5,6,9)
+
+If y is removed from x, the original x values now have a new placement (index) in the resulting vector (new): 
+
+new=x[-y]
+
+index=1:length(new)
+
+The challenge is: How can I *quickly* and *efficiently* deduce the new 'index' value directly from the original 'x' value -- using only 'y' as an input?
+
+In practice, I have very large matrices containing the 'x' values, and I need to convert them to the corresponding 'index' if the 'y' values are removed.
+
+
+
+
+	[[alternative HTML version deleted]]
+
+______________________________________________
+R-help at r-project.org mailing list
+https://stat.ethz.ch/mailman/listinfo/r-help
+PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
+and provide commented, minimal, self-contained, reproducible code.