[Rcpp-commits] r3008 - papers/BatesEddelbuettel
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Sat Apr 23 00:32:11 CEST 2011
Author: edd
Date: 2011-04-23 00:32:11 +0200 (Sat, 23 Apr 2011)
New Revision: 3008
Added:
papers/BatesEddelbuettel/bb.r
papers/BatesEddelbuettel/christian_20110414.txt
papers/BatesEddelbuettel/doug_20110414.txt
papers/BatesEddelbuettel/kevin_20110414.txt
Modified:
papers/BatesEddelbuettel/UmmelKeles.Rnw
Log:
more emails about the question
Modified: papers/BatesEddelbuettel/UmmelKeles.Rnw
===================================================================
--- papers/BatesEddelbuettel/UmmelKeles.Rnw 2011-04-22 21:39:36 UTC (rev 3007)
+++ papers/BatesEddelbuettel/UmmelKeles.Rnw 2011-04-22 22:32:11 UTC (rev 3008)
@@ -19,7 +19,7 @@
library(inline)
library(Rcpp)
library(rbenchmark)
-@
+@
\begin{document}
Recently a query by Kevin Ummel on the R-help mailing list prompted a
@@ -30,31 +30,31 @@
There are various ways of doing this. The original poster used
<<f1>>=
-f1 <- function(x, y)
+f1 <- function(x, y)
sapply(x, function(i) length(which(y < i)))
-@
+@
Richard Heiberger and Marc Swartz both suggested
<<f2>>=
f2 <- function(x, y)
colSums(outer(y, x, '<'))
-@
+@
Gustavo Carvalho suggested the equivalent of
<<f3>>=
f3 <- Vectorize(function(x, y) sum(y < x), "x")
-@
+@
and Bill Dunlap, drawing on his encyclopedic knowledge of S-PLUS and R
functions, noted that this operation was essential what is done in R's
findInterval function which uses compiled code implementing a binary search.
<<f4>>=
f4 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
-@
+@
For large vectors \code{x} and \code{y}, Bill's version is much faster
than any of the other suggestions which involve comparing each element
of \code{x} to each elements of \code{y}. Interestingly, the second
-version (\code{f2}), which was suggested by two experience R users,
+version (\code{f2}), which was suggested by two experienced R users,
can become deadly slow on moderate sized vectors, because of the way
that the \code{outer} function is implemented.
@@ -74,14 +74,14 @@
columns=c("test","elapsed","relative"),
order="relative", replications=10L)
-@
+@
We will eliminate all but Bill Dunlap's method on this evidence and
change the rules a bit. The question posed by Sunduz Keles regarded
p-values from a test sample relative to a larger reference sample
-(From here you can continue with the description on the R-help and
-Rcpp-Devel postings in which I included benchmark timings of various
-versions.)
+%(From here you can continue with the description on the R-help and
+%Rcpp-Devel postings in which I included benchmark timings of various
+%versions.)
\end{document}
Added: papers/BatesEddelbuettel/bb.r
===================================================================
--- papers/BatesEddelbuettel/bb.r (rev 0)
+++ papers/BatesEddelbuettel/bb.r 2011-04-22 22:32:11 UTC (rev 3008)
@@ -0,0 +1,144 @@
+
+## From: Douglas Bates <bates at stat.wisc.edu>
+## To: xian at unm.edu
+## Cc: William Dunlap <wdunlap at tibco.com>, rcpp-devel at r-forge.wu-wien.ac.at,
+## Kevin Ummel <kevinummel at gmail.com>
+## Subject: Re: [Rcpp-devel] [R] Find number of elements less than some number:
+## Elegant/fastsolution needed
+## Date: Sun, 17 Apr 2011 13:00:39 -0500
+
+## I enclose some comparisons, admittedly on only one size of example (10
+## million in the reference distribution and 100,000 in the test sample).
+
+## The versions using Rcpp and the std algorithms came out about 3 times
+## as fast as the version using R's findInterval. What surprises me is
+## that the sequential comparisons in C++ (version k) is slightly faster
+## than the binary search versions (h and j). The unassisted binary
+## search requires 30 probes for each element of ans and the sequential
+## comparisons show take an average of 100 comparisons. I suppose the
+## difference is sequential access versus random access to the elements
+## of y.
+
+## [ ... lots of lines from email deleted, final results below ... ]
+
+## > benchmark(f(x,y), g(x,y), h(x,y), j(x,y), k(x,y),
+## + order="relative", replications=10,
+## + columns=c("test", "replications", "elapsed", "relative"))
+## test replications elapsed relative
+## 5 k(x, y) 10 19.477 1.000000
+## 4 j(x, y) 10 19.984 1.026031
+## 3 h(x, y) 10 20.107 1.032346
+## 2 g(x, y) 10 57.691 2.962006
+## 1 f(x, y) 10 60.563 3.109462
+## >
+## >
+## > proc.time()
+## user system elapsed
+## 213.060 15.880 233.067
+
+## Code slighly edited
+
+suppressMessage(require(inline))
+suppressMessage(require(Rcpp))
+suppressMessage(require(rbenchmark))
+
+## I'm changing the rules a bit here because Sunduz wanted the
+## counts in the original order of x
+
+f <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
+
+## findInterval can take advantage of x being ordered
+
+g <- function(x, y) {
+ ord <- order(x)
+ ans <- integer(length(x))
+ ans[ord] <- length(y) - findInterval(-x[ord], rev(-sort(y)))
+ ans
+}
+
+## version using std::upper_bound for unordered x
+h <- cxxfunction(signature(x_="numeric", y_="numeric"), '
+{
+ Rcpp::NumericVector x(x_),
+ y = clone(Rcpp::NumericVector(y_));
+ int n = x.size();
+ Rcpp::IntegerVector ans(n);
+ const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+ // sort (a copy of) the y values for binary search
+ std::sort(bb, ee);
+
+ for (int i = 0; i < n; i++) {
+ ans[i] = std::upper_bound(bb, ee, x[i]) - bb;
+ }
+ return ans;
+}
+', plugin="Rcpp")
+
+## version using std::upper_bound when x is non-decreasing
+j1 <- cxxfunction(signature(x_="numeric", y_="numeric"), '
+{
+ Rcpp::NumericVector x(x_),
+ y = clone(Rcpp::NumericVector(y_));
+ int n = x.size();
+ Rcpp::IntegerVector ans(n);
+ const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+ Rcpp::NumericVector::iterator mm = y.begin();
+ // sort (a copy of) the y values for binary search
+ std::sort(bb, ee);
+
+ for (int i = 0; i < n; i++) {
+ mm = std::upper_bound(mm, ee, x[i]);
+ ans[i] = mm - bb;
+ }
+ return ans;
+}
+', plugin="Rcpp")
+
+j <- function(x, y) {
+ ord <- order(x)
+ ans <- integer(length(x))
+ ans[ord] <- j1(x[ord], y)
+ ans
+}
+
+## version using sequential search
+k1 <- cxxfunction(signature(xs_="numeric", y_="numeric", ord_="integer"), '
+{
+ Rcpp::NumericVector xs(xs_),
+ y = clone(Rcpp::NumericVector(y_));
+ int n = xs.size();
+ Rcpp::IntegerVector ord(ord_), ans(n);
+ const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+ Rcpp::NumericVector::iterator yy = y.begin();
+ double *xp = xs.begin();
+ // sort (a copy of) the y values for binary search
+ std::sort(bb, ee);
+
+ for (int i = 0; i < n && yy < ee; i++, xp++) {
+ while (*xp >= *yy && yy < ee) yy++;
+ ans[ord[i] - 1] = yy - bb;
+ }
+ return ans;
+}
+', plugin="Rcpp")
+
+k <- function(x, y) {
+ ord <- order(x)
+ k1(x[ord], y, ord)
+}
+
+x <- rnorm(1e5)
+yy <- y <- rnorm(1e7)
+yy[1] <- y[1] # ensure y and yy are distinct
+
+aa <- f(x, y)
+all.equal(aa, g(x, y))
+all.equal(aa, h(x, y))
+all.equal(aa, j(x, y))
+all.equal(aa, k(x, y))
+all.equal(y, yy) # check that nothing has changed in y
+
+res <- benchmark(f(x,y), g(x,y), h(x,y), j(x,y), k(x,y),
+ order="relative", replications=10,
+ columns=c("test", "replications", "elapsed", "relative"))
+print(res)
Added: papers/BatesEddelbuettel/christian_20110414.txt
===================================================================
--- papers/BatesEddelbuettel/christian_20110414.txt (rev 0)
+++ papers/BatesEddelbuettel/christian_20110414.txt 2011-04-22 22:32:11 UTC (rev 3008)
@@ -0,0 +1,123 @@
+## From: Christian Gunning <xian at unm.edu>
+## To: rcpp-devel at r-forge.wu-wien.ac.at
+## Cc: rvaradhan at jhmi.edu, Kevin Ummel <kevinummel at gmail.com>,
+## r-help at r-project.org, Sunduz Keles <keles at stat.wisc.edu>,
+## William Dunlap <wdunlap at tibco.com>,
+## "W. Duncan Wadsworth" <w.d.wadsworth at gmail.com>
+## Subject: Re: [Rcpp-devel] [R] Find number of elements less than some number:
+## Elegant/fastsolution needed
+## Date: Thu, 14 Apr 2011 21:47:22 -0700
+
+## On Thu, Apr 14, 2011 at 7:02 PM,
+## <rcpp-devel-request at r-forge.wu-wien.ac.at> wrote:
+
+## > I was able to write a very short C++ function using the Rcpp package
+## > that provided about a 1000-fold increase in speed relative to the best
+## > I could do in R. I don't have the script on this computer so I will
+## > post it tomorrow when I am back on the computer at the office.
+## >
+## > Apologies for cross-posting to the Rcpp-devel list but I am doing so
+## > because this might make a good example of the usefulness of Rcpp and
+## > inline.
+
+## And RcppArmadillo, as the case may be.
+
+## This is a cool little problem. In the examples given, I'd caution
+## people against comparing apples and durian. The sort(x) is a cost
+## that should be considered *within* each implementation. I used
+## Armadillo to sort (src, f4), and get another 100% worth of speedup
+## that I can't reproducing using R's sort (src1, f1-f3). If i modify
+## SEXP in-place (and this always confuses me, so I tend to avoid it),
+## I'm seeing an additional ~5-10% speed gain (src2, f5) -- the advantage
+## of this last seems to be primarily in memory-constrained applications.
+
+## On to the code!
+
+src = '
+NumericVector xx_(clone(x)), yy_(clone(y));
+int nxx = xx_.size();
+int nyy = yy_.size();
+arma::vec xx(xx_), yy(yy_);
+yy = sort(yy);
+xx = sort(xx);
+//
+//
+int j = 0; //gt index for yy
+for (int i=0; i < nxx; i++) {
+ while ((j < nyy) && ( xx(i) > yy(j) ) ) {
+ j++;
+ }
+ xx_(i) = j;
+ }
+return (xx_);
+'
+
+src1 = '
+NumericVector xx_(clone(x)), yy_(clone(y));
+// assumes x & y are already sorted
+arma::vec xx(xx_), yy(yy_);
+int nxx = xx.n_elem;
+int nyy = yy.n_elem;
+int j = 0; //gt index for yy
+for (int i=0; i < nxx; i++) {
+ while ((j < nyy) && ( xx(i) > yy(j) ) ) {
+ j++;
+ }
+ xx_(i) = j;
+ }
+return (xx_);
+'
+
+src2 = '
+NumericVector xx_(x), yy_(y); //kinda scary
+int nxx = xx_.size();
+int nyy = yy_.size();
+arma::vec xx(xx_.begin(), nxx, false), yy(yy_.begin(), nyy, false);
+//really kinda scary
+yy = sort(yy);
+xx = sort(xx);
+//
+//
+int j = 0; //gt index for yy
+for (int i=0; i < nxx; i++) {
+ while ((j < nyy) && ( xx(i) > yy(j) ) ) {
+ j++;
+ }
+ xx_(i) = j;
+}
+return (xx_);
+'
+
+require(inline)
+require(RcppArmadillo)
+f1 <- function(x, y) { sort(length(y) - findInterval(-x, rev(-sort(y))));}
+f2 <- function(x, y) {x = sort(x); length(y) - findInterval(-x, rev(-sort(y)))}
+f3.a <- cxxfunction(signature(x="numeric", y="numeric"), src1,
+plugin='RcppArmadillo')
+f3 <- function(x,y) {
+ x <- sort(x)
+ y <- sort(y)
+ return(f3.a(x,y))
+}
+f4 <- cxxfunction(signature(x="numeric", y="numeric"), src,
+plugin='RcppArmadillo')
+## danger -- violates R semantics
+f5 <- cxxfunction(signature(x="numeric", y="numeric"), src2,
+plugin='RcppArmadillo')
+
+
+## this is a really ugly test. ygwypf, i suppose :)
+
+for (i in 1:5) {
+ x1 <- x <- rnorm(5e6)
+ y1 <- y <- rnorm(5e6)
+ print( cbind(
+ r1=system.time(r1 <- f1(x,y)),
+ r2=system.time(r2 <- f2(x,y)), r3=system.time(r3 <- f3(x1,y1)),
+ r4 = system.time(r4 <- f4(x,y)), r5 = system.time(r5 <- f5(x,y))
+ ))
+}
+print(all.equal(r1, r2))
+print(all.equal(r1, r3))
+print(all.equal(r1, r4))
+print(all.equal(r1, r5))
Added: papers/BatesEddelbuettel/doug_20110414.txt
===================================================================
--- papers/BatesEddelbuettel/doug_20110414.txt (rev 0)
+++ papers/BatesEddelbuettel/doug_20110414.txt 2011-04-22 22:32:11 UTC (rev 3008)
@@ -0,0 +1,31 @@
+From: Douglas Bates <bates at stat.wisc.edu>
+To: William Dunlap <wdunlap at tibco.com>
+Cc: r-help at r-project.org, Sunduz Keles <keles at stat.wisc.edu>,
+ rcpp-devel <Rcpp-devel at r-forge.wu-wien.ac.at>,
+ Kevin Ummel <kevinummel at gmail.com>
+Subject: Re: [Rcpp-devel] [R] Find number of elements less than some number:
+ Elegant/fastsolution needed
+Date: Thu, 14 Apr 2011 17:22:53 -0500
+
+My colleague Sunduz Keles once mentioned a similar problem to me. She
+had a large sample from a reference distribution and a test sample
+(both real-valued in her case) and she wanted, for each element of the
+test sample, the proportion of the reference sample that was less than
+the element. It's a type of empirical p-value calculation.
+
+I forget the exact sizes of the samples (do you remember, Sunduz?) but
+they were in the tens of thousands or larger. Solutions in R tended
+to involve comparing every element in the test sample to every element
+in the reference sample but, of course, that is unnecessary. If you
+sort both samples then you can start the comparisons for a particular
+element of the test sample at the element of the reference sample
+where the last comparison failed.
+
+I was able to write a very short C++ function using the Rcpp package
+that provided about a 1000-fold increase in speed relative to the best
+I could do in R. I don't have the script on this computer so I will
+post it tomorrow when I am back on the computer at the office.
+
+Apologies for cross-posting to the Rcpp-devel list but I am doing so
+because this might make a good example of the usefulness of Rcpp and
+inline.
Added: papers/BatesEddelbuettel/kevin_20110414.txt
===================================================================
--- papers/BatesEddelbuettel/kevin_20110414.txt (rev 0)
+++ papers/BatesEddelbuettel/kevin_20110414.txt 2011-04-22 22:32:11 UTC (rev 3008)
@@ -0,0 +1,62 @@
+From: Kevin Ummel <kevinummel at gmail.com>
+To: r-help at r-project.org
+Subject: [R] Find number of elements less than some number: Elegant/fast
+ solution needed
+Date: Thu, 14 Apr 2011 14:34:44 -0500
+
+Take vector x and a subset y:
+
+x=1:10
+
+y=c(4,5,7,9)
+
+For each value in 'x', I want to know how many elements in 'y' are less than 'x'.
+
+An example would be:
+
+sapply(x,FUN=function(i) {length(which(y<i))})
+ [1] 0 0 0 0 1 2 2 3 3 4
+
+But this solution is far too slow when x and y have lengths in the millions.
+
+I'm certain an elegant (and computationally efficient) solution exists, but I'm in the weeds at this point.
+
+Any help is much appreciated.
+
+Kevin
+
+University of Manchester
+
+
+
+
+
+
+
+
+Take two vectors x and y, where y is a subset of x:
+
+x=1:10
+
+y=c(2,5,6,9)
+
+If y is removed from x, the original x values now have a new placement (index) in the resulting vector (new):
+
+new=x[-y]
+
+index=1:length(new)
+
+The challenge is: How can I *quickly* and *efficiently* deduce the new 'index' value directly from the original 'x' value -- using only 'y' as an input?
+
+In practice, I have very large matrices containing the 'x' values, and I need to convert them to the corresponding 'index' if the 'y' values are removed.
+
+
+
+
+ [[alternative HTML version deleted]]
+
+______________________________________________
+R-help at r-project.org mailing list
+https://stat.ethz.ch/mailman/listinfo/r-help
+PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
+and provide commented, minimal, self-contained, reproducible code.
More information about the Rcpp-commits
mailing list