[Rcpp-devel] Numeric comparison of two long vectors [was: quick question]

Thu Apr 29 22:40:26 CEST 2010

Hi Sunduz,

Thanks for the question.  As I mentioned when we spoke, this will make
a good demonstration of the Rcpp interface to compiled code in C++
because it is the type of calculation that is going to be very slow in
native R code but can be done quite quickly in C++.

I am cc:'ing the Rcpp-devel list on this so that others can comment
and perhaps compare methods.

The gist of the question is that you have long vector, tobs, of
observed values and a reference distribution represented by tperm.
For each element, tobs[i], you want to know the proportion of the
values in tperm less than tobs[i].

In the current form the problem is that you are doing N * B
comparisons plus a lot of sums and all that.  The first thing to do to
speed it up is to sort tperm so you only need to compare to the first
j satisfying tperm[j] > tobs[i].  If you also sort tobs then you can
start at the point where the last comparison failed.  For tobs you
should use order, rather than sort which will allow you to associate
the pvalues with the original ordering of tobs.

The output I enclose is before assigning the pvalues back to the
correct observations (and may have a few "infelicities", as Bill
Venables calls them, still lurking around).

On Thu, Apr 29, 2010 at 1:40 PM, Sunduz Keles <keles at stat.wisc.edu> wrote:
>
> Hi Doug,
>
> we are trying to implement a simple permutation approach for getting a null
> dist. Generating observations from the permutation null distribution seems
> pretty fast but when it comes to getting the actual p-value which is just a
> simple sum of number of permuted test stat which are smaller than or equal
> to what we actually observe, things get pretty slow becaue we have a large
> number of observed test stats.
>
> Here is a piece of code from what we have
> B = 6000000
> N = 3000000
> set.seed(1)
> tperm = rnorm(B, 0, 1) #we get these from somewhere else but for simplicity
> let's assume that we generate them like this.
> tobs = rnorm(N, 0, 1)
> tobs[sample(1:N, 10000, replace = F)]  = rnorm(1000, 2, 1)
>
> #This is super slow.
> pvals = apply(as.matrix(tobs), 1, function(x, y, B){sum(x>=y)/B}, tperm, B)
>
> Maybe there is really no way around this but I was wondering whether you had
> any suggestions for replacing sum(x>=y). If we have for example pnorm
> instead of sum(x>=y) that runs much much faster.
>
>
> Thanks much in advance!
> Sunduz
>
>
> ------------------------------------------------------------------------------------
> Sunduz Keles
>  http://www.stat.wisc.edu/~keles
> Associate Professor
> Department of Statistics                            tel: 608-263-4533 (1245B
>  MSC)
> Department of Biostatistics and Medical Informatics
> University of Wisconsin-Madison
> -----------------------------------------------------------------------------------
>
-------------- next part --------------

R version 2.10.1 (2009-12-14)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> nperm <- 6000000
> nobs <- 3000000
> 
> set.seed(1)
> tperm <- sort(rnorm(nperm))                 # these can be sorted immediately
> tobs <- rnorm(nobs)
> 
> ## contaminate the sample
> nbig <- 10000
> tobs[sample(1:nobs, nbig, replace = FALSE)]  <- rnorm(nbig, mean = 2)
> 
> ord <- order(tobs)
> tord <- tobs[ord]
> 
> library(inline)  # allows you to compile and load C++ code
> cpp <- '
+   Rcpp::NumericVector tord(tordP), tperm(tpermP);
+   Rcpp::NumericVector ans(tord.size());  // result vector
+   double *to = tord.begin(), *tp = tperm.begin(), *aa = ans.begin();
+   int nobs = tord.size(), nperm = tperm.size();
+   for (int i = 0, j = 0; i < nobs; i++) {
+     while (tp[j] < to[i] && j < nperm) j++;
+     aa[i] = ((double)j)/((double)nperm);
+   }
+   return ans;
+ '
> ff <- cfunction(signature(tordP = "numeric", tpermP = "numeric"), cpp, Rcpp = TRUE, )
Loading required package: Rcpp
> 
> str(ans <- ff(tord, tperm))
 num [1:3000000] 1.67e-07 1.67e-07 8.33e-07 1.00e-06 1.67e-06 ...
> 
> proc.time()
   user  system elapsed 
 13.140   1.310  14.069