[Rcpp-devel] Performance/memory management question

Toki Loo tokiloo1 at yahoo.fr
Thu Mar 28 00:54:07 CET 2013


I have an initial (10 ^5, 20) matrix including observations for a set of individuals (individual column in the matrix)
I want to "sample with replacement the list of individus (unique)" and get the list of observations (with eventual repetitions)
Simplified Ex : m( 5 , 2)
given m :
Ind  Obs
1   3.4
1   3.6
2   5
3   6
4   7

resample(m) may give
1 3.4
1 3.6
2 5
1 3.4
1 3.6
1 3.4
1 3.6
if 1 2 1 1 were sampled from the 1 2 3 4  inds.

I'm trying to do it via Rcpp and here is some code 

// [[Rcpp::export]]
void resample(NumericMatrix mat) {

    int nrow = mat.nrow();
    IntegerVector d1(nrow);
    for (int i = 0; i < nrow; i++) {
        d1[i] = mat(i, 0);
    }
    std::cout << "Number of elements in mat:  " << d1.length() << std::endl;
    std::multimap<int, NumericVector> m;
    for (int i = 0; i < nrow; i++) {
        NumericVector d = mat.row(i);
        m.insert(std::pair<int, NumericVector>(d1[i], d));
    }

    // Create vector of deduplicated entries: 
    std::set<int> keys_dedup;
    for (int i = 0; i < nrow; ++i) keys_dedup.insert(d1[i]);
    std::cout << "Number of elements in set :  " << keys_dedup.size() << std::endl;
    std: vector<int> vec;
    vec.assign(keys_dedup.begin(), keys_dedup.end());
    std::cout << "Number of elements in vec :  " << vec.size() << std::endl;

    //sampling among the unique keys
    Engine eng;
    eng.seed((unsigned int) 123);
    std::tr1::uniform_int<int> unif(0, vec.size() - 1);
    std::list<NumericVector> samples;
    for (int i = 0; i < vec.size(); ++i) {
        int u = unif(eng);
        std::cout << u << " : " << vec[u] << std::endl;

        std::pair<std::multimap<int, NumericVector>::iterator,
                std::multimap<int, NumericVector>::iterator> ret =
                m.equal_range(vec[u]);
        for (std::multimap<int, NumericVector>::iterator it = ret.first;
                it != ret.second; ++it) {
            samples.push_back(it->second);
        }
    }
    std::cout << "Number of elements in samples :  " << samples.size() << std::endl;

    //    NumericMatrix matR(samples.size(), mat.ncol());
    //        for (int i = 0; i < samples.size(); ++i) {
    //            matR.row(i) = Rcpp::as(samples[i]);
    //        }
//    return matR; 
}

I have a performance related question : 
m is a 10^⁵ * 20 matrix
if i submit : system.time(m <- resample(m))

I see: 
Number of elements in mat:  100000
Number of elements in set :  939
Number of elements in vec :  939
Number of elements in samples :  99008  !!!!( here in the console it takes less than 1 sec to get there)
utilisateur     système      écoulé 
     38.531       0.004      38.631 


I would like to know if possible how to decrease the 38 seconds between the std::cout (in the c++ code) and the end of the execution in R. 
Could this be due to memory management/garbage collection, as I can see the last cout in less than 1 sec in the R console ?

Please advise
Toki
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130327/37e2cfc0/attachment.html>


More information about the Rcpp-devel mailing list