[Rcpp-devel] Performance/memory management question
Toki Loo
tokiloo1 at yahoo.fr
Thu Mar 28 00:54:07 CET 2013
I have an initial (10 ^5, 20) matrix including observations for a set of individuals (individual column in the matrix)
I want to "sample with replacement the list of individus (unique)" and get the list of observations (with eventual repetitions)
Simplified Ex : m( 5 , 2)
given m :
Ind Obs
1 3.4
1 3.6
2 5
3 6
4 7
resample(m) may give
1 3.4
1 3.6
2 5
1 3.4
1 3.6
1 3.4
1 3.6
if 1 2 1 1 were sampled from the 1 2 3 4 inds.
I'm trying to do it via Rcpp and here is some code
// [[Rcpp::export]]
void resample(NumericMatrix mat) {
int nrow = mat.nrow();
IntegerVector d1(nrow);
for (int i = 0; i < nrow; i++) {
d1[i] = mat(i, 0);
}
std::cout << "Number of elements in mat: " << d1.length() << std::endl;
std::multimap<int, NumericVector> m;
for (int i = 0; i < nrow; i++) {
NumericVector d = mat.row(i);
m.insert(std::pair<int, NumericVector>(d1[i], d));
}
// Create vector of deduplicated entries:
std::set<int> keys_dedup;
for (int i = 0; i < nrow; ++i) keys_dedup.insert(d1[i]);
std::cout << "Number of elements in set : " << keys_dedup.size() << std::endl;
std: vector<int> vec;
vec.assign(keys_dedup.begin(), keys_dedup.end());
std::cout << "Number of elements in vec : " << vec.size() << std::endl;
//sampling among the unique keys
Engine eng;
eng.seed((unsigned int) 123);
std::tr1::uniform_int<int> unif(0, vec.size() - 1);
std::list<NumericVector> samples;
for (int i = 0; i < vec.size(); ++i) {
int u = unif(eng);
std::cout << u << " : " << vec[u] << std::endl;
std::pair<std::multimap<int, NumericVector>::iterator,
std::multimap<int, NumericVector>::iterator> ret =
m.equal_range(vec[u]);
for (std::multimap<int, NumericVector>::iterator it = ret.first;
it != ret.second; ++it) {
samples.push_back(it->second);
}
}
std::cout << "Number of elements in samples : " << samples.size() << std::endl;
// NumericMatrix matR(samples.size(), mat.ncol());
// for (int i = 0; i < samples.size(); ++i) {
// matR.row(i) = Rcpp::as(samples[i]);
// }
// return matR;
}
I have a performance related question :
m is a 10^⁵ * 20 matrix
if i submit : system.time(m <- resample(m))
I see:
Number of elements in mat: 100000
Number of elements in set : 939
Number of elements in vec : 939
Number of elements in samples : 99008 !!!!( here in the console it takes less than 1 sec to get there)
utilisateur système écoulé
38.531 0.004 38.631
I would like to know if possible how to decrease the 38 seconds between the std::cout (in the c++ code) and the end of the execution in R.
Could this be due to memory management/garbage collection, as I can see the last cout in less than 1 sec in the R console ?
Please advise
Toki
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130327/37e2cfc0/attachment.html>
More information about the Rcpp-devel
mailing list