[Rcpp-devel] How to speed up selection of unique unordered pairs
Asis Hallab
asis.hallab at gmail.com
Mon Dec 30 19:01:18 CET 2013
Dear Rcpp experts,
I need to select all unique un-ordered pairs from a two column table.
Here each row represents a pair and
two rows, un-ordered pairs p and q are identical if
either first and second elements of p and q are identical, respectively,
or if first element of p is identical to second element of q and vice versa.
Hence:
( "A", "B" )
( "B", "A" )
( "A", "B" )
are all identical pairs.
Currently I have this generation of a mathematical set of unordered
pairs implemented as follows - basically using a custom comparator for
a standard set of standard vectors:
"
struct Comparator {
bool operator()(const std::vector<std::string> & a, const
std::vector<std::string> & b) {
const bool swapA = a[0] < a[1];
const std::string & al = swapA ? a[0] : a[1];
const std::string & ar = swapA ? a[1] : a[0];
const bool swapB = b[0] < b[1];
const std::string & bl = swapB ? b[0] : b[1];
const std::string & br = swapB ? b[1] : b[0];
return al < bl || (al == bl && ar < br);
}
};
SEXP extractProteinPairs( SEXP proteinAccessionMatrix, SEXP
pairFirstMemberColIndex, SEXP pairSecondMemberColIndex ) {
BEGIN_RCPP
StringMatrix tbl( proteinAccessionMatrix );
int colA( NumericVector( pairFirstMemberColIndex )( 0 ) );
int colB( NumericVector( pairSecondMemberColIndex )( 0 ) );
std::set<std::vector<std::string>, Comparator> prs;
for ( int i=0; i<tbl.nrow(); ++i ) {
CharacterVector r( tbl( i, _ ) );
std::vector< std::string > p;
p.push_back( std::string( r( colA ) ) );
p.push_back( std::string( r( colB ) ) );
prs.insert( p );
}
return( wrap( prs ) );
END_RCPP
}
"
I am wondering if this could not be sped up using the following
algorithm sketched in "pseudo R":
# input:
pairsTable
# a matrix with two columns and n rows, each representing an un-ordered pair
# output:
m <- matrix( ncol=2 )
# a matrix with two columns, subset of input and initialized as given
for_each row p in argument pairsTable {
indices <- find all rows of m where EITHER column is identical with
either p[[1]] or p[[2]]
identityCandidates <- m[ indices, ]
if( ! any( candidate in identiyCandidates identical with p ) ) {
append p to m
}
}
Would the above algorithm be faster than my implementation, because it
does just two "find_all" for each column in the result matrix and a
subsequent identity comparison with the current pair p?
And if the algorithm is faster how would I best implement this in Rcpp?
- The currently used Rcpp implementation given above takes
approximately 29 secs on an input table with 3,018,671 rows, returning
a subset with 2,597,797 unique un-ordered pairs (rows).
- If I managed to explain my problem properly and did not confuse you too much:
Any comments and help will be much appreciated!
A happy new year to everyone!
Cheers!
More information about the Rcpp-devel
mailing list