[Rcpp-devel] How to speed up selection of unique unordered pairs

Mon Dec 30 19:38:55 CET 2013

Asis,

On 30 December 2013 at 19:01, Asis Hallab wrote:
| I need to select all unique un-ordered pairs from a two column table.
| Here each row represents a pair and
| two rows, un-ordered pairs p and q are identical if
| either first and second elements of p and q are identical, respectively,
| or if first element of p is identical to second element of q and vice versa.
| 
| Hence:
| ( "A", "B" )
| ( "B", "A" )
| ( "A", "B" )
| are all identical pairs.
| 
| Currently I have this generation of a mathematical set of unordered
| pairs implemented as follows - basically using a custom comparator for
| a standard set of standard vectors:
| "
| struct Comparator {
|   bool operator()(const std::vector<std::string> & a, const
| std::vector<std::string> & b) {
|     const bool swapA = a[0] < a[1];
|     const std::string & al = swapA ? a[0] : a[1];
|     const std::string & ar = swapA ? a[1] : a[0];
|     const bool swapB = b[0] < b[1];
|     const std::string & bl = swapB ? b[0] : b[1];
|     const std::string & br = swapB ? b[1] : b[0];
|     return al < bl || (al == bl && ar < br);
|   }
| };
| 
| SEXP extractProteinPairs( SEXP proteinAccessionMatrix, SEXP
|     pairFirstMemberColIndex, SEXP pairSecondMemberColIndex ) {
|   BEGIN_RCPP
| 
|     StringMatrix tbl( proteinAccessionMatrix );
|     int colA( NumericVector( pairFirstMemberColIndex )( 0 ) );
|     int colB( NumericVector( pairSecondMemberColIndex )( 0 ) );
|     std::set<std::vector<std::string>, Comparator> prs;
| 
|     for ( int i=0; i<tbl.nrow(); ++i ) {
|       CharacterVector r( tbl( i, _ ) );
|       std::vector< std::string > p;
|       p.push_back( std::string( r( colA ) ) );
|       p.push_back( std::string( r( colB ) ) );
|       prs.insert( p );
|     }
| 
|     return( wrap( prs ) );
| 
|   END_RCPP
| }
| "
| 
| I am wondering if this could not be sped up using the following
| algorithm sketched in "pseudo R":
| 
| # input:
| pairsTable
| # a matrix with two columns and n rows, each representing an un-ordered pair
| 
| # output:
| m <- matrix( ncol=2 )
| # a matrix with two columns, subset of input and initialized as given
| 
| for_each row p in argument pairsTable {
|    indices <- find all rows of m where EITHER column is identical with
| either p[[1]] or p[[2]]
|    identityCandidates <- m[ indices, ]
|    if( ! any( candidate in identiyCandidates identical with p ) ) {
|       append p to m
|    }
| }
| 
| Would the above algorithm be faster than my implementation, because it
| does just two "find_all" for each column in the result matrix and a
| subsequent identity comparison with the current pair p?

Not sure. 

| And if the algorithm is faster how would I best implement this in Rcpp?

Not sure either.

You have something that works.  Maybe just profile it to see if there are any
bottlenecks.  

| - The currently used Rcpp implementation given above takes
| approximately 29 secs on an input table with 3,018,671 rows, returning
| a subset with 2,597,797 unique un-ordered pairs (rows).
| 
| - If I managed to explain my problem properly and did not confuse you too much:
| Any comments and help will be much appreciated!

With that many objects it starts to matter how you store and access the
data. Maybe std::vector<std::string> is not the best -- I don't know.  

But as you have something reasonably fast now -- is it worth spending days on
getting 29 secs down to 19 secs ?

Dirk

| 
| A happy new year to everyone!
| Cheers!
| _______________________________________________
| Rcpp-devel mailing list
| Rcpp-devel at lists.r-forge.r-project.org
| https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

-- 
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com