[Rcpp-devel] How to speed up selection of unique unordered pairs

Mon Dec 30 20:07:59 CET 2013

On Dec 30, 2013, at 1:38 PM, Dirk Eddelbuettel <edd at debian.org> wrote:

> 
> Asis,
> 
> On 30 December 2013 at 19:01, Asis Hallab wrote:
> | I need to select all unique un-ordered pairs from a two column table.
> | Here each row represents a pair and
> | two rows, un-ordered pairs p and q are identical if
> | either first and second elements of p and q are identical, respectively,
> | or if first element of p is identical to second element of q and vice versa.
> | 
> | Hence:
> | ( "A", "B" )
> | ( "B", "A" )
> | ( "A", "B" )
> | are all identical pairs.
> | 
> | Currently I have this generation of a mathematical set of unordered
> | pairs implemented as follows - basically using a custom comparator for
> | a standard set of standard vectors:
> | "
> | struct Comparator {
> |   bool operator()(const std::vector<std::string> & a, const
> | std::vector<std::string> & b) {
> |     const bool swapA = a[0] < a[1];
> |     const std::string & al = swapA ? a[0] : a[1];
> |     const std::string & ar = swapA ? a[1] : a[0];
> |     const bool swapB = b[0] < b[1];
> |     const std::string & bl = swapB ? b[0] : b[1];
> |     const std::string & br = swapB ? b[1] : b[0];
> |     return al < bl || (al == bl && ar < br);
> |   }
> | };
> | 
> | SEXP extractProteinPairs( SEXP proteinAccessionMatrix, SEXP
> |     pairFirstMemberColIndex, SEXP pairSecondMemberColIndex ) {
> |   BEGIN_RCPP
> | 
> |     StringMatrix tbl( proteinAccessionMatrix );
> |     int colA( NumericVector( pairFirstMemberColIndex )( 0 ) );
> |     int colB( NumericVector( pairSecondMemberColIndex )( 0 ) );
> |     std::set<std::vector<std::string>, Comparator> prs;
> | 
> |     for ( int i=0; i<tbl.nrow(); ++i ) {
> |       CharacterVector r( tbl( i, _ ) );
> |       std::vector< std::string > p;
> |       p.push_back( std::string( r( colA ) ) );
> |       p.push_back( std::string( r( colB ) ) );
> |       prs.insert( p );
> |     }
> | 
> |     return( wrap( prs ) );
> | 
> |   END_RCPP
> | }
> | "
> | 
> | I am wondering if this could not be sped up using the following
> | algorithm sketched in "pseudo R":
> | 
> | # input:
> | pairsTable
> | # a matrix with two columns and n rows, each representing an un-ordered pair
> | 
> | # output:
> | m <- matrix( ncol=2 )
> | # a matrix with two columns, subset of input and initialized as given
> | 
> | for_each row p in argument pairsTable {
> |    indices <- find all rows of m where EITHER column is identical with
> | either p[[1]] or p[[2]]
> |    identityCandidates <- m[ indices, ]
> |    if( ! any( candidate in identiyCandidates identical with p ) ) {
> |       append p to m
> |    }
> | }
> | 
> | Would the above algorithm be faster than my implementation, because it
> | does just two "find_all" for each column in the result matrix and a
> | subsequent identity comparison with the current pair p?
> 
> Not sure. 

Since m is unordered, "find all" would need to do O(n) comparisons, whereas the std::set implementation should use a balancing binary search tree that needs only O(log(n)) comparisons, so my intuition is that this wouldn't be faster than your previous implementation.

> | And if the algorithm is faster how would I best implement this in Rcpp?
> 
> Not sure either.
> 
> You have something that works.  Maybe just profile it to see if there are any
> bottlenecks.  
> 
> | - The currently used Rcpp implementation given above takes
> | approximately 29 secs on an input table with 3,018,671 rows, returning
> | a subset with 2,597,797 unique un-ordered pairs (rows).
> | 
> | - If I managed to explain my problem properly and did not confuse you too much:
> | Any comments and help will be much appreciated!
> 
> With that many objects it starts to matter how you store and access the
> data. Maybe std::vector<std::string> is not the best -- I don't know.  

You know that the objects are pairs, so you should probably use std::pair instead of std::vector.
And if you do an initial pass to identify all unique strings, you should be able to replace strings with an integer index, which would be faster for comparison. (You could also reorder the pairs so that first element is always <= second element, then sort the pairs, and remove duplicates by making another pass to check for identical consecutive rows.)

> But as you have something reasonably fast now -- is it worth spending days on
> getting 29 secs down to 19 secs ?
> 
> Dirk
> 
> | 
> | A happy new year to everyone!
> | Cheers!
> | _______________________________________________
> | Rcpp-devel mailing list
> | Rcpp-devel at lists.r-forge.r-project.org
> | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
> 
> -- 
> Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel