[Rcpp-devel] How to speed up selection of unique unordered pairs

Romain Francois romain at r-enthusiasts.com
Mon Dec 30 21:04:56 CET 2013


Hello, 

This is a perfect example for using hash sets (std::unordered_set in C++11) with custom implementations of hashing and comparison operators. 

See this gist: https://gist.github.com/romainfrancois/8187193

I have only tested it with Rcpp11 (and therefore using C++11), but I get down to 2 or 3 seconds. 

> microbenchmark( unordered_pairs( m, 1, 2 ), times=10 )
Unit: seconds
                     expr      min       lq   median       uq      max neval
 unordered_pairs(m, 1, 2) 2.534141 2.536889 2.556607 2.714883 2.804312    10

I have used an input matrix of the same size, but probably not the same as yours. I would still expect this code to perform better to your 29 seconds. 

Romain

Le 30 déc. 2013 à 19:01, Asis Hallab <asis.hallab at gmail.com> a écrit :

> Dear Rcpp experts,
> 
> I need to select all unique un-ordered pairs from a two column table.
> Here each row represents a pair and
> two rows, un-ordered pairs p and q are identical if
> either first and second elements of p and q are identical, respectively,
> or if first element of p is identical to second element of q and vice versa.
> 
> Hence:
> ( "A", "B" )
> ( "B", "A" )
> ( "A", "B" )
> are all identical pairs.
> 
> Currently I have this generation of a mathematical set of unordered
> pairs implemented as follows - basically using a custom comparator for
> a standard set of standard vectors:
> "
> struct Comparator {
>  bool operator()(const std::vector<std::string> & a, const
> std::vector<std::string> & b) {
>    const bool swapA = a[0] < a[1];
>    const std::string & al = swapA ? a[0] : a[1];
>    const std::string & ar = swapA ? a[1] : a[0];
>    const bool swapB = b[0] < b[1];
>    const std::string & bl = swapB ? b[0] : b[1];
>    const std::string & br = swapB ? b[1] : b[0];
>    return al < bl || (al == bl && ar < br);
>  }
> };
> 
> SEXP extractProteinPairs( SEXP proteinAccessionMatrix, SEXP
>    pairFirstMemberColIndex, SEXP pairSecondMemberColIndex ) {
>  BEGIN_RCPP
> 
>    StringMatrix tbl( proteinAccessionMatrix );
>    int colA( NumericVector( pairFirstMemberColIndex )( 0 ) );
>    int colB( NumericVector( pairSecondMemberColIndex )( 0 ) );
>    std::set<std::vector<std::string>, Comparator> prs;
> 
>    for ( int i=0; i<tbl.nrow(); ++i ) {
>      CharacterVector r( tbl( i, _ ) );
>      std::vector< std::string > p;
>      p.push_back( std::string( r( colA ) ) );
>      p.push_back( std::string( r( colB ) ) );
>      prs.insert( p );
>    }
> 
>    return( wrap( prs ) );
> 
>  END_RCPP
> }
> "
> 
> I am wondering if this could not be sped up using the following
> algorithm sketched in "pseudo R":
> 
> # input:
> pairsTable
> # a matrix with two columns and n rows, each representing an un-ordered pair
> 
> # output:
> m <- matrix( ncol=2 )
> # a matrix with two columns, subset of input and initialized as given
> 
> for_each row p in argument pairsTable {
>   indices <- find all rows of m where EITHER column is identical with
> either p[[1]] or p[[2]]
>   identityCandidates <- m[ indices, ]
>   if( ! any( candidate in identiyCandidates identical with p ) ) {
>      append p to m
>   }
> }
> 
> Would the above algorithm be faster than my implementation, because it
> does just two "find_all" for each column in the result matrix and a
> subsequent identity comparison with the current pair p?
> And if the algorithm is faster how would I best implement this in Rcpp?
> - The currently used Rcpp implementation given above takes
> approximately 29 secs on an input table with 3,018,671 rows, returning
> a subset with 2,597,797 unique un-ordered pairs (rows).
> 
> - If I managed to explain my problem properly and did not confuse you too much:
> Any comments and help will be much appreciated!
> 
> A happy new year to everyone!
> Cheers!
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel



More information about the Rcpp-devel mailing list