[Rcpp-devel] Rcpp "version" of R's match function

Romain Francois romain at r-enthusiasts.com
Thu Nov 15 14:32:50 CET 2012


Commited in revision 3973. I added a few abstraction that let me 
eventually write the code in terms of stl algorithms.

See the implementation in match.h or the curious.

With that:

require(Rcpp)

sourceCpp( code = '
#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
IntegerVector match__( CharacterVector x, CharacterVector table){
     return match( x, table ) ;
}

' )

I get even better performance than yesterday:

xx <- sample( letters, 10000000, replace = TRUE )
writeLines( "R" )
system.time( match( xx, letters ) )

writeLines( "R (second time)" )
system.time( match( xx, letters ) )

writeLines( "match_ manual Rcpp" )
system.time( match_( xx, letters ) )

writeLines( "match__ internal Rcpp" )
system.time( match__( xx, letters ) )

which gives:


R
utilisateur     système      écoulé
       0.338       0.024       0.362
R (second time)
utilisateur     système      écoulé
       0.262       0.000       0.262
match_ manual Rcpp
utilisateur     système      écoulé
       0.144       0.000       0.144
match__ internal Rcpp
utilisateur     système      écoulé
       0.134       0.000       0.133

Worth noting that the R version gets better the second time. I guess R 
keeps its hash table somewhere.

Romain

Le 15/11/12 08:36, Søren Højsgaard a écrit :
> Thanks a lot; that seems elegant. Could one imagine that this goes into the next version of Rcpp? That would be nice.
>
> Regards
> Søren
>
> -----Original Message-----
> From: Romain Francois [mailto:romain at r-enthusiasts.com]
> Sent: 15. november 2012 00:50
> To: Søren Højsgaard
> Cc: rcpp-devel at lists.r-forge.r-project.org
> Subject: Re: [Rcpp-devel] Rcpp "version" of R's match function
>
> Ah. Things are particularly interesting if you want to deal with strings.
>
> The code below implements a c++ version of match for character vectors.
> beware it uses trickery related to tricking the write barrier (but not an issue as we are not "writing").
>
> require(Rcpp)
>
> sourceCpp( code = '
> #include <Rcpp.h>
> using namespace Rcpp ;
>
> // [[Rcpp::export]]
> IntegerVector match_( const CharacterVector& x, const CharacterVector& table){
>       typedef std::tr1::unordered_map<SEXP,int> MAP ;
>       typedef MAP::value_type VALUE ;
>
>       // populate the hash
>       MAP hash ;
>       int n = table.size() ;
>       SEXP* ptr = get_string_ptr(table) ;
>       for( int i=0 ; i<n; i++){
>           hash.insert( std::make_pair<SEXP,int>(ptr[i], i + 1) ) ;
>       }
>
>       n = x.size() ;
>       IntegerVector result(n) ;
>       ptr = get_string_ptr(x) ;
>       MAP::const_iterator end=hash.end() ;
>       for( int i=0; i<n; i++){
>           MAP::const_iterator it = hash.find(ptr[i]) ;
>
>           if( it == end ){
>               // no match
>               result[i] = NA_INTEGER ;
>           } else {
>               result[i] = it->second ;
>           }
>       }
>       return result ;
> }
> ' )
>
> The key of the trick is to use the pointer of the CHARSXP SEXP as the key of the hash map. (strings are cached in R, see the CHARSXP cache in R internals:
> http://cran.r-project.org/doc/manuals/R-ints.html#The-CHARSXP-cache )
>
>
> Then:
>
>   > match_( c( "b", "k") , letters )
> [1]  2 11
>
>
> But also (fasten your seatbelt):
>
>   > xx <- sample( letters, 1000000, replace = TRUE )  > system.time( match( xx, letters ) )
> utilisateur     système      écoulé
>         0.063       0.002       0.065
>   > system.time( match_( xx, letters ) )
> utilisateur     système      écoulé
>         0.015       0.000       0.015
>
>
> Romain
>
> Le 15/11/12 00:08, Søren Højsgaard a écrit :
>> Thanks! What I need to do is this
>>
>>> match(c("b","k"), letters)
>> [1]  2 11
>>
>> Regards
>> Søren
>>
>>
>> -----Original Message-----
>> From: rcpp-devel-bounces at lists.r-forge.r-project.org
>> [mailto:rcpp-devel-bounces at lists.r-forge.r-project.org] On Behalf Of
>> Romain Francois
>> Sent: 15. november 2012 00:01
>> To: rcpp-devel at lists.r-forge.r-project.org
>> Subject: Re: [Rcpp-devel] Rcpp "version" of R's match function
>>
>> Le 14/11/12 23:48, Søren Højsgaard a écrit :
>>> Dear all,
>>>
>>> I need to call R's match function from a c++ program so I can do
>>>     Rcpp::Function R_match("match");
>>>     Rcpp::NumericVector out = R_match(x_, table_);
>>>
>>> However, I wonder if there is such a "match" function in C++ that I can call (to avoid the extra overhead)?
>>>
>>> Any ideas would be welcome
>>>
>>> Regards
>>> Søren
>>
>> We don't have one in Rcpp. It would not be too hard I guess to use some algorithms from the STL, e.g. std::find.
>> http://www.cplusplus.com/reference/algorithm/find/
>>
>> Not easy to say more withou seeing exactly what you are trying to do.
>> For example, do you need to store "out" or is it enough to work one value at a time ...
>>
>>
>> Also, R's match returns an integer vector, so you probably want:
>>
>> Rcpp::IntegerVector out = R_match(x_, table_);
>>
>> otherwise you also pay for making the coercion to numeric on top of
>> the price you pay for calling back into R.
>>
>>
>
>
> --
> Romain Francois
> Professional R Enthusiast
> +33(0) 6 28 91 30 30
>
> R Graph Gallery: http://gallery.r-enthusiasts.com
> `- http://bit.ly/SweN1Z : SuperStorm Sandy
>
> blog:            http://romainfrancois.blog.free.fr
> |- http://bit.ly/RE6sYH : OOP with Rcpp modules
> `- http://bit.ly/Thw7IK : Rcpp modules more flexible
>
>


-- 
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30

R Graph Gallery: http://gallery.r-enthusiasts.com
`- http://bit.ly/SweN1Z : SuperStorm Sandy

blog:            http://romainfrancois.blog.free.fr
|- http://bit.ly/RE6sYH : OOP with Rcpp modules
`- http://bit.ly/Thw7IK : Rcpp modules more flexible



More information about the Rcpp-devel mailing list