[Rcpp-devel] Testing for existence of named components

Dirk Eddelbuettel edd at debian.org
Wed Mar 28 17:07:56 CEST 2012


On 28 March 2012 at 16:36, Jonas Rauch wrote:
| If I remember correctly, environments use hash tables to lookup names.
| 
| In pure R to check if an element L[["key"]] exists I would usually use
| > "key" %in% names(L)
| which seems to be reasonably fast. `%in%` works in vectors, too
| > keys %in% L
| returning exactly what Dirks suggestion was supposed to do (you missed an "if",
| or alternatively an "or")
| 
| `%in%` uses .Internal(match(...)), which computes a temporary hash table as far
| as I can see in the source (src/main/unique.c). I am not sure about the
| implementation of the Hash. In particular I don't know if some kind of caching
| is done, because the lookup is much faster than Dirks function (see below) and
| computing a hash table for one lookup seems inefficient. I think it would be
| good to use the internal `match` function of R to do what you want. In
| Rinternals.h match is exposed as
|   SEXP Rf_match(SEXP itable, SEXP ix, int nmatch)
| where nmatch is the value to be returned if there is no match.

Excellent suggestion!

I think we can even return just the index vector, with NA or another filler
for non-matches:

R> g <- cxxfunction(signature(ls="list", ts="character"), plugin="Rcpp", body='
+   Rcpp::List lst(ls);
+   Rcpp::CharacterVector nam = lst.names();
+ 
+   Rcpp::IntegerVector ind(Rf_match(nam,ts,R_NaInt));
+   return ind; 
+ ')
R> g(ll, c("aa", "b", "c"))
[1] NA  2 NA
R> 

Dirk
 
| Below are the results of my little benchmark (see below). First parameter of
| `lookup` is the number elements in L, second is the length of keys, third one
| specifies the probability that a key is in present in the list.
| 
| Regards,
| Jonas
| 
| ##Benchmark results
| > lookup(10000, 1000, 0.1)
|                 test elapsed relative
| 2         f(L, keys)   0.641   160.25
| 3         g(L, keys)   0.005     1.25
| 1 keys %in% names(L)   0.004     1.00
| > lookup(10000, 1000, 1.0)
|                 test elapsed relative
| 2         f(L, keys)   0.643   160.75
| 3         g(L, keys)   0.004     1.00
| 1 keys %in% names(L)   0.004     1.00
| > lookup(100000, 100, 0.1)
|                 test elapsed relative
| 2         f(L, keys)   0.656      164
| 3         g(L, keys)   0.004        1
| 1 keys %in% names(L)   0.004        1
| > lookup(100000, 100, 1.0)
|                 test elapsed relative
| 2         f(L, keys)   0.654    163.5
| 3         g(L, keys)   0.004      1.0
| 1 keys %in% names(L)   0.004      1.0
| > lookup(100000, 1, 0.1)
|                 test elapsed relative
| 2         f(L, keys)   0.647   161.75
| 3         g(L, keys)   0.004     1.00
| 1 keys %in% names(L)   0.005     1.25
| > lookup(100000, 1, 1.0)
|                 test elapsed relative
| 2         f(L, keys)   0.647   161.75
| 3         g(L, keys)   0.004     1.00
| 1 keys %in% names(L)   0.005     1.25
| 
| 
| ## Code for benchmarks
| library(inline)
| library(rbenchmark)
| f <- cxxfunction(signature(ls="list", ts="character"), plugin="Rcpp", body='
|     Rcpp::List lst(ls);
|     Rcpp::CharacterVector nam = lst.names();
|     std::vector<std::string> nm = Rcpp::as< std::vector< std::string > >
| (nam);
|     int m = nam.size();
| 
|     Rcpp::CharacterVector tok(ts);
|     std::vector<std::string> tk = Rcpp::as< std::vector< std::string > >
| (tok);
|     int n = tok.size();
| 
|     Rcpp::LogicalVector   log(n);
| 
|     for (int i=0; i<n; i++) {  // look at all tokens
|        log[i] = false;                   // assume
| false, but compare to all names
|        for (int j=0; j<m; j++) {
|                     if(tk[i] == nm[j])
|               log[i] = 1;     // and note equality if we find
| it
|        }
|     }
|     return log;
|  ')
| 
| g <- cxxfunction(signature(ls="list", ts="character"), plugin="Rcpp", body='
|     Rcpp::List lst(ls);
|     Rcpp::CharacterVector nam = lst.names();
| 
|         Rcpp::IntegerVector ind(Rf_match(nam,ts,0));
|     Rcpp::LogicalVector   log(ind);
|         return log;
|  ')
| 
| 
| lookup<-function(nList, nKeys, P) {
|     nList<-10000
|     nKeys<-1000
|     P<-0.1
|     keys<-as.character(floor(runif(nKeys, 1, nList/P+1)))
|     L<-as.list(1:nList)
|     names(L)<-as.character(1:nList)
|     benchmark(keys %in% names(L), f(L, keys), g(L, keys), replications=10,
| columns=c("test", "elapsed", "relative"))
| }
| 
| lookup(10000, 1000, 0.1)
| lookup(10000, 1000, 1.0)
| lookup(100000, 100, 0.1)
| lookup(100000, 100, 1.0)
| lookup(100000, 1, 0.1)
| lookup(100000, 1, 1.0)
| 
| 
| On Wed, Mar 28, 2012 at 3:34 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
| 
| 
|     On 28 March 2012 at 15:17, Ulrich Bodenhofer wrote:
|     | Thanks for your swift reply, Dirk! Wow, to be frank, that is not what I
|     | was expecting. In the meantime, I read Section 5.9.6 of "Writing R
|     | extensions" and I was stunned to see a solution there that is similar to
|     | the one you propose. I do not know R internals very well, but I cannot
|     | believe that accesses by names are implemented by tediously searching
|     | through all names sequentially. There must be some hash table behind,
|     | right? Otherwise, list accesses would be terribly slow. If all my
| 
|     It's a trade-off: Inserting into a hash table has cost, you would enforce
|     that on every (!!) vector use, whether names are looked up or not.
| 
|     Also, vectors are a natural representation as R's SEXP are just linear
|     collections.
| 
|     | assumptions are right, I wonder why the R API does not make this
|     | mechanism available.
|     |
|     | May I add one more question: where can I find the definition of the List
|     | class and the implementation of its methods in the Rcpp package source
|     | code? I was looking for this, but got lost. Are you using the method
| 
|     The only answer I can give is to ... keep looking.  It's messy, because
|     there
|     is so much code---but that is because Rcpp does so much stuff.
| 
|     But some header files are more important than others -- and the Vector.h
|     file
|     as well as the Vector/ subdirectory are important.
| 
|     As mentioned, Rcpp::List() is a vector too.
| 
|     | described in Section 5.9.6 of "Writing R extensions" (which I doubt) or
|     | something else. I would actually be quite curious to learn more about
|     | how lists are implemented in Rcpp.
| 
|     They are just vectors, hence the linear traversal.
|    
|     Dirk
|    
|     | Thanks and best regards,
|     | Ulrich
|     |
|     |
|     | If so, I wonder why this mechanism has not been
|     |
|     | On 03/28/2012 02:56 PM, Dirk Eddelbuettel wrote:
|     | > On 28 March 2012 at 13:56, Ulrich Bodenhofer wrote:
|     | > | My question is the following: is there any way of checking in whether
|     a
|     | > | component of an Rcpp list (or vector) with a given name exists in
|     this list. If
|     | > | I simply try accessing a non-existing component, I get an "index out
|     of bounds"
|     | > | error. Trying to catch a possible exception did not work either. I
|     also browsed
|     | > | the Rcpp package source code, but unfortunately I got lost. Sorry if
|     this has
|     | > | been addressed on this list before. At least I googled in many
|     different ways,
|     | > | but could not find anything. Any ideas? Any help is gratefully
|     appreciated!
|     | >
|     | > Good question, and I have the suspicion that we answered that years ago
|     on
|     | > the list before.
|     | >
|     | > Here is a super-pedestrian version. It takes a list, extracts its names
|     () --
|     | > and should probably test whether names is empty ? -- and then compares
|     these
|     | > against a vector of tokens, returning a bool for each token:
|     | >
|     | > R>
|     | > R>  suppressMessages(library(inline))
|     | > R>
|     | > R>  f<- cxxfunction(signature(ls="list", ts="character"), plugin=
|     "Rcpp", body='
|     | > +    Rcpp::List lst(ls);
|     | > +    Rcpp::CharacterVector nam = lst.names();
|     | > +    std::vector<std::string>  nm = Rcpp::as<  std::vector<
|      std::string>  >(nam);
|     | > +    int m = nam.size();
|     | > +
|     | > +    Rcpp::CharacterVector tok(ts);
|     | > +    std::vector<std::string>  tk = Rcpp::as<  std::vector<
|      std::string>  >(tok);
|     | > +    int n = tok.size();
|     | > +
|     | > +    Rcpp::LogicalVector   log(n);
|     | > +
|     | > +    for (int i=0; i<n; i++) {  // look at all tokens
|     | > +       log[i] = false;                   // assume false,
|     but compare to all names
|     | > +       for (int j=0; j<m; j++) {
|     | > +          log[i] = (tk[i] == nm[j]);     // and note equality
|     if we find it
|     | > +       }
|     | > +    }
|     | > +    return log;
|     | > + ')
|     | > R>  ll<- list(aa=1,b="b")       ## list with names 'aa' and 'b'
|     | > R>  f(ll, c("aa", "b", "c"))         ## does it contain 'a' or 'b'
|     or 'c' ?
|     | > [1] FALSE  TRUE FALSE
|     | > R>  f(ll, "c")                       ## works with scalars
|     too
|     | > [1] FALSE
|     | > R>
|     | >
|     | > I am sure there is a much cleverer solution one could write...
|     | >
|     | > Dirk
|     | >
|     |
| 
|     --
|     R/Finance 2012 Conference on May 11 and 12, 2012 at UIC in Chicago, IL
|     See agenda, registration details and more at http://www.RinFinance.com
|     _______________________________________________
|     Rcpp-devel mailing list
|     Rcpp-devel at lists.r-forge.r-project.org
|     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
| 
| 
| 
| ----------------------------------------------------------------------
| _______________________________________________
| Rcpp-devel mailing list
| Rcpp-devel at lists.r-forge.r-project.org
| https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
-- 
R/Finance 2012 Conference on May 11 and 12, 2012 at UIC in Chicago, IL
See agenda, registration details and more at http://www.RinFinance.com


More information about the Rcpp-devel mailing list