[Rcpp-devel] How to properly check if a String has only ASCII characters in Rcpp?

Thu Feb 18 20:17:49 CET 2021

On Thu, Feb 18, 2021 at 1:36 AM Andrade Solomon
<andradesolomon2011 at gmail.com> wrote:
>
> Dear list,
>
> I see here that Rccp strings have both a get_encoding() and a set_encoding() member functions, which respectively return and accept a cetype_t enum defined in Rinternals.h with options:
>     CE_NATIVE = 0,
>     CE_UTF8   = 1,
>     CE_LATIN1 = 2,
>     CE_BYTES  = 3,
>     CE_SYMBOL = 5,
>     CE_ANY = 99
>
> This means that if the String is UTF-8, Latin1 or Bytecode, the String's get_encoding() member function will return 1, 2 or 3, respectively. Experimentally, I see that when I try it with string objects containing only 0 to 127 ASCII characters (with no manually set encoding), the get_encoding() member function returns 0, which means CE_NATIVE in the aforementioned enum. At first, I just assumed that, then, in Rcpp's eyes, ASCII would be considered CE_NATIVE. This could even make sense with what is described in R's Encoding() command help entry, i.e. that "character strings in R can be declared to be encoded in 'latin1' or 'UTF-8' or as 'bytes' (...) ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings". However, I later realized that was not the case: if one creates an object that stores a UTF-8 or Latin-1 string (e.g. x <- "á") then manually drops the encoding (i.e. Encoding(x) <- ""), if that object (i.e. x) was passed to Rcpp its get_encoding() would still return 0 (which suggests that CE_NATIVE corresponds to the "unknown" label returned by the Encoding() command).
>
> Note that, in R's official documentation, nothing is said about CE_NATIVE and, conversely, it is explicitly said that "Value CE_ANY is used to indicate a character string that will not need re-encoding – this is used for character strings known to be in ASCII, and can also be used as an input parameter where the intention is that the string is treated as a series of bytes". With this last bit of information in mind, I would then have expected that strings containing simple 0-127 ASCII characters and no manually set encoding, when passed to a Rcpp code would then have their get_encoding() member function return 99 instead of 0 - hence making it easy to check within Rcpp whether a string was ASCII only. That not being the case, my question actually becomes two-folded:
>
> 1) why does Rcpp's get_encoding() apparently return 0 instead of 99 for ASCII only text?

I think this is because this is what R does; e.g.

    Encoding("ascii")  =>. "unknown"

As far as I can tell, CE_ANY is used only sparsely by R itself
internally, and isn't really surfaced as a "public" encoding to be
used.

> 2) is there an established way to properly check within Rcpp whether a Rcpp String is ASCII only (besides obviously looping over each character to check if it's <128) just like it is done in R's C API with the IS_ASCII macro?

I think this is the most reasonable way forward. If you need something
more complicated or specific, I would honestly just recommend rolling
your own class with the behaviors you need.

If you think there's a way to make this happen with Rcpp's own String
class, then a pull request would be welcomed.

> Thanks,
>
> Andrade Solomon
>
>
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel