[Rcpp-devel] How to properly check if a String has only ASCII characters in Rcpp?
andradesolomon2011 at gmail.com
Thu Feb 18 10:35:39 CET 2021
I see here
that Rccp strings have both a get_encoding() and a set_encoding() member
functions, which respectively return and accept a cetype_t enum defined in
CE_NATIVE = 0,
CE_UTF8 = 1,
CE_LATIN1 = 2,
CE_BYTES = 3,
CE_SYMBOL = 5,
CE_ANY = 99
This means that if the String is UTF-8, Latin1 or Bytecode, the String's
get_encoding() member function will return 1, 2 or 3, respectively.
Experimentally, I see that when I try it with string objects containing
only 0 to 127 ASCII characters (with no manually set encoding), the
get_encoding() member function returns 0, which means CE_NATIVE in the
aforementioned enum. At first, I just assumed that, then, in Rcpp's eyes,
ASCII would be considered CE_NATIVE. This could even make sense with what
is described in R's Encoding() command help entry
i.e. that "character strings in R can be declared to be encoded in 'latin1'
or 'UTF-8' or as 'bytes' (...) ASCII strings will never be marked with a
declared encoding, since their representation is the same in all supported
encodings". However, I later realized that was not the case: if one creates
an object that stores a UTF-8 or Latin-1 string (e.g. x <- "á") then
manually drops the encoding (i.e. Encoding(x) <- ""), if that object (i.e.
x) was passed to Rcpp its get_encoding() would still return 0 (which
suggests that CE_NATIVE corresponds to the "unknown" label returned by the
Note that, in R's official documentation
nothing is said about CE_NATIVE and, conversely, it is explicitly said that
"Value CE_ANY is used to indicate a character string that will not need
re-encoding – this is used for character strings known to be in ASCII, and
can also be used as an input parameter where the intention is that the
string is treated as a series of bytes". With this last bit of information
in mind, I would then have expected that strings containing simple 0-127
ASCII characters and no manually set encoding, when passed to a Rcpp code
would then have their get_encoding() member function return 99 instead of 0
- hence making it easy to check within Rcpp whether a string was ASCII
only. That not being the case, my question actually becomes two-folded:
1) why does Rcpp's get_encoding() apparently return 0 instead of 99 for
ASCII only text?
2) is there an established way to properly check within Rcpp whether a Rcpp
String is ASCII only (besides obviously looping over each character to
check if it's <128) just like it is done in R's C API with the IS_ASCII
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Rcpp-devel