[Rcpp-devel] How to properly check if a String has only ASCII characters in Rcpp?

Thu Feb 18 10:35:39 CET 2021

Dear list,

I see here
<https://dirk.eddelbuettel.com/code/rcpp/html/classRcpp_1_1String.html>
that Rccp strings have both a get_encoding() and a set_encoding() member
functions, which respectively return and accept a cetype_t enum defined in
Rinternals.h
<https://github.com/wch/r-source/blob/bf0a0a9d12f2ce5d66673dc32cd253524f3270bf/src/include/Rinternals.h#L928-L935>
with
options:
    CE_NATIVE = 0,
    CE_UTF8   = 1,
    CE_LATIN1 = 2,
    CE_BYTES  = 3,
    CE_SYMBOL = 5,
    CE_ANY = 99

This means that if the String is UTF-8, Latin1 or Bytecode, the String's
get_encoding() member function will return 1, 2 or 3, respectively.
Experimentally, I see that when I try it with string objects containing
only 0 to 127 ASCII characters (with no manually set encoding), the
get_encoding() member function returns 0, which means CE_NATIVE in the
aforementioned enum. At first, I just assumed that, then, in Rcpp's eyes,
ASCII would be considered CE_NATIVE. This could even make sense with what
is described in R's Encoding() command help entry
<https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Encoding>,
i.e. that "character strings in R can be declared to be encoded in 'latin1'
or 'UTF-8' or as 'bytes' (...) ASCII strings will never be marked with a
declared encoding, since their representation is the same in all supported
encodings". However, I later realized that was not the case: if one creates
an object that stores a UTF-8 or Latin-1 string (e.g. x <- "á") then
manually drops the encoding (i.e. Encoding(x) <- ""), if that object (i.e.
x) was passed to Rcpp its get_encoding() would still return 0 (which
suggests that CE_NATIVE corresponds to the "unknown" label returned by the
Encoding() command).

Note that, in R's official documentation
<https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Character-encoding-issues>,
nothing is said about CE_NATIVE and, conversely, it is explicitly said that
"Value CE_ANY is used to indicate a character string that will not need
re-encoding – this is used for character strings known to be in ASCII, and
can also be used as an input parameter where the intention is that the
string is treated as a series of bytes". With this last bit of information
in mind, I would then have expected that strings containing simple 0-127
ASCII characters and no manually set encoding, when passed to a Rcpp code
would then have their get_encoding() member function return 99 instead of 0
- hence making it easy to check within Rcpp whether a string was ASCII
only. That not being the case, my question actually becomes two-folded:

1) why does Rcpp's get_encoding() apparently return 0 instead of 99 for
ASCII only text?

2) is there an established way to properly check within Rcpp whether a Rcpp
String is ASCII only (besides obviously looping over each character to
check if it's <128) just like it is done in R's C API with the IS_ASCII
macro?

Thanks,

Andrade Solomon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20210218/4c873471/attachment.html>