[Rcpp-devel] How to properly check if a String has only ASCII characters in Rcpp?

Andrade Solomon andradesolomon2011 at gmail.com
Fri Feb 19 00:43:17 CET 2021


Thanks all for your replies.

#define ASCII_MASK (1<<6)
bool is_ascii_internal(Rcpp::String xi) {
  return (LEVELS(xi.get_sexp()) & ASCII_MASK) != 0;
}
This is almost exactly the solution I was attempting after reading the
src/include/Defn.h file, but was missing LEVELS().

Tomas wrote well about it:

https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html
That's a great read, thanks for sharing.

"One day" this may be easier. Until then the best we can do may be to borrow
helper functions from R
Agreed with both sentences. Accessing some helper functions from R would
help a lot while at the same time maintaining consistency from the time
being.



On Thu, Feb 18, 2021 at 2:37 PM Travers Ching <traversc at gmail.com> wrote:

> Hi Kevin and Andrade,
>
> I was also once looking for a way to test if strings are ASCII. Although
> looping over the characters would be fast enough in most cases, it isn't
> efficient or necessary.
>
> The IS_ASCII function isn't visible to users and it isn't obvious how one
> can re-implement the function unless you read R source code. I would agree
> with Andrade that a function in Rcpp would be helpful. Here's one
> re-implementation of the internal R function:
>
> #include <Rcpp.h>
>
> #define ASCII_MASK (1<<6)
> bool is_ascii_internal(Rcpp::String xi) {
>   return (LEVELS(xi.get_sexp()) & ASCII_MASK) != 0;
> }
>
> // [[Rcpp::export]]
> bool is_ascii(Rcpp::CharacterVector x) {
>   return is_ascii_internal(x[0]);
> }
>
> Travers
>
>
> On Thu, Feb 18, 2021 at 11:18 AM Kevin Ushey <kevinushey at gmail.com> wrote:
>
>> On Thu, Feb 18, 2021 at 1:36 AM Andrade Solomon
>> <andradesolomon2011 at gmail.com> wrote:
>> >
>> > Dear list,
>> >
>> > I see here that Rccp strings have both a get_encoding() and a
>> set_encoding() member functions, which respectively return and accept a
>> cetype_t enum defined in Rinternals.h with options:
>> >     CE_NATIVE = 0,
>> >     CE_UTF8   = 1,
>> >     CE_LATIN1 = 2,
>> >     CE_BYTES  = 3,
>> >     CE_SYMBOL = 5,
>> >     CE_ANY = 99
>> >
>> > This means that if the String is UTF-8, Latin1 or Bytecode, the
>> String's get_encoding() member function will return 1, 2 or 3,
>> respectively. Experimentally, I see that when I try it with string objects
>> containing only 0 to 127 ASCII characters (with no manually set encoding),
>> the get_encoding() member function returns 0, which means CE_NATIVE in the
>> aforementioned enum. At first, I just assumed that, then, in Rcpp's eyes,
>> ASCII would be considered CE_NATIVE. This could even make sense with what
>> is described in R's Encoding() command help entry, i.e. that "character
>> strings in R can be declared to be encoded in 'latin1' or 'UTF-8' or as
>> 'bytes' (...) ASCII strings will never be marked with a declared encoding,
>> since their representation is the same in all supported encodings".
>> However, I later realized that was not the case: if one creates an object
>> that stores a UTF-8 or Latin-1 string (e.g. x <- "á") then manually drops
>> the encoding (i.e. Encoding(x) <- ""), if that object (i.e. x) was passed
>> to Rcpp its get_encoding() would still return 0 (which suggests that
>> CE_NATIVE corresponds to the "unknown" label returned by the Encoding()
>> command).
>> >
>> > Note that, in R's official documentation, nothing is said about
>> CE_NATIVE and, conversely, it is explicitly said that "Value CE_ANY is used
>> to indicate a character string that will not need re-encoding – this is
>> used for character strings known to be in ASCII, and can also be used as an
>> input parameter where the intention is that the string is treated as a
>> series of bytes". With this last bit of information in mind, I would then
>> have expected that strings containing simple 0-127 ASCII characters and no
>> manually set encoding, when passed to a Rcpp code would then have their
>> get_encoding() member function return 99 instead of 0 - hence making it
>> easy to check within Rcpp whether a string was ASCII only. That not being
>> the case, my question actually becomes two-folded:
>> >
>> > 1) why does Rcpp's get_encoding() apparently return 0 instead of 99 for
>> ASCII only text?
>>
>> I think this is because this is what R does; e.g.
>>
>>     Encoding("ascii")  =>. "unknown"
>>
>> As far as I can tell, CE_ANY is used only sparsely by R itself
>> internally, and isn't really surfaced as a "public" encoding to be
>> used.
>>
>> > 2) is there an established way to properly check within Rcpp whether a
>> Rcpp String is ASCII only (besides obviously looping over each character to
>> check if it's <128) just like it is done in R's C API with the IS_ASCII
>> macro?
>>
>> I think this is the most reasonable way forward. If you need something
>> more complicated or specific, I would honestly just recommend rolling
>> your own class with the behaviors you need.
>>
>> If you think there's a way to make this happen with Rcpp's own String
>> class, then a pull request would be welcomed.
>>
>> > Thanks,
>> >
>> > Andrade Solomon
>> >
>> >
>> > _______________________________________________
>> > Rcpp-devel mailing list
>> > Rcpp-devel at lists.r-forge.r-project.org
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
>> _______________________________________________
>> Rcpp-devel mailing list
>> Rcpp-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20210218/b3148f0d/attachment-0001.html>


More information about the Rcpp-devel mailing list