[Rcpp-devel] How to properly check if a String has only ASCII characters in Rcpp?
Travers Ching
traversc at gmail.com
Thu Feb 18 20:36:38 CET 2021
Hi Kevin and Andrade,
I was also once looking for a way to test if strings are ASCII. Although
looping over the characters would be fast enough in most cases, it isn't
efficient or necessary.
The IS_ASCII function isn't visible to users and it isn't obvious how one
can re-implement the function unless you read R source code. I would agree
with Andrade that a function in Rcpp would be helpful. Here's one
re-implementation of the internal R function:
#include <Rcpp.h>
#define ASCII_MASK (1<<6)
bool is_ascii_internal(Rcpp::String xi) {
return (LEVELS(xi.get_sexp()) & ASCII_MASK) != 0;
}
// [[Rcpp::export]]
bool is_ascii(Rcpp::CharacterVector x) {
return is_ascii_internal(x[0]);
}
Travers
On Thu, Feb 18, 2021 at 11:18 AM Kevin Ushey <kevinushey at gmail.com> wrote:
> On Thu, Feb 18, 2021 at 1:36 AM Andrade Solomon
> <andradesolomon2011 at gmail.com> wrote:
> >
> > Dear list,
> >
> > I see here that Rccp strings have both a get_encoding() and a
> set_encoding() member functions, which respectively return and accept a
> cetype_t enum defined in Rinternals.h with options:
> > CE_NATIVE = 0,
> > CE_UTF8 = 1,
> > CE_LATIN1 = 2,
> > CE_BYTES = 3,
> > CE_SYMBOL = 5,
> > CE_ANY = 99
> >
> > This means that if the String is UTF-8, Latin1 or Bytecode, the String's
> get_encoding() member function will return 1, 2 or 3, respectively.
> Experimentally, I see that when I try it with string objects containing
> only 0 to 127 ASCII characters (with no manually set encoding), the
> get_encoding() member function returns 0, which means CE_NATIVE in the
> aforementioned enum. At first, I just assumed that, then, in Rcpp's eyes,
> ASCII would be considered CE_NATIVE. This could even make sense with what
> is described in R's Encoding() command help entry, i.e. that "character
> strings in R can be declared to be encoded in 'latin1' or 'UTF-8' or as
> 'bytes' (...) ASCII strings will never be marked with a declared encoding,
> since their representation is the same in all supported encodings".
> However, I later realized that was not the case: if one creates an object
> that stores a UTF-8 or Latin-1 string (e.g. x <- "á") then manually drops
> the encoding (i.e. Encoding(x) <- ""), if that object (i.e. x) was passed
> to Rcpp its get_encoding() would still return 0 (which suggests that
> CE_NATIVE corresponds to the "unknown" label returned by the Encoding()
> command).
> >
> > Note that, in R's official documentation, nothing is said about
> CE_NATIVE and, conversely, it is explicitly said that "Value CE_ANY is used
> to indicate a character string that will not need re-encoding – this is
> used for character strings known to be in ASCII, and can also be used as an
> input parameter where the intention is that the string is treated as a
> series of bytes". With this last bit of information in mind, I would then
> have expected that strings containing simple 0-127 ASCII characters and no
> manually set encoding, when passed to a Rcpp code would then have their
> get_encoding() member function return 99 instead of 0 - hence making it
> easy to check within Rcpp whether a string was ASCII only. That not being
> the case, my question actually becomes two-folded:
> >
> > 1) why does Rcpp's get_encoding() apparently return 0 instead of 99 for
> ASCII only text?
>
> I think this is because this is what R does; e.g.
>
> Encoding("ascii") =>. "unknown"
>
> As far as I can tell, CE_ANY is used only sparsely by R itself
> internally, and isn't really surfaced as a "public" encoding to be
> used.
>
> > 2) is there an established way to properly check within Rcpp whether a
> Rcpp String is ASCII only (besides obviously looping over each character to
> check if it's <128) just like it is done in R's C API with the IS_ASCII
> macro?
>
> I think this is the most reasonable way forward. If you need something
> more complicated or specific, I would honestly just recommend rolling
> your own class with the behaviors you need.
>
> If you think there's a way to make this happen with Rcpp's own String
> class, then a pull request would be welcomed.
>
> > Thanks,
> >
> > Andrade Solomon
> >
> >
> > _______________________________________________
> > Rcpp-devel mailing list
> > Rcpp-devel at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20210218/788f325e/attachment.html>
More information about the Rcpp-devel
mailing list