[Rcpp-devel] String encoding (UTF-8 conversion)

Jeroen Ooms jeroen.ooms at stat.ucla.edu
Tue Dec 16 06:00:38 CET 2014


On Thu, Dec 11, 2014 at 12:24 PM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
> I'm interfacing a c++ library which assumes strings are UTF-8. However
> strings from R can have various encodings. It's not clear to me how I
> need to account for that in Rcpp.

Follow-up on this: from what I have found, there is currently no
string type that is unambiguous across platforms and locales (other
than the actual STRSXP). If the native locale uses UTF8 than all is
fine, but we can not assume that in R. Here is a little script that
illustrates the various combinations I tried and the results on
Windows: https://gist.github.com/jeroenooms/9edf97f873f17a4ce5d3.

Assuming that each of these cases are intended behavior, perhaps we
could introduce an additional string type e.g. Rcpp::UTF8String. The
mapping from STRSXP to Rcpp::UTF8String would use
translateCharUTF8(STRING_ELT(x, 0)) and the mapping Rcpp::UTF8String
back to STRSXP would use SET_STRING_ELT(out, 0, mkCharCE(olds,
CE_UTF8)). That would allow for defining c++ functions operating on
UTF8 strings which will work as expected across platforms and locales.


More information about the Rcpp-devel mailing list