[Rcpp-devel] Unicode on windows

Romain Francois romain at r-enthusiasts.com
Thu Aug 1 19:02:12 CEST 2013


Yes, encoding is something we have not dealt with yet.

This is not high on my priority list but there are ways to alter that 
list, e.g. find a way to sponsor the development of that particular 
feature through funding or even crowdfunding if enough people are 
interested in having the feature and willing to pay for it.

Otherwise this will have to wait someone with the skills develops it.

Romain

Le 01/08/13 18:57, Ned Harding a écrit :
> Just to follow up in case anyone other than me is using Unicode in R:
> Rcpp does not support Unicode, or really any encoding other than 7 bit
> ascii.  Internally R marks every string with an encoding, typically
> UTF8, Latin1 or ASCII.  When using as<string> Rcpp just copies the bytes
> over ignoring the encoding.  This means that if you take a string that
> was utf8 and then later wrap it again, the encoding info is lost and the
> characters get corrupted.  In particular, never use
> Rcpp::as<std::wstring> because the string gets widened without being
> converted to Unicode.
>
> If you want (or need) to support Unicode text in an R plugin, you need
> to use Rf_translateCharUTF8(…) to get a string.  Regardless of what
> encoding it was originally, R will make sure it is encoded as UTF-8. In
> order to set a string into a R object you have to use the corresponding
> Rf_mkCharLenCE(p, len, CE_UTF8) function – which tells R that the data
> you have is UTF-8.
>
> Ned.
>
> *From:*rcpp-devel-bounces at lists.r-forge.r-project.org
> [mailto:rcpp-devel-bounces at lists.r-forge.r-project.org] *On Behalf Of
> *Ned Harding
> *Sent:* Wednesday, June 26, 2013 11:54 AM
> *To:* rcpp-devel at lists.r-forge.r-project.org
> *Subject:* [Rcpp-devel] Unicode on windows
>
> I am having issues with the wide string conversion to and from Rcpp.
> When taking in a string from R that is encoding UTF-8, I would expect
> as<wstring> to have converted the utf-8 to a wide string.  Instead, it
> is just widening all the characters and leaving the UTF-8 encoding.  I
> have no issue with UTF-8, but my issue is that Rcpp doesn’t seem to be
> able to tell me what encoding the source is so I don’t know if I should
> convert or not.
>
> Similarly, I would expect that wrap<wstring> would produce a UTF-8
> encoding SEXP, but instead the encoding in R comes back “Unknown” and
> the data can’t print.  See The C++ & R sources below along with the output.
>
> C++ function
>
> ----------------------------------------
>
> RcppExport SEXP TestWide(SEXP _strIn)
>
> {
>
>                  std::wstring strIn = Rcpp::as<std::wstring>(_strIn);
>
>                  for (const wchar_t *p = strIn.c_str(); *p; ++p)
>
>                                  Rprintf("%x\n", *p);
>
>                  std::wstring str = L"a\x02a5c";
>
>                  return Rcpp::wrap(str);
>
> }
>
> R Script
>
> ----------------------------------------
>
> test <- "a\u02a5b"
>
> a<-.Call( "TestWide", test, PACKAGE = "AlteryxRDataX" )
>
> print(Encoding(a))
>
> print(a)
>
> R Output
>
> ----------------------------------------
>
> R version 3.0.0 (2013-04-03) - x86_64
>
> rgeos version: 0.2-16, (SVN revision 389)
>
> GEOS runtime version: 3.3.6-CAPI-1.7.6
>
> Polygon checking: TRUE
>
> 61
>
> ffca
>
> ffa5
>
> 62
>
> "unknown"
>
> "a?"
>
> Thanks,
>
> *Ned Harding*
>
> Alteryx
>
> CTO
>
> 3825 Iris Avenue, Suite 150
>
> Boulder, CO 80301
>
> Phone:  720-259-0541
>
> eMail: ned at alteryx.com <mailto:ned at alteryx.com>
>
>
>
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
>


-- 
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30

R Graph Gallery: http://gallery.r-enthusiasts.com

blog:            http://blog.r-enthusiasts.com
|- http://bit.ly/13SrjxO : highlight 0.4.2
`- http://bit.ly/10X94UM : Mobile version of the graph gallery



More information about the Rcpp-devel mailing list