[Rcpp-devel] Missing values
Romain Francois
romain at r-enthusiasts.com
Fri Nov 16 09:25:52 CET 2012
Thanks for exploring these issue. This looks very useful.
I get:
> str( first_log(NA) )
logi TRUE
> str( first_int(NA_integer_) )
int NA
> str( first_num(NA_real_) )
num NA
> str( first_char(NA_character_) )
chr "NA"
For first_log: a bool can either be true or false. In R logical vectors
are represented as arrays of ints. When we coerce to bool, we test
whether the value is not 0. This works for most cases. I guess
conversion to bool should be avoided.
We have the is_na template function that can help:
> evalCpp( 'traits::is_na<LGLSXP>( NA_LOGICAL )' )
[1] TRUE
> evalCpp( 'traits::is_na<REALSXP>( NA_REAL )' )
[1] TRUE
And from this I can see we don't have is_na<STRSXP>, will fix this.
> str( evalCpp( 'traits::get_na<REALSXP>()' ) )
num NA
> str( evalCpp( 'traits::get_na<INTSXP>()' ) )
int NA
I guess we could come up with a nicer syntax for these, maybe static
functions in Vector<> so that we could do :
IntegerVector::is_na( )
NumericVector::get_na( )
...
More below:
Le 15/11/12 23:36, Hadley Wickham a écrit :
> Hi all,
>
> I'm working on a description of how missing values work in Rcpp
> (expanding on FAQ 3.4). I'd really appreciate any comments,
> corrections or suggestions on the text below.
>
> Thanks!
>
> Hadley
>
>
> # Missing values
>
> If you're working with missing values, you need to know two things:
>
> * what happens when you put missing values in scalars (e.g. `double`)
> * how to get and set missing values in vectors (e.g. `NumericVector`)
>
> ## Scalars
>
> The following code explores what happens when you coerce the first
> element of a vector into the corresponding scalar:
>
> cppFunction('int first_int(IntegerVector x) {
> return(x[0]);
> }')
> cppFunction('double first_num(NumericVector x) {
> return(x[0]);
> }')
> cppFunction('std::string first_char(CharacterVector x) {
> return((std::string) x[0]);
> }')
> cppFunction('bool first_log(LogicalVector x) {
> return(x[0]);
> }')
>
> first_log(NA)
> first_int(NA_integer_)
> first_num(NA_real_)
> first_char(NA_character_)
>
> So
>
> * `NumericVector` -> `double`: NAN
>
> * `IntegerVector` -> `int`: NAN (not sure how this works given that
> integer types don't usually have a missing value)
> str( evalCpp( 'std::numeric_limits<int>::min()' ) )
int NA
This is how NA_integer_ is represented.
> * `CharacterVector` -> `std::string`: the string "NA"
Ouch. We definitely need to fix this. Will do.
> * `LogicalVector` -> `bool`: TRUE
>
> If you're working with doubles, depending on your problem, you may be
> able to get away with ignoring missing values and working with NaNs.
> R's missing values are a special type of the IEEE 754 floating point
> number NaN (not a number). That means if you coerce them to `double`
> or `int` in your C++ code, they will behave like regular NaN's.
>
> In a logical context they always evaluate to FALSE:
>
> evalCpp("NAN == 1")
> evalCpp("NAN < 1")
> evalCpp("NAN > 1")
> evalCpp("NAN == NAN")
>
> But be careful when combining then with boolean values:
>
> evalCpp("NAN && TRUE")
> evalCpp("NAN || FALSE")
>
> In numeric contexts, they propagate similarly to NA in R:
>
> evalCpp("NAN + 1")
> evalCpp("NAN - 1")
> evalCpp("NAN / 1")
> evalCpp("NAN * 1")
That's very useful to let people know of these issues.
> ## Vectors
>
> To set a missing value in a vector, you need to use a missing value
> specific to the type of vector. Unfortunately these are not named
> terribly consistently:
>
> cppFunction('
> List missing_sampler() {
>
> NumericVector num(1);
> num[0] = NA_REAL;
>
> IntegerVector intv(1);
> intv[0] = NA_INTEGER;
>
> LogicalVector lgl(1);
> lgl[0] = NA_LOGICAL;
>
> CharacterVector chr(1);
> chr[0] = NA_STRING;
>
> List out(4);
> out[0] = num;
> out[1] = intv;
> out[2] = lgl;
> out[3] = chr;
> return(out);
> }
> ')
> str(missing_sampler())
>
> To check if a value in a vector is missing, use `ISNA`:
>
> cppFunction('
> LogicalVector is_na2(NumericVector x) {
> LogicalVector out(x.size());
>
> NumericVector::iterator x_it;
> LogicalVector::iterator out_it;
> for (x_it = x.begin(), out_it = out.begin(); x_it != x.end();
> x_it++, out_it++) {
> *out_it = ISNA(*x_it);
> }
> return(out);
> }
> ')
> is_na2(c(NA, 5.4, 3.2, NA))
>
> Rcpp provides a helper function called `is_na` that works similarly to
> `is_na2` above, producing a logical vector that's true where the value
> in the vector was missing.
As said above, I'll add
...Vector::is_na
...Vector::get_na
to have something more consistent and not as cryptic as
traits::is_na<...>( ). People should not need to know what REALSXP,
INTSXP, LGLSXP, ... mean.
--
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
R Graph Gallery: http://gallery.r-enthusiasts.com
`- http://bit.ly/SweN1Z : SuperStorm Sandy
blog: http://romainfrancois.blog.free.fr
|- http://bit.ly/RE6sYH : OOP with Rcpp modules
`- http://bit.ly/Thw7IK : Rcpp modules more flexible
More information about the Rcpp-devel
mailing list