[datatable-help] Follow-up on subsetting data.table with NAs

Sun Jun 9 23:47:44 CEST 2013

On 09.06.2013 22:08, Arunkumar Srinivasan wrote: 

> Matthew, 
>
Regarding your recent answer here:
http://stackoverflow.com/a/17008872/559784 I'd a few questions/thoughts
and I thought it may be more appropriate to share here (even though I've
already written 3 comments!). 
> 1) First, you write that, DT[ColA ==
ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB,] 
>
However, you can write this long expression as: DF[which(DF$ColA ==
DF$ColB), ]

Good point. But DT[ColA == ColB] still seems simpler than
DF[which(DF$ColA == DF$ColB), ] (in data.table DT[which(ColA == ColB)]).
I worry about forgetting I need which() and then have bugs occur when NA
occur in the data at some time in future that don't occur now or in
test. 

> 2) Second, you mention that the motivation is not just
convenience but speed. By checking: 
> 
> require(data.table) 
>
set.seed(45) 
> df <- as.data.frame(matrix(sample(c(1,2,3,NA), 2e6,
replace=TRUE), ncol=2)) 
> dt <- data.table(df) 
> system.time(dt[V1 ==
V2]) 
> # 0.077 seconds 
> system.time(df[!is.na(df$V1) & !is.na(df$V2)
& df$V1 == df$V2, ]) 
> # 0.252 seconds 
> system.time(df[which(df$V1 ==
df$V2), ]) 
> # 0.038 seconds 
> We see that using `which` (in addition
to removing NA) is also faster than `DT[V1 == V2]`. In fact,
`DT[which(V1 == V2)]` is faster than `DT[V1 == V2]`. I suspect this is
because of the snippet below in `[.data.table`: 
> 
> if (is.logical(i))
{ 
> if (identical(i,NA)) i = NA_integer_ # see DT[NA] thread re
recycling of NA logical 
> else i[is.na(i)] = FALSE # avoids
DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB] 
> }

> But at the end `irows <- which(i)` is being done: 
> 
> if
(is.logical(i)) { 
> if (length(i)==nrow(x)) irows=which(i) # e.g.
DT[colA>3,which=TRUE] 
> And this "irows" is what's used to index the
corresponding rows. So, is the replacement of `NA` to FALSE really
necessary? I may very well have overlooked the purpose of the NA
replacement to FALSE for other scenarios, but just by looking at this
case, it doesn't seem like it's necessary as you fetch index/row numbers
later.

Interesting. Cool, so dt[V1 == V2] can and should be at least as
fast as the which() way. Will file a FR to improve that speed! 
3) And
finally, more of a philosophica

> n using "which"), 
> Not sure that is
agreed yet, but happy to be persuaded.
in-left:5px; width:100%"> 
then
are there other reasons to change the default behaviour of R's
philosophy of handling NAs as unknowns/missing observations? I find I
can relate more to the native concept of handling NAs. For example: 
x
<- c(1,2,3,NA) 
x != 3 
# TRUE TRUE FALSE NA 
makes more sense because
`NA != 3` doesn't fall in either TRUE or FALSE, if NA is a missing
observation/unknown data. The answer "unknown/missing" seems more
appropriate, therefore. 
True but the context of where that result is
used is all important; i.e.

> The data.table philosophy is that DT [
x==3 ] should exclude any rows in x that are NA, without needing to do
anything special such as needing to know to call which() as well. That
differs to data.frame, but is more consistent with SQL. In SQL "where x
= 3" doesn't need anything else if x contains some NULL values. 
> I'd
be interested in h
dition to Matthew's, other's thoughts and inputs as
well. 
Best regards, 

Arun 

> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130609/2c91c506/attachment.html>