[datatable-help] Follow-up on subsetting data.table with NAs

Mon Jun 10 10:05:15 CEST 2013

Hi Arun, 

Hm, good point. Is data.table consistent with SQL
already, for both == and !=, and so no change needed? And it was correct
for Frank to be mistaken. Maybe just some more documentation and
examples needed then. Are you happy that DT[!(x==.)] and DT[x!=.] do
treat NA inconsistently? :

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

"na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch
before in the context of joins, not logical subsets. 

Thanks, Matthew

On 10.06.2013 08:11, Arunkumar Srinivasan wrote: 

> Matthew, 
>
Regarding your suggestion of changes regarding Frank's post here:
http://stackoverflow.com/a/17008872/559784 [1] I find it a bit more
confusing and frankly not like sql. 
> You wrote: "If I haven't
understood correctly feel free to correct, otherwise the change will get
made eventually. It will need to be done in a way that considers
compound expressions; e.g., DT[colA=="foo" & colB!="bar"] should exclude
rows with NA in colA but include rows where colA is non-NA but colB is
NA. Similarly, DT[colA!=colB] should include rows where either colA or
colB is NA but not both. And perhaps DT[colA==colB] should include rows
where bothcolA and colB are NA (which it doesn't currently, I believe)."

> 
> Even though sql (ex: sqldf) has a different way of handling NAs
when compared to data.frame, it doesn't seem to find NA == NA. That is,

> df <- data.frame(x = c(1:3,NA), y = c(NA,4:5,NA)) 
> require(sqldf)

> sqldf("select * from df where x == y") 
> # returns empty data.frame

> sqldf("select * from df where x != y") 
> 
> x y 
> 1 2 4 
> 2 3 5 
>

> That is, at least in sqldf package, NA is not == NA and NA is not !=
NA which is very much in coherence with R's default NA == NA and NA !=
NA (both giving NA). But I don't think they it's considered FALSE here.
It just acts like the "subset" function where all entries that were
evaluated to NAs are simply dropped. But with data.table philosophy NA
!= NA should be evaluated to TRUE, which I don't think (from what I
meagrely understand from sql) is what sql does. Please correct me if
I've got it wrong. 
> I think it is clearer and simpler if "NAs are just
dropped" after evaluating logical expressions. It would be also easy to
document this and easier to grasp, imho. This would also explain Frank's
post for NA rows being removed. 
> And probably if there is more
consensus an option for "na.rm = TRUE/FALSE" could be added? 
> Arun

Links:
------
[1] http://stackoverflow.com/a/17008872/559784
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130610/03f8d8d0/attachment.html>