[datatable-help] Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan aragorn168b at gmail.com
Mon Jun 10 10:28:46 CEST 2013


> Hm, good point.  Is data.table consistent with SQL already, for both == and !=, and so no change needed?  

Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below).
 
> And it was correct for Frank to be mistaken.  

Yes, it seems like he was mistaken.
> Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. 
> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :
> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. 

In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency.
> "na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE).

Best regards,
 
Arun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130610/727fce4e/attachment-0001.html>


More information about the datatable-help mailing list