[datatable-help] Follow-up on subsetting data.table with NAs

Mon Jun 10 00:43:25 CEST 2013

Matthew,

I personally don't think using "which" takes the simplicity away from the syntax. However, since it's (now) clear (to me) that the philosophy of data.table relates more towards SQL, I don't see a reason for "which". 

Even in the context of `[` in data.table/data.frame, "missing/unknown" data could be related to R philosophy. 

dt <- data.table(x=c(1,3,4,NA), y=c(1:4))
dt[x <= 3]

Here, one could argue that we don't know if the 4th row missing value is <= 3 or not. So, the problem comes to a point about what is the action to be taken. Do you give back the rows where no decision could be made or not? But as you rightly pointed out the idea behind data.table to be SQL-like, the current output stands very much. So retaining NA rows becomes invalid as well.

Regarding FR4652, thanks for the speedy filing of this! I'm glad to have spotted it.

Best regards,
Arun.

On Sunday, June 9, 2013 at 11:47 PM, Matthew Dowle wrote:

> On 09.06.2013 22:08, Arunkumar Srinivasan wrote:
> > Matthew,
> > Regarding your recent answer here: http://stackoverflow.com/a/17008872/559784 I'd a few questions/thoughts and I thought it may be more appropriate to share here (even though I've already written 3 comments!).
> > 1) First, you write that, DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB,]
> > However, you can write this long expression as: DF[which(DF$ColA == DF$ColB), ]
> > 
> 
> Good point. But DT[ColA == ColB] still seems simpler than DF[which(DF$ColA == DF$ColB), ]  (in data.table  DT[which(ColA == ColB)]).   I worry about forgetting I need which() and then have bugs occur when NA occur in the data at some time in future that don't occur now or in test.
> > 2) Second, you mention that the motivation is not just convenience but speed. By checking:
> > require(data.table)
> > set.seed(45)
> > df <- as.data.frame(matrix(sample(c(1,2,3,NA), 2e6, replace=TRUE), ncol=2))
> > dt <- data.table(df)
> > system.time(dt[V1 == V2])
> > # 0.077 seconds
> > system.time(df[!is.na(df$V1) & !is.na(df$V2) & df$V1 == df$V2, ])
> > # 0.252 seconds
> > system.time(df[which(df$V1 == df$V2), ])
> > 
> > # 0.038 seconds
> > We see that using `which` (in addition to removing NA) is also faster than `DT[V1 == V2]`. In fact, `DT[which(V1 == V2)]` is faster than `DT[V1 == V2]`. I suspect this is because of the snippet below in `[.data.table`:
> >         if (is.logical(i)) {
> >             if (identical(i,NA)) i = NA_integer_  # see DT[NA] thread re recycling of NA logical
> >             else i[is.na(i)] = FALSE              # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
> >         }
> > 
> > But at the end `irows <- which(i)` is being done:
> >             if (is.logical(i)) {
> >                 if (length(i)==nrow(x)) irows=which(i)   # e.g. DT[colA>3,which=TRUE]
> > 
> > And this "irows" is what's used to index the corresponding rows. So, is the replacement of `NA` to FALSE really necessary? I may very well have overlooked the purpose of the NA replacement to FALSE for other scenarios, but just by looking at this case, it doesn't seem like it's necessary as you fetch index/row numbers later.
> > 
> 
> Interesting.  Cool, so dt[V1 == V2] can and should be at least as fast as the which() way.  Will file a FR to improve that speed!
> > 3) And finally, more of a philosophical point. If we agree that subsetting can be done conveniently (using "which") and with no loss of speed (again using "which"),
> > 
> 
> Not sure that is agreed yet, but happy to be persuaded.
> > then are there other reasons to change the default behaviour of R's philosophy of handling NAs as unknowns/missing observations? I find I can relate more to the native concept of handling NAs. For example:
> > x <- c(1,2,3,NA)
> > x != 3
> > # TRUE TRUE FALSE NA
> > makes more sense because `NA != 3` doesn't fall in either TRUE or FALSE, if NA is a missing observation/unknown data. The answer "unknown/missing" seems more appropriate, therefore.
> > 
> 
> True but the context of where that result is used is all important; i.e., in this case that's `i` of [.data.table or [.data.frame.  It may be easier to consider == first.  The data.table philosophy is that DT [ x==3 ]  should exclude any rows in x that are NA,  without needing to do anything special such as needing to know to call which() as well.  That differs to data.frame,  but is more consistent with SQL.  In SQL "where x = 3" doesn't need anything else if x contains some NULL values.
> > I'd be interested in hearing, in addition to Matthew's, other's thoughts and inputs as well.
> > Best regards,
> > Arun
> > 
> > 
> 
>  
>  
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130610/657e8f74/attachment-0001.html>