[datatable-help] Follow-up on subsetting data.table with NAs

Mon Jun 10 16:35:57 CEST 2013

Hm, another good point.   We need ~ for formulae,  although I can't 
imagine a formula in i (only in j).  But in both i and j we might want 
to get(x).

I thought about ^ i.e. X[^Y] in the spirit of regular expression 
syntax,  but ^ doesn't parse with a RHS only. Needs to be parsable as a 
prefix.

- maybe then?  Consistent with - meaning in R.  I don't think I 
actually had a specific use in mind for - and +, to reserve them for,  
but at the time it just seemed a shame to use up one of -/+ without 
defining the other.  If - does a not join, then, might + be more like 
merge() (i.e. returning the union of the rows in x and i by join).  I 
think I had something like that in mind, but hadn't thought it through.

Some might say it should be a new argument e.g. notjoin=TRUE,  but my 
thinking there is readability,  since we often have many lines in i, j 
and by in that order, and if the "notjoin=TRUE" followed afterwards it 
would be far away from the i argument to which it applies.  If we 
incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet 
more parameters, too.

On 10.06.2013 15:02, Gabor Grothendieck wrote:
> The problem with ~ is that it is using up a special character (of
> which there are only a few) for a case that does not occur much.
>
> I can think of other things that ~ might be better used for.  For
> example, perhaps ~ x could mean get(x).  One aspect of data.table 
> that
> tends to be difficult is when you don't know the variable name ahead
> of time and this woiuld give a way to specify it concisely.
>
> On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>> Matthew,
>>
>> How about ~ instead of ! ?      I ruled out - previously to leave + 
>> and -
>> available for future use.  NJ() may be possible too.
>>
>> Both "NJ()" and "~" are okay for me.
>>
>> That result makes perfect sense to me.   I don't think of !(x==.) 
>> being the
>> same as  x!=.    ! is simply a prefix.    It's all the rows that 
>> aren't
>> returned if the ! prefix wasn't there.
>>
>> I understand that `DT[!(x)]` does what `data.table` is designed to 
>> do
>> currently. What I failed to mention was that if one were to consider
>> implementing `!(x==.)` as the same as `x != .` then this behaviour 
>> has to be
>> changed. Let's forget this point for a moment.
>>
>> That needs to be fixed.  But we're getting quite theoretical here 
>> and far
>> away from common use cases.  Why would we ever have row numbers of 
>> the
>> table, as a column of the table itself and want to select the rows 
>> by number
>> not mentioned in that column?
>>
>> Probably I did not choose a good example. Suppose that I've a 
>> data.table and
>> I want to get all rows where "x == 0". Let's say:
>>
>> set.seed(45)
>> DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
>> sample(15))
>>
>> DF <- as.data.frame(DT)
>>
>> To get all rows where x == 0, it could be done with DT[x == 0]. But 
>> it makes
>> sense, at least in the context of data.frames, to do equivalently,
>>
>> DF[!(DF$x), ] (or) DF[DF$x == 0, ]
>>
>> All I want to say is, I expect `DT[!(x)]` should give the same 
>> result as
>> `DT[x == 0]` (even though I fully understand it's not the intended 
>> behaviour
>> of data.table), as it's more intuitive and less confusing.
>>
>> So, changing `!` to `~` or `NJ` is one half of the issue for me. The 
>> other
>> is to replace the actual function of `!` in all contexts. I hope I 
>> came
>> across with what I wanted to say, better this time.
>>
>> Best,
>>
>> Arun
>>
>>
>> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:
>>
>>
>>
>> Hi,
>>
>> How about ~ instead of ! ?      I ruled out - previously to leave + 
>> and -
>> available for future use.  NJ() may be possible too.
>>
>> Matthew
>>
>>
>>
>> On 10.06.2013 09:35, Arunkumar Srinivasan wrote:
>>
>> Hi Matthew,
>> My view (from the last reply) more or less reflects mnel's comments 
>> here:
>> 
>> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
>> Pasted here for convenience:
>> data.table is mimicing subset in its handling of NA values in 
>> logical i
>> arguments. -- the only issue is the ! prefix signifying a not-join, 
>> not the
>> way one might expect. Perhaps the not join prefix could have been NJ 
>> not !
>> to avoid this confusion -- this might be another discussion to have 
>> on the
>> mailing list -- (I think it is a discussion worth having)
>>
>> Arun
>>
>> On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:
>>
>> Hm, good point.  Is data.table consistent with SQL already, for both 
>> == and
>> !=, and so no change needed?
>>
>> Yes, I believe it's already consistent with SQL. However, the 
>> current
>> interpretation of NA (documentation) being treated as FALSE is not 
>> needed /
>> untrue, imho (Please see below).
>>
>>
>> And it was correct for Frank to be mistaken.
>>
>> Yes, it seems like he was mistaken.
>>
>> Maybe just some more documentation and examples needed then.
>>
>> It'd be much more appropriate if the documentation reflects the role 
>> of
>> subsetting in data.table mimicking "subset" function (in order to be 
>> in line
>> with SQL) by dropping NA evaluated logicals. From a couple of posts 
>> before,
>> where I pasted the code where NAs are replaced to FALSE were not 
>> necessary
>> as `irows <- which(i)` makes clear that `which` is being used to get 
>> indices
>> and then subset, this fits perfectly well with the interpretation of 
>> NA in
>> data.table.
>>
>> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA 
>> inconsistently? :
>>
>> 
>> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently
>>
>>  Ha, I like the idea behind the use of () in evaluating expressions. 
>> It's
>> another nice layer towards simplicity in data.table. But I still 
>> think there
>> should not be an inconsistency in equivalent logical operations to 
>> provide
>> different results. If !(x== .) and x != . are indeed different, then 
>> I'd
>> suppose replacing `!` with a more appropriate name as it's much 
>> easier to
>> get confused otherwise.
>> In essence, either !(x == .) must evaluate to (x != .) if the 
>> underlying
>> meaning of these are the same, or the `!` in `!(x==.)` must be 
>> replaced to
>> something that's more appropriate for what it's supposed to be. 
>> Personally,
>> I prefer the former. It would greatly tighten the structure and 
>> consistency.
>>
>> "na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch 
>> before
>> in the context of joins, not logical subsets.
>>
>> Yes, I find this option would give more control in evaluating 
>> expressions
>> with ease in `i`, by providing both "subset" (default) and the 
>> typical
>> data.frame subsetting (na.rm = FALSE).
>> Best regards,
>>
>> Arun
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> 
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help