[datatable-help] Follow-up on subsetting data.table with NAs

Sun Jul 7 09:46:35 CEST 2013

Hello all,

I thought it might be useful to connect a recent post on SO discussing more or less the same issue: 
http://stackoverflow.com/questions/17508127/na-in-i-expression-of-data-table-possible-bug

Arun

On Monday, June 10, 2013 at 7:01 PM, Arunkumar Srinivasan wrote:

> Hi Matthew, 
> Thanks for clarifying this. To me the "not join" operation is very similar to "setdiff" operation but for a data.frame/data.table. So DT[!J(.)] could be interpreted as setdiff(DT, DT[J(.)]).
> 
> No, I'm with you in that it makes much sense in extending it to logical vectors operations as well. And so far, I guess all of them who wrote back also agree with the idea of:
> 
> 1) !(x == .) and x != . being identical
> 2) ~(.) (or) NJ(.) (or) -(.) being a NOT JOIN on data.table/list/vectors etc..
> 
> I'd love for these two to be on the feature list. I really don't mind the "~", "NJ" or "-". 
> 
> Thanks again,
> Arun
> 
> 
> On Monday, June 10, 2013 at 5:28 PM, Matthew Dowle wrote:
> 
> >  
> > Hi Arun,
> > Indeed.  ! was introduced for not-join i.e. X[!Y] where i is type data.table.  Extending it to vectors seemed to make sense at the time; e.g., X[!"foo"] and X[!3:6] (rather than the X[-3:6] mistake where X[-(3:6)] was intended) were in my mind.   I think of everything as a join really; e.g., "where rownumber = i".
> > But I think I'm fine with ! being not-join for data.table/list i only.  Or is it just logical vector i to be turned off only, and could leave ! as-is for character and integer vector i?
> > Matthew
> >  
> > On 10.06.2013 15:52, Arunkumar Srinivasan wrote:
> > > Matthew, 
> > > It just occurred to me. I'd be glad if you can clarify this. The operation is supposed to be "Not Join". Which means, I'd expect the "!" to be used with "J" as in:
> > > dt <- data.table(x=c(0,0,1,1,3), y=1:5)
> > > setkey(dt, "x")
> > > dt[J(c(1,3))] # join
> > >    x y
> > > 1: 1 3
> > > 2: 1 4
> > > 3: 3 5
> > > 
> > > dt[!J(c(1,3))]
> > >    x y
> > > 1: 0 1
> > > 2: 0 2
> > > 
> > > Here the concept of "Not Join" with the use of "!J(.)" makes total sense. However, extending it to not-join for logical vectors is what seems to be an issue. It's more of a logical indexing than a join (at least in my mind). So, if it is possible to distinguish between "!" and "!J" (by checking if `i` is a data.table or not) to tell if it's a subsetting by logical vector or subsetting by "data.table" and then deciding what to do, would that resolve this issue? If not, what's the reason behind using "!" as a not-join during logical indexing? Is it still considered as a not-join?? 
> > > Just a thought. I hope it makes at least a little sense.
> > > Best,
> > > Arun
> > > 
> > > 
> > > On Monday, June 10, 2013 at 4:35 PM, Matthew Dowle wrote:
> > > 
> > > > Hm, another good point. We need ~ for formulae, although I can't
> > > > imagine a formula in i (only in j). But in both i and j we might want
> > > > to get(x).
> > > > I thought about ^ i.e. X[^Y] in the spirit of regular expression
> > > > syntax, but ^ doesn't parse with a RHS only. Needs to be parsable as a
> > > > prefix.
> > > > - maybe then? Consistent with - meaning in R. I don't think I
> > > > actually had a specific use in mind for - and +, to reserve them for,
> > > > but at the time it just seemed a shame to use up one of -/+ without
> > > > defining the other. If - does a not join, then, might + be more like
> > > > merge() (i.e. returning the union of the rows in x and i by join). I
> > > > think I had something like that in mind, but hadn't thought it through.
> > > > Some might say it should be a new argument e.g. notjoin=TRUE, but my
> > > > thinking there is readability, since we often have many lines in i, j
> > > > and by in that order, and if the "notjoin=TRUE" followed afterwards it
> > > > would be far away from the i argument to which it applies. If we
> > > > incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet
> > > > more parameters, too.
> > > > On 10.06.2013 15:02, Gabor Grothendieck wrote:
> > > > > The problem with ~ is that it is using up a special character (of
> > > > > which there are only a few) for a case that does not occur much.
> > > > > I can think of other things that ~ might be better used for. For
> > > > > example, perhaps ~ x could mean get(x). One aspect of data.table
> > > > > that
> > > > > tends to be difficult is when you don't know the variable name ahead
> > > > > of time and this woiuld give a way to specify it concisely.
> > > > > On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
> > > > > <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > > > Matthew,
> > > > > > How about ~ instead of ! ? I ruled out - previously to leave +
> > > > > > and -
> > > > > > available for future use. NJ() may be possible too.
> > > > > > Both "NJ()" and "~" are okay for me.
> > > > > > That result makes perfect sense to me. I don't think of !(x==.)
> > > > > > being the
> > > > > > same as x!=. ! is simply a prefix. It's all the rows that
> > > > > > aren't
> > > > > > returned if the ! prefix wasn't there.
> > > > > > I understand that `DT[!(x)]` does what `data.table` is designed to
> > > > > > do
> > > > > > currently. What I failed to mention was that if one were to consider
> > > > > > implementing `!(x==.)` as the same as `x != .` then this behaviour
> > > > > > has to be
> > > > > > changed. Let's forget this point for a moment.
> > > > > > That needs to be fixed. But we're getting quite theoretical here
> > > > > > and far
> > > > > > away from common use cases. Why would we ever have row numbers of
> > > > > > the
> > > > > > table, as a column of the table itself and want to select the rows
> > > > > > by number
> > > > > > not mentioned in that column?
> > > > > > Probably I did not choose a good example. Suppose that I've a
> > > > > > data.table and
> > > > > > I want to get all rows where "x == 0". Let's say:
> > > > > > set.seed(45)
> > > > > > DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
> > > > > > sample(15))
> > > > > > DF <- as.data.frame(DT)
> > > > > > To get all rows where x == 0, it could be done with DT[x == 0]. But
> > > > > > it makes
> > > > > > sense, at least in the context of data.frames, to do equivalently,
> > > > > > DF[!(DF$x), ] (or) DF[DF$x == 0, ]
> > > > > > All I want to say is, I expect `DT[!(x)]` should give the same
> > > > > > result as
> > > > > > `DT[x == 0]` (even though I fully understand it's not the intended
> > > > > > behaviour
> > > > > > of data.table), as it's more intuitive and less confusing.
> > > > > > So, changing `!` to `~` or `NJ` is one half of the issue for me. The
> > > > > > other
> > > > > > is to replace the actual function of `!` in all contexts. I hope I
> > > > > > came
> > > > > > across with what I wanted to say, better this time.
> > > > > > Best,
> > > > > > Arun
> > > > > > On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:
> > > > > > Hi,
> > > > > > How about ~ instead of ! ? I ruled out - previously to leave +
> > > > > > and -
> > > > > > available for future use. NJ() may be possible too.
> > > > > > Matthew
> > > > > > On 10.06.2013 09:35, Arunkumar Srinivasan wrote:
> > > > > > Hi Matthew,
> > > > > > My view (from the last reply) more or less reflects mnel's comments
> > > > > > here:
> > > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
> > > > > > Pasted here for convenience:
> > > > > > data.table is mimicing subset in its handling of NA values in
> > > > > > logical i
> > > > > > arguments. -- the only issue is the ! prefix signifying a not-join,
> > > > > > not the
> > > > > > way one might expect. Perhaps the not join prefix could have been NJ
> > > > > > not !
> > > > > > to avoid this confusion -- this might be another discussion to have
> > > > > > on the
> > > > > > mailing list -- (I think it is a discussion worth having)
> > > > > > Arun
> > > > > > On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:
> > > > > > Hm, good point. Is data.table consistent with SQL already, for both
> > > > > > == and
> > > > > > !=, and so no change needed?
> > > > > > Yes, I believe it's already consistent with SQL. However, the
> > > > > > current
> > > > > > interpretation of NA (documentation) being treated as FALSE is not
> > > > > > needed /
> > > > > > untrue, imho (Please see below).
> > > > > > And it was correct for Frank to be mistaken.
> > > > > > Yes, it seems like he was mistaken.
> > > > > > Maybe just some more documentation and examples needed then.
> > > > > > It'd be much more appropriate if the documentation reflects the role
> > > > > > of
> > > > > > subsetting in data.table mimicking "subset" function (in order to be
> > > > > > in line
> > > > > > with SQL) by dropping NA evaluated logicals. From a couple of posts
> > > > > > before,
> > > > > > where I pasted the code where NAs are replaced to FALSE were not
> > > > > > necessary
> > > > > > as `irows <- which(i)` makes clear that `which` is being used to get
> > > > > > indices
> > > > > > and then subset, this fits perfectly well with the interpretation of
> > > > > > NA in
> > > > > > data.table.
> > > > > > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA
> > > > > > inconsistently? :
> > > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently
> > > > > > Ha, I like the idea behind the use of () in evaluating expressions.
> > > > > > It's
> > > > > > another nice layer towards simplicity in data.table. But I still
> > > > > > think there
> > > > > > should not be an inconsistency in equivalent logical operations to
> > > > > > provide
> > > > > > different results. If !(x== .) and x != . are indeed different, then
> > > > > > I'd
> > > > > > suppose replacing `!` with a more appropriate name as it's much
> > > > > > easier to
> > > > > > get confused otherwise.
> > > > > > In essence, either !(x == .) must evaluate to (x != .) if the
> > > > > > underlying
> > > > > > meaning of these are the same, or the `!` in `!(x==.)` must be
> > > > > > replaced to
> > > > > > something that's more appropriate for what it's supposed to be.
> > > > > > Personally,
> > > > > > I prefer the former. It would greatly tighten the structure and
> > > > > > consistency.
> > > > > > "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch
> > > > > > before
> > > > > > in the context of joins, not logical subsets.
> > > > > > Yes, I find this option would give more control in evaluating
> > > > > > expressions
> > > > > > with ease in `i`, by providing both "subset" (default) and the
> > > > > > typical
> > > > > > data.frame subsetting (na.rm = FALSE).
> > > > > > Best regards,
> > > > > > Arun
> > > > > > _______________________________________________
> > > > > > datatable-help mailing list
> > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> >  
> >  
> > 
> > 
> > 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130707/f0edaec6/attachment-0001.html>