[datatable-help] Follow-up on subsetting data.table with NAs

Gabor Grothendieck ggrothendieck at gmail.com
Mon Jun 10 16:02:38 CEST 2013


The problem with ~ is that it is using up a special character (of
which there are only a few) for a case that does not occur much.

I can think of other things that ~ might be better used for.  For
example, perhaps ~ x could mean get(x).  One aspect of data.table that
tends to be difficult is when you don't know the variable name ahead
of time and this woiuld give a way to specify it concisely.

On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Matthew,
>
> How about ~ instead of ! ?      I ruled out - previously to leave + and -
> available for future use.  NJ() may be possible too.
>
> Both "NJ()" and "~" are okay for me.
>
> That result makes perfect sense to me.   I don't think of !(x==.) being the
> same as  x!=.    ! is simply a prefix.    It's all the rows that aren't
> returned if the ! prefix wasn't there.
>
> I understand that `DT[!(x)]` does what `data.table` is designed to do
> currently. What I failed to mention was that if one were to consider
> implementing `!(x==.)` as the same as `x != .` then this behaviour has to be
> changed. Let's forget this point for a moment.
>
> That needs to be fixed.  But we're getting quite theoretical here and far
> away from common use cases.  Why would we ever have row numbers of the
> table, as a column of the table itself and want to select the rows by number
> not mentioned in that column?
>
> Probably I did not choose a good example. Suppose that I've a data.table and
> I want to get all rows where "x == 0". Let's say:
>
> set.seed(45)
> DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
> sample(15))
>
> DF <- as.data.frame(DT)
>
> To get all rows where x == 0, it could be done with DT[x == 0]. But it makes
> sense, at least in the context of data.frames, to do equivalently,
>
> DF[!(DF$x), ] (or) DF[DF$x == 0, ]
>
> All I want to say is, I expect `DT[!(x)]` should give the same result as
> `DT[x == 0]` (even though I fully understand it's not the intended behaviour
> of data.table), as it's more intuitive and less confusing.
>
> So, changing `!` to `~` or `NJ` is one half of the issue for me. The other
> is to replace the actual function of `!` in all contexts. I hope I came
> across with what I wanted to say, better this time.
>
> Best,
>
> Arun
>
>
> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:
>
>
>
> Hi,
>
> How about ~ instead of ! ?      I ruled out - previously to leave + and -
> available for future use.  NJ() may be possible too.
>
> Matthew
>
>
>
> On 10.06.2013 09:35, Arunkumar Srinivasan wrote:
>
> Hi Matthew,
> My view (from the last reply) more or less reflects mnel's comments here:
> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
> Pasted here for convenience:
> data.table is mimicing subset in its handling of NA values in logical i
> arguments. -- the only issue is the ! prefix signifying a not-join, not the
> way one might expect. Perhaps the not join prefix could have been NJ not !
> to avoid this confusion -- this might be another discussion to have on the
> mailing list -- (I think it is a discussion worth having)
>
> Arun
>
> On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:
>
> Hm, good point.  Is data.table consistent with SQL already, for both == and
> !=, and so no change needed?
>
> Yes, I believe it's already consistent with SQL. However, the current
> interpretation of NA (documentation) being treated as FALSE is not needed /
> untrue, imho (Please see below).
>
>
> And it was correct for Frank to be mistaken.
>
> Yes, it seems like he was mistaken.
>
> Maybe just some more documentation and examples needed then.
>
> It'd be much more appropriate if the documentation reflects the role of
> subsetting in data.table mimicking "subset" function (in order to be in line
> with SQL) by dropping NA evaluated logicals. From a couple of posts before,
> where I pasted the code where NAs are replaced to FALSE were not necessary
> as `irows <- which(i)` makes clear that `which` is being used to get indices
> and then subset, this fits perfectly well with the interpretation of NA in
> data.table.
>
> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :
>
> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently
>
>  Ha, I like the idea behind the use of () in evaluating expressions. It's
> another nice layer towards simplicity in data.table. But I still think there
> should not be an inconsistency in equivalent logical operations to provide
> different results. If !(x== .) and x != . are indeed different, then I'd
> suppose replacing `!` with a more appropriate name as it's much easier to
> get confused otherwise.
> In essence, either !(x == .) must evaluate to (x != .) if the underlying
> meaning of these are the same, or the `!` in `!(x==.)` must be replaced to
> something that's more appropriate for what it's supposed to be. Personally,
> I prefer the former. It would greatly tighten the structure and consistency.
>
> "na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before
> in the context of joins, not logical subsets.
>
> Yes, I find this option would give more control in evaluating expressions
> with ease in `i`, by providing both "subset" (default) and the typical
> data.frame subsetting (na.rm = FALSE).
> Best regards,
>
> Arun
>
>
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com


More information about the datatable-help mailing list