[datatable-help] Follow-up on subsetting data.table with NAs

Mon Jun 10 17:06:53 CEST 2013

Btw, since we're on the topic of join/not-join syntax does this break
others' expectations or is it just me?

> dt = data.table(x = c(1,2,3))
> setkey(dt,x)
> dt[J(1)]
   x
1: 1
> dt[!J(1)]
   x
1: 2
2: 3
*> dt[(!J(1))]*
*Error in eval(expr, envir, enclos) : could not find function "J"*
*> dt[(J(1))]
*
*Error in eval(expr, envir, enclos) : could not find function "J"*

I understand why this happens internally, because the function "()" is read
as the head of the expression tree, but it's still pretty weird.

On Mon, Jun 10, 2013 at 9:55 AM, Frank Erickson <FErickson at psu.edu> wrote:

> I prefer ~ and/or NJ() over -. The not-join operation is different from
> the subsetting operation usually associated with -.
>
> I don't know what characters are available for this sort of thing, but @x,
> @(x,y) seems natural enough as syntax for a getter.
>
>
> On Mon, Jun 10, 2013 at 9:35 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>> Hm, another good point.   We need ~ for formulae,  although I can't
>> imagine a formula in i (only in j).  But in both i and j we might want to
>> get(x).
>>
>> I thought about ^ i.e. X[^Y] in the spirit of regular expression syntax,
>>  but ^ doesn't parse with a RHS only. Needs to be parsable as a prefix.
>>
>> - maybe then?  Consistent with - meaning in R.  I don't think I actually
>> had a specific use in mind for - and +, to reserve them for,  but at the
>> time it just seemed a shame to use up one of -/+ without defining the
>> other.  If - does a not join, then, might + be more like merge() (i.e.
>> returning the union of the rows in x and i by join).  I think I had
>> something like that in mind, but hadn't thought it through.
>>
>> Some might say it should be a new argument e.g. notjoin=TRUE,  but my
>> thinking there is readability,  since we often have many lines in i, j and
>> by in that order, and if the "notjoin=TRUE" followed afterwards it would be
>> far away from the i argument to which it applies.  If we incorporate
>> merge() into X[Y] using X[+Y] then it might avoid adding yet more
>> parameters, too.
>>
>>
>>
>> On 10.06.2013 15:02, Gabor Grothendieck wrote:
>>
>>> The problem with ~ is that it is using up a special character (of
>>> which there are only a few) for a case that does not occur much.
>>>
>>> I can think of other things that ~ might be better used for.  For
>>> example, perhaps ~ x could mean get(x).  One aspect of data.table that
>>> tends to be difficult is when you don't know the variable name ahead
>>> of time and this woiuld give a way to specify it concisely.
>>>
>>> On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
>>> <aragorn168b at gmail.com> wrote:
>>>
>>>> Matthew,
>>>>
>>>> How about ~ instead of ! ?      I ruled out - previously to leave + and
>>>> -
>>>> available for future use.  NJ() may be possible too.
>>>>
>>>> Both "NJ()" and "~" are okay for me.
>>>>
>>>> That result makes perfect sense to me.   I don't think of !(x==.) being
>>>> the
>>>> same as  x!=.    ! is simply a prefix.    It's all the rows that aren't
>>>> returned if the ! prefix wasn't there.
>>>>
>>>> I understand that `DT[!(x)]` does what `data.table` is designed to do
>>>> currently. What I failed to mention was that if one were to consider
>>>> implementing `!(x==.)` as the same as `x != .` then this behaviour has
>>>> to be
>>>> changed. Let's forget this point for a moment.
>>>>
>>>> That needs to be fixed.  But we're getting quite theoretical here and
>>>> far
>>>> away from common use cases.  Why would we ever have row numbers of the
>>>> table, as a column of the table itself and want to select the rows by
>>>> number
>>>> not mentioned in that column?
>>>>
>>>> Probably I did not choose a good example. Suppose that I've a
>>>> data.table and
>>>> I want to get all rows where "x == 0". Let's say:
>>>>
>>>> set.seed(45)
>>>> DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
>>>> sample(15))
>>>>
>>>> DF <- as.data.frame(DT)
>>>>
>>>> To get all rows where x == 0, it could be done with DT[x == 0]. But it
>>>> makes
>>>> sense, at least in the context of data.frames, to do equivalently,
>>>>
>>>> DF[!(DF$x), ] (or) DF[DF$x == 0, ]
>>>>
>>>> All I want to say is, I expect `DT[!(x)]` should give the same result as
>>>> `DT[x == 0]` (even though I fully understand it's not the intended
>>>> behaviour
>>>> of data.table), as it's more intuitive and less confusing.
>>>>
>>>> So, changing `!` to `~` or `NJ` is one half of the issue for me. The
>>>> other
>>>> is to replace the actual function of `!` in all contexts. I hope I came
>>>> across with what I wanted to say, better this time.
>>>>
>>>> Best,
>>>>
>>>> Arun
>>>>
>>>>
>>>> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>> How about ~ instead of ! ?      I ruled out - previously to leave + and
>>>> -
>>>> available for future use.  NJ() may be possible too.
>>>>
>>>> Matthew
>>>>
>>>>
>>>>
>>>> On 10.06.2013 09:35, Arunkumar Srinivasan wrote:
>>>>
>>>> Hi Matthew,
>>>> My view (from the last reply) more or less reflects mnel's comments
>>>> here:
>>>>
>>>> http://stackoverflow.com/**questions/16239153/dtx-and-**
>>>> dtx-treat-na-in-x-**inconsistently#**comment23317096_16240143<http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143>
>>>> Pasted here for convenience:
>>>> data.table is mimicing subset in its handling of NA values in logical i
>>>> arguments. -- the only issue is the ! prefix signifying a not-join, not
>>>> the
>>>> way one might expect. Perhaps the not join prefix could have been NJ
>>>> not !
>>>> to avoid this confusion -- this might be another discussion to have on
>>>> the
>>>> mailing list -- (I think it is a discussion worth having)
>>>>
>>>> Arun
>>>>
>>>> On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:
>>>>
>>>> Hm, good point.  Is data.table consistent with SQL already, for both ==
>>>> and
>>>> !=, and so no change needed?
>>>>
>>>> Yes, I believe it's already consistent with SQL. However, the current
>>>> interpretation of NA (documentation) being treated as FALSE is not
>>>> needed /
>>>> untrue, imho (Please see below).
>>>>
>>>>
>>>> And it was correct for Frank to be mistaken.
>>>>
>>>> Yes, it seems like he was mistaken.
>>>>
>>>> Maybe just some more documentation and examples needed then.
>>>>
>>>> It'd be much more appropriate if the documentation reflects the role of
>>>> subsetting in data.table mimicking "subset" function (in order to be in
>>>> line
>>>> with SQL) by dropping NA evaluated logicals. From a couple of posts
>>>> before,
>>>> where I pasted the code where NAs are replaced to FALSE were not
>>>> necessary
>>>> as `irows <- which(i)` makes clear that `which` is being used to get
>>>> indices
>>>> and then subset, this fits perfectly well with the interpretation of NA
>>>> in
>>>> data.table.
>>>>
>>>> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently?
>>>> :
>>>>
>>>>
>>>> http://stackoverflow.com/**questions/16239153/dtx-and-**
>>>> dtx-treat-na-in-x-**inconsistently<http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently>
>>>>
>>>>  Ha, I like the idea behind the use of () in evaluating expressions.
>>>> It's
>>>> another nice layer towards simplicity in data.table. But I still think
>>>> there
>>>> should not be an inconsistency in equivalent logical operations to
>>>> provide
>>>> different results. If !(x== .) and x != . are indeed different, then I'd
>>>> suppose replacing `!` with a more appropriate name as it's much easier
>>>> to
>>>> get confused otherwise.
>>>> In essence, either !(x == .) must evaluate to (x != .) if the underlying
>>>> meaning of these are the same, or the `!` in `!(x==.)` must be replaced
>>>> to
>>>> something that's more appropriate for what it's supposed to be.
>>>> Personally,
>>>> I prefer the former. It would greatly tighten the structure and
>>>> consistency.
>>>>
>>>> "na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch
>>>> before
>>>> in the context of joins, not logical subsets.
>>>>
>>>> Yes, I find this option would give more control in evaluating
>>>> expressions
>>>> with ease in `i`, by providing both "subset" (default) and the typical
>>>> data.frame subsetting (na.rm = FALSE).
>>>> Best regards,
>>>>
>>>> Arun
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ______________________________**_________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.**r-project.org<datatable-help at lists.r-forge.r-project.org>
>>>>
>>>> https://lists.r-forge.r-**project.org/cgi-bin/mailman/**
>>>> listinfo/datatable-help<https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help>
>>>>
>>>
>> ______________________________**_________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.**r-project.org<datatable-help at lists.r-forge.r-project.org>
>> https://lists.r-forge.r-**project.org/cgi-bin/mailman/**
>> listinfo/datatable-help<https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help>
>>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130610/4036790f/attachment.html>