[datatable-help] Follow-up on subsetting data.table with NAs
Matthew Dowle
mdowle at mdowle.plus.com
Mon Jun 10 17:28:20 CEST 2013
Hi Arun,
Indeed. ! was introduced for not-join i.e. X[!Y] where i
is type data.table. Extending it to vectors seemed to make sense at the
time; e.g., X[!"foo"] and X[!3:6] (rather than the X[-3:6] mistake where
X[-(3:6)] was intended) were in my mind. I think of everything as a join
really; e.g., "where rownumber = i".
But I think I'm fine with ! being
not-join for data.table/list i only. Or is it just logical vector i to
be turned off only, and could leave ! as-is for character and integer
vector i?
Matthew
On 10.06.2013 15:52, Arunkumar Srinivasan wrote:
> Matthew,
> It just occurred to me. I'd be glad if you can clarify
this. The operation is supposed to be "Not Join". Which means, I'd
expect the "!" to be used with "J" as in:
> dt <-
data.table(x=c(0,0,1,1,3), y=1:5)
> setkey(dt, "x")
> dt[J(c(1,3))] #
join
>
> x y
> 1: 1 3
> 2: 1 4
> 3: 3 5
> dt[!J(c(1,3))]
>
> x y
> 1: 0 1
> 2: 0 2
> Here the concept of "Not Join" with the use of
"!J(.)" makes total sense. However, extending it to not-join for logical
vectors is what seems to be an issue. It's more of a logical indexing
than a join (at least in my mind). So, if it is possible to distinguish
between "!" and "!J" (by checking if `i` is a data.table or not) to tell
if it's a subsetting by logical vector or subsetting by "data.table" and
then deciding what to do, would that resolve this issue? If not, what's
the reason behind using "!" as a not-join during logical indexing? Is it
still considered as a not-join??
> Just a thought. I hope it makes at
least a little sense.
>
> Best,
> Arun
>
> On Monday, June 10, 2013
at 4:35 PM, Matthew Dowle wrote:
>
>> Hm, another good point. We need
~ for formulae, although I can't
>> imagine a formula in i (only in j).
But in both i and j we might want
>> to get(x).
>> I thought about ^
i.e. X[^Y] in the spirit of regular expression
>> syntax, but ^ doesn't
parse with a RHS only. Needs to be parsable as a
>> prefix.
>> - maybe
then? Consistent with - meaning in R. I don't think I
>> actually had a
specific use in mind for - and +, to reserve them for,
>> but at the
time it just seemed a shame to use up one of -/+ without
>> defining
the other. If - does a not join, then, might + be more like
>> merge()
(i.e. returning the union of the rows in x and i by join). I
>> think I
had something like that in mind, but hadn't thought it through.
>> Some
might say it should be a new argument e.g. notjoin=TRUE, but my
>>
thinking there is readability, since we often have many lines in i, j
>> and by in that order, and if the "notjoin=TRUE" followed afterwards
it
>> would be far away from the i argument to which it applies. If we
>> incorporate merge() into X[Y] using X[+Y] then it might avoid adding
yet
>> more parameters, too.
>> On 10.06.2013 15:02, Gabor
Grothendieck wrote:
>>
>>> The problem with ~ is that it is using up a
special character (of
>>> which there are only a few) for a case that
does not occur much.
>>> I can think of other things that ~ might be
better used for. For
>>> example, perhaps ~ x could mean get(x). One
aspect of data.table
>>> that
>>> tends to be difficult is when you
don't know the variable name ahead
>>> of time and this woiuld give a
way to specify it concisely.
>>> On Mon, Jun 10, 2013 at 5:21 AM,
Arunkumar Srinivasan
>>> <aragorn168b at gmail.com [5]> wrote:
>>>
>>>>
Matthew,
>>>> How about ~ instead of ! ? I ruled out - previously to
leave +
>>>> and -
>>>> available for future use. NJ() may be possible
too.
>>>> Both "NJ()" and "~" are okay for me.
>>>> That result makes
perfect sense to me. I don't think of !(x==.)
>>>> being the
>>>> same
as x!=. ! is simply a prefix. It's all the rows that
>>>> aren't
>>>>
returned if the ! prefix wasn't there.
>>>> I understand that
`DT[!(x)]` does what `data.table` is designed to
>>>> do
>>>>
currently. What I failed to mention was that if one were to consider
>>>> implementing `!(x==.)` as the same as `x != .` then this behaviour
>>>> has to be
>>>> changed. Let's forget this point for a moment.
>>>> That needs to be fixed. But we're getting quite theoretical here
>>>> and far
>>>> away from common use cases. Why would we ever have
row numbers of
>>>> the
>>>> table, as a column of the table itself
and want to select the rows
>>>> by number
>>>> not mentioned in that
column?
>>>> Probably I did not choose a good example. Suppose that
I've a
>>>> data.table and
>>>> I want to get all rows where "x == 0".
Let's say:
>>>> set.seed(45)
>>>> DT <- data.table( x =
sample(c(0,5,10,15), 10, replace=TRUE), y =
>>>> sample(15))
>>>> DF
<- as.data.frame(DT)
>>>> To get all rows where x == 0, it could be
done with DT[x == 0]. But
>>>> it makes
>>>> sense, at least in the
context of data.frames, to do equivalently,
>>>> DF[!(DF$x), ] (or)
DF[DF$x == 0, ]
>>>> All I want to say is, I expect `DT[!(x)]` should
give the same
>>>> result as
>>>> `DT[x == 0]` (even though I fully
understand it's not the intended
>>>> behaviour
>>>> of data.table),
as it's more intuitive and less confusing.
>>>> So, changing `!` to `~`
or `NJ` is one half of the issue for me. The
>>>> other
>>>> is to
replace the actual function of `!` in all contexts. I hope I
>>>> came
>>>> across with what I wanted to say, better this time.
>>>> Best,
>>>> Arun
>>>> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle
wrote:
>>>> Hi,
>>>> How about ~ instead of ! ? I ruled out -
previously to leave +
>>>> and -
>>>> available for future use. NJ()
may be possible too.
>>>> Matthew
>>>> On 10.06.2013 09:35, Arunkumar
Srinivasan wrote:
>>>> Hi Matthew,
>>>> My view (from the last reply)
more or less reflects mnel's comments
>>>> here:
>>>>
http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
[1]
>>>> Pasted here for convenience:
>>>> data.table is mimicing
subset in its handling of NA values in
>>>> logical i
>>>> arguments.
-- the only issue is the ! prefix signifying a not-join,
>>>> not the
>>>> way one might expect. Perhaps the not join prefix could have been
NJ
>>>> not !
>>>> to avoid this confusion -- this might be another
discussion to have
>>>> on the
>>>> mailing list -- (I think it is a
discussion worth having)
>>>> Arun
>>>> On Monday, June 10, 2013 at
10:28 AM, Arunkumar Srinivasan wrote:
>>>> Hm, good point. Is
data.table consistent with SQL already, for both
>>>> == and
>>>> !=,
and so no change needed?
>>>> Yes, I believe it's already consistent
with SQL. However, the
>>>> current
>>>> interpretation of NA
(documentation) being treated as FALSE is not
>>>> needed /
>>>>
untrue, imho (Please see below).
>>>> And it was correct for Frank to
be mistaken.
>>>> Yes, it seems like he was mistaken.
>>>> Maybe just
some more documentation and examples needed then.
>>>> It'd be much
more appropriate if the documentation reflects the role
>>>> of
>>>>
subsetting in data.table mimicking "subset" function (in order to be
>>>> in line
>>>> with SQL) by dropping NA evaluated logicals. From a
couple of posts
>>>> before,
>>>> where I pasted the code where NAs
are replaced to FALSE were not
>>>> necessary
>>>> as `irows <-
which(i)` makes clear that `which` is being used to get
>>>> indices
>>>> and then subset, this fits perfectly well with the interpretation
of
>>>> NA in
>>>> data.table.
>>>> Are you happy that DT[!(x==.)]
and DT[x!=.] do treat NA
>>>> inconsistently? :
>>>>
http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently
[2]
>>>> Ha, I like the idea behind the use of () in evaluating
expressions.
>>>> It's
>>>> another nice layer towards simplicity in
data.table. But I still
>>>> think there
>>>> should not be an
inconsistency in equivalent logical operations to
>>>> provide
>>>>
different results. If !(x== .) and x != . are indeed different, then
>>>> I'd
>>>> suppose replacing `!` with a more appropriate name as
it's much
>>>> easier to
>>>> get confused otherwise.
>>>> In
essence, either !(x == .) must evaluate to (x != .) if the
>>>>
underlying
>>>> meaning of these are the same, or the `!` in `!(x==.)`
must be
>>>> replaced to
>>>> something that's more appropriate for
what it's supposed to be.
>>>> Personally,
>>>> I prefer the former.
It would greatly tighten the structure and
>>>> consistency.
>>>>
"na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch
>>>> before
>>>> in the context of joins, not logical subsets.
>>>>
Yes, I find this option would give more control in evaluating
>>>>
expressions
>>>> with ease in `i`, by providing both "subset" (default)
and the
>>>> typical
>>>> data.frame subsetting (na.rm = FALSE).
>>>>
Best regards,
>>>> Arun
>>>>
_______________________________________________
>>>> datatable-help
mailing list
>>>> datatable-help at lists.r-forge.r-project.org [3]
>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[4]
Links:
------
[1]
http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
[2]
http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently
[3]
mailto:datatable-help at lists.r-forge.r-project.org
[4]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[5]
mailto:aragorn168b at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130610/4ce62723/attachment-0001.html>
More information about the datatable-help
mailing list