[datatable-help] := unclarity and possible bug?

Thu Aug 4 16:33:02 CEST 2011

On 4 August 2011 10:18, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
>> "Chris Neff" <caneff at gmail.com> wrote in message
>> news:CAAuY0RVUsa6A-XA8cZ64CPdKrbvAqsgZTDzGKP3UzFm+GGnSnw at mail.gmail.com...
>> I think I understand the difference between DT$y <- TRUE and DT$y <-
>> rnorm(10) now, and that is that the first is just an atomic element
>> while the second is a vector. If instead I had did
>> DT$y <- 0.5
> Clarification in other reply.
>
>> I get a coercion warning.  I suppose this is an okay feature if the
>> warning consistently shows up whenever coercion happens (even if it
>> coerces successfully with no loss of precision). It is just different
>> than data.frame and without the warning I didn't understand the logic.
>
> One coercion warning is missing. Will add.
>
>> I can see the speedup of not coercing, but from a users point of view
>> I expect DT$y <- 0.5 and DT$y <- rep(0.5, nrow(DT)) to behave
>> identically thanks to the magic of vector recycling.
>
> Same clarification in other reply.
>
>> Similarly if I've decided to do DT$y[4] <- .2  and before y was an
>> integer, I've clearly changed my mind about that aspect and want y to
>> be a numeric.
>
> I don't think that's clear at all.  The most common case is DT$y[4] <- 1
> In that case user has forgotten his "L", and all of a sudden his carefully
> chosen integer column gets coerced to double (automatically, silently,
> and slowly). You shouldn't change your mind on large data. Get the types
> right up
> front and stick to them.  If you do change your mind,  then it's made (ok I
> have deliberately made it)
> harder for you to change the type (which is the correct emphasis I
> think); i.e. explicity change your mind by creating a new large vector
> of the type you want and use := to "replace" the whole column. Clearer
> for the reader of your code that way  (rather than a silent automatic
> column type change just because you forgot L).
>
>> However, I'm able to work with the way it is as long as I'm warned
>> about it.  I can see this making terribly confusing bugs for people
>
> I'd say they're being hidden from what's actually happening at the moment in
> data.frame,
> and they need to get their types correct up front.  Obviously data.frame
> could never
> be changed in this regard because too much code depends on those coercion
> choices. Happy
> to be wrong, but lets get the behaviour of := correct, now.  Which is why
> all your
> feedback has been so great so quickly!  I don't think I'm wrong, yet.

Okay I think I do agree.  It does make sense and I think I've just
grown accustomed to doing bad things in R that it lets me get away
with without telling me how bad it is.  So as of now I agree it is
working as intended (with the warning added :) ).

>> if they don't get a warning.
>
> Agreed. Yes the coercion to logical warning is missing. I'll make it like
> the coercion
> to integer warning.  Also some documentation would help,  wouldn't it ;)
>
>> -Chris
>
> On 4 August 2011 09:09, Chris Neff <caneff at gmail.com> wrote:
>> I've ran the following 3 different times in new sessions:
>>
>> install.packages("data.table",
>> repos="http://R-Forge.R-project.org",type="source")
>>
>> and still DT[,z:=5] does nothing. Is there something I check to make
>> sure that the latest version is loaded?
>>
>>
>> As for the coercion stuff, I feel that it feels somewhat inconsistent
>> right now. For instance:
>>
>>> DT <- data.table(x=1:10, y=1:10)
>>
>>> DT$y <- TRUE
>>
>>> sapply(DT, class)
>>
>> x y
>> "integer" "integer"
>>
>>> DT$y <- rnorm(10)
>>> sapply(DT, class)
>> x y
>> "integer" "numeric"
>>
>> So in the first case y silently coerces the logical to an integer
>> without warning, but in the second case y happily turns into a numeric
>> when need be. Why the difference?
>>
>> When I do something like DT$y <- foo, I expect that y should turn into
>> foo regardless of what y was before. If there is some reason why DT[,
>> y:=foo] should be different than DT$y <- foo, that is a secondary
>> matter, but I get mightily confused when DT$y <- foo doesn't behave
>> like data.frame.
>>
>> On 4 August 2011 08:50, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>> Still doesn't seem to be latest version: DT[,z:=5] should add column (and
>>> that's tested).
>>> Otherwise correct and intended behaviour (although an informative warning
>>> needs adding when 5 gets coerced to type of column (i.e. logical) -
>>> thanks
>>> for spotting). Remember as.logical(5) is TRUE without warning. So, try
>>> creating column with NA_integer_ or NA_real_ instead. Once the column
>>> type
>>> is set, that's it. Columns aren't coerced to match type of RHS, unlike
>>> data.frame [which if you think about it is a big hit if the data is
>>> large].
>>>
>>> "Chris Neff" <caneff at gmail.com> wrote in message
>>> news:CAAuY0RXT7q+cm91PJ8KGkMwDApwFxM_EALb-Yu=P6ndp+LEfXg at mail.gmail.com...
>>> Ignore this second one, restarting and refreshing my data.table
>>> install now gives the proper error message when I try that. Sorry I'm
>>> not used to being on the bleeding edge of these things and I forget to
>>> update. However the first question is still mainly relevant:
>>>
>>>> DT <- data.table(x=1:10, y=rep(1:2,5))
>>>> DT[,z:=5]
>>> x y
>>> [1,] 1 1
>>> [2,] 2 2
>>> [3,] 3 1
>>> [4,] 4 2
>>> [5,] 5 1
>>> [6,] 6 2
>>> [7,] 7 1
>>> [8,] 8 2
>>> [9,] 9 1
>>> [10,] 10 2
>>>> DT[1:nrow(DT),z:=5]
>>> Error in `[.data.table`(DT, 1:nrow(DT), `:=`(z, 5)) :
>>> Attempt to add new column(s) and set subset of rows at the same
>>> time. Create the new column(s) first, and then you'll be able to
>>> assign to a subset. If i is set to 1:nrow(x) then please remove that
>>> (no need, it's faster without).
>>>> DT$z <- NA
>>>> DT[, z:=5]
>>> x y z
>>> [1,] 1 1 TRUE
>>> [2,] 2 2 TRUE
>>> [3,] 3 1 TRUE
>>> [4,] 4 2 TRUE
>>> [5,] 5 1 TRUE
>>> [6,] 6 2 TRUE
>>> [7,] 7 1 TRUE
>>> [8,] 8 2 TRUE
>>> [9,] 9 1 TRUE
>>> [10,] 10 2 TRUE
>>>
>>>
>>>
>>> The return on DT[,z:=5] when I haven't initialized DT$z yet is
>>> different, but still more uninformative than it is when I do
>>> DT[1:nrow(DT), z:=5]. And the DT$z <- NA issue is still there.
>>>
>>> Thanks!
>>>
>>>
>>> On 4 August 2011 08:18, Chris Neff <caneff at gmail.com> wrote:
>>>> A second question while I'm playing with it. It seems from the FRs
>>>> that it doesn't support multiple := in one select, but:
>>>>
>>>> DT <- data.table(x=1:10, y=rep(1:2,10))
>>>> DT$a = 0
>>>> DT$z = 0
>>>>
>>>> DT[, list(a := y/sum(y), z := 5)]
>>>>
>>>> works just fine for me. An error gets thrown but afterwards the
>>>> columns are modified as intended. Why the error?
>>>>
>>>>> DT[,list(z:=5,a:=y/sum(y))]
>>>> z
>>>> [1] 5
>>>> [1] TRUE
>>>> a
>>>> y/sum(y)
>>>> [1] TRUE
>>>> Error in data.table(`:=`(z, 5), `:=`(a, y/sum(y))) :
>>>> column or argument 1 is NULL
>>>>> DT
>>>> x y z a
>>>> [1,] 1 1 5 0.06666667
>>>> [2,] 2 2 5 0.13333333
>>>> [3,] 3 1 5 0.06666667
>>>> [4,] 4 2 5 0.13333333
>>>> [5,] 5 1 5 0.06666667
>>>> [6,] 6 2 5 0.13333333
>>>> [7,] 7 1 5 0.06666667
>>>> [8,] 8 2 5 0.13333333
>>>> [9,] 9 1 5 0.06666667
>>>> [10,] 10 2 5 0.13333333
>>>>
>>>> -Chris
>>>>
>>>> On 4 August 2011 08:12, Chris Neff <caneff at gmail.com> wrote:
>>>>> Hi all,
>>>>>
>>>>> If I do:
>>>>>
>>>>> DT <- data.table(x=1:10, y=rep(1:2,5))
>>>>>
>>>>> Then try the following
>>>>>
>>>>> DT[, z:=5]
>>>>>
>>>>> I get:
>>>>>
>>>>>> DT[, z:=5]
>>>>> z
>>>>> [1] 5
>>>>> [1] TRUE
>>>>> NULL
>>>>>
>>>>> and if I were to do DT <- DT[, z:=5], then DT gets set to NULL.
>>>>> Alternatively if I do
>>>>>
>>>>> DT[1:10, z:=5]
>>>>>
>>>>> I get
>>>>>
>>>>>> DT=DT[1:nrow(DT),z:=5]
>>>>> z
>>>>> [1] 5
>>>>> [1] 1 2 3 4 5 6 7 8 9 10
>>>>> Error in `:=`(z, 5) :
>>>>> Attempt to add new column(s) and set subset of rows at the same
>>>>> time. Create the new column(s) first, and then you'll be able to
>>>>> assign to a subset. If i is set to 1:nrow(x) then please remove that
>>>>> (no need, it's faster without).
>>>>>
>>>>>
>>>>> Which is more informative. So I do as it instructs:
>>>>>
>>>>> DT$z <- NA
>>>>>
>>>>> DT[, z:=5]
>>>>>
>>>>> And as output I get:
>>>>>
>>>>>> DT
>>>>> x y z
>>>>> [1,] 1 1 TRUE
>>>>> [2,] 2 2 TRUE
>>>>> [3,] 3 1 TRUE
>>>>> [4,] 4 2 TRUE
>>>>> [5,] 5 1 TRUE
>>>>> [6,] 6 2 TRUE
>>>>> [7,] 7 1 TRUE
>>>>> [8,] 8 2 TRUE
>>>>> [9,] 9 1 TRUE
>>>>> [10,] 10 2 TRUE
>>>>>
>>>>>
>>>>> Why isn't z 5 like assigned? I think it is because I assigned it as
>>>>> NA, and data table didn't know to change it to integer (although why
>>>>> it changed it to logical is another puzzle). If I instead do
>>>>>
>>>>> DT$z <- 0
>>>>>
>>>>> DT[, z:=5]
>>>>>
>>>>> It works fine.
>>>>>
>>>>> So my two points are:
>>>>>
>>>>> A) Doing DT[,z:=5] should be as informative as doing DT[1:nrow(DT),
>>>>> z:=5] with the error message.
>>>>>
>>>>> B) What went wrong with the NA assignment I did?
>>>>>
>>>>> Thanks!
>>>>> Chris
>>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>