[datatable-help] BUG: droplevels mangles subsetted data.table

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed Feb 22 03:24:33 CET 2012


Ahh, right ... the copying. Good point.

Regarding the logic you suggest as to when to copy or not, how do you
feel about going the explicit route instead of trying to take a best
guess when we should/shouldn't copy via `cedta` and doing the
'data.frame behavior' by default.

By that I mean: since the droplevels function has a `...` param, can
we do something like:

droplevels.data.table <- function(x, except=NULL, do.copy=TRUE, ...) {
  if (do.copy) {
    x <- copy(x)
  }
  oldkey = key(x)
  change.me <- names(x)
  if (!is.null(except)) {
    change.me <- setdiff(change.me, names(x)[except])
  }
  for (i in change.me)) {
       if (is.factor(x[[i]])) x[,i:=droplevels(x[[i]]),with=FALSE]
   }
  setkeyv( x, oldkey )
}

yay/nay?

-steve

On Tue, Feb 21, 2012 at 6:22 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Hi. Just because as it stands it doesn't copy, so
>
>    newDT = dropfactors(DT)
>
> would change DT by reference with newDT a new pointer to that same
> modified object, whereas base would leave DT unchanged with newDT a
> modified copy.
>
> Just adding dt=copy(dt) at the start of the function would make it
> consistent,  but then how would we (data.table-aware code) call the
> non-copying version if we wanted that (which is likely needed, given the
> motivation of dropping unused levels I guess). Could continue the set*
> theme and create setdropfactors()? but that doesn't roll off the tongue.
> Or the copy() could be switched in the usual way :
>
>     if (!cedta) dt = copy(dt)
>
> and then we data.table users would just know that droplevels worked by
> reference and we should copy() first if we want a copy, in the usual
> way. Whilst not upsetting non-data.table-aware packages, since they
> would still copy. Think I prefer the switched copy, carefully
> documented, which would save yet another new function. I'm thinking that
> users' expectations of dropfactors() would probably be that it worked by
> reference on data.tables anyway (or if not, would want it to after the
> initial surprise).
>
> Matthew
>
> On Tue, 2012-02-21 at 17:52 -0500, Steve Lianoglou wrote:
>> Hi,
>>
>> I guess I'm missing something, but ... why isn't your proposed
>> droplevels.data.table consistent with base? Because the ordering of
>> the rows might change (maybe(?))?
>>
>> -steve
>>
>> On Tue, Feb 21, 2012 at 4:42 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>> >
>> > Yes, could do. Building on that here's a quick stab at
>> > droplevels.data.table. This does it by reference, or it could take a
>> > copy(). If it takes a copy() it would be consistent with base (probably
>> > required), but then how best to make a non-copying version available?
>> >
>> > droplevels.data.table = function(dt) {
>> >    oldkey = key( dt )
>> >    for (i in names(dt)) {
>> >        if (is.factor(dt[[i]])) dt[,i:=droplevels(dt[[i]]),with=FALSE]
>> >    }
>> >    setkeyv( dt, oldkey )
>> >    dt
>> > }
>> >
>> > On Tue, 2012-02-21 at 15:38 -0500, Prasad Chalasani wrote:
>> >> Meanwhile as a work-around, I suppose one should do:
>> >>
>> >> keys <- key( dt ) # this could in general be a large set of keys
>> >> sub_d <- droplevels( as.data.frame( dt[ name != 'a' ] ) )
>> >> sub_dt <- data.table( sub_d )
>> >> setkeyv( sub_dt, keys )
>> >>
>> >>
>> >>
>> >> On Feb 21, 2012, at 1:59 PM, Matthew Dowle wrote:
>> >>
>> >> >
>> >> > I see the problem too but (just) adding droplevels.data.table might miss
>> >> > the root cause.
>> >> >
>> >> >> because the way the
>> >> >> droplevels.data.frame method works isn't compatible with data.table
>> >> >> indexing.
>> >> >
>> >> > But it's intended to be. I can see the switch at the top of [.data.table
>> >> > is detecting the caller isn't data.table aware, and it is then dispatching
>> >> > to `[.data.frame` but why it then isn't working I'm not sure. Something to
>> >> > do with the missing j or missing drop not being passed through correctly,
>> >> > perhaps.
>> >> >
>> >> > I have heard it said (once or twice) that data.table is "almost"
>> >> > compatible with non-data.table-aware packages, but never had an example
>> >> > before. I wonder if this is it!
>> >> >
>> >> > A (fast) droplevels.data.table using := would be good anyway, though.
>> >> >
>> >> > Matthew
>> >> >
>> >> >
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> I see what the problem is -- we need to provide a
>> >> >> droplevels.data.table S3 method, because the way the
>> >> >> droplevels.data.frame method works isn't compatible with data.table
>> >> >> indexing.
>> >> >>
>> >> >> Will fix:
>> >> >>
>> >> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1841&group_id=240&atid=975
>> >> >>
>> >> >> Thanks for raising the flag.
>> >> >>
>> >> >> Cheers,
>> >> >> -steve
>> >> >>
>> >> >> On Tue, Feb 21, 2012 at 12:38 PM, pchalasani <pchalasani at gmail.com> wrote:
>> >> >>>  Surprising that this wasn't noticed before, or perhaps I'm not
>> >> >>> following
>> >> >>> some recommended idiom to drop levels when using  data.table. The
>> >> >>> following
>> >> >>> code illustrates the bug clearly: The bug remains regardless of whether
>> >> >>> I
>> >> >>> use "subset" or simply use dt1 = dt[ name != 'a' ].
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>    d <- data.table(name = c('a','b','c'), value = 1:3)
>> >> >>>    dt <- data.table(d)
>> >> >>>    setkey(dt,'name')
>> >> >>>    dt1 <- subset(dt,name != 'a')  # or dt1 <- dt[ name != 'a' ]
>> >> >>>    > dt1
>> >> >>>          name value
>> >> >>>     [1,]    b     2
>> >> >>>     [2,]    c     3
>> >> >>>
>> >> >>>    > droplevels(dt1)
>> >> >>>          name value
>> >> >>>     [1,]    b     1
>> >> >>>     [2,]    c     3
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> View this message in context:
>> >> >>> http://r.789695.n4.nabble.com/BUG-droplevels-mangles-subsetted-data-table-tp4407694p4407694.html
>> >> >>> Sent from the datatable-help mailing list archive at Nabble.com.
>> >> >>> _______________________________________________
>> >> >>> datatable-help mailing list
>> >> >>> datatable-help at lists.r-forge.r-project.org
>> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Steve Lianoglou
>> >> >> Graduate Student: Computational Systems Biology
>> >> >>  | Memorial Sloan-Kettering Cancer Center
>> >> >>  | Weill Medical College of Cornell University
>> >> >> Contact Info: http://cbio.mskcc.org/~lianos/contact
>> >> >> _______________________________________________
>> >> >> datatable-help mailing list
>> >> >> datatable-help at lists.r-forge.r-project.org
>> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >> >>
>> >> >
>> >> >
>> >>
>> >
>> >
>>
>>
>>
>
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the datatable-help mailing list