[datatable-help] BUG: droplevels mangles subsetted data.table

Matthew Dowle mdowle at mdowle.plus.com
Wed Feb 29 01:17:18 CET 2012


Root cause now fixed, in v1.8.0 on R-Forge :

o data.table unaware packages that use DF[i] and DF[i]<-value syntax
  were not compatible with data.table, fixed. Thanks to Prasad Chalasani
  for providing a reproducible example with base::droplevels(). Test
  added.


In terms of adding droplevels.data.table for speed, to save parameters,
and since copy() can appear in line, I think I prefer just :
   droplevels(DT)        # by reference when caller is dt-aware,
                         # otherwise copied.
   droplevels(copy(DT))  # return modified copy in dt-aware code 

i.e. by default do.copy=FALSE. I think that's the default most DT users
would want e.g. DT=droplevels(DT), not DT2=droplevels(DT). Rule of thumb
being that DTs aren't copied, but if a copy is needed then take a copy()
first.

Matthew

On Tue, 2012-02-21 at 21:24 -0500, Steve Lianoglou wrote:
> Ahh, right ... the copying. Good point.
> 
> Regarding the logic you suggest as to when to copy or not, how do you
> feel about going the explicit route instead of trying to take a best
> guess when we should/shouldn't copy via `cedta` and doing the
> 'data.frame behavior' by default.
> 
> By that I mean: since the droplevels function has a `...` param, can
> we do something like:
> 
> droplevels.data.table <- function(x, except=NULL, do.copy=TRUE, ...) {
>   if (do.copy) {
>     x <- copy(x)
>   }
>   oldkey = key(x)
>   change.me <- names(x)
>   if (!is.null(except)) {
>     change.me <- setdiff(change.me, names(x)[except])
>   }
>   for (i in change.me)) {
>        if (is.factor(x[[i]])) x[,i:=droplevels(x[[i]]),with=FALSE]
>    }
>   setkeyv( x, oldkey )
> }
> 
> yay/nay?
> 
> -steve
> 
> On Tue, Feb 21, 2012 at 6:22 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> > Hi. Just because as it stands it doesn't copy, so
> >
> >    newDT = dropfactors(DT)
> >
> > would change DT by reference with newDT a new pointer to that same
> > modified object, whereas base would leave DT unchanged with newDT a
> > modified copy.
> >
> > Just adding dt=copy(dt) at the start of the function would make it
> > consistent,  but then how would we (data.table-aware code) call the
> > non-copying version if we wanted that (which is likely needed, given the
> > motivation of dropping unused levels I guess). Could continue the set*
> > theme and create setdropfactors()? but that doesn't roll off the tongue.
> > Or the copy() could be switched in the usual way :
> >
> >     if (!cedta) dt = copy(dt)
> >
> > and then we data.table users would just know that droplevels worked by
> > reference and we should copy() first if we want a copy, in the usual
> > way. Whilst not upsetting non-data.table-aware packages, since they
> > would still copy. Think I prefer the switched copy, carefully
> > documented, which would save yet another new function. I'm thinking that
> > users' expectations of dropfactors() would probably be that it worked by
> > reference on data.tables anyway (or if not, would want it to after the
> > initial surprise).
> >
> > Matthew
> >
> > On Tue, 2012-02-21 at 17:52 -0500, Steve Lianoglou wrote:
> >> Hi,
> >>
> >> I guess I'm missing something, but ... why isn't your proposed
> >> droplevels.data.table consistent with base? Because the ordering of
> >> the rows might change (maybe(?))?
> >>
> >> -steve
> >>
> >> On Tue, Feb 21, 2012 at 4:42 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> >> >
> >> > Yes, could do. Building on that here's a quick stab at
> >> > droplevels.data.table. This does it by reference, or it could take a
> >> > copy(). If it takes a copy() it would be consistent with base (probably
> >> > required), but then how best to make a non-copying version available?
> >> >
> >> > droplevels.data.table = function(dt) {
> >> >    oldkey = key( dt )
> >> >    for (i in names(dt)) {
> >> >        if (is.factor(dt[[i]])) dt[,i:=droplevels(dt[[i]]),with=FALSE]
> >> >    }
> >> >    setkeyv( dt, oldkey )
> >> >    dt
> >> > }
> >> >
> >> > On Tue, 2012-02-21 at 15:38 -0500, Prasad Chalasani wrote:
> >> >> Meanwhile as a work-around, I suppose one should do:
> >> >>
> >> >> keys <- key( dt ) # this could in general be a large set of keys
> >> >> sub_d <- droplevels( as.data.frame( dt[ name != 'a' ] ) )
> >> >> sub_dt <- data.table( sub_d )
> >> >> setkeyv( sub_dt, keys )
> >> >>
> >> >>
> >> >>
> >> >> On Feb 21, 2012, at 1:59 PM, Matthew Dowle wrote:
> >> >>
> >> >> >
> >> >> > I see the problem too but (just) adding droplevels.data.table might miss
> >> >> > the root cause.
> >> >> >
> >> >> >> because the way the
> >> >> >> droplevels.data.frame method works isn't compatible with data.table
> >> >> >> indexing.
> >> >> >
> >> >> > But it's intended to be. I can see the switch at the top of [.data.table
> >> >> > is detecting the caller isn't data.table aware, and it is then dispatching
> >> >> > to `[.data.frame` but why it then isn't working I'm not sure. Something to
> >> >> > do with the missing j or missing drop not being passed through correctly,
> >> >> > perhaps.
> >> >> >
> >> >> > I have heard it said (once or twice) that data.table is "almost"
> >> >> > compatible with non-data.table-aware packages, but never had an example
> >> >> > before. I wonder if this is it!
> >> >> >
> >> >> > A (fast) droplevels.data.table using := would be good anyway, though.
> >> >> >
> >> >> > Matthew
> >> >> >
> >> >> >
> >> >> >
> >> >> >> Hi,
> >> >> >>
> >> >> >> I see what the problem is -- we need to provide a
> >> >> >> droplevels.data.table S3 method, because the way the
> >> >> >> droplevels.data.frame method works isn't compatible with data.table
> >> >> >> indexing.
> >> >> >>
> >> >> >> Will fix:
> >> >> >>
> >> >> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1841&group_id=240&atid=975
> >> >> >>
> >> >> >> Thanks for raising the flag.
> >> >> >>
> >> >> >> Cheers,
> >> >> >> -steve
> >> >> >>
> >> >> >> On Tue, Feb 21, 2012 at 12:38 PM, pchalasani <pchalasani at gmail.com> wrote:
> >> >> >>>  Surprising that this wasn't noticed before, or perhaps I'm not
> >> >> >>> following
> >> >> >>> some recommended idiom to drop levels when using  data.table. The
> >> >> >>> following
> >> >> >>> code illustrates the bug clearly: The bug remains regardless of whether
> >> >> >>> I
> >> >> >>> use "subset" or simply use dt1 = dt[ name != 'a' ].
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>    d <- data.table(name = c('a','b','c'), value = 1:3)
> >> >> >>>    dt <- data.table(d)
> >> >> >>>    setkey(dt,'name')
> >> >> >>>    dt1 <- subset(dt,name != 'a')  # or dt1 <- dt[ name != 'a' ]
> >> >> >>>    > dt1
> >> >> >>>          name value
> >> >> >>>     [1,]    b     2
> >> >> >>>     [2,]    c     3
> >> >> >>>
> >> >> >>>    > droplevels(dt1)
> >> >> >>>          name value
> >> >> >>>     [1,]    b     1
> >> >> >>>     [2,]    c     3
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>> View this message in context:
> >> >> >>> http://r.789695.n4.nabble.com/BUG-droplevels-mangles-subsetted-data-table-tp4407694p4407694.html
> >> >> >>> Sent from the datatable-help mailing list archive at Nabble.com.
> >> >> >>> _______________________________________________
> >> >> >>> datatable-help mailing list
> >> >> >>> datatable-help at lists.r-forge.r-project.org
> >> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Steve Lianoglou
> >> >> >> Graduate Student: Computational Systems Biology
> >> >> >>  | Memorial Sloan-Kettering Cancer Center
> >> >> >>  | Weill Medical College of Cornell University
> >> >> >> Contact Info: http://cbio.mskcc.org/~lianos/contact
> >> >> >> _______________________________________________
> >> >> >> datatable-help mailing list
> >> >> >> datatable-help at lists.r-forge.r-project.org
> >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >>
> >>
> >
> >
> 
> 
> 




More information about the datatable-help mailing list