[datatable-help] BUG: droplevels mangles subsetted data.table

Matthew Dowle mdowle at mdowle.plus.com
Wed Feb 22 00:22:59 CET 2012


Hi. Just because as it stands it doesn't copy, so

    newDT = dropfactors(DT)

would change DT by reference with newDT a new pointer to that same
modified object, whereas base would leave DT unchanged with newDT a
modified copy.

Just adding dt=copy(dt) at the start of the function would make it
consistent,  but then how would we (data.table-aware code) call the
non-copying version if we wanted that (which is likely needed, given the
motivation of dropping unused levels I guess). Could continue the set*
theme and create setdropfactors()? but that doesn't roll off the tongue.
Or the copy() could be switched in the usual way :

     if (!cedta) dt = copy(dt)

and then we data.table users would just know that droplevels worked by
reference and we should copy() first if we want a copy, in the usual
way. Whilst not upsetting non-data.table-aware packages, since they
would still copy. Think I prefer the switched copy, carefully
documented, which would save yet another new function. I'm thinking that
users' expectations of dropfactors() would probably be that it worked by
reference on data.tables anyway (or if not, would want it to after the
initial surprise).

Matthew

On Tue, 2012-02-21 at 17:52 -0500, Steve Lianoglou wrote:
> Hi,
> 
> I guess I'm missing something, but ... why isn't your proposed
> droplevels.data.table consistent with base? Because the ordering of
> the rows might change (maybe(?))?
> 
> -steve
> 
> On Tue, Feb 21, 2012 at 4:42 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> >
> > Yes, could do. Building on that here's a quick stab at
> > droplevels.data.table. This does it by reference, or it could take a
> > copy(). If it takes a copy() it would be consistent with base (probably
> > required), but then how best to make a non-copying version available?
> >
> > droplevels.data.table = function(dt) {
> >    oldkey = key( dt )
> >    for (i in names(dt)) {
> >        if (is.factor(dt[[i]])) dt[,i:=droplevels(dt[[i]]),with=FALSE]
> >    }
> >    setkeyv( dt, oldkey )
> >    dt
> > }
> >
> > On Tue, 2012-02-21 at 15:38 -0500, Prasad Chalasani wrote:
> >> Meanwhile as a work-around, I suppose one should do:
> >>
> >> keys <- key( dt ) # this could in general be a large set of keys
> >> sub_d <- droplevels( as.data.frame( dt[ name != 'a' ] ) )
> >> sub_dt <- data.table( sub_d )
> >> setkeyv( sub_dt, keys )
> >>
> >>
> >>
> >> On Feb 21, 2012, at 1:59 PM, Matthew Dowle wrote:
> >>
> >> >
> >> > I see the problem too but (just) adding droplevels.data.table might miss
> >> > the root cause.
> >> >
> >> >> because the way the
> >> >> droplevels.data.frame method works isn't compatible with data.table
> >> >> indexing.
> >> >
> >> > But it's intended to be. I can see the switch at the top of [.data.table
> >> > is detecting the caller isn't data.table aware, and it is then dispatching
> >> > to `[.data.frame` but why it then isn't working I'm not sure. Something to
> >> > do with the missing j or missing drop not being passed through correctly,
> >> > perhaps.
> >> >
> >> > I have heard it said (once or twice) that data.table is "almost"
> >> > compatible with non-data.table-aware packages, but never had an example
> >> > before. I wonder if this is it!
> >> >
> >> > A (fast) droplevels.data.table using := would be good anyway, though.
> >> >
> >> > Matthew
> >> >
> >> >
> >> >
> >> >> Hi,
> >> >>
> >> >> I see what the problem is -- we need to provide a
> >> >> droplevels.data.table S3 method, because the way the
> >> >> droplevels.data.frame method works isn't compatible with data.table
> >> >> indexing.
> >> >>
> >> >> Will fix:
> >> >>
> >> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1841&group_id=240&atid=975
> >> >>
> >> >> Thanks for raising the flag.
> >> >>
> >> >> Cheers,
> >> >> -steve
> >> >>
> >> >> On Tue, Feb 21, 2012 at 12:38 PM, pchalasani <pchalasani at gmail.com> wrote:
> >> >>>  Surprising that this wasn't noticed before, or perhaps I'm not
> >> >>> following
> >> >>> some recommended idiom to drop levels when using  data.table. The
> >> >>> following
> >> >>> code illustrates the bug clearly: The bug remains regardless of whether
> >> >>> I
> >> >>> use "subset" or simply use dt1 = dt[ name != 'a' ].
> >> >>>
> >> >>>
> >> >>>
> >> >>>    d <- data.table(name = c('a','b','c'), value = 1:3)
> >> >>>    dt <- data.table(d)
> >> >>>    setkey(dt,'name')
> >> >>>    dt1 <- subset(dt,name != 'a')  # or dt1 <- dt[ name != 'a' ]
> >> >>>    > dt1
> >> >>>          name value
> >> >>>     [1,]    b     2
> >> >>>     [2,]    c     3
> >> >>>
> >> >>>    > droplevels(dt1)
> >> >>>          name value
> >> >>>     [1,]    b     1
> >> >>>     [2,]    c     3
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> View this message in context:
> >> >>> http://r.789695.n4.nabble.com/BUG-droplevels-mangles-subsetted-data-table-tp4407694p4407694.html
> >> >>> Sent from the datatable-help mailing list archive at Nabble.com.
> >> >>> _______________________________________________
> >> >>> datatable-help mailing list
> >> >>> datatable-help at lists.r-forge.r-project.org
> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Steve Lianoglou
> >> >> Graduate Student: Computational Systems Biology
> >> >>  | Memorial Sloan-Kettering Cancer Center
> >> >>  | Weill Medical College of Cornell University
> >> >> Contact Info: http://cbio.mskcc.org/~lianos/contact
> >> >> _______________________________________________
> >> >> datatable-help mailing list
> >> >> datatable-help at lists.r-forge.r-project.org
> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> >>
> >> >
> >> >
> >>
> >
> >
> 
> 
> 




More information about the datatable-help mailing list