[datatable-help] BUG: droplevels mangles subsetted data.table
Joseph Voelkel
jgvcqa at rit.edu
Wed Feb 22 03:59:05 CET 2012
This discussion is too arcane for me :) , but I can live with that.
In the meantime, I think my fix would be just to convert the factor to character and then back:
> d <- data.table(name = c('a','b','c'), value = 1:3)
> dt <- data.table(d)
> setkey(dt,'name')
> dt1 <- subset(dt,name != 'a') # or dt1 <- dt[ name != 'a' ]
> dt1
name value
[1,] b 2
[2,] c 3
> droplevels(dt1) # ouch
name value
[1,] b 1
[2,] c 3
>
> # A cheap fix
> dt1$name<-factor(as.character(dt1$name))
> setkey(dt1,name)
> dt1
name value
[1,] b 2
[2,] c 3
>
-----Original Message-----
From: datatable-help-bounces at r-forge.wu-wien.ac.at [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Steve Lianoglou
Sent: Tuesday, February 21, 2012 9:25 PM
To: mdowle at mdowle.plus.com
Cc: datatable-help at r-forge.wu-wien.ac.at
Subject: Re: [datatable-help] BUG: droplevels mangles subsetted data.table
Ahh, right ... the copying. Good point.
Regarding the logic you suggest as to when to copy or not, how do you feel about going the explicit route instead of trying to take a best guess when we should/shouldn't copy via `cedta` and doing the 'data.frame behavior' by default.
By that I mean: since the droplevels function has a `...` param, can we do something like:
droplevels.data.table <- function(x, except=NULL, do.copy=TRUE, ...) {
if (do.copy) {
x <- copy(x)
}
oldkey = key(x)
change.me <- names(x)
if (!is.null(except)) {
change.me <- setdiff(change.me, names(x)[except])
}
for (i in change.me)) {
if (is.factor(x[[i]])) x[,i:=droplevels(x[[i]]),with=FALSE]
}
setkeyv( x, oldkey )
}
yay/nay?
-steve
On Tue, Feb 21, 2012 at 6:22 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Hi. Just because as it stands it doesn't copy, so
>
> newDT = dropfactors(DT)
>
> would change DT by reference with newDT a new pointer to that same
> modified object, whereas base would leave DT unchanged with newDT a
> modified copy.
>
> Just adding dt=copy(dt) at the start of the function would make it
> consistent, but then how would we (data.table-aware code) call the
> non-copying version if we wanted that (which is likely needed, given
> the motivation of dropping unused levels I guess). Could continue the
> set* theme and create setdropfactors()? but that doesn't roll off the tongue.
> Or the copy() could be switched in the usual way :
>
> if (!cedta) dt = copy(dt)
>
> and then we data.table users would just know that droplevels worked by
> reference and we should copy() first if we want a copy, in the usual
> way. Whilst not upsetting non-data.table-aware packages, since they
> would still copy. Think I prefer the switched copy, carefully
> documented, which would save yet another new function. I'm thinking
> that users' expectations of dropfactors() would probably be that it
> worked by reference on data.tables anyway (or if not, would want it to
> after the initial surprise).
>
> Matthew
>
> On Tue, 2012-02-21 at 17:52 -0500, Steve Lianoglou wrote:
>> Hi,
>>
>> I guess I'm missing something, but ... why isn't your proposed
>> droplevels.data.table consistent with base? Because the ordering of
>> the rows might change (maybe(?))?
>>
>> -steve
>>
>> On Tue, Feb 21, 2012 at 4:42 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>> >
>> > Yes, could do. Building on that here's a quick stab at
>> > droplevels.data.table. This does it by reference, or it could take
>> > a copy(). If it takes a copy() it would be consistent with base
>> > (probably required), but then how best to make a non-copying version available?
>> >
>> > droplevels.data.table = function(dt) {
>> > oldkey = key( dt )
>> > for (i in names(dt)) {
>> > if (is.factor(dt[[i]]))
>> > dt[,i:=droplevels(dt[[i]]),with=FALSE]
>> > }
>> > setkeyv( dt, oldkey )
>> > dt
>> > }
>> >
>> > On Tue, 2012-02-21 at 15:38 -0500, Prasad Chalasani wrote:
>> >> Meanwhile as a work-around, I suppose one should do:
>> >>
>> >> keys <- key( dt ) # this could in general be a large set of keys
>> >> sub_d <- droplevels( as.data.frame( dt[ name != 'a' ] ) ) sub_dt
>> >> <- data.table( sub_d ) setkeyv( sub_dt, keys )
>> >>
>> >>
>> >>
>> >> On Feb 21, 2012, at 1:59 PM, Matthew Dowle wrote:
>> >>
>> >> >
>> >> > I see the problem too but (just) adding droplevels.data.table
>> >> > might miss the root cause.
>> >> >
>> >> >> because the way the
>> >> >> droplevels.data.frame method works isn't compatible with
>> >> >> data.table indexing.
>> >> >
>> >> > But it's intended to be. I can see the switch at the top of
>> >> > [.data.table is detecting the caller isn't data.table aware, and
>> >> > it is then dispatching to `[.data.frame` but why it then isn't
>> >> > working I'm not sure. Something to do with the missing j or
>> >> > missing drop not being passed through correctly, perhaps.
>> >> >
>> >> > I have heard it said (once or twice) that data.table is "almost"
>> >> > compatible with non-data.table-aware packages, but never had an
>> >> > example before. I wonder if this is it!
>> >> >
>> >> > A (fast) droplevels.data.table using := would be good anyway, though.
>> >> >
>> >> > Matthew
>> >> >
>> >> >
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> I see what the problem is -- we need to provide a
>> >> >> droplevels.data.table S3 method, because the way the
>> >> >> droplevels.data.frame method works isn't compatible with
>> >> >> data.table indexing.
>> >> >>
>> >> >> Will fix:
>> >> >>
>> >> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid
>> >> >> =1841&group_id=240&atid=975
>> >> >>
>> >> >> Thanks for raising the flag.
>> >> >>
>> >> >> Cheers,
>> >> >> -steve
>> >> >>
>> >> >> On Tue, Feb 21, 2012 at 12:38 PM, pchalasani <pchalasani at gmail.com> wrote:
>> >> >>> Surprising that this wasn't noticed before, or perhaps I'm
>> >> >>> not following some recommended idiom to drop levels when using
>> >> >>> data.table. The following code illustrates the bug clearly:
>> >> >>> The bug remains regardless of whether I use "subset" or simply
>> >> >>> use dt1 = dt[ name != 'a' ].
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> d <- data.table(name = c('a','b','c'), value = 1:3)
>> >> >>> dt <- data.table(d)
>> >> >>> setkey(dt,'name')
>> >> >>> dt1 <- subset(dt,name != 'a') # or dt1 <- dt[ name != 'a'
>> >> >>> ]
>> >> >>> > dt1
>> >> >>> name value
>> >> >>> [1,] b 2
>> >> >>> [2,] c 3
>> >> >>>
>> >> >>> > droplevels(dt1)
>> >> >>> name value
>> >> >>> [1,] b 1
>> >> >>> [2,] c 3
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> View this message in context:
>> >> >>> http://r.789695.n4.nabble.com/BUG-droplevels-mangles-subsetted
>> >> >>> -data-table-tp4407694p4407694.html
>> >> >>> Sent from the datatable-help mailing list archive at Nabble.com.
>> >> >>> _______________________________________________
>> >> >>> datatable-help mailing list
>> >> >>> datatable-help at lists.r-forge.r-project.org
>> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
>> >> >>> atatable-help
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Steve Lianoglou
>> >> >> Graduate Student: Computational Systems Biology
>> >> >> | Memorial Sloan-Kettering Cancer Center
>> >> >> | Weill Medical College of Cornell University Contact Info:
>> >> >> http://cbio.mskcc.org/~lianos/contact
>> >> >> _______________________________________________
>> >> >> datatable-help mailing list
>> >> >> datatable-help at lists.r-forge.r-project.org
>> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/da
>> >> >> tatable-help
>> >> >>
>> >> >
>> >> >
>> >>
>> >
>> >
>>
>>
>>
>
>
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list