[datatable-help] In 1.9.2, By with factor column do not work the same as in 1.8.10

Paul Johnson pauljohn32 at gmail.com
Mon Mar 31 02:03:34 CEST 2014


Hi
I see this problem too. I was not using data.table before 1.9, so I did no
realize it ever behaved differently.  In the examples I've tried, any
calculation that I expect to create a factor seems to create an integer
that uses the R internal integer of the factor.

I noticed this, I thought maybe I needed to do more explicit casting to
make it come out as a factor. Here's my variable to lag a factor that beats
the point into the ground.

lagFactor <- function(x, N){
    xold <- x
    if (is.factor(x)) {
        xlev <- levels(x)
        xnum <- as.numeric(x)
    } else {
        xlev <- unique(x)
    }
    xlag <- c(rep(NA, N), xnum[-(length(xnum):(length(xnum)-N+1))])
    xlagf <- factor(xlev[xlag], levels = xlev)
    xlagf
}

dat is a data.table with lots of lines, I can give you a copy if you want.

Now I'll show you that the result is different in and out of a data.table.

> xx <- lagFactor(dat$east2b, 1)
> table(xx)
xx
   Yes     No
130232 151885
> levels(xx)
[1] "Yes" "No"
> dat[ , xx := lagFactor(east2b, 1), by = c("sippid"), roll  = TRUE]
> table(dat$xx)

     1      2
114963 130095
> levels(dat$xx)
NULL
> table(xx, dat$xx)

xx         1      2
  Yes 114963      0
  No       0 130095


For my case, the only fix is an explicit re-factoring.

 pj


On Fri, Mar 28, 2014 at 5:29 AM, DERVIEUX Christophe <
christophe.dervieux at rte-france.com> wrote:

>  Hi,
>
> I have updated data.table package to 1.9.2 recently from 1.8.10 and I
> found errors on my previous code.
>
> See reproductible example below:
>
> On 1.8.10 :
> DT<-data.table(X=factor(2006:2012),Y=rep(1:7,2))
> DT[,Z:=paste(X,.N,sep=" - "),by=list(X)][]
>
> X Y Z
> 1: 2006 1 2006 - 2
> 2: 2007 2 2007 - 2
> 3: 2008 3 2008 - 2
> 4: 2009 4 2009 - 2
> 5: 2010 5 2010 - 2
> 6: 2011 6 2011 - 2
> 7: 2012 7 2012 - 2
> 8: 2006 1 2006 - 2
> 9: 2007 2 2007 - 2
> 10: 2008 3 2008 - 2
> 11: 2009 4 2009 - 2
> 12: 2010 5 2010 - 2
> 13: 2011 6 2011 - 2
> 14: 2012 7 2012 - 2
>
> In column Z, I get the level of the factor column X
> pasted with count '.N' as expected
>
> However, in the 1.9.2, with same code :
> DT<-data.table(X=factor(2006:2012),Y=rep(1:7,2))
> DT[,Z:=paste(X,.N,sep=" - "),by=list(X)][]
>
> X Y Z
> 1: 2006 1 1 - 2
> 2: 2007 2 2 - 2
> 3: 2008 3 3 - 2
> 4: 2009 4 4 - 2
> 5: 2010 5 5 - 2
> 6: 2011 6 6 - 2
> 7: 2012 7 7 - 2
> 8: 2006 1 1 - 2
> 9: 2007 2 2 - 2
> 10: 2008 3 3 - 2
> 11: 2009 4 4 - 2
> 12: 2010 5 5 - 2
> 13: 2011 6 6 - 2
> 14: 2012 7 7 - 2
>
> as results, I do not get levels of factor column X but the numeric values
> associated with the level.
>
> is this working normally? Why has it changed? Is that a bug?
>
> I use this kind of procedure to make labels for ggplot. All my previous
> code is not working anymore. It's kind of annoying.
>
> Thanks
>
> Christophe
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>



-- 
Paul E. Johnson
Professor, Political Science      Assoc. Director
1541 Lilac Lane, Room 504      Center for Research Methods
University of Kansas                 University of Kansas
http://pj.freefaculty.org               http://quant.ku.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140330/340f14a8/attachment.html>


More information about the datatable-help mailing list