[datatable-help] Characters and Factor

Matthew Dowle mdowle at mdowle.plus.com
Fri Apr 27 10:24:33 CEST 2012


Damian Betebenner <dbetebenner <at> nciea.org> writes:
> All, 
>  
> Not sure how to characterize this (a new feature or a bug) but the behavior 
is causing problems in code I’ve written that previously work as I expected. I 
have integers that are bigger than 2^32 that
> I have to encode as factors, after doing some data.table stuff (like below), 
it recorders the factors as characters and corrupts subsequent merges back to 
tables where these factors are ordered as
> “integers”.
>  
> Remedies?
>  
> tmp.dt1 <- data.table(X=as.factor(1:10), Y=rnorm(10), key="X")
> tmp.dt2 <- data.table(X=as.factor(101:110), Y=rnorm(10), key="X")
>  
> rbind(tmp.dt1, tmp.dt2)
>  
>        V1           Y
>  [1,]   1  0.47655333
>  [2,]   2 -0.43962704
>  [3,]   3 -0.78312270
>  [4,]   4  1.88935392
>  [5,]   5 -0.56413463
>  [6,]   6 -0.69177767
>  [7,]   7 -0.09942112
>  [8,]   8  0.21452552
>  [9,]   9 -0.86136222
> [10,]  10  0.55623427
> [11,] 101  0.02090036
> [12,] 102 -0.41816481
> [13,] 103  0.04798975
> [14,] 104  0.93709966
> [15,] 105 -0.95835181
> [16,] 106  0.82207890
> [17,] 107  0.85902512
> [18,] 108  1.33042023
> [19,] 109  0.22596849
> [20,] 110  0.99209054
>  
> data.table(rbind(tmp.dt1, tmp.dt2), key="X")
>         X           Y
>  [1,]   1 -0.16225884
>  [2,]  10  0.82979617
>  [3,] 101  0.22412653
>  [4,] 102 -0.24841475
>  [5,] 103 -0.09914182
>  [6,] 104 -1.47982574
>  [7,] 105 -1.79957210
>  [8,] 106 -2.01715940
>  [9,] 107 -0.81900855
> [10,] 108  0.26357249
> [11,] 109 -1.22742679
> [12,] 110  0.64773494
> [13,]   2 -0.98312948
> [14,]   3  0.99937771
> [15,]   4 -1.72355977
> [16,]   5 -2.02481542
> [17,]   6 -0.07222688
> [18,]   7  0.17921321
> [19,]   8 -0.92102526
> [20,]   9 -0.14129584  
>  
Hi,

I've had a quick look but can't quite grasp it. rbind.data.table calls 
data.table::c.factor() to concatenate factor columns, and that reorders the 
levels on the new combined factor. I guess it shouldn't now that unordered 
factor levels are allowed and supported in data.table. But if that's it, it 
wouldn't have worked before either and it's always been a problem. Also I don't 
see how a corruption could occur since joins between two factor columns with 
different levels (each possibly unordered) should work fine. Could you provide 
some more details before and after showing the change exactly?

I was going to say 'just use character', but had never considered ordered 
integers greater than 2^32 as the use case, so character type wouldn't work for 
them. It's a new one on me, so either way some new tests are needed.

Finally, there is this fix in v1.8.1 that might be involved somehow :

o Joining a factor column with unsorted and unused levels to a character
  column now matches properly, fixing #1922. Thanks to Christoph Jäckel for
  the reproducible example. Test added.

Matthew




More information about the datatable-help mailing list