[datatable-help] Characters and Factor
Matthew Dowle
mdowle at mdowle.plus.com
Fri Apr 27 10:24:33 CEST 2012
Damian Betebenner <dbetebenner <at> nciea.org> writes:
> All,
>
> Not sure how to characterize this (a new feature or a bug) but the behavior
is causing problems in code I’ve written that previously work as I expected. I
have integers that are bigger than 2^32 that
> I have to encode as factors, after doing some data.table stuff (like below),
it recorders the factors as characters and corrupts subsequent merges back to
tables where these factors are ordered as
> “integers”.
>
> Remedies?
>
> tmp.dt1 <- data.table(X=as.factor(1:10), Y=rnorm(10), key="X")
> tmp.dt2 <- data.table(X=as.factor(101:110), Y=rnorm(10), key="X")
>
> rbind(tmp.dt1, tmp.dt2)
>
> V1 Y
> [1,] 1 0.47655333
> [2,] 2 -0.43962704
> [3,] 3 -0.78312270
> [4,] 4 1.88935392
> [5,] 5 -0.56413463
> [6,] 6 -0.69177767
> [7,] 7 -0.09942112
> [8,] 8 0.21452552
> [9,] 9 -0.86136222
> [10,] 10 0.55623427
> [11,] 101 0.02090036
> [12,] 102 -0.41816481
> [13,] 103 0.04798975
> [14,] 104 0.93709966
> [15,] 105 -0.95835181
> [16,] 106 0.82207890
> [17,] 107 0.85902512
> [18,] 108 1.33042023
> [19,] 109 0.22596849
> [20,] 110 0.99209054
>
> data.table(rbind(tmp.dt1, tmp.dt2), key="X")
> X Y
> [1,] 1 -0.16225884
> [2,] 10 0.82979617
> [3,] 101 0.22412653
> [4,] 102 -0.24841475
> [5,] 103 -0.09914182
> [6,] 104 -1.47982574
> [7,] 105 -1.79957210
> [8,] 106 -2.01715940
> [9,] 107 -0.81900855
> [10,] 108 0.26357249
> [11,] 109 -1.22742679
> [12,] 110 0.64773494
> [13,] 2 -0.98312948
> [14,] 3 0.99937771
> [15,] 4 -1.72355977
> [16,] 5 -2.02481542
> [17,] 6 -0.07222688
> [18,] 7 0.17921321
> [19,] 8 -0.92102526
> [20,] 9 -0.14129584
>
Hi,
I've had a quick look but can't quite grasp it. rbind.data.table calls
data.table::c.factor() to concatenate factor columns, and that reorders the
levels on the new combined factor. I guess it shouldn't now that unordered
factor levels are allowed and supported in data.table. But if that's it, it
wouldn't have worked before either and it's always been a problem. Also I don't
see how a corruption could occur since joins between two factor columns with
different levels (each possibly unordered) should work fine. Could you provide
some more details before and after showing the change exactly?
I was going to say 'just use character', but had never considered ordered
integers greater than 2^32 as the use case, so character type wouldn't work for
them. It's a new one on me, so either way some new tests are needed.
Finally, there is this fix in v1.8.1 that might be involved somehow :
o Joining a factor column with unsorted and unused levels to a character
column now matches properly, fixing #1922. Thanks to Christoph Jäckel for
the reproducible example. Test added.
Matthew
More information about the datatable-help
mailing list