[datatable-help] Merging problem when column in one DT is factor and in the other character
Christoph Jäckel
christoph.jaeckel at wi.tum.de
Fri Mar 30 17:28:17 CEST 2012
Hi together,
here is the problem I needed dput for: http://www.fileuploadx.de/287440 (sorry,
I know that this filehoster is annoying because you have to wait until you
can download the file; I hope you have a coffee machine close by ;-)
In this attachment, I basically load in two data.tables DT1 and DT2 that I
want to join, i.e. DT2[DT1], according to the keyed columns "Company_Code"
and "intDatum" in DT1 and "DSCD" and "intDatum" in DT2. However, while
"DSCD" is formatted as a character-column, "Company_Code" is formatted as a
factor-column. As you can see from the long structure-object, there are
plenty of levels here (the actual data.tables are very small).
Now, when I try to join those with DT2[DT1], I get:
DSCD intDatum MONTH MV SICClass
[1,] 997859 151 <NA> NA 44
[2,] 997859 152 <NA> NA 44
[3,] 998064 151 <NA> NA 15
[4,] 998064 152 <NA> NA 15
[5,] 142268 151 <NA> NA 53
[6,] 142268 152 <NA> NA 53
[7,] 142859 151 <NA> NA 56
[8,] 142859 152 <NA> NA 56
[9,] 143415 151 <NA> NA 63
[10,] 143415 152 <NA> NA 63
[11,] 307045 151 <NA> NA 15
[12,] 307045 152 <NA> NA 15
Basically, data.table finds no values for MV and MONTH for any DSCD
and intDatum combination. However, as DT2[DSCD=="142268"] clearly
shows, there are values for that DSCD:
DSCD MONTH MV intDatum
[1,] 142268 1997-08-28 1901.12 151
[2,] 142268 1997-09-28 1829.00 152
Those, however, only show up in the join after i get rid of all the
unused levels (equivalently, I can also transform the Company_Code to
a character column):
DT1[, Company_Code := factor(Company_Code)]
DT2[DT1]
DSCD intDatum MONTH MV SICClass
[1,] 997859 151 <NA> NA 44
[2,] 997859 152 <NA> NA 44
[3,] 998064 151 <NA> NA 15
[4,] 998064 152 <NA> NA 15
[5,] 142268 151 1997-08-28 1901.12 53
[6,] 142268 152 1997-09-28 1829.00 53
[7,] 142859 151 <NA> NA 56
[8,] 142859 152 <NA> NA 56
[9,] 143415 151 <NA> NA 63
[10,] 143415 152 <NA> NA 63
[11,] 307045 151 <NA> NA 15
[12,] 307045 152 <NA> NA 15
I'm pretty sure this behaviour occurred only with version 1.8.0,
probably because data.table coerced every key to factor before (see
the NEWS to 1.8.0). So my question is: Is what happens here intended
behavior? I'm honest with you: I'm working now for a while with R and
factors are one of those things that I never got. I just don't see
their use and every so often they cause me huge problems (as in this
case). So I'm probably making something stupid here. The nasty thing
about this issue here is that mostly, however, the joins just work as
expected (believe me, I tried to produce a simple example with one
column factor and one character that would reproduce this behavior,
but no matter what I did, the joins afterwards always worked as
expected). So in case this is expected behavior, should data.table
give at least a warning that says something like "You join two
data.tables whereby one keyed column is a factor, one is a character.
That is probably not your intention. Convert the factor column to
character or vice versa."?
Thanks for your help!
Christoph
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120330/63e3d70b/attachment.html>
More information about the datatable-help
mailing list