[datatable-help] Merging problem when column in one DT is factor and in the other character

Christoph Jäckel christoph.jaeckel at wi.tum.de
Fri Mar 30 17:28:17 CEST 2012


Hi together,

here is the problem I needed dput for: http://www.fileuploadx.de/287440 (sorry,
I know that this filehoster is annoying because you have to wait until you
can download the file; I hope you have a coffee machine close by ;-)

In this attachment, I basically load in two data.tables DT1 and DT2 that I
want to join, i.e. DT2[DT1], according to the keyed columns "Company_Code"
and "intDatum" in DT1 and "DSCD" and "intDatum" in DT2. However, while
"DSCD" is formatted as a character-column, "Company_Code" is formatted as a
factor-column. As you can see from the long structure-object, there are
plenty of levels here (the actual data.tables are very small).

Now, when I try to join those with DT2[DT1], I get:


        DSCD intDatum MONTH MV SICClass
 [1,] 997859      151  <NA> NA       44
 [2,] 997859      152  <NA> NA       44
 [3,] 998064      151  <NA> NA       15
 [4,] 998064      152  <NA> NA       15
 [5,] 142268      151  <NA> NA       53
 [6,] 142268      152  <NA> NA       53
 [7,] 142859      151  <NA> NA       56
 [8,] 142859      152  <NA> NA       56
 [9,] 143415      151  <NA> NA       63
[10,] 143415      152  <NA> NA       63
[11,] 307045      151  <NA> NA       15
[12,] 307045      152  <NA> NA       15



Basically, data.table finds no values for MV and MONTH for any DSCD
and intDatum combination. However, as DT2[DSCD=="142268"] clearly
shows, there are values for that DSCD:



       DSCD      MONTH      MV intDatum
[1,] 142268 1997-08-28 1901.12      151
[2,] 142268 1997-09-28 1829.00      152



Those, however, only show up in the join after i get rid of all the
unused levels (equivalently, I can also transform the Company_Code to
a character column):


DT1[, Company_Code := factor(Company_Code)]
DT2[DT1]


        DSCD intDatum      MONTH      MV SICClass
 [1,] 997859      151       <NA>      NA       44
 [2,] 997859      152       <NA>      NA       44
 [3,] 998064      151       <NA>      NA       15
 [4,] 998064      152       <NA>      NA       15
 [5,] 142268      151 1997-08-28 1901.12       53
 [6,] 142268      152 1997-09-28 1829.00       53
 [7,] 142859      151       <NA>      NA       56
 [8,] 142859      152       <NA>      NA       56
 [9,] 143415      151       <NA>      NA       63
[10,] 143415      152       <NA>      NA       63
[11,] 307045      151       <NA>      NA       15
[12,] 307045      152       <NA>      NA       15



I'm pretty sure this behaviour occurred only with version 1.8.0,
probably because data.table coerced every key to factor before (see
the NEWS to 1.8.0). So my question is: Is what happens here intended
behavior? I'm honest with you: I'm working now for a while with R and
factors are one of those things that I never got. I just don't see
their use and every so often they cause me huge problems (as in this
case). So I'm probably making something stupid here. The nasty thing
about this issue here is that mostly, however, the joins just work as
expected (believe me, I tried to produce a simple example with one
column factor and one character that would reproduce this behavior,
but no matter what I did, the joins afterwards always worked as
expected). So in case this is expected behavior, should data.table
give at least a warning that says something like "You join two
data.tables whereby one keyed column is a factor, one is a character.
That is probably not your intention. Convert the factor column to
character or vice versa."?



Thanks for your help!



Christoph
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120330/63e3d70b/attachment.html>


More information about the datatable-help mailing list