[datatable-help] Merging problem when column in one DT is factor and in the other character
Matthew Dowle
mdowle at mdowle.plus.com
Fri Mar 30 17:51:48 CEST 2012
Hi. A quick read suggests it's not intended and that's a bug. Just convert
the columns to character for now, and it should work. Character columns
are now preferred going forward, so I'd be suggesting that anyway even if
it worked.
> So in case this is expected behavior, should data.table
> give at least a warning that says something like "You join two
> data.tables whereby one keyed column is a factor, one is a character.
> That is probably not your intention. Convert the factor column to
> character or vice versa."?
Yes. It should be converting to character (with a warning) in this case.
Thought that's what I coded and tested. Will investigate...
> Hi together,
>
> here is the problem I needed dput for: http://www.fileuploadx.de/287440
> (sorry,
> I know that this filehoster is annoying because you have to wait until you
> can download the file; I hope you have a coffee machine close by ;-)
>
> In this attachment, I basically load in two data.tables DT1 and DT2 that I
> want to join, i.e. DT2[DT1], according to the keyed columns "Company_Code"
> and "intDatum" in DT1 and "DSCD" and "intDatum" in DT2. However, while
> "DSCD" is formatted as a character-column, "Company_Code" is formatted as
> a
> factor-column. As you can see from the long structure-object, there are
> plenty of levels here (the actual data.tables are very small).
>
> Now, when I try to join those with DT2[DT1], I get:
>
>
> DSCD intDatum MONTH MV SICClass
> [1,] 997859 151 <NA> NA 44
> [2,] 997859 152 <NA> NA 44
> [3,] 998064 151 <NA> NA 15
> [4,] 998064 152 <NA> NA 15
> [5,] 142268 151 <NA> NA 53
> [6,] 142268 152 <NA> NA 53
> [7,] 142859 151 <NA> NA 56
> [8,] 142859 152 <NA> NA 56
> [9,] 143415 151 <NA> NA 63
> [10,] 143415 152 <NA> NA 63
> [11,] 307045 151 <NA> NA 15
> [12,] 307045 152 <NA> NA 15
>
>
>
> Basically, data.table finds no values for MV and MONTH for any DSCD
> and intDatum combination. However, as DT2[DSCD=="142268"] clearly
> shows, there are values for that DSCD:
>
>
>
> DSCD MONTH MV intDatum
> [1,] 142268 1997-08-28 1901.12 151
> [2,] 142268 1997-09-28 1829.00 152
>
>
>
> Those, however, only show up in the join after i get rid of all the
> unused levels (equivalently, I can also transform the Company_Code to
> a character column):
>
>
> DT1[, Company_Code := factor(Company_Code)]
> DT2[DT1]
>
>
> DSCD intDatum MONTH MV SICClass
> [1,] 997859 151 <NA> NA 44
> [2,] 997859 152 <NA> NA 44
> [3,] 998064 151 <NA> NA 15
> [4,] 998064 152 <NA> NA 15
> [5,] 142268 151 1997-08-28 1901.12 53
> [6,] 142268 152 1997-09-28 1829.00 53
> [7,] 142859 151 <NA> NA 56
> [8,] 142859 152 <NA> NA 56
> [9,] 143415 151 <NA> NA 63
> [10,] 143415 152 <NA> NA 63
> [11,] 307045 151 <NA> NA 15
> [12,] 307045 152 <NA> NA 15
>
>
>
> I'm pretty sure this behaviour occurred only with version 1.8.0,
> probably because data.table coerced every key to factor before (see
> the NEWS to 1.8.0). So my question is: Is what happens here intended
> behavior? I'm honest with you: I'm working now for a while with R and
> factors are one of those things that I never got. I just don't see
> their use and every so often they cause me huge problems (as in this
> case). So I'm probably making something stupid here. The nasty thing
> about this issue here is that mostly, however, the joins just work as
> expected (believe me, I tried to produce a simple example with one
> column factor and one character that would reproduce this behavior,
> but no matter what I did, the joins afterwards always worked as
> expected). So in case this is expected behavior, should data.table
> give at least a warning that says something like "You join two
> data.tables whereby one keyed column is a factor, one is a character.
> That is probably not your intention. Convert the factor column to
> character or vice versa."?
>
>
>
> Thanks for your help!
>
>
>
> Christoph
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list