[datatable-help] Merging problem when column in one DT is factor and in the other character

Matthew Dowle mdowle at mdowle.plus.com
Mon Apr 2 02:17:21 CEST 2012


Thanks for example and data, very clear.

Yes, problem looks to be factors with unused levels, when joined to a
character column, as you suggested. Work arounds are to drop the unused
levels or convert to character, as you found. 

A fix is a bit more involved and won't have time for a while. Please
could you file a bug report so it doesn't get forgotten.

Thanks.

On Fri, 2012-03-30 at 16:51 +0100, Matthew Dowle wrote:
> Hi. A quick read suggests it's not intended and that's a bug. Just convert
> the columns to character for now, and it should work. Character columns
> are now preferred going forward, so I'd be suggesting that anyway even if
> it worked.
> 
> > So in case this is expected behavior, should data.table
> > give at least a warning that says something like "You join two
> > data.tables whereby one keyed column is a factor, one is a character.
> > That is probably not your intention. Convert the factor column to
> > character or vice versa."?
> 
> Yes. It should be converting to character (with a warning) in this case.
> Thought that's what I coded and tested. Will investigate...
> 
> > Hi together,
> >
> > here is the problem I needed dput for: http://www.fileuploadx.de/287440
> > (sorry,
> > I know that this filehoster is annoying because you have to wait until you
> > can download the file; I hope you have a coffee machine close by ;-)
> >
> > In this attachment, I basically load in two data.tables DT1 and DT2 that I
> > want to join, i.e. DT2[DT1], according to the keyed columns "Company_Code"
> > and "intDatum" in DT1 and "DSCD" and "intDatum" in DT2. However, while
> > "DSCD" is formatted as a character-column, "Company_Code" is formatted as
> > a
> > factor-column. As you can see from the long structure-object, there are
> > plenty of levels here (the actual data.tables are very small).
> >
> > Now, when I try to join those with DT2[DT1], I get:
> >
> >
> >         DSCD intDatum MONTH MV SICClass
> >  [1,] 997859      151  <NA> NA       44
> >  [2,] 997859      152  <NA> NA       44
> >  [3,] 998064      151  <NA> NA       15
> >  [4,] 998064      152  <NA> NA       15
> >  [5,] 142268      151  <NA> NA       53
> >  [6,] 142268      152  <NA> NA       53
> >  [7,] 142859      151  <NA> NA       56
> >  [8,] 142859      152  <NA> NA       56
> >  [9,] 143415      151  <NA> NA       63
> > [10,] 143415      152  <NA> NA       63
> > [11,] 307045      151  <NA> NA       15
> > [12,] 307045      152  <NA> NA       15
> >
> >
> >
> > Basically, data.table finds no values for MV and MONTH for any DSCD
> > and intDatum combination. However, as DT2[DSCD=="142268"] clearly
> > shows, there are values for that DSCD:
> >
> >
> >
> >        DSCD      MONTH      MV intDatum
> > [1,] 142268 1997-08-28 1901.12      151
> > [2,] 142268 1997-09-28 1829.00      152
> >
> >
> >
> > Those, however, only show up in the join after i get rid of all the
> > unused levels (equivalently, I can also transform the Company_Code to
> > a character column):
> >
> >
> > DT1[, Company_Code := factor(Company_Code)]
> > DT2[DT1]
> >
> >
> >         DSCD intDatum      MONTH      MV SICClass
> >  [1,] 997859      151       <NA>      NA       44
> >  [2,] 997859      152       <NA>      NA       44
> >  [3,] 998064      151       <NA>      NA       15
> >  [4,] 998064      152       <NA>      NA       15
> >  [5,] 142268      151 1997-08-28 1901.12       53
> >  [6,] 142268      152 1997-09-28 1829.00       53
> >  [7,] 142859      151       <NA>      NA       56
> >  [8,] 142859      152       <NA>      NA       56
> >  [9,] 143415      151       <NA>      NA       63
> > [10,] 143415      152       <NA>      NA       63
> > [11,] 307045      151       <NA>      NA       15
> > [12,] 307045      152       <NA>      NA       15
> >
> >
> >
> > I'm pretty sure this behaviour occurred only with version 1.8.0,
> > probably because data.table coerced every key to factor before (see
> > the NEWS to 1.8.0). So my question is: Is what happens here intended
> > behavior? I'm honest with you: I'm working now for a while with R and
> > factors are one of those things that I never got. I just don't see
> > their use and every so often they cause me huge problems (as in this
> > case). So I'm probably making something stupid here. The nasty thing
> > about this issue here is that mostly, however, the joins just work as
> > expected (believe me, I tried to produce a simple example with one
> > column factor and one character that would reproduce this behavior,
> > but no matter what I did, the joins afterwards always worked as
> > expected). So in case this is expected behavior, should data.table
> > give at least a warning that says something like "You join two
> > data.tables whereby one keyed column is a factor, one is a character.
> > That is probably not your intention. Convert the factor column to
> > character or vice versa."?
> >
> >
> >
> > Thanks for your help!
> >
> >
> >
> > Christoph
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list