[datatable-help] Merging problem when column in one DT is factor and in the other character

Matthew Dowle mdowle at mdowle.plus.com
Wed Apr 18 09:11:33 CEST 2012


Now fixed in v1.8.1.

On Mon, 2012-04-02 at 13:39 +0200, Christoph Jäckel wrote:
> I have raised a bug report under, which can be found under #1922.
> 
> On Mon, Apr 2, 2012 at 2:17 AM, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
>         
>         Thanks for example and data, very clear.
>         
>         Yes, problem looks to be factors with unused levels, when
>         joined to a
>         character column, as you suggested. Work arounds are to drop
>         the unused
>         levels or convert to character, as you found.
>         
>         A fix is a bit more involved and won't have time for a while.
>         Please
>         could you file a bug report so it doesn't get forgotten.
>         
>         Thanks.
>         
>         On Fri, 2012-03-30 at 16:51 +0100, Matthew Dowle wrote:
>         > Hi. A quick read suggests it's not intended and that's a
>         bug. Just convert
>         > the columns to character for now, and it should work.
>         Character columns
>         > are now preferred going forward, so I'd be suggesting that
>         anyway even if
>         > it worked.
>         >
>         > > So in case this is expected behavior, should data.table
>         > > give at least a warning that says something like "You join
>         two
>         > > data.tables whereby one keyed column is a factor, one is a
>         character.
>         > > That is probably not your intention. Convert the factor
>         column to
>         > > character or vice versa."?
>         >
>         > Yes. It should be converting to character (with a warning)
>         in this case.
>         > Thought that's what I coded and tested. Will investigate...
>         >
>         > > Hi together,
>         > >
>         > > here is the problem I needed dput for:
>         http://www.fileuploadx.de/287440
>         > > (sorry,
>         > > I know that this filehoster is annoying because you have
>         to wait until you
>         > > can download the file; I hope you have a coffee machine
>         close by ;-)
>         > >
>         > > In this attachment, I basically load in two data.tables
>         DT1 and DT2 that I
>         > > want to join, i.e. DT2[DT1], according to the keyed
>         columns "Company_Code"
>         > > and "intDatum" in DT1 and "DSCD" and "intDatum" in DT2.
>         However, while
>         > > "DSCD" is formatted as a character-column, "Company_Code"
>         is formatted as
>         > > a
>         > > factor-column. As you can see from the long
>         structure-object, there are
>         > > plenty of levels here (the actual data.tables are very
>         small).
>         > >
>         > > Now, when I try to join those with DT2[DT1], I get:
>         > >
>         > >
>         > >         DSCD intDatum MONTH MV SICClass
>         > >  [1,] 997859      151  <NA> NA       44
>         > >  [2,] 997859      152  <NA> NA       44
>         > >  [3,] 998064      151  <NA> NA       15
>         > >  [4,] 998064      152  <NA> NA       15
>         > >  [5,] 142268      151  <NA> NA       53
>         > >  [6,] 142268      152  <NA> NA       53
>         > >  [7,] 142859      151  <NA> NA       56
>         > >  [8,] 142859      152  <NA> NA       56
>         > >  [9,] 143415      151  <NA> NA       63
>         > > [10,] 143415      152  <NA> NA       63
>         > > [11,] 307045      151  <NA> NA       15
>         > > [12,] 307045      152  <NA> NA       15
>         > >
>         > >
>         > >
>         > > Basically, data.table finds no values for MV and MONTH for
>         any DSCD
>         > > and intDatum combination. However, as DT2[DSCD=="142268"]
>         clearly
>         > > shows, there are values for that DSCD:
>         > >
>         > >
>         > >
>         > >        DSCD      MONTH      MV intDatum
>         > > [1,] 142268 1997-08-28 1901.12      151
>         > > [2,] 142268 1997-09-28 1829.00      152
>         > >
>         > >
>         > >
>         > > Those, however, only show up in the join after i get rid
>         of all the
>         > > unused levels (equivalently, I can also transform the
>         Company_Code to
>         > > a character column):
>         > >
>         > >
>         > > DT1[, Company_Code := factor(Company_Code)]
>         > > DT2[DT1]
>         > >
>         > >
>         > >         DSCD intDatum      MONTH      MV SICClass
>         > >  [1,] 997859      151       <NA>      NA       44
>         > >  [2,] 997859      152       <NA>      NA       44
>         > >  [3,] 998064      151       <NA>      NA       15
>         > >  [4,] 998064      152       <NA>      NA       15
>         > >  [5,] 142268      151 1997-08-28 1901.12       53
>         > >  [6,] 142268      152 1997-09-28 1829.00       53
>         > >  [7,] 142859      151       <NA>      NA       56
>         > >  [8,] 142859      152       <NA>      NA       56
>         > >  [9,] 143415      151       <NA>      NA       63
>         > > [10,] 143415      152       <NA>      NA       63
>         > > [11,] 307045      151       <NA>      NA       15
>         > > [12,] 307045      152       <NA>      NA       15
>         > >
>         > >
>         > >
>         > > I'm pretty sure this behaviour occurred only with version
>         1.8.0,
>         > > probably because data.table coerced every key to factor
>         before (see
>         > > the NEWS to 1.8.0). So my question is: Is what happens
>         here intended
>         > > behavior? I'm honest with you: I'm working now for a while
>         with R and
>         > > factors are one of those things that I never got. I just
>         don't see
>         > > their use and every so often they cause me huge problems
>         (as in this
>         > > case). So I'm probably making something stupid here. The
>         nasty thing
>         > > about this issue here is that mostly, however, the joins
>         just work as
>         > > expected (believe me, I tried to produce a simple example
>         with one
>         > > column factor and one character that would reproduce this
>         behavior,
>         > > but no matter what I did, the joins afterwards always
>         worked as
>         > > expected). So in case this is expected behavior, should
>         data.table
>         > > give at least a warning that says something like "You join
>         two
>         > > data.tables whereby one keyed column is a factor, one is a
>         character.
>         > > That is probably not your intention. Convert the factor
>         column to
>         > > character or vice versa."?
>         > >
>         > >
>         > >
>         > > Thanks for your help!
>         > >
>         > >
>         > >
>         > > Christoph
>         > > _______________________________________________
>         > > datatable-help mailing list
>         > > datatable-help at lists.r-forge.r-project.org
>         > >
>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>         >
>         >
>         > _______________________________________________
>         > datatable-help mailing list
>         > datatable-help at lists.r-forge.r-project.org
>         >
>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>         
>         
>         
> 
> 




More information about the datatable-help mailing list