[datatable-help] Merging problem when column in one DT is factor and in the other character
Matthew Dowle
mdowle at mdowle.plus.com
Wed Apr 18 09:11:33 CEST 2012
Now fixed in v1.8.1.
On Mon, 2012-04-02 at 13:39 +0200, Christoph Jäckel wrote:
> I have raised a bug report under, which can be found under #1922.
>
> On Mon, Apr 2, 2012 at 2:17 AM, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
>
> Thanks for example and data, very clear.
>
> Yes, problem looks to be factors with unused levels, when
> joined to a
> character column, as you suggested. Work arounds are to drop
> the unused
> levels or convert to character, as you found.
>
> A fix is a bit more involved and won't have time for a while.
> Please
> could you file a bug report so it doesn't get forgotten.
>
> Thanks.
>
> On Fri, 2012-03-30 at 16:51 +0100, Matthew Dowle wrote:
> > Hi. A quick read suggests it's not intended and that's a
> bug. Just convert
> > the columns to character for now, and it should work.
> Character columns
> > are now preferred going forward, so I'd be suggesting that
> anyway even if
> > it worked.
> >
> > > So in case this is expected behavior, should data.table
> > > give at least a warning that says something like "You join
> two
> > > data.tables whereby one keyed column is a factor, one is a
> character.
> > > That is probably not your intention. Convert the factor
> column to
> > > character or vice versa."?
> >
> > Yes. It should be converting to character (with a warning)
> in this case.
> > Thought that's what I coded and tested. Will investigate...
> >
> > > Hi together,
> > >
> > > here is the problem I needed dput for:
> http://www.fileuploadx.de/287440
> > > (sorry,
> > > I know that this filehoster is annoying because you have
> to wait until you
> > > can download the file; I hope you have a coffee machine
> close by ;-)
> > >
> > > In this attachment, I basically load in two data.tables
> DT1 and DT2 that I
> > > want to join, i.e. DT2[DT1], according to the keyed
> columns "Company_Code"
> > > and "intDatum" in DT1 and "DSCD" and "intDatum" in DT2.
> However, while
> > > "DSCD" is formatted as a character-column, "Company_Code"
> is formatted as
> > > a
> > > factor-column. As you can see from the long
> structure-object, there are
> > > plenty of levels here (the actual data.tables are very
> small).
> > >
> > > Now, when I try to join those with DT2[DT1], I get:
> > >
> > >
> > > DSCD intDatum MONTH MV SICClass
> > > [1,] 997859 151 <NA> NA 44
> > > [2,] 997859 152 <NA> NA 44
> > > [3,] 998064 151 <NA> NA 15
> > > [4,] 998064 152 <NA> NA 15
> > > [5,] 142268 151 <NA> NA 53
> > > [6,] 142268 152 <NA> NA 53
> > > [7,] 142859 151 <NA> NA 56
> > > [8,] 142859 152 <NA> NA 56
> > > [9,] 143415 151 <NA> NA 63
> > > [10,] 143415 152 <NA> NA 63
> > > [11,] 307045 151 <NA> NA 15
> > > [12,] 307045 152 <NA> NA 15
> > >
> > >
> > >
> > > Basically, data.table finds no values for MV and MONTH for
> any DSCD
> > > and intDatum combination. However, as DT2[DSCD=="142268"]
> clearly
> > > shows, there are values for that DSCD:
> > >
> > >
> > >
> > > DSCD MONTH MV intDatum
> > > [1,] 142268 1997-08-28 1901.12 151
> > > [2,] 142268 1997-09-28 1829.00 152
> > >
> > >
> > >
> > > Those, however, only show up in the join after i get rid
> of all the
> > > unused levels (equivalently, I can also transform the
> Company_Code to
> > > a character column):
> > >
> > >
> > > DT1[, Company_Code := factor(Company_Code)]
> > > DT2[DT1]
> > >
> > >
> > > DSCD intDatum MONTH MV SICClass
> > > [1,] 997859 151 <NA> NA 44
> > > [2,] 997859 152 <NA> NA 44
> > > [3,] 998064 151 <NA> NA 15
> > > [4,] 998064 152 <NA> NA 15
> > > [5,] 142268 151 1997-08-28 1901.12 53
> > > [6,] 142268 152 1997-09-28 1829.00 53
> > > [7,] 142859 151 <NA> NA 56
> > > [8,] 142859 152 <NA> NA 56
> > > [9,] 143415 151 <NA> NA 63
> > > [10,] 143415 152 <NA> NA 63
> > > [11,] 307045 151 <NA> NA 15
> > > [12,] 307045 152 <NA> NA 15
> > >
> > >
> > >
> > > I'm pretty sure this behaviour occurred only with version
> 1.8.0,
> > > probably because data.table coerced every key to factor
> before (see
> > > the NEWS to 1.8.0). So my question is: Is what happens
> here intended
> > > behavior? I'm honest with you: I'm working now for a while
> with R and
> > > factors are one of those things that I never got. I just
> don't see
> > > their use and every so often they cause me huge problems
> (as in this
> > > case). So I'm probably making something stupid here. The
> nasty thing
> > > about this issue here is that mostly, however, the joins
> just work as
> > > expected (believe me, I tried to produce a simple example
> with one
> > > column factor and one character that would reproduce this
> behavior,
> > > but no matter what I did, the joins afterwards always
> worked as
> > > expected). So in case this is expected behavior, should
> data.table
> > > give at least a warning that says something like "You join
> two
> > > data.tables whereby one keyed column is a factor, one is a
> character.
> > > That is probably not your intention. Convert the factor
> column to
> > > character or vice versa."?
> > >
> > >
> > >
> > > Thanks for your help!
> > >
> > >
> > >
> > > Christoph
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>
More information about the datatable-help
mailing list