[datatable-help] Something seems funky. I think with character-to-factor conversion for keys (?)

Matthew Dowle mdowle at mdowle.plus.com
Mon Mar 7 20:31:15 CET 2011


An option to turn on a check like that might be good. Probably at the
start of [.data.table.

When I've seen this issue before it's been when I have been constructing
data.table's 'manually'. Similar to other places in R, nothing stops you
creating invalid objects, directly.

For example (in base R) :

> DF = list(1:10,1:5)
> DF
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
[1] 1 2 3 4 5

> class(DF)="data.frame"
> sapply(DF,length)
[1] 10  5
> DF
NULL
<0 rows> (or 0-length row.names)
>
> attr(DF,"row.names")=letters[1:3]
> DF[6,]
   NA NA
NA  6 NA
> 


On Mon, 2011-03-07 at 11:30 -0500, Steve Lianoglou wrote:
> Hi,
> 
> On Sat, Mar 5, 2011 at 4:06 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> > Hi,
> > Seems consistent with out of order factor levels. The binary search
> > relies on levels being sorted. If that's it then please track down the
> > earlier point where the out-of-order factor levels were introduced and
> > maybe a fix is needed there. Everything else here is correct behaviour.
> 
> I know it sounds lame, but I'm having problems tracking down how my
> key/factor column arrived at having out of order levels.
> 
> While I try to smoke that out, do you think it would be a good idea to
> write a small utility at the C level to scan through the levels() of
> factor-keys to test for them being in order and
> breaking/short-circuiting as soon as it finds one level that's out of
> order? This way we can fire off a warning when this problem is
> detected so the user would be warned to expect "weird" behavior (and
> also know how to fix(?))
> 
> I'm not sure exactly where/when we would invoke that test -- maybe
> after calls to setkey ... and optionally under merge-like operations.
> 
> I can take a crack at doing that if it seems like a good idea.
> 
> -steve
> 
> > Matthew
> >
> > On Fri, 2011-03-04 at 21:43 -0500, Steve Lianoglou wrote:
> >> Hi Mel,
> >>
> >> On Fri, Mar 4, 2011 at 8:15 PM, Bacou, Melanie <mel at mbacou.com> wrote:
> >> > Steve,
> >> >
> >> > Try instead:
> >> >
> >> > R> m2[J(9)]
> >> >
> >> > It seems your original entrez.id key is integer not character
> >>
> >> It's actually a factor:
> >>
> >> R> is(m2$entrez.id)
> >> [1] "factor"   "integer"  "oldClass" "numeric"  "vector"
> >>
> >> and moreover:
> >>
> >> R> '9' %in% levels(m2$entrez.id)
> >> [1] TRUE
> >>
> >> and the integer J() maneuver is a no go:
> >>
> >> R> Error in `[.data.table`(m2, J(9)) :
> >>   x.entrez.id is a factor but joining to i.V1 which is not a factor.
> >> Factors must join to factors.
> >>
> >> > -- but to be honest I'm not sure why:
> >> >
> >> > R> m2[9]
> >> >
> >> > doesn't work either...
> >>
> >> That works, in that it does something, but it just gets the 9th row of
> >> m2, not the row whose key is '9'
> >>
> >> Seems like something's strange is afoot here ...
> >>
> >> -steve
> >>
> >> > --Mel.
> >> >
> >> > -----Original Message-----
> >> > From: datatable-help-bounces at r-forge.wu-wien.ac.at
> >> > [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Steve
> >> > Lianoglou
> >> > Sent: Friday, March 04, 2011 5:46 PM
> >> > To: datatable-help at r-forge.wu-wien.ac.at
> >> > Subject: [datatable-help] Something seems funky. I think with
> >> > character-to-factor conversion for keys (?)
> >> >
> >> > I'll have to apologize in advance because I can't create a
> >> > reproducible example for this behavior, but I'll keep trying .. please
> >> > bear with me.
> >> >
> >> > Somehow I've ended up with a data.table `m2` that looks like this:
> >> >
> >> > R> m2
> >> >      entrez.id total.tags.liver cds.liver intron.liver utr.liver
> >> >  [1,]         9               27         0            0         0
> >> >  [2,]        10              347         0            0         0
> >> >  [3,]        12             5076         0           17         0
> >> >  [4,]        13             2445         0            0         0
> >> >  [5,]        18             2076         0            0         0
> >> >  [6,]        20               15         0            0         0
> >> >  [7,]        25               62         0            0         0
> >> >  [8,]        32              320         0            0         0
> >> >  [9,]        34             1377         0            0         0
> >> > [10,]        35              757         0            0         0
> >> > First 10 rows of 5236 printed.
> >> >
> >> > R> key(m2)
> >> > [1] "entrez.id"
> >> >
> >> > R> any(duplicated(m2$entrez.id))
> >> > [1] FALSE
> >> >
> >> > So far so good -- I stumbled on the following problem when `merge`-ing
> >> > two large data tables which was giving me a stranger error. In the
> >> > process of trying to smoke out the problem, I notice this unexpected
> >> > behavior:
> >> >
> >> > ## This is expected
> >> > R> subset(m2, entrez.id == '9')
> >> >     entrez.id total.tags.liver cds.liver intron.liver utr.liver
> >> > [1,]         9               27         0            0         0
> >> >
> >> > ## This isn't
> >> > R> m2['9']
> >> >     entrez.id total.tags.liver cds.liver intron.liver utr.liver
> >> > [1,]         9               NA        NA           NA        NA
> >> >
> >> > Woops! Isn't that supposed to return the same as above?
> >> >
> >> > I can fix `m2` by manipulating the key column:
> >> >
> >> > R> key(m2) <- NULL ## probably not necessary
> >> > R> m2$entrez.id <- as.character(m2$entrez.id)
> >> > R> key(m2) <- 'entrez.id'
> >> > R> m2['9']
> >> >     entrez.id total.tags.liver cds.liver intron.liver utr.liver
> >> > [1,]         9               27         0            0         0
> >> >
> >> > (side note: the bug I mentioned when I try to `merge` this w/ another
> >> > data.table is gone after I did the above fix).
> >> >
> >> > So -- I guess my point is that I'm not exactly sure how I got `m2` to
> >> > have a funky key, but the fact that it got messed up like this somehow
> >> > I think is undesired behavior, no?
> >> >
> >> > Does this point to something (maybe obvious) that happened on the way
> >> > to building up `m2`?
> >> >
> >> > Thanks,
> >> > -steve
> >> >
> >> > --
> >> > Steve Lianoglou
> >> > Graduate Student: Computational Systems Biology
> >> >  | Memorial Sloan-Kettering Cancer Center
> >> >  | Weill Medical College of Cornell University
> >> > Contact Info: http://cbio.mskcc.org/~lianos/contact
> >> > _______________________________________________
> >> > datatable-help mailing list
> >> > datatable-help at lists.r-forge.r-project.org
> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> >
> >> > _______________________________________________
> >> > datatable-help mailing list
> >> > datatable-help at lists.r-forge.r-project.org
> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> >
> >>
> >>
> >>
> >
> >
> >
> 
> 
> 




More information about the datatable-help mailing list