[datatable-help] Something seems funky. I think with character-to-factor conversion for keys (?)

Steve Lianoglou mailinglist.honeypot at gmail.com
Mon Mar 7 17:30:30 CET 2011


Hi,

On Sat, Mar 5, 2011 at 4:06 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Hi,
> Seems consistent with out of order factor levels. The binary search
> relies on levels being sorted. If that's it then please track down the
> earlier point where the out-of-order factor levels were introduced and
> maybe a fix is needed there. Everything else here is correct behaviour.

I know it sounds lame, but I'm having problems tracking down how my
key/factor column arrived at having out of order levels.

While I try to smoke that out, do you think it would be a good idea to
write a small utility at the C level to scan through the levels() of
factor-keys to test for them being in order and
breaking/short-circuiting as soon as it finds one level that's out of
order? This way we can fire off a warning when this problem is
detected so the user would be warned to expect "weird" behavior (and
also know how to fix(?))

I'm not sure exactly where/when we would invoke that test -- maybe
after calls to setkey ... and optionally under merge-like operations.

I can take a crack at doing that if it seems like a good idea.

-steve

> Matthew
>
> On Fri, 2011-03-04 at 21:43 -0500, Steve Lianoglou wrote:
>> Hi Mel,
>>
>> On Fri, Mar 4, 2011 at 8:15 PM, Bacou, Melanie <mel at mbacou.com> wrote:
>> > Steve,
>> >
>> > Try instead:
>> >
>> > R> m2[J(9)]
>> >
>> > It seems your original entrez.id key is integer not character
>>
>> It's actually a factor:
>>
>> R> is(m2$entrez.id)
>> [1] "factor"   "integer"  "oldClass" "numeric"  "vector"
>>
>> and moreover:
>>
>> R> '9' %in% levels(m2$entrez.id)
>> [1] TRUE
>>
>> and the integer J() maneuver is a no go:
>>
>> R> Error in `[.data.table`(m2, J(9)) :
>>   x.entrez.id is a factor but joining to i.V1 which is not a factor.
>> Factors must join to factors.
>>
>> > -- but to be honest I'm not sure why:
>> >
>> > R> m2[9]
>> >
>> > doesn't work either...
>>
>> That works, in that it does something, but it just gets the 9th row of
>> m2, not the row whose key is '9'
>>
>> Seems like something's strange is afoot here ...
>>
>> -steve
>>
>> > --Mel.
>> >
>> > -----Original Message-----
>> > From: datatable-help-bounces at r-forge.wu-wien.ac.at
>> > [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Steve
>> > Lianoglou
>> > Sent: Friday, March 04, 2011 5:46 PM
>> > To: datatable-help at r-forge.wu-wien.ac.at
>> > Subject: [datatable-help] Something seems funky. I think with
>> > character-to-factor conversion for keys (?)
>> >
>> > I'll have to apologize in advance because I can't create a
>> > reproducible example for this behavior, but I'll keep trying .. please
>> > bear with me.
>> >
>> > Somehow I've ended up with a data.table `m2` that looks like this:
>> >
>> > R> m2
>> >      entrez.id total.tags.liver cds.liver intron.liver utr.liver
>> >  [1,]         9               27         0            0         0
>> >  [2,]        10              347         0            0         0
>> >  [3,]        12             5076         0           17         0
>> >  [4,]        13             2445         0            0         0
>> >  [5,]        18             2076         0            0         0
>> >  [6,]        20               15         0            0         0
>> >  [7,]        25               62         0            0         0
>> >  [8,]        32              320         0            0         0
>> >  [9,]        34             1377         0            0         0
>> > [10,]        35              757         0            0         0
>> > First 10 rows of 5236 printed.
>> >
>> > R> key(m2)
>> > [1] "entrez.id"
>> >
>> > R> any(duplicated(m2$entrez.id))
>> > [1] FALSE
>> >
>> > So far so good -- I stumbled on the following problem when `merge`-ing
>> > two large data tables which was giving me a stranger error. In the
>> > process of trying to smoke out the problem, I notice this unexpected
>> > behavior:
>> >
>> > ## This is expected
>> > R> subset(m2, entrez.id == '9')
>> >     entrez.id total.tags.liver cds.liver intron.liver utr.liver
>> > [1,]         9               27         0            0         0
>> >
>> > ## This isn't
>> > R> m2['9']
>> >     entrez.id total.tags.liver cds.liver intron.liver utr.liver
>> > [1,]         9               NA        NA           NA        NA
>> >
>> > Woops! Isn't that supposed to return the same as above?
>> >
>> > I can fix `m2` by manipulating the key column:
>> >
>> > R> key(m2) <- NULL ## probably not necessary
>> > R> m2$entrez.id <- as.character(m2$entrez.id)
>> > R> key(m2) <- 'entrez.id'
>> > R> m2['9']
>> >     entrez.id total.tags.liver cds.liver intron.liver utr.liver
>> > [1,]         9               27         0            0         0
>> >
>> > (side note: the bug I mentioned when I try to `merge` this w/ another
>> > data.table is gone after I did the above fix).
>> >
>> > So -- I guess my point is that I'm not exactly sure how I got `m2` to
>> > have a funky key, but the fact that it got messed up like this somehow
>> > I think is undesired behavior, no?
>> >
>> > Does this point to something (maybe obvious) that happened on the way
>> > to building up `m2`?
>> >
>> > Thanks,
>> > -steve
>> >
>> > --
>> > Steve Lianoglou
>> > Graduate Student: Computational Systems Biology
>> >  | Memorial Sloan-Kettering Cancer Center
>> >  | Weill Medical College of Cornell University
>> > Contact Info: http://cbio.mskcc.org/~lianos/contact
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>>
>>
>>
>
>
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the datatable-help mailing list