[datatable-help] Something seems funky. I think with character-to-factor conversion for keys (?)
Steve Lianoglou
mailinglist.honeypot at gmail.com
Mon Mar 7 17:30:30 CET 2011
Hi,
On Sat, Mar 5, 2011 at 4:06 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Hi,
> Seems consistent with out of order factor levels. The binary search
> relies on levels being sorted. If that's it then please track down the
> earlier point where the out-of-order factor levels were introduced and
> maybe a fix is needed there. Everything else here is correct behaviour.
I know it sounds lame, but I'm having problems tracking down how my
key/factor column arrived at having out of order levels.
While I try to smoke that out, do you think it would be a good idea to
write a small utility at the C level to scan through the levels() of
factor-keys to test for them being in order and
breaking/short-circuiting as soon as it finds one level that's out of
order? This way we can fire off a warning when this problem is
detected so the user would be warned to expect "weird" behavior (and
also know how to fix(?))
I'm not sure exactly where/when we would invoke that test -- maybe
after calls to setkey ... and optionally under merge-like operations.
I can take a crack at doing that if it seems like a good idea.
-steve
> Matthew
>
> On Fri, 2011-03-04 at 21:43 -0500, Steve Lianoglou wrote:
>> Hi Mel,
>>
>> On Fri, Mar 4, 2011 at 8:15 PM, Bacou, Melanie <mel at mbacou.com> wrote:
>> > Steve,
>> >
>> > Try instead:
>> >
>> > R> m2[J(9)]
>> >
>> > It seems your original entrez.id key is integer not character
>>
>> It's actually a factor:
>>
>> R> is(m2$entrez.id)
>> [1] "factor" "integer" "oldClass" "numeric" "vector"
>>
>> and moreover:
>>
>> R> '9' %in% levels(m2$entrez.id)
>> [1] TRUE
>>
>> and the integer J() maneuver is a no go:
>>
>> R> Error in `[.data.table`(m2, J(9)) :
>> x.entrez.id is a factor but joining to i.V1 which is not a factor.
>> Factors must join to factors.
>>
>> > -- but to be honest I'm not sure why:
>> >
>> > R> m2[9]
>> >
>> > doesn't work either...
>>
>> That works, in that it does something, but it just gets the 9th row of
>> m2, not the row whose key is '9'
>>
>> Seems like something's strange is afoot here ...
>>
>> -steve
>>
>> > --Mel.
>> >
>> > -----Original Message-----
>> > From: datatable-help-bounces at r-forge.wu-wien.ac.at
>> > [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Steve
>> > Lianoglou
>> > Sent: Friday, March 04, 2011 5:46 PM
>> > To: datatable-help at r-forge.wu-wien.ac.at
>> > Subject: [datatable-help] Something seems funky. I think with
>> > character-to-factor conversion for keys (?)
>> >
>> > I'll have to apologize in advance because I can't create a
>> > reproducible example for this behavior, but I'll keep trying .. please
>> > bear with me.
>> >
>> > Somehow I've ended up with a data.table `m2` that looks like this:
>> >
>> > R> m2
>> > entrez.id total.tags.liver cds.liver intron.liver utr.liver
>> > [1,] 9 27 0 0 0
>> > [2,] 10 347 0 0 0
>> > [3,] 12 5076 0 17 0
>> > [4,] 13 2445 0 0 0
>> > [5,] 18 2076 0 0 0
>> > [6,] 20 15 0 0 0
>> > [7,] 25 62 0 0 0
>> > [8,] 32 320 0 0 0
>> > [9,] 34 1377 0 0 0
>> > [10,] 35 757 0 0 0
>> > First 10 rows of 5236 printed.
>> >
>> > R> key(m2)
>> > [1] "entrez.id"
>> >
>> > R> any(duplicated(m2$entrez.id))
>> > [1] FALSE
>> >
>> > So far so good -- I stumbled on the following problem when `merge`-ing
>> > two large data tables which was giving me a stranger error. In the
>> > process of trying to smoke out the problem, I notice this unexpected
>> > behavior:
>> >
>> > ## This is expected
>> > R> subset(m2, entrez.id == '9')
>> > entrez.id total.tags.liver cds.liver intron.liver utr.liver
>> > [1,] 9 27 0 0 0
>> >
>> > ## This isn't
>> > R> m2['9']
>> > entrez.id total.tags.liver cds.liver intron.liver utr.liver
>> > [1,] 9 NA NA NA NA
>> >
>> > Woops! Isn't that supposed to return the same as above?
>> >
>> > I can fix `m2` by manipulating the key column:
>> >
>> > R> key(m2) <- NULL ## probably not necessary
>> > R> m2$entrez.id <- as.character(m2$entrez.id)
>> > R> key(m2) <- 'entrez.id'
>> > R> m2['9']
>> > entrez.id total.tags.liver cds.liver intron.liver utr.liver
>> > [1,] 9 27 0 0 0
>> >
>> > (side note: the bug I mentioned when I try to `merge` this w/ another
>> > data.table is gone after I did the above fix).
>> >
>> > So -- I guess my point is that I'm not exactly sure how I got `m2` to
>> > have a funky key, but the fact that it got messed up like this somehow
>> > I think is undesired behavior, no?
>> >
>> > Does this point to something (maybe obvious) that happened on the way
>> > to building up `m2`?
>> >
>> > Thanks,
>> > -steve
>> >
>> > --
>> > Steve Lianoglou
>> > Graduate Student: Computational Systems Biology
>> > | Memorial Sloan-Kettering Cancer Center
>> > | Weill Medical College of Cornell University
>> > Contact Info: http://cbio.mskcc.org/~lianos/contact
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>>
>>
>>
>
>
>
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the datatable-help
mailing list