[datatable-help] Something seems funky. I think with character-to-factor conversion for keys (?)
Steve Lianoglou
mailinglist.honeypot at gmail.com
Tue Mar 8 05:05:51 CET 2011
On Mon, Mar 7, 2011 at 10:18 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Maybe. The slowdown would be fairly significant, perhaps. Although the
> levels vector is contiguous in memory, the global character hash (the
> memory where the character pointers point to) isn't. It's not the string
> cmp as such, it's the page fetches. Also, it might potentially do this
> check over and over again for the same levels vectors (very wasteful).
> Remember that [.data.table is recursive in places, although once only I
> think.
Good point.
Well -- there is a place I've identified in merge.data.table that
throws some esoteric error due to this problem. Perhaps I'll just
catch the error there and wrap it with a test to see if the levels of
the factor are sorted -- it'll only hit that once, and if it does
happen, maybe I (or whoever) will be able to remember how it got
messed up to begin with.
> Did you find out what created the out-of-order levels? This check won't
> help you find out where that occurred, or will it?
Unfortunately, I haven't investigated much further. I have a sneaking
suspicion that it has to do with the type of values that are in the
key'd column (enterz.id). It is originally of type character when the
data.table is constructed -- then after it is key'd, it turns into a
factor.
When I was making this particular data.table, I saved the table to a
text file, and reloaded it into another R session via read.table. The
thing with that entrez.id column is that it can successfully be parsed
as an integer, so it was likely read in as such. Somehow after it was
keyed it was turned into a factor -- not sure how.
The ordering of the "broken" levels is consistent with ordering an integer:
R> head(levels(m2$entrez.id))
[1] "9" "10" "12" "13" "18" "20"
And after fixing the data.table, the levels are reorderd as a
character should be:
R> m2$entrez.id <- factor(as.character(m2$entrez.id))
[1] "10" "10009" "100093630" "10010" "100113384" "100113407"
I'm thinking all signs are pointing to me having done something bone
headed ... I haven't had time to really try too many different things,
but the few (one) obvious thing I tried to reconstruct my m2 table
from my raw (text) data file isn't turning my entrez.id column into a
screwed factor column.
Anyway -- I guess there isn't much to do just yet.
As I said, I'll just add a check for an error in the appropriate place
in merge.data.table and keep my eye out to see if it happens again.
-steve
>
>
> On Mon, 2011-03-07 at 21:39 -0500, Steve Lianoglou wrote:
>> On Mon, Mar 7, 2011 at 8:50 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>> > Btw :
>> >
>> >> a small utility at the C level to scan through the levels() of
>> >> factor-keys to test for them being in order and
>> >> breaking/short-circuiting as soon as it finds one level that's out of
>> >> order?
>> >
>> > That's base::is.unsorted(), which is done in C.
>>
>> Aww -- was looking forward to writing some C code ...
>>
>> It looks like you were right, though -- the problematic data.table has
>> a (factor) key where `is.unsorted(levels(the_key_column))` is TRUE.
>>
>> So I guess we're talking about having something like
>> options(datatable.check.factor.levels=TRUE) check at the top of the
>> [.data.table function that fires a warning() when the levels are
>> unsorted, yeah?
>>
>> -steve
>>
>
>
>
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the datatable-help
mailing list