[datatable-help] My real issue with numeric keys: two numeric keys don't seem to unique correctly.

Matthew Dowle mdowle at mdowle.plus.com
Tue May 22 19:49:55 CEST 2012


[ For new users watching, we're talking about the very new feature in dev
that numeric can be in keys, not on CRAN yet. ]

is.unsorted is just for atomic vectors I think and data.table should only
be using it for integer vectors. It does do numeric, but disregards
tolerance. It's fastorder and duplist that do the mult-column logic. I
think the non-bug fix is something in the modified shell sort that isn't
stable for ties within tolerance, still. There are radix algos for numeric
out there, but I was planning to stick to shell (with the modification for
stability within ties (within tolerance) that's in the base R's source),
then do the radix speedup another time.  But if anyone can plonk in one of
the radix orderers (not sorters, and for double not float, that works on
all endians), that would be great. Or is there a package that has radix
for floating point already?  I think the source assumes NAs sort last, and
I've tried to modify that to put NA first in a wrong way somehow.  I was
also trying an in-place modification of the ordering vector, rather than
reordering x for each column (base always takes 1:length input).

The other thing that needs to be done for speed is cycle through the
columns to be ordered in 1:n order. Do 1st first, then recursively order
each group separately. Currently it orders the whole of every column in
reverse order n:1, which is nice but makes it non-natural. That'll have to
wait for a future version though, but should be a good speedup when there
are 2 or more columns in the key, the more columns in the key the larger
the improvement.

Matthew

> I am saying
>
> is.unsorted(dt)
>
> returns FALSE.  Is that the expected result here? If so then I do not
> understand how is.unsorted works. I guess I thought it should work for
> data.frames and not just vectors. I see that in setkeyv it is only
> used on the vector out of fastorder though so maybe that is my
> confusion.
>
> Either way, fastorder does not return the rightly sorted output indices.
>
>
>
> On Tue, May 22, 2012 at 12:52 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>> Hi,
>>
>> On Tue, May 22, 2012 at 12:31 PM, Chris Neff <caneff at gmail.com> wrote:
>>> Okay, I tried the latest dev version that claimed to fix this issue,
>>> but it is still there in a different way.  This was one hell of an
>>> issue to nail down. An example:
>>>
>>>> dt=data.table(x=rep(c(1,2), each=10), y=rnorm(20))
>>>> setkeyv(dt,c("x","y"))
>>>
>>> dt is not properly sorted in the y column. This isn't just an issue
>>> with your code. If you try is.unsorted (which you use in setkeyv), it
>>> returns FALSE, so it thinks it is sorted.
>>
>> I may be lost, but `is.unsorted` is working as expected here.
>>
>> For instance:
>>
>> R> is.unsorted(dt$y[1:10])
>> [1] TRUE
>>
>> But you're saying that returns FALSE for you? I guess we should
>> technically set.seed to be sure, but I'm pretty sure we shouldn't have
>> to ...
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>




More information about the datatable-help mailing list