[datatable-help] My real issue with numeric keys: two numeric keys don't seem to unique correctly.

Matthew Dowle mdowle at mdowle.plus.com
Wed May 23 00:38:51 CEST 2012


Ok, fixed in dev now, honest. Please retest.

It was the data.table code, nothing to do with is.unsorted().
But, Chris has a point, look at this:

> is.unsorted(data.frame(1:2))
[1] FALSE
> is.unsorted(data.frame(1:2,3:4))
[1] TRUE

I had to look at the C source to work it out. is.unsorted seems
definitely intended for atomic vectors only, but there's an earlier
switch to return FALSE if length==1. One of anything must be sorted,
even if that item is itself a vector. So that explains the FALSE. Then I
don't see what it's doing dispatching to something called .gtn and
returning TRUE in the 2nd case.  There's an error message in the source
"only atomic vectors can be tested to be sorted" but it doesn't seem to
get called in this case, think it should do in both but the dispatch
call seems deliberately put there.  Anyway ...

Matthew


On Tue, 2012-05-22 at 18:49 +0100, Matthew Dowle wrote:
> [ For new users watching, we're talking about the very new feature in dev
> that numeric can be in keys, not on CRAN yet. ]
> 
> is.unsorted is just for atomic vectors I think and data.table should only
> be using it for integer vectors. It does do numeric, but disregards
> tolerance. It's fastorder and duplist that do the mult-column logic. I
> think the non-bug fix is something in the modified shell sort that isn't
> stable for ties within tolerance, still. There are radix algos for numeric
> out there, but I was planning to stick to shell (with the modification for
> stability within ties (within tolerance) that's in the base R's source),
> then do the radix speedup another time.  But if anyone can plonk in one of
> the radix orderers (not sorters, and for double not float, that works on
> all endians), that would be great. Or is there a package that has radix
> for floating point already?  I think the source assumes NAs sort last, and
> I've tried to modify that to put NA first in a wrong way somehow.  I was
> also trying an in-place modification of the ordering vector, rather than
> reordering x for each column (base always takes 1:length input).
> 
> The other thing that needs to be done for speed is cycle through the
> columns to be ordered in 1:n order. Do 1st first, then recursively order
> each group separately. Currently it orders the whole of every column in
> reverse order n:1, which is nice but makes it non-natural. That'll have to
> wait for a future version though, but should be a good speedup when there
> are 2 or more columns in the key, the more columns in the key the larger
> the improvement.
> 
> Matthew
> 
> > I am saying
> >
> > is.unsorted(dt)
> >
> > returns FALSE.  Is that the expected result here? If so then I do not
> > understand how is.unsorted works. I guess I thought it should work for
> > data.frames and not just vectors. I see that in setkeyv it is only
> > used on the vector out of fastorder though so maybe that is my
> > confusion.
> >
> > Either way, fastorder does not return the rightly sorted output indices.
> >
> >
> >
> > On Tue, May 22, 2012 at 12:52 PM, Steve Lianoglou
> > <mailinglist.honeypot at gmail.com> wrote:
> >> Hi,
> >>
> >> On Tue, May 22, 2012 at 12:31 PM, Chris Neff <caneff at gmail.com> wrote:
> >>> Okay, I tried the latest dev version that claimed to fix this issue,
> >>> but it is still there in a different way.  This was one hell of an
> >>> issue to nail down. An example:
> >>>
> >>>> dt=data.table(x=rep(c(1,2), each=10), y=rnorm(20))
> >>>> setkeyv(dt,c("x","y"))
> >>>
> >>> dt is not properly sorted in the y column. This isn't just an issue
> >>> with your code. If you try is.unsorted (which you use in setkeyv), it
> >>> returns FALSE, so it thinks it is sorted.
> >>
> >> I may be lost, but `is.unsorted` is working as expected here.
> >>
> >> For instance:
> >>
> >> R> is.unsorted(dt$y[1:10])
> >> [1] TRUE
> >>
> >> But you're saying that returns FALSE for you? I guess we should
> >> technically set.seed to be sure, but I'm pretty sure we shouldn't have
> >> to ...
> >>
> >> -steve
> >>
> >> --
> >> Steve Lianoglou
> >> Graduate Student: Computational Systems Biology
> >>  | Memorial Sloan-Kettering Cancer Center
> >>  | Weill Medical College of Cornell University
> >> Contact Info: http://cbio.mskcc.org/~lianos/contact
> >
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list