[datatable-help] Setting key when resulting order of table is not unique

Matthew Dowle mdowle at mdowle.plus.com
Fri Jul 22 09:16:14 CEST 2011


On Thu, 2011-07-21 at 17:33 +0000, Alexander Peterhansl wrote:
> Thank you for your reply.
> 
> Yes, it's best to enforce a unique ordering with an additional key, as you said.  This works as expected, but seems to not be in line with what the help page says.
> 
> Example:
> > DT = data.table(index1=c(1,2,2),index2=c(1,2,3),values=c("a","b","c"))
> > key(DT) <- c("index1","index2")
> > DT
>      index1 index2 values
> [1,]      1      1      a
> [2,]      2      2      b
> [3,]      2      3      c
> 
> > DT[J(1:3),roll=TRUE]
>      index1 index2 values
> [1,]      1      1      a
> [2,]      2      2      b
> [3,]      2      3      c
> [4,]      3      3      c
> 
> The "rolling index" is index1 here.  Isn't index1 considered the first column of DT's key?  
> ig
> In the help pages -- help(data.table) -- the following is said about the "roll" option:
> Applies to the last column of x's key, which is generally a date but can be any ordered variable, with gaps. When roll=TRUE if i's row matches to all but the last column of x's key, and the value of the last column falls in a gap (including after the last observation for that group), the prevailing value in x is rolled forward.
> 
> -Alex
> 
Thanks, yes good point. Improved that and committed :

roll  :  Applies to the last join column, generally a date but can be
any ordered variable, irregular and including gaps. If roll=TRUE and i's
row matches to all but the last x join column, and its value in the last
i join column falls in a gap (including after the last observation in x
for that group), then the prevailing value in x is rolled forward. This
operation is particularly fast using a modified binary search. The
operation is also known as last observation carried forward (LOCF).
Usually, there should be no duplicates in x's key, the last key column
is a date (or time, or datetime) and all the columns of x's key are
joined to. A common idiom is to select a contemporaneous regular time
series (dts) across a set of identifiers (ids):  
    DT[CJ(ids,dts),roll=TRUE]}
where DT has a 2-column key (id,date) and CJ stands for cross join.



> 
> 
> -----Original Message-----
> From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com] 
> Sent: Thursday, July 21, 2011 11:24 AM
> To: Alexander Peterhansl
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Setting key when resulting order of table is not unique
> 
> Hi,
> 
> On Thu, Jul 21, 2011 at 11:02 AM, Alexander Peterhansl <APeterhansl at gaincapital.com> wrote:
> > Dear Data Table Help List,
> >
> > I am using data.table version 1.6 (with R version 2.12.2, 64-bit on 
> > Windows 7).  Suppose I have a table whose key does not give me a unique ordering.
> > Then the output of the "roll" option will be arbitrary (i.e., it will 
> > depend on what one does between the two executions).  Is this something noteworthy?
> >
> > Please see output of the following:
> >
> >> DT = data.table(A=c(1,2,2),B=c("b1","b3","b2"),key="A")
> >
> >> DT[J(1:3),roll=TRUE]  # output 1
> >
> >         A  B
> > [1,] 1 b1
> > [2,] 2 b3
> > [3,] 2 b2
> > [4,] 3 b2
> >
> >> key(DT)="B"           # change keys to do other stuff...
> >> key(DT)="A"           # get back to key A DT[J(1:3),roll=TRUE]  # 
> >> output 2 does not match output 1
> >         A  B
> > [1,] 1 b1
> > [2,] 2 b2
> > [3,] 2 b3
> > [4,] 3 b3
> >
> > (Also, as an aside, I get identical output in the two executions of 
> > DT[J(1:3),roll=TRUE] when I start with the table DT =
> > data.table(A=c(1,2,2),B=c("b1","b2","b3"),key="A") instead.)
> >
> > I'm sure there must also be other reverberations-beyond the effect on 
> > the roll option.
> >
> > Any insight would be of interest.  Thank you.
> 
> I don't think it's all that surprising in this case.
> 
> The original "keying" on A does not take your B column into consideration here:
> 
> R> DT = data.table(A=c(1,2,2),B=c("b1","b3","b2"),key="A")
> 
> But then when you set the key on "B", of course "b2" will have to be rearranged to come before "b3".
> 
> After you set the key on your DT back to A, A itself is in order already (1,2,2) == (1,2,2) so no moving around happens. You should note that the reordering in data.table is "stable" (I'm 95% sure on that, Matthew can verify) so "ties" will appear in the same order as they did in the original input.
> 
> If it is important in your scenario that this doesn't change when you "roll", you can always set a compound key on DT prior to doing that
> calculation:
> 
> R> key(DT) <- c('A', 'B')
> 
> Anyway you shake it, if you run your code, then set the key to just "B", then again to c("A", "B") to "roll" again, your results will be the same.
> 
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list