[datatable-help] Question about by statements and subsetting

Steve Lianoglou lianoglou.steve at gene.com
Fri Aug 2 19:44:51 CEST 2013


Hi John,

On Fri, Aug 2, 2013 at 10:26 AM, John Kerpel <john.kerpel2 at gmail.com> wrote:
> I'm a noob to data.table and I've got a couple of questions:
>
> 1).  Why do I get different answers in the following example:
>
>> DT =
>> data.table(a=c(4:13),y=c(1,1,2,2,2,3,3,3,4,4),x=1:10,z=c(1,1,1,1,2,2,2,2,3,3),zz=c(1,1,1,1,1,2,2,2,2,2))
>> setkeyv(DT,cols=c("a","x","y","z","zz"))
>> DT[,if(.N>=4) {list(predict(smooth.spline(x,y),c(4,5,6))$y)} ,by=z]
>    z        V1
> 1: 1 2.1000000
> 2: 1 2.5000000
> 3: 1 2.9000000
> 4: 2 0.9998959
> 5: 2 2.0453352
> 6: 2 2.9093247
>
> Versus:
>
>> DT[,if(.N>=4) {list(predict(smooth.spline(x,y),a[1:3])$y)} ,by=z]
>    z       V1
> 1: 1 2.100000
> 2: 1 2.500000
> 3: 1 2.900000
> 4: 2 2.999995
> 5: 2 2.954664
> 6: 2 2.909333

I'm not sure why you would expect those two calls to give the same result?

In the first case, the second parameter to your call to predict is
always c(4,5,6), while in the second case, when z is 1, the second
param to predict is 4,5,6 (the first three rows in your 2nd are teh
same as the first, so fine), but when z=2, the second param to predict
becomes c(8,9,10), so ... doesn't that explain the behavior you are
seeing?

> Is some sort of recycling going on here?

Where?

You are asking to predict on 3 points (either 4,5,6 or a[1:3]) so you
get 3 values back per z group.

> 2).  How can I do some sort of nested "by" statement?
>
> Let's say I want to set by=zz, but run the spline statement within each z
> subset.  Do I use .SD somehow?

Not sure what you mean, but does this do it?

R> DT[, list(predict(smooth.spline(x, y), a)$y), by=c('zz', 'z')]

or something?

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech


More information about the datatable-help mailing list