[datatable-help] assignment by reference in subset

Mon Nov 19 23:54:07 CET 2012

On 18.11.2012 20:03, Steve Lianoglou wrote:
> Hi,
>
> On Sun, Nov 18, 2012 at 11:19 AM, Philip de Witt Hamer
> <pcvdwh at gmail.com> wrote:
>> Dear all,
>>
>> data.table is great! thanks for this life(time)saving package.
>>
>> Now, I run into a difficult nut to crack using ':='.
>> I'd like to do a calculation using column information conditional on 
>> another
>> column
>>
>> first some jumbo data:
>>
>> library(data.table)
>> DT <- data.table(
>>  1:50,
>>  rep(1:5,each=10),
>>  runif(50,0,1)
>> )
>> setnames(DT, 1:3, c("id","grp","p"))
>>
>> id's are unique
>> grp's speaks for itself
>> think of p's as e.g. p-values
>>
>> next, if I want to obtain the nr of p values at least as extreme as 
>> the p of
>> each row from the whole set, this seems to work well:
>>
>> DT[,c1 := sum(DT[,p] <= p), by=id]
>>
>> but then, I would like to get the nr of p values at least as extreme 
>> as the
>> p of each row for the subset with identical grp, I am having a hard 
>> time,
>> because these attempts fail:
>>
>> DT[,c2 := sum(DT[grp,p] <= p),by=id]
>> DT[,c3 := sum(DT[DT[,grp]==grp,p] <= p), by=id]
>
> You will want to group by "grp".
>
> This gets you pretty close -- it fails the "ties" criterion:
>
> DT[, cg := rank(p) - 1, by=grp]
>
> If you *really* want to keep the ties criterion, perhaps here's a way
> to do so by avoiding a for loop:
>
> DT[, cgo := rowSums(outer(p, p, '-') > 0), by=grp]
>
> The problem is that if your groups are very large, the `outer` call
> might chew lots of RAM, since you'll be creating a p x p matrix (per
> group).
>
> Does that get you where you need to be?
>
> -steve

Grouping by grp feels right to me, too. How about :

    setkey(DT,grp,p)

and then using the ordered p within each group :

    DT[,c1:=seq_len(.N),by=grp]
    DT[,c1:=max(c1),by='grp,p']  # to deal with ties

NB: data.table grouping of numerics is machine tolerance aware. So
this ties treatment is more like sum(DT[,p] <= p+tol) which may or
may not be what you need. tol = .Machine$double.eps ^ 0.5.

Or, staying with the self join approach, one trick for the scoping
issue you hit is :

    DT[,c3:={i=list(grp);sum(DT[i,p]<=p)},by=id]

Where the DT[i,...] part relies on the fact that single name i is 
evaluated
in calling scope.

Or another way in one step is :

    DT[,c3:=sum(DT[eval(.(grp)),p]<=p),by=id]

which uses the feature that eval() is already like what ..() will do in 
future.

But grouping by grp should be much faster and cleaner, if possible.

Matthew