[datatable-help] assignment by reference in subset

Matthew Dowle mdowle at mdowle.plus.com
Mon Nov 19 23:54:07 CET 2012


On 18.11.2012 20:03, Steve Lianoglou wrote:
> Hi,
>
> On Sun, Nov 18, 2012 at 11:19 AM, Philip de Witt Hamer
> <pcvdwh at gmail.com> wrote:
>> Dear all,
>>
>> data.table is great! thanks for this life(time)saving package.
>>
>> Now, I run into a difficult nut to crack using ':='.
>> I'd like to do a calculation using column information conditional on 
>> another
>> column
>>
>> first some jumbo data:
>>
>> library(data.table)
>> DT <- data.table(
>>  1:50,
>>  rep(1:5,each=10),
>>  runif(50,0,1)
>> )
>> setnames(DT, 1:3, c("id","grp","p"))
>>
>> id's are unique
>> grp's speaks for itself
>> think of p's as e.g. p-values
>>
>> next, if I want to obtain the nr of p values at least as extreme as 
>> the p of
>> each row from the whole set, this seems to work well:
>>
>> DT[,c1 := sum(DT[,p] <= p), by=id]
>>
>> but then, I would like to get the nr of p values at least as extreme 
>> as the
>> p of each row for the subset with identical grp, I am having a hard 
>> time,
>> because these attempts fail:
>>
>> DT[,c2 := sum(DT[grp,p] <= p),by=id]
>> DT[,c3 := sum(DT[DT[,grp]==grp,p] <= p), by=id]
>
> You will want to group by "grp".
>
> This gets you pretty close -- it fails the "ties" criterion:
>
> DT[, cg := rank(p) - 1, by=grp]
>
> If you *really* want to keep the ties criterion, perhaps here's a way
> to do so by avoiding a for loop:
>
> DT[, cgo := rowSums(outer(p, p, '-') > 0), by=grp]
>
> The problem is that if your groups are very large, the `outer` call
> might chew lots of RAM, since you'll be creating a p x p matrix (per
> group).
>
> Does that get you where you need to be?
>
> -steve


Grouping by grp feels right to me, too. How about :

    setkey(DT,grp,p)

and then using the ordered p within each group :

    DT[,c1:=seq_len(.N),by=grp]
    DT[,c1:=max(c1),by='grp,p']  # to deal with ties

NB: data.table grouping of numerics is machine tolerance aware. So
this ties treatment is more like sum(DT[,p] <= p+tol) which may or
may not be what you need. tol = .Machine$double.eps ^ 0.5.

Or, staying with the self join approach, one trick for the scoping
issue you hit is :

    DT[,c3:={i=list(grp);sum(DT[i,p]<=p)},by=id]

Where the DT[i,...] part relies on the fact that single name i is 
evaluated
in calling scope.

Or another way in one step is :

    DT[,c3:=sum(DT[eval(.(grp)),p]<=p),by=id]

which uses the feature that eval() is already like what ..() will do in 
future.

But grouping by grp should be much faster and cleaner, if possible.

Matthew





More information about the datatable-help mailing list