[datatable-help] assignment by reference in subset

Wed Nov 21 10:19:47 CET 2012

Hi Steve and Matthew,

Very helpful solutions indeed! Thanks a lot.

I played around with all your valuable suggestions a little.
To me it seems that, the simplest one step solution that would handle ties the way I had hoped for is:

DT[, cmx := rank(p,ties.method="max"), by=grp]

--Philip

On Nov 19, 2012, at 11:54 PM, Matthew Dowle wrote:

> On 18.11.2012 20:03, Steve Lianoglou wrote:
>> Hi,
>> 
>> On Sun, Nov 18, 2012 at 11:19 AM, Philip de Witt Hamer
>> <pcvdwh at gmail.com> wrote:
>>> Dear all,
>>> 
>>> data.table is great! thanks for this life(time)saving package.
>>> 
>>> Now, I run into a difficult nut to crack using ':='.
>>> I'd like to do a calculation using column information conditional on another
>>> column
>>> 
>>> first some jumbo data:
>>> 
>>> library(data.table)
>>> DT <- data.table(
>>> 1:50,
>>> rep(1:5,each=10),
>>> runif(50,0,1)
>>> )
>>> setnames(DT, 1:3, c("id","grp","p"))
>>> 
>>> id's are unique
>>> grp's speaks for itself
>>> think of p's as e.g. p-values
>>> 
>>> next, if I want to obtain the nr of p values at least as extreme as the p of
>>> each row from the whole set, this seems to work well:
>>> 
>>> DT[,c1 := sum(DT[,p] <= p), by=id]
>>> 
>>> but then, I would like to get the nr of p values at least as extreme as the
>>> p of each row for the subset with identical grp, I am having a hard time,
>>> because these attempts fail:
>>> 
>>> DT[,c2 := sum(DT[grp,p] <= p),by=id]
>>> DT[,c3 := sum(DT[DT[,grp]==grp,p] <= p), by=id]
>> 
>> You will want to group by "grp".
>> 
>> This gets you pretty close -- it fails the "ties" criterion:
>> 
>> DT[, cg := rank(p) - 1, by=grp]
>> 
>> If you *really* want to keep the ties criterion, perhaps here's a way
>> to do so by avoiding a for loop:
>> 
>> DT[, cgo := rowSums(outer(p, p, '-') > 0), by=grp]
>> 
>> The problem is that if your groups are very large, the `outer` call
>> might chew lots of RAM, since you'll be creating a p x p matrix (per
>> group).
>> 
>> Does that get you where you need to be?
>> 
>> -steve
> 
> 
> Grouping by grp feels right to me, too. How about :
> 
>   setkey(DT,grp,p)
> 
> and then using the ordered p within each group :
> 
>   DT[,c1:=seq_len(.N),by=grp]
>   DT[,c1:=max(c1),by='grp,p']  # to deal with ties
> 
> NB: data.table grouping of numerics is machine tolerance aware. So
> this ties treatment is more like sum(DT[,p] <= p+tol) which may or
> may not be what you need. tol = .Machine$double.eps ^ 0.5.
> 
> Or, staying with the self join approach, one trick for the scoping
> issue you hit is :
> 
>   DT[,c3:={i=list(grp);sum(DT[i,p]<=p)},by=id]
> 
> Where the DT[i,...] part relies on the fact that single name i is evaluated
> in calling scope.
> 
> Or another way in one step is :
> 
>   DT[,c3:=sum(DT[eval(.(grp)),p]<=p),by=id]
> 
> which uses the feature that eval() is already like what ..() will do in future.
> 
> But grouping by grp should be much faster and cleaner, if possible.
> 
> Matthew
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20121121/3df2ffc8/attachment.html>