[datatable-help] assignment by reference in subset

Steve Lianoglou mailinglist.honeypot at gmail.com
Sun Nov 18 21:03:48 CET 2012


Hi,

On Sun, Nov 18, 2012 at 11:19 AM, Philip de Witt Hamer <pcvdwh at gmail.com> wrote:
> Dear all,
>
> data.table is great! thanks for this life(time)saving package.
>
> Now, I run into a difficult nut to crack using ':='.
> I'd like to do a calculation using column information conditional on another
> column
>
> first some jumbo data:
>
> library(data.table)
> DT <- data.table(
>  1:50,
>  rep(1:5,each=10),
>  runif(50,0,1)
> )
> setnames(DT, 1:3, c("id","grp","p"))
>
> id's are unique
> grp's speaks for itself
> think of p's as e.g. p-values
>
> next, if I want to obtain the nr of p values at least as extreme as the p of
> each row from the whole set, this seems to work well:
>
> DT[,c1 := sum(DT[,p] <= p), by=id]
>
> but then, I would like to get the nr of p values at least as extreme as the
> p of each row for the subset with identical grp, I am having a hard time,
> because these attempts fail:
>
> DT[,c2 := sum(DT[grp,p] <= p),by=id]
> DT[,c3 := sum(DT[DT[,grp]==grp,p] <= p), by=id]

You will want to group by "grp".

This gets you pretty close -- it fails the "ties" criterion:

DT[, cg := rank(p) - 1, by=grp]

If you *really* want to keep the ties criterion, perhaps here's a way
to do so by avoiding a for loop:

DT[, cgo := rowSums(outer(p, p, '-') > 0), by=grp]

The problem is that if your groups are very large, the `outer` call
might chew lots of RAM, since you'll be creating a p x p matrix (per
group).

Does that get you where you need to be?

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the datatable-help mailing list