[datatable-help] Skipping some Vi names

Steve Lianoglou mailinglist.honeypot at gmail.com
Fri Jul 15 16:47:06 CEST 2011


Hi,

On Fri, Jul 15, 2011 at 10:06 AM, Joseph Voelkel <jgvcqa at rit.edu> wrote:
> Thanks.
>
> 1. Where the quotes came from... No good reason. As I said, I hadn't used data.table in a while, and was a bit unsure of the syntax. I know the syntax is all rational, but for example, I first tried to use by= via something like by=names(oldDataFrame)[1:3] , which was, say c("x1","x2","x3"), but then found I needed the syntax "x1,x2,x3" (which to me does not seem very R-like, because I can't use a simple R function like names() in by= to transfer the information).

I think this might have been true in earlier versions of data.table,
but is not the case anymore.

R> dt <- data.table(a=sample(letters[1:3], 15, replace=TRUE),
b=sample(letters[1:3], 15, replace=TRUE), score=sample(1:100, 15))
R> dt[, list(total=sum(score)), by=c('a', 'b')]
     a b total
[1,] a a    33
[2,] a b    78
[3,] a c    86
[4,] b b   178
[5,] b c    73
[6,] c a    91
[7,] c b    67
[8,] c c    40

What *is* still true is that you can only a length 1 character vector
when setting the key on a data.table during its construction, eg:

R> dt <- data.tabe[a=..., b=..., score=..., key='a,b']

instead of

R> dt <- data.tabe[a=..., b=..., score=..., key=c('a','b')]

And I do agree that that is a bit strange and worth tweaking.

> So I used a similar (?) syntax for the j term. No, your documentation has it done correctly. Sorry about that.
> 2. Thanks for mentioning .SD. I have wanted to use this several times, but in every case I have other variables in the data table as well. Is there a natural way to use something like dt[,lapply(.SD, sum), by="x,y"], if for example, dt's variables are x, y, A1, A2, A3, B1, B2, B3, B4, but I only want to sum over the Ai's? (Imagine a case where there are 40 Ai's and 40 Bi's, for example.)

As far as I know, you can do two things:

(1) For the terms of the summary/aggregation, you can split your
data.table into different "groupings" based on the columns you want to
calc over. In your case, you might split dt into 2 data.tables. One
with the A* cols and the other with the B* cols and process each
individually using the `lapply(.SD, ...)` trick if that's appropriate.
Then recombine the results later; or

(2) you can have a vector of names you want to process and use those
in your lapply, eg:

R> dt <- data.table(a=sample(letters[1:3], 15, replace=TRUE),
          b=sample(letters[1:3], 15, replace=TRUE),
          score=sample(1:100, 15),
          x=rnorm(15),
          y=sample(200:300, 15))

Say I only want to sum over `score` and `x`

R> use <- c('score', 'x')
R> dt[, lapply(use, function(name) sum(.SD[[name]])), by=c('a','b')]

Perhaps others will come up with something more clever ...

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the datatable-help mailing list