[datatable-help] Skipping some Vi names

Sun Jul 17 12:41:32 CEST 2011

On Fri, 2011-07-15 at 10:47 -0400, Steve Lianoglou wrote:
> Hi,
> 
> On Fri, Jul 15, 2011 at 10:06 AM, Joseph Voelkel <jgvcqa at rit.edu> wrote:
> > Thanks.
> >
> > 1. Where the quotes came from... No good reason. As I said, I hadn't used data.table in a while, and was a bit unsure of the syntax. I know the syntax is all rational, but for example, I first tried to use by= via something like by=names(oldDataFrame)[1:3] , which was, say c("x1","x2","x3"), but then found I needed the syntax "x1,x2,x3" (which to me does not seem very R-like, because I can't use a simple R function like names() in by= to transfer the information).
> 
> I think this might have been true in earlier versions of data.table,
> but is not the case anymore.
> 
> R> dt <- data.table(a=sample(letters[1:3], 15, replace=TRUE),
> b=sample(letters[1:3], 15, replace=TRUE), score=sample(1:100, 15))
> R> dt[, list(total=sum(score)), by=c('a', 'b')]
>      a b total
> [1,] a a    33
> [2,] a b    78
> [3,] a c    86
> [4,] b b   178
> [5,] b c    73
> [6,] c a    91
> [7,] c b    67
> [8,] c c    40

Right. And any other R expressions too; e.g.,
R> dt[, list(total=sum(score)), by=names(dt)[1:2]]
     a b total
[1,] a a   100   # same (different random data)
[2,] a b    36
[3,] a c   192
[4,] b b    21
[5,] b c   113
[6,] c a    91
[7,] c b    80
[8,] c c    98

That was added in 1.5.3 :
o   'by' may now be a character vector of column names.
    This allows syntax such as DT[,sum(x),by=key(DT)].

There was a bug fix in 1.6 :
o    by=key(DT) now works when the number of rows is not
     divisible by the number of groups (#1298, an odd bug).

and another in 1.6.1. :
o    A 'by' character vector of column names now
     works when there are less rows than columns; e.g.,
        DT[,sum(x),by=key(DT)]  where nrow(DT)==1.

> 
> What *is* still true is that you can only a length 1 character vector
> when setting the key on a data.table during its construction, eg:
> 
> R> dt <- data.tabe[a=..., b=..., score=..., key='a,b']
> 
> instead of
> 
> R> dt <- data.tabe[a=..., b=..., score=..., key=c('a','b')]
> 
> And I do agree that that is a bit strange and worth tweaking.

Duly tweaked and committed :

o   The key argument of data.table() now accepts a vector of
    column names in addition to a single comma separated string
    of column names. Thanks to Steve Lianoglou for highlighting.

> 
> > So I used a similar (?) syntax for the j term. No, your documentation has it done correctly. Sorry about that.
> > 2. Thanks for mentioning .SD. I have wanted to use this several times, but in every case I have other variables in the data table as well. Is there a natural way to use something like dt[,lapply(.SD, sum), by="x,y"], if for example, dt's variables are x, y, A1, A2, A3, B1, B2, B3, B4, but I only want to sum over the Ai's? (Imagine a case where there are 40 Ai's and 40 Bi's, for example.)
> 
> As far as I know, you can do two things:
> 
> (1) For the terms of the summary/aggregation, you can split your
> data.table into different "groupings" based on the columns you want to
> calc over. In your case, you might split dt into 2 data.tables. One
> with the A* cols and the other with the B* cols and process each
> individually using the `lapply(.SD, ...)` trick if that's appropriate.
> Then recombine the results later; or
> 
> (2) you can have a vector of names you want to process and use those
> in your lapply, eg:
> 
> R> dt <- data.table(a=sample(letters[1:3], 15, replace=TRUE),
>           b=sample(letters[1:3], 15, replace=TRUE),
>           score=sample(1:100, 15),
>           x=rnorm(15),
>           y=sample(200:300, 15))
> 
> Say I only want to sum over `score` and `x`
> 
> R> use <- c('score', 'x')
> R> dt[, lapply(use, function(name) sum(.SD[[name]])), by=c('a','b')]
> 
> Perhaps others will come up with something more clever ...

We need something better don't we. Will follow up in the other thread...

> 
> -steve
>