[datatable-help] Skipping some Vi names
Matthew Dowle
mdowle at mdowle.plus.com
Sun Jul 17 12:41:32 CEST 2011
On Fri, 2011-07-15 at 10:47 -0400, Steve Lianoglou wrote:
> Hi,
>
> On Fri, Jul 15, 2011 at 10:06 AM, Joseph Voelkel <jgvcqa at rit.edu> wrote:
> > Thanks.
> >
> > 1. Where the quotes came from... No good reason. As I said, I hadn't used data.table in a while, and was a bit unsure of the syntax. I know the syntax is all rational, but for example, I first tried to use by= via something like by=names(oldDataFrame)[1:3] , which was, say c("x1","x2","x3"), but then found I needed the syntax "x1,x2,x3" (which to me does not seem very R-like, because I can't use a simple R function like names() in by= to transfer the information).
>
> I think this might have been true in earlier versions of data.table,
> but is not the case anymore.
>
> R> dt <- data.table(a=sample(letters[1:3], 15, replace=TRUE),
> b=sample(letters[1:3], 15, replace=TRUE), score=sample(1:100, 15))
> R> dt[, list(total=sum(score)), by=c('a', 'b')]
> a b total
> [1,] a a 33
> [2,] a b 78
> [3,] a c 86
> [4,] b b 178
> [5,] b c 73
> [6,] c a 91
> [7,] c b 67
> [8,] c c 40
Right. And any other R expressions too; e.g.,
R> dt[, list(total=sum(score)), by=names(dt)[1:2]]
a b total
[1,] a a 100 # same (different random data)
[2,] a b 36
[3,] a c 192
[4,] b b 21
[5,] b c 113
[6,] c a 91
[7,] c b 80
[8,] c c 98
That was added in 1.5.3 :
o 'by' may now be a character vector of column names.
This allows syntax such as DT[,sum(x),by=key(DT)].
There was a bug fix in 1.6 :
o by=key(DT) now works when the number of rows is not
divisible by the number of groups (#1298, an odd bug).
and another in 1.6.1. :
o A 'by' character vector of column names now
works when there are less rows than columns; e.g.,
DT[,sum(x),by=key(DT)] where nrow(DT)==1.
>
> What *is* still true is that you can only a length 1 character vector
> when setting the key on a data.table during its construction, eg:
>
> R> dt <- data.tabe[a=..., b=..., score=..., key='a,b']
>
> instead of
>
> R> dt <- data.tabe[a=..., b=..., score=..., key=c('a','b')]
>
> And I do agree that that is a bit strange and worth tweaking.
Duly tweaked and committed :
o The key argument of data.table() now accepts a vector of
column names in addition to a single comma separated string
of column names. Thanks to Steve Lianoglou for highlighting.
>
> > So I used a similar (?) syntax for the j term. No, your documentation has it done correctly. Sorry about that.
> > 2. Thanks for mentioning .SD. I have wanted to use this several times, but in every case I have other variables in the data table as well. Is there a natural way to use something like dt[,lapply(.SD, sum), by="x,y"], if for example, dt's variables are x, y, A1, A2, A3, B1, B2, B3, B4, but I only want to sum over the Ai's? (Imagine a case where there are 40 Ai's and 40 Bi's, for example.)
>
> As far as I know, you can do two things:
>
> (1) For the terms of the summary/aggregation, you can split your
> data.table into different "groupings" based on the columns you want to
> calc over. In your case, you might split dt into 2 data.tables. One
> with the A* cols and the other with the B* cols and process each
> individually using the `lapply(.SD, ...)` trick if that's appropriate.
> Then recombine the results later; or
>
> (2) you can have a vector of names you want to process and use those
> in your lapply, eg:
>
> R> dt <- data.table(a=sample(letters[1:3], 15, replace=TRUE),
> b=sample(letters[1:3], 15, replace=TRUE),
> score=sample(1:100, 15),
> x=rnorm(15),
> y=sample(200:300, 15))
>
> Say I only want to sum over `score` and `x`
>
> R> use <- c('score', 'x')
> R> dt[, lapply(use, function(name) sum(.SD[[name]])), by=c('a','b')]
>
> Perhaps others will come up with something more clever ...
We need something better don't we. Will follow up in the other thread...
>
> -steve
>
More information about the datatable-help
mailing list