[datatable-help] constructing expressions for the j argument fromcharacter vectors

Matthew Dowle mdowle at mdowle.plus.com
Wed Sep 28 17:16:40 CEST 2011


Hi,

Welcome.
Just to check you've found .SD,  [,lapply(.SD,sum),by=...], and .SDcols?
.SD consist of all columns other than the grouping columns, which seems 
similar
to what this line is doing? :
> mean(summaryDT[,(ncols-length(sList) + 2):ncols, with = FALSE])

Matthew


"Erik Iverson" <erikriverson at gmail.com> wrote in message 
news:CAKzGw12zWpPt3pSqJCDH_SmDOQOLAjRUV7cV64UYWb8pXK13uQ at mail.gmail.com...
Hello,

Thank you for providing the data.table package, I think it will be
very useful to me going forward.  I have a question about passing
around expressions, and have come up with an example to show what I'm
after.

library(data.table)
## test data
N <- 500000
set.seed(100)
testData <- data.frame(id = c(sample(1:10000, N, replace = TRUE)),
clinic = c(sample(1:10, N, replace = TRUE)),
dx = c(sample(1:200, N, replace = TRUE)),
rx = c(sample(1:1000, N, replace = TRUE)))

## want to know mean number of dx per ID
mean(tapply(testData$dx, testData$id,
function(x) length(unique(x)))) ## 44.2212

## in my real use case, I want to run this with different 'by'
## variables, so let's write a function and try to use data.table,
## call the function uniqueSummary1

uniqueSummary1 <- function(df, key) {
DT <- data.table(df)
key(DT) <- key

summaryDT <- DT[, list(length(unique(dx)),
length(unique(rx))), by = key]

mean(summaryDT[,list(V1, V2)])

}

## agrees with tapply
uniqueSummary1(df = testData, key = c("id"))

## The above works great, but isn't general, since in my real use
## case, I won't know dx and rx are the variables of interest. I want
## to be able to pass them in as arguments. This is exactly what FAQ
## 1.6 is, so let's use that solution to define uniqueSummary2

uniqueSummary2 <- function(df, key, vars) {
DT <- data.table(df)
key(DT) <- key

sList <- substitute(vars)
summaryDT <- DT[, eval(sList), by = key]
ncols <- ncol(summaryDT)

mean(summaryDT[,(ncols-length(sList) + 2):ncols, with = FALSE])
}

uniqueSummary2(df = testData, key = c("id"),
vars = list(length(unique(dx)),
length(unique(rx)),
length(unique(clinic))))

## uniqueSummary2 is better, but relies on me repeating the
## "length(unique())" bit several times. Ideally, I'd just like to
## pass in a list of QUOTED vars to summarize, like the following
## hypothetical call to my yet-unwritten uniqueSummary3 function:

uniqueSummary3(df = testData, key = c("id"),
vars = c("dx", "rx", "clinic"))

I assume I can somehow construct the expression for the j index inside
my function, based on the 'vars' character vector, but am stuck on
how.  Any ideas?

Thanks so much,
Erik 





More information about the datatable-help mailing list