[datatable-help] constructing expressions for the jargumentfromcharacter vectors

Matthew Dowle mdowle at mdowle.plus.com
Wed Sep 28 18:58:26 CEST 2011


items 1 and 5 on the wiki are relevant here, for speed comparisons :
http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table

"Matthew Dowle" <mdowle at mdowle.plus.com> wrote in message 
news:j5vg13$9l$1 at dough.gmane.org...
>
> Something like this :
>
>> DT = as.data.table(testData)
>> f = function(x)length(unique(x))
>> vars = "dx"
>> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1]
>     dx
> 44.2212
>> vars = c("dx","rx")
>> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1]  # same again, just 
>> different vars
>     dx      rx
> 44.2212 48.7814
>> vars = c("dx","rx","clinic")
>> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1]  # same again, just 
>> different vars
>     dx      rx  clinic
> 44.2212 48.7814  9.9331
>>
>
> Chris' suggestion of parse(text=paste(...)) is another way you could do it 
> (and may be more efficient).
>
> Matthew
>
>
> "Matthew Dowle" <mdowle at mdowle.plus.com> wrote in message 
> news:j5vdoq$epb$1 at dough.gmane.org...
>> Hi,
>>
>> Welcome.
>> Just to check you've found .SD,  [,lapply(.SD,sum),by=...], and .SDcols?
>> .SD consist of all columns other than the grouping columns, which seems 
>> similar
>> to what this line is doing? :
>>> mean(summaryDT[,(ncols-length(sList) + 2):ncols, with = FALSE])
>>
>> Matthew
>>
>>
>> "Erik Iverson" <erikriverson at gmail.com> wrote in message 
>> news:CAKzGw12zWpPt3pSqJCDH_SmDOQOLAjRUV7cV64UYWb8pXK13uQ at mail.gmail.com...
>> Hello,
>>
>> Thank you for providing the data.table package, I think it will be
>> very useful to me going forward.  I have a question about passing
>> around expressions, and have come up with an example to show what I'm
>> after.
>>
>> library(data.table)
>> ## test data
>> N <- 500000
>> set.seed(100)
>> testData <- data.frame(id = c(sample(1:10000, N, replace = TRUE)),
>> clinic = c(sample(1:10, N, replace = TRUE)),
>> dx = c(sample(1:200, N, replace = TRUE)),
>> rx = c(sample(1:1000, N, replace = TRUE)))
>>
>> ## want to know mean number of dx per ID
>> mean(tapply(testData$dx, testData$id,
>> function(x) length(unique(x)))) ## 44.2212
>>
>> ## in my real use case, I want to run this with different 'by'
>> ## variables, so let's write a function and try to use data.table,
>> ## call the function uniqueSummary1
>>
>> uniqueSummary1 <- function(df, key) {
>> DT <- data.table(df)
>> key(DT) <- key
>>
>> summaryDT <- DT[, list(length(unique(dx)),
>> length(unique(rx))), by = key]
>>
>> mean(summaryDT[,list(V1, V2)])
>>
>> }
>>
>> ## agrees with tapply
>> uniqueSummary1(df = testData, key = c("id"))
>>
>> ## The above works great, but isn't general, since in my real use
>> ## case, I won't know dx and rx are the variables of interest. I want
>> ## to be able to pass them in as arguments. This is exactly what FAQ
>> ## 1.6 is, so let's use that solution to define uniqueSummary2
>>
>> uniqueSummary2 <- function(df, key, vars) {
>> DT <- data.table(df)
>> key(DT) <- key
>>
>> sList <- substitute(vars)
>> summaryDT <- DT[, eval(sList), by = key]
>> ncols <- ncol(summaryDT)
>>
>> mean(summaryDT[,(ncols-length(sList) + 2):ncols, with = FALSE])
>> }
>>
>> uniqueSummary2(df = testData, key = c("id"),
>> vars = list(length(unique(dx)),
>> length(unique(rx)),
>> length(unique(clinic))))
>>
>> ## uniqueSummary2 is better, but relies on me repeating the
>> ## "length(unique())" bit several times. Ideally, I'd just like to
>> ## pass in a list of QUOTED vars to summarize, like the following
>> ## hypothetical call to my yet-unwritten uniqueSummary3 function:
>>
>> uniqueSummary3(df = testData, key = c("id"),
>> vars = c("dx", "rx", "clinic"))
>>
>> I assume I can somehow construct the expression for the j index inside
>> my function, based on the 'vars' character vector, but am stuck on
>> how.  Any ideas?
>>
>> Thanks so much,
>> Erik 





More information about the datatable-help mailing list