[datatable-help] constructing expressions for the jargumentfromcharacter vectors

Erik Iverson erikriverson at gmail.com
Thu Sep 29 04:51:14 CEST 2011


Excellent everyone, thanks so much.  .SD was the feature I did not
know about that makes it easy to pass in strings like I wanted,
thanks.  I was also able to get a large speedup by implementing the
function f using an idea from the wiki, as:

f <- function(x) length(.Internal(unique(x, FALSE, FALSE)))

Thanks again; very useful package!

On Wed, Sep 28, 2011 at 11:58 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> items 1 and 5 on the wiki are relevant here, for speed comparisons :
> http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table
>
> "Matthew Dowle" <mdowle at mdowle.plus.com> wrote in message
> news:j5vg13$9l$1 at dough.gmane.org...
>>
>> Something like this :
>>
>>> DT = as.data.table(testData)
>>> f = function(x)length(unique(x))
>>> vars = "dx"
>>> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1]
>>     dx
>> 44.2212
>>> vars = c("dx","rx")
>>> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1]  # same again, just
>>> different vars
>>     dx      rx
>> 44.2212 48.7814
>>> vars = c("dx","rx","clinic")
>>> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1]  # same again, just
>>> different vars
>>     dx      rx  clinic
>> 44.2212 48.7814  9.9331
>>>
>>
>> Chris' suggestion of parse(text=paste(...)) is another way you could do it
>> (and may be more efficient).
>>
>> Matthew
>>
>>
>> "Matthew Dowle" <mdowle at mdowle.plus.com> wrote in message
>> news:j5vdoq$epb$1 at dough.gmane.org...
>>> Hi,
>>>
>>> Welcome.
>>> Just to check you've found .SD,  [,lapply(.SD,sum),by=...], and .SDcols?
>>> .SD consist of all columns other than the grouping columns, which seems
>>> similar
>>> to what this line is doing? :
>>>> mean(summaryDT[,(ncols-length(sList) + 2):ncols, with = FALSE])
>>>
>>> Matthew
>>>
>>>
>>> "Erik Iverson" <erikriverson at gmail.com> wrote in message
>>> news:CAKzGw12zWpPt3pSqJCDH_SmDOQOLAjRUV7cV64UYWb8pXK13uQ at mail.gmail.com...
>>> Hello,
>>>
>>> Thank you for providing the data.table package, I think it will be
>>> very useful to me going forward.  I have a question about passing
>>> around expressions, and have come up with an example to show what I'm
>>> after.
>>>
>>> library(data.table)
>>> ## test data
>>> N <- 500000
>>> set.seed(100)
>>> testData <- data.frame(id = c(sample(1:10000, N, replace = TRUE)),
>>> clinic = c(sample(1:10, N, replace = TRUE)),
>>> dx = c(sample(1:200, N, replace = TRUE)),
>>> rx = c(sample(1:1000, N, replace = TRUE)))
>>>
>>> ## want to know mean number of dx per ID
>>> mean(tapply(testData$dx, testData$id,
>>> function(x) length(unique(x)))) ## 44.2212
>>>
>>> ## in my real use case, I want to run this with different 'by'
>>> ## variables, so let's write a function and try to use data.table,
>>> ## call the function uniqueSummary1
>>>
>>> uniqueSummary1 <- function(df, key) {
>>> DT <- data.table(df)
>>> key(DT) <- key
>>>
>>> summaryDT <- DT[, list(length(unique(dx)),
>>> length(unique(rx))), by = key]
>>>
>>> mean(summaryDT[,list(V1, V2)])
>>>
>>> }
>>>
>>> ## agrees with tapply
>>> uniqueSummary1(df = testData, key = c("id"))
>>>
>>> ## The above works great, but isn't general, since in my real use
>>> ## case, I won't know dx and rx are the variables of interest. I want
>>> ## to be able to pass them in as arguments. This is exactly what FAQ
>>> ## 1.6 is, so let's use that solution to define uniqueSummary2
>>>
>>> uniqueSummary2 <- function(df, key, vars) {
>>> DT <- data.table(df)
>>> key(DT) <- key
>>>
>>> sList <- substitute(vars)
>>> summaryDT <- DT[, eval(sList), by = key]
>>> ncols <- ncol(summaryDT)
>>>
>>> mean(summaryDT[,(ncols-length(sList) + 2):ncols, with = FALSE])
>>> }
>>>
>>> uniqueSummary2(df = testData, key = c("id"),
>>> vars = list(length(unique(dx)),
>>> length(unique(rx)),
>>> length(unique(clinic))))
>>>
>>> ## uniqueSummary2 is better, but relies on me repeating the
>>> ## "length(unique())" bit several times. Ideally, I'd just like to
>>> ## pass in a list of QUOTED vars to summarize, like the following
>>> ## hypothetical call to my yet-unwritten uniqueSummary3 function:
>>>
>>> uniqueSummary3(df = testData, key = c("id"),
>>> vars = c("dx", "rx", "clinic"))
>>>
>>> I assume I can somehow construct the expression for the j index inside
>>> my function, based on the 'vars' character vector, but am stuck on
>>> how.  Any ideas?
>>>
>>> Thanks so much,
>>> Erik
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


More information about the datatable-help mailing list