[datatable-help] variable column names

Fri Apr 26 18:45:53 CEST 2013

S.O. is probably better for this kind of question then.
But if you don't get an answer there, then come back to datatable-help.

On 26.04.2013 17:26, Sam Steingold wrote:
>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:00:27 
>> +0100]:
>>
>>> dt[, sum(behavior) > 0, by=user]
>>    user    V1
>> 1:    3  TRUE
>> 2:    4 FALSE
>>> dt[, any(behavior), by=user]     # same
>>    user    V1
>> 1:    3  TRUE
>> 2:    4 FALSE
>>> dt[, list(behavior = any(behavior)), by=user]   # how to same 
>>> without
>>> setnames afterwards
>>    user behavior
>> 1:    3     TRUE
>> 2:    4    FALSE
>>> fields <- c("country","language")
>>> dt[, list(behavior = any(behavior)), by=c("user",fields)]   # by 
>>> may
>>> be character vector of column names
>>    user country language behavior
>> 1:    3       2        5     TRUE
>> 2:    3       2        6     TRUE
>> 3:    4       1        6    FALSE
>> 4:    4       2        6    FALSE
>
> oh no, this is _not_ what I want!
> user should be unique and fields should be summarized as described in
> the SO question (see the code below)
>
>
>>
>>
>> On 26.04.2013 16:45, Sam Steingold wrote:
>>> I am still missing something:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>>> dt <- data.table(user=c(rep(4, 5),rep(3, 5)),
>>>> behavior=c(rep(FALSE,5),rep(TRUE,5)),
>>>                  country=c(rep(1,4),rep(2,6)),
>>> language=c(rep(6,6),rep(5,4)),
>>>                  event=1:10, key=c("user","country","language"))
>>>> dt
>>>     user behavior country language event
>>>  1:    3     TRUE       2        5     7
>>>  2:    3     TRUE       2        5     8
>>>  3:    3     TRUE       2        5     9
>>>  4:    3     TRUE       2        5    10
>>>  5:    3     TRUE       2        6     6
>>>  6:    4    FALSE       1        6     1
>>>  7:    4    FALSE       1        6     2
>>>  8:    4    FALSE       1        6     3
>>>  9:    4    FALSE       1        6     4
>>> 10:    4    FALSE       2        6     5
>>>>   users <- dt[, sum(behavior) > 0, by=user]
>>> Finding groups (bysameorder=TRUE) ... done in 0secs. 
>>> bysameorder=TRUE
>>> and o__ is length 0
>>> Detected that j uses these columns: behavior
>>> Optimization is on but j left unchanged as 'sum(behavior) > 0'
>>> Starting dogroups ... done dogroups in 0 secs
>>>> users
>>>    user    V1
>>> 1:    3  TRUE
>>> 2:    4 FALSE
>>>> setnames(users, "V1", "behavior")
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> Now I want to do the same thing as in
>>>
>>> 
>>> http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data
>>> for both fields
>>>> fields <- c("country","language")
>>>
>>> here is what I tried so far:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> dt[, .N, .SDcols=fields, by=eval(list("user",fields))]
>>> Error in `[.data.table`(dt, , .N, .SDcols = fields, by =
>>> eval(list("user",  :
>>>   The items in the 'by' or 'keyby' list are length (1,2). Each must
>>> be same length as rows in x or number of rows returned by i (10).
>>> Calls: [ -> [.data.table
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> the idea is to do something like
>>>
>>> --8<---------------cut here---------------start------------->8---
>>>> dt.out <- dt[, .N, by=list(user,country)][,
>>>> list(country[which.max(N)], max(N)/sum(N)), by=user]
>>>> setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name",
>>>> ".support")))
>>>> users <- users[dt.out]
>>>    user behavior country.name country.support
>>> 1:    3     TRUE            2             1.0
>>> 2:    4    FALSE            1             0.8
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> except that I do not want to have the literal "country" and 
>>> "language"
>>> and that I am sure there is a way to avoid copying users in
>>>> users <- users[dt.out]
>>> by a ":=" trick.
>>>
>>> Thanks.
>>>
>>>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-24 21:54:17 
>>>> +0100]:
>>>>
>>>> where ... is eval(myid)
>>>> iigc
>>>>> Or:
>>>>> DT[,lapply(.SD,sum),by=...,.SDcols=myvars]