[datatable-help] variable column names
Matthew Dowle
mdowle at mdowle.plus.com
Fri Apr 26 18:45:53 CEST 2013
S.O. is probably better for this kind of question then.
But if you don't get an answer there, then come back to datatable-help.
On 26.04.2013 17:26, Sam Steingold wrote:
>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:00:27
>> +0100]:
>>
>>> dt[, sum(behavior) > 0, by=user]
>> user V1
>> 1: 3 TRUE
>> 2: 4 FALSE
>>> dt[, any(behavior), by=user] # same
>> user V1
>> 1: 3 TRUE
>> 2: 4 FALSE
>>> dt[, list(behavior = any(behavior)), by=user] # how to same
>>> without
>>> setnames afterwards
>> user behavior
>> 1: 3 TRUE
>> 2: 4 FALSE
>>> fields <- c("country","language")
>>> dt[, list(behavior = any(behavior)), by=c("user",fields)] # by
>>> may
>>> be character vector of column names
>> user country language behavior
>> 1: 3 2 5 TRUE
>> 2: 3 2 6 TRUE
>> 3: 4 1 6 FALSE
>> 4: 4 2 6 FALSE
>
> oh no, this is _not_ what I want!
> user should be unique and fields should be summarized as described in
> the SO question (see the code below)
>
>
>>
>>
>> On 26.04.2013 16:45, Sam Steingold wrote:
>>> I am still missing something:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>>> dt <- data.table(user=c(rep(4, 5),rep(3, 5)),
>>>> behavior=c(rep(FALSE,5),rep(TRUE,5)),
>>> country=c(rep(1,4),rep(2,6)),
>>> language=c(rep(6,6),rep(5,4)),
>>> event=1:10, key=c("user","country","language"))
>>>> dt
>>> user behavior country language event
>>> 1: 3 TRUE 2 5 7
>>> 2: 3 TRUE 2 5 8
>>> 3: 3 TRUE 2 5 9
>>> 4: 3 TRUE 2 5 10
>>> 5: 3 TRUE 2 6 6
>>> 6: 4 FALSE 1 6 1
>>> 7: 4 FALSE 1 6 2
>>> 8: 4 FALSE 1 6 3
>>> 9: 4 FALSE 1 6 4
>>> 10: 4 FALSE 2 6 5
>>>> users <- dt[, sum(behavior) > 0, by=user]
>>> Finding groups (bysameorder=TRUE) ... done in 0secs.
>>> bysameorder=TRUE
>>> and o__ is length 0
>>> Detected that j uses these columns: behavior
>>> Optimization is on but j left unchanged as 'sum(behavior) > 0'
>>> Starting dogroups ... done dogroups in 0 secs
>>>> users
>>> user V1
>>> 1: 3 TRUE
>>> 2: 4 FALSE
>>>> setnames(users, "V1", "behavior")
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> Now I want to do the same thing as in
>>>
>>>
>>> http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data
>>> for both fields
>>>> fields <- c("country","language")
>>>
>>> here is what I tried so far:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> dt[, .N, .SDcols=fields, by=eval(list("user",fields))]
>>> Error in `[.data.table`(dt, , .N, .SDcols = fields, by =
>>> eval(list("user", :
>>> The items in the 'by' or 'keyby' list are length (1,2). Each must
>>> be same length as rows in x or number of rows returned by i (10).
>>> Calls: [ -> [.data.table
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> the idea is to do something like
>>>
>>> --8<---------------cut here---------------start------------->8---
>>>> dt.out <- dt[, .N, by=list(user,country)][,
>>>> list(country[which.max(N)], max(N)/sum(N)), by=user]
>>>> setnames(dt.out, c("V1", "V2"), paste0("country",c(".name",
>>>> ".support")))
>>>> users <- users[dt.out]
>>> user behavior country.name country.support
>>> 1: 3 TRUE 2 1.0
>>> 2: 4 FALSE 1 0.8
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> except that I do not want to have the literal "country" and
>>> "language"
>>> and that I am sure there is a way to avoid copying users in
>>>> users <- users[dt.out]
>>> by a ":=" trick.
>>>
>>> Thanks.
>>>
>>>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-24 21:54:17
>>>> +0100]:
>>>>
>>>> where ... is eval(myid)
>>>> iigc
>>>>> Or:
>>>>> DT[,lapply(.SD,sum),by=...,.SDcols=myvars]
More information about the datatable-help
mailing list