[datatable-help] variable column names
Sam Steingold
sds at gnu.org
Fri Apr 26 18:26:06 CEST 2013
> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:00:27 +0100]:
>
>> dt[, sum(behavior) > 0, by=user]
> user V1
> 1: 3 TRUE
> 2: 4 FALSE
>> dt[, any(behavior), by=user] # same
> user V1
> 1: 3 TRUE
> 2: 4 FALSE
>> dt[, list(behavior = any(behavior)), by=user] # how to same without
>> setnames afterwards
> user behavior
> 1: 3 TRUE
> 2: 4 FALSE
>> fields <- c("country","language")
>> dt[, list(behavior = any(behavior)), by=c("user",fields)] # by may
>> be character vector of column names
> user country language behavior
> 1: 3 2 5 TRUE
> 2: 3 2 6 TRUE
> 3: 4 1 6 FALSE
> 4: 4 2 6 FALSE
oh no, this is _not_ what I want!
user should be unique and fields should be summarized as described in
the SO question (see the code below)
>
>
> On 26.04.2013 16:45, Sam Steingold wrote:
>> I am still missing something:
>>
>> --8<---------------cut here---------------start------------->8---
>>> dt <- data.table(user=c(rep(4, 5),rep(3, 5)),
>>> behavior=c(rep(FALSE,5),rep(TRUE,5)),
>> country=c(rep(1,4),rep(2,6)),
>> language=c(rep(6,6),rep(5,4)),
>> event=1:10, key=c("user","country","language"))
>>> dt
>> user behavior country language event
>> 1: 3 TRUE 2 5 7
>> 2: 3 TRUE 2 5 8
>> 3: 3 TRUE 2 5 9
>> 4: 3 TRUE 2 5 10
>> 5: 3 TRUE 2 6 6
>> 6: 4 FALSE 1 6 1
>> 7: 4 FALSE 1 6 2
>> 8: 4 FALSE 1 6 3
>> 9: 4 FALSE 1 6 4
>> 10: 4 FALSE 2 6 5
>>> users <- dt[, sum(behavior) > 0, by=user]
>> Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE
>> and o__ is length 0
>> Detected that j uses these columns: behavior
>> Optimization is on but j left unchanged as 'sum(behavior) > 0'
>> Starting dogroups ... done dogroups in 0 secs
>>> users
>> user V1
>> 1: 3 TRUE
>> 2: 4 FALSE
>>> setnames(users, "V1", "behavior")
>> --8<---------------cut here---------------end--------------->8---
>>
>> Now I want to do the same thing as in
>>
>> http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data
>> for both fields
>>> fields <- c("country","language")
>>
>> here is what I tried so far:
>>
>> --8<---------------cut here---------------start------------->8---
>> dt[, .N, .SDcols=fields, by=eval(list("user",fields))]
>> Error in `[.data.table`(dt, , .N, .SDcols = fields, by =
>> eval(list("user", :
>> The items in the 'by' or 'keyby' list are length (1,2). Each must
>> be same length as rows in x or number of rows returned by i (10).
>> Calls: [ -> [.data.table
>> --8<---------------cut here---------------end--------------->8---
>>
>> the idea is to do something like
>>
>> --8<---------------cut here---------------start------------->8---
>>> dt.out <- dt[, .N, by=list(user,country)][,
>>> list(country[which.max(N)], max(N)/sum(N)), by=user]
>>> setnames(dt.out, c("V1", "V2"), paste0("country",c(".name",
>>> ".support")))
>>> users <- users[dt.out]
>> user behavior country.name country.support
>> 1: 3 TRUE 2 1.0
>> 2: 4 FALSE 1 0.8
>> --8<---------------cut here---------------end--------------->8---
>>
>> except that I do not want to have the literal "country" and "language"
>> and that I am sure there is a way to avoid copying users in
>>> users <- users[dt.out]
>> by a ":=" trick.
>>
>> Thanks.
>>
>>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-24 21:54:17 +0100]:
>>>
>>> where ... is eval(myid)
>>> iigc
>>>> Or:
>>>> DT[,lapply(.SD,sum),by=...,.SDcols=myvars]
--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://ffii.org http://pmw.org.il
http://palestinefacts.org http://dhimmi.com http://thereligionofpeace.com
Perl: all stupidities of UNIX in one.
More information about the datatable-help
mailing list