[datatable-help] Idea/feature request

Mon Jun 27 14:03:50 CEST 2011

For some reason I am not able to install the latest version, so I cannot 
test it right now. Anyway, it looks great. Thanks!

Andreas

Matthew Dowle schrieb:
> Andreas, Steve,
>
> Committed. Please test and confirm if it satisfies all needs ok?
>
> o    A new symbol .BY is available to j, containing 1 row
>      of the current 'by' variables, type list. 'by' variables
>      may also be used by name and they are now length 1, too.
>      This implements FR#1313.
>      For example :
>           DT[,sum(x)*.BY[[1]],by=y]
>           DT[,sum(x)*.BY[[1]],by=eval(byexp)]
>           DT[,sapply(.SD,sum)*y,by=y]
>           DT[,sapply(.SD,sum)*.BY[[2]],by=list(y,z)]
>
> Matthew
>
>
>
> On Wed, 2011-05-11 at 10:24 +0200, Andreas Borg wrote:
>   
>> Hi Steve,
>>
>>     
>>> Now that you've brought this back up, what do you think you would
>>> prefer? For example, using my (admittedly contrived) original example:
>>>
>>> result <- some.big.data.table[, by=list(colA, colB), {
>>>  ## Sometimes I want to know what the current values of
>>>  ## colA and colB are in here to get some more info. Mabye
>>>  ## we can have .BY:
>>>
>>>  xref <- more.data[J(.BY[1], .BY[2]), mult='all'] ## or something
>>>  ## ...
>>> }]
>>>
>>> Should it be `J(.BY[1], .BY[2])` or is something like `J(colA, colB)`
>>> more natural, you think?
>>>
>>>   
>>>       
>> 'J(colA, colB)' is perfect if you know the column names in advance. This 
>> is not true in my case. I created a minimal example for a possible 
>> application for a '.BY' construct:
>>
>>  > dt <- data.table(x=c(0,1,0,1), y=c(1,0,1,0))
>>  > dt
>>      x y
>> [1,] 0 1
>> [2,] 1 0
>> [3,] 0 1
>> [4,] 1 0
>>
>>  From this table, I want the row sum for each group, i.e. "select x + y 
>> from dt group by x, y" in SQL. This would be:
>>
>>  > setkey(dt, x, y)
>>  > dt[,sum(x[1], y[1]), by=list(x,y)]
>>      x y V1
>> [1,] 0 1  1
>> [2,] 1 0  1
>>
>> But what if dt can have an arbitrary number of (grouping) columns with 
>> arbitrary names? If the grouping columns are given as
>>
>> groupCols <- c("x", "y")
>>
>> , the following is possible:
>>
>>  > expr <- parse(text = sprintf("sum(%s)", paste(groupCols, "[1]", 
>> sep="", collapse=", ")))
>>  > dt[,eval(expr), by=groupCols]
>>      x y V1
>> [1,] 0 1  1
>> [2,] 1 0  1
>>
>> Now, this is certainly uglier than
>>
>>  > dt[, sum(.BY), by = groupCols]
>>
>> My actual application is that I apply decision tree models (rpart) to a 
>> large number of binary patterns. In order to save computation time, I 
>> classify each distinct pattern only once. So what I basically do is to 
>> group by all attributes and apply the model once to each group.
>>
>> Andreas
>>
>>     
>
>
>
>   

-- 
Andreas Borg
Medizinische Informatik

UNIVERSITÄTSMEDIZIN
der Johannes Gutenberg-Universität
Institut für Medizinische Biometrie, Epidemiologie und Informatik
Obere Zahlbacher Straße 69, 55131 Mainz
www.imbei.uni-mainz.de

Telefon +49 (0) 6131 175062
E-Mail: borg at imbei.uni-mainz.de

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der
richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den
Absender und löschen Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe
dieser Mail und der darin enthaltenen Informationen ist nicht gestattet.