[datatable-help] Idea/feature request

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed Jun 22 00:01:00 CEST 2011


Well done, Matthew!

Will try to test it soon ...

Thanks,
-steve

On Tue, Jun 21, 2011 at 3:40 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Andreas, Steve,
>
> Committed. Please test and confirm if it satisfies all needs ok?
>
> o    A new symbol .BY is available to j, containing 1 row
>     of the current 'by' variables, type list. 'by' variables
>     may also be used by name and they are now length 1, too.
>     This implements FR#1313.
>     For example :
>          DT[,sum(x)*.BY[[1]],by=y]
>          DT[,sum(x)*.BY[[1]],by=eval(byexp)]
>          DT[,sapply(.SD,sum)*y,by=y]
>          DT[,sapply(.SD,sum)*.BY[[2]],by=list(y,z)]
>
> Matthew
>
>
>
> On Wed, 2011-05-11 at 10:24 +0200, Andreas Borg wrote:
>> Hi Steve,
>>
>> > Now that you've brought this back up, what do you think you would
>> > prefer? For example, using my (admittedly contrived) original example:
>> >
>> > result <- some.big.data.table[, by=list(colA, colB), {
>> >  ## Sometimes I want to know what the current values of
>> >  ## colA and colB are in here to get some more info. Mabye
>> >  ## we can have .BY:
>> >
>> >  xref <- more.data[J(.BY[1], .BY[2]), mult='all'] ## or something
>> >  ## ...
>> > }]
>> >
>> > Should it be `J(.BY[1], .BY[2])` or is something like `J(colA, colB)`
>> > more natural, you think?
>> >
>> >
>> 'J(colA, colB)' is perfect if you know the column names in advance. This
>> is not true in my case. I created a minimal example for a possible
>> application for a '.BY' construct:
>>
>>  > dt <- data.table(x=c(0,1,0,1), y=c(1,0,1,0))
>>  > dt
>>      x y
>> [1,] 0 1
>> [2,] 1 0
>> [3,] 0 1
>> [4,] 1 0
>>
>>  From this table, I want the row sum for each group, i.e. "select x + y
>> from dt group by x, y" in SQL. This would be:
>>
>>  > setkey(dt, x, y)
>>  > dt[,sum(x[1], y[1]), by=list(x,y)]
>>      x y V1
>> [1,] 0 1  1
>> [2,] 1 0  1
>>
>> But what if dt can have an arbitrary number of (grouping) columns with
>> arbitrary names? If the grouping columns are given as
>>
>> groupCols <- c("x", "y")
>>
>> , the following is possible:
>>
>>  > expr <- parse(text = sprintf("sum(%s)", paste(groupCols, "[1]",
>> sep="", collapse=", ")))
>>  > dt[,eval(expr), by=groupCols]
>>      x y V1
>> [1,] 0 1  1
>> [2,] 1 0  1
>>
>> Now, this is certainly uglier than
>>
>>  > dt[, sum(.BY), by = groupCols]
>>
>> My actual application is that I apply decision tree models (rpart) to a
>> large number of binary patterns. In order to save computation time, I
>> classify each distinct pattern only once. So what I basically do is to
>> group by all attributes and apply the model once to each group.
>>
>> Andreas
>>
>
>
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the datatable-help mailing list