[datatable-help] select * and getting the full sub data.table/frame

Matthew Dowle mdowle at mdowle.plus.com
Thu Jan 17 18:07:48 CET 2013


Akhil,

Kind of, but defining :

my.func <- function (d) {
     sum(sqrt(d[["y"]]))
}

followed by

x.dt[ , my.func(.SD), by=x]

isn't very data.table'ish. In fact the
advice is to avoid .SD if possible, for speed.

We'd forget my.funct, and just do :

x.dt[, sum(sqrt(y)), by=x]

That is how we recommend it to be used, and
allows data.table to optimize the query (which
use of .SD may prevent).

David - have you read the introduction vignette and have
you worked through example(data.table) at the prompt?

Matthew


On 17.01.2013 16:53, Akhil Behl wrote:
> If I am not wrong, you are looking for `.SD'. In fact you can put in
> the exact function you were throwing at ddply earlier. There are 
> other
> special names like .SD that you can find in the data.table FAQs.
>
> Let's see:
> R> require(plyr)
> Loading required package: plyr
>
> R> require(data.table)
> Loading required package: data.table
> data.table 1.8.7  For help type: help("data.table")
>
> R> x.df <- data.frame(x=letters[1:2], y=1:10)
> R> x.dt <- data.table(x.df)
> R>
> R> my.func <- function (d) { # Define a function on the subset
> + sum(sqrt(d[["y"]]))
> + }
> R>
> R> # The plyr way:
> R> ddply(x.df, "x", my.func) -> ans.plyr
> R>
> R> # The data.table way:
> R> x.dt[ , my.func(.SD), by=x] -> ans.dt
> R>
> R> ans.plyr
>   x       V1
> 1 a 10.61387
> 2 b 11.85441
>
> R> ans.dt
>    x       V1
> 1: a 10.61387
> 2: b 11.85441
>
> For more help, try this on an R prompt:
>
> R> vignette('datatable-faq')
>
> --
> ASB.
>
> On Thu, Jan 17, 2013 at 9:49 PM, David Bellot 
> <david.bellot at gmail.com> wrote:
>> Hi,
>>
>> I've been looking all around the web without a clear answer to this 
>> trivial
>> problem. I'm sure I'm not looking where I should:
>>
>> in fact, I want to replace my use of ddply from the plyr package by
>> data.table. One of my main use is to group a big data.frame by a 
>> group of
>> variable and do something on this sub data.frame:
>>
>> ddply( my_df, my_grouping_var, function (d)   { do something with d 
>> } )
>> ----> d is a data.frame again
>>
>> and it's slow on big data.frame.
>>
>>
>> However, I don't really understand how to redo the same thing with a
>> data.table. Basically if "j" in a data.table is equivalent to the 
>> select
>> clause in SQL, then how do I do SELECT * FROM etc...
>>
>> I want to be able to pass a function like in ddply that will receive 
>> not
>> only a few columns but the full subset that is selected by the "by" 
>> clause.
>>
>> Thanks...
>> Best,
>> David
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> 
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


More information about the datatable-help mailing list