[datatable-help] select * and getting the full sub data.table/frame
Akhil Behl
akhil at igidr.ac.in
Thu Jan 17 18:09:37 CET 2013
Well, yes, I agree. In fact, I had it in mind to mention the
alternative you suggested, but then it slipped out of my mind.
I did point him to the datatable-faq. :)
On Thu, Jan 17, 2013 at 10:37 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
> Akhil,
>
> Kind of, but defining :
>
> my.func <- function (d) {
> sum(sqrt(d[["y"]]))
> }
>
> followed by
>
> x.dt[ , my.func(.SD), by=x]
>
> isn't very data.table'ish. In fact the
> advice is to avoid .SD if possible, for speed.
>
> We'd forget my.funct, and just do :
>
> x.dt[, sum(sqrt(y)), by=x]
>
> That is how we recommend it to be used, and
> allows data.table to optimize the query (which
> use of .SD may prevent).
>
> David - have you read the introduction vignette and have
> you worked through example(data.table) at the prompt?
>
> Matthew
>
>
>
> On 17.01.2013 16:53, Akhil Behl wrote:
>>
>> If I am not wrong, you are looking for `.SD'. In fact you can put in
>> the exact function you were throwing at ddply earlier. There are other
>> special names like .SD that you can find in the data.table FAQs.
>>
>> Let's see:
>> R> require(plyr)
>> Loading required package: plyr
>>
>> R> require(data.table)
>> Loading required package: data.table
>> data.table 1.8.7 For help type: help("data.table")
>>
>> R> x.df <- data.frame(x=letters[1:2], y=1:10)
>> R> x.dt <- data.table(x.df)
>> R>
>> R> my.func <- function (d) { # Define a function on the subset
>> + sum(sqrt(d[["y"]]))
>> + }
>> R>
>> R> # The plyr way:
>> R> ddply(x.df, "x", my.func) -> ans.plyr
>> R>
>> R> # The data.table way:
>> R> x.dt[ , my.func(.SD), by=x] -> ans.dt
>> R>
>> R> ans.plyr
>> x V1
>> 1 a 10.61387
>> 2 b 11.85441
>>
>> R> ans.dt
>> x V1
>> 1: a 10.61387
>> 2: b 11.85441
>>
>> For more help, try this on an R prompt:
>>
>> R> vignette('datatable-faq')
>>
>> --
>> ASB.
>>
>> On Thu, Jan 17, 2013 at 9:49 PM, David Bellot <david.bellot at gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I've been looking all around the web without a clear answer to this
>>> trivial
>>> problem. I'm sure I'm not looking where I should:
>>>
>>> in fact, I want to replace my use of ddply from the plyr package by
>>> data.table. One of my main use is to group a big data.frame by a group of
>>> variable and do something on this sub data.frame:
>>>
>>> ddply( my_df, my_grouping_var, function (d) { do something with d } )
>>> ----> d is a data.frame again
>>>
>>> and it's slow on big data.frame.
>>>
>>>
>>> However, I don't really understand how to redo the same thing with a
>>> data.table. Basically if "j" in a data.table is equivalent to the select
>>> clause in SQL, then how do I do SELECT * FROM etc...
>>>
>>> I want to be able to pass a function like in ddply that will receive not
>>> only a few columns but the full subset that is selected by the "by"
>>> clause.
>>>
>>> Thanks...
>>> Best,
>>> David
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list