[datatable-help] select * and getting the full sub data.table/frame

Matthew Dowle mdowle at mdowle.plus.com
Thu Jan 17 19:31:34 CET 2013


 

Glad all clear. Given the follow up head() examples, yes, .SD is
there 

for just that purpose. Something like this : 

 DT[,
head(.SD,2), by=colA] 

is idiomatic in data.table. That's like a
"select top 2 * from" in SQL, but by group. 

Also things like : 

 DT[,
.SD[1:2], by=colA] # similar provided all groups have at least 2 rows 


DT[, .SD[-1], by=colA] # all but the first 

 DT[,
someFunctionThatWantsADataFrame(..., data=.SD), by=colA] 

It's when you
don't use all the data in .SD that it's wasteful to use it (since


data.table needs to populate it for each group before running j). 

So
in the subset of rows of .SD examples above, something like this can


be a lot faster : 

 w = DT[,head(.I,5),by=colA][[2]] # top 5 row
numbers of each group 

 DT[w] # select those rows 

is the same but
must faster than 

 DT[, head(.SD,5), by=colA] 

especially if each of
the groups have a lot more rows than 5. 

Hope that adds some colour.


On 17.01.2013 17:33, David Bellot wrote: 

> indeed, it makes sense
now, as what is passed to the function is indeed a data.table and not a
data.frame.
> 
> Thanks guys for your help. Now I'm a convinced
data.table user.
> Best,
> David
> 
> On Thu, Jan 17, 2013 at 5:25 PM,
Akhil Behl <akhil at igidr.ac.in [8]> wrote:
> 
>> Hey David,
>> 
>> I
thought your problem may have been a typo, but I realized that it is
>>
in fact a subtle difference between the way data.table and data.frame
>>
work.
>> 
>> One must provide unquoted names in the `j' expression for
a
>> data.table, i.e. one can say x.dt[ , y] but not x.dt[ , "y"]
(which
>> will evaluate to just "y" and hence the error).
>> 
>> There
are tricks around it like using with=FALSE, or using the
>> data.frame
notation x.dt[["y"]]. But once again, you will find such
>> examples and
explanations of idiomatic data.table expressions in the
>> vignettes.
>>

>> --
>> ASB.
>> 
>> On Thu, Jan 17, 2013 at 10:42 PM, David Bellot
<david.bellot at gmail.com [1]> wrote:
>> > Hi Matthew,
>> >
>> > I read
indeed the introduction but I wasn't sure about the way to write it.
>>
> Hence my question.
>> >
>> > In fact, I do agree if the function would
sum(sqrt(y)), but in my case, I
>> > would like to do something like
>>
>
>> > f >
>> > It's a small example for the sake of simplicity, just to
illustrate that I
>> > really want to have access to the full sub
data.frame (the d variable) and
>> > not just one column.
>> >
>> >
Best,
>> > David
>> >
>> > On Thu, Jan 17, 2013 at 5:07 PM, Matthew
Dowle <mdowle at mdowle.plus.com [2]>
>> > wrote:
>> >>
>> >>
>> >>
Akhil,
>> >>
>> >> Kind of, but defining :
>> >>
>> >> my.func >>
sum(sqrt(d[["y"]]))
>> >> }
>> >>
>> >> followed by
>> >>
>> >> x.dt[ ,
my.func(.SD), by=x]
>> >>
>> >> isn't very data.table'ish. In fact
the
>> >> advice is to avoid .SD if possible, for speed.
>> >>
>> >>
We'd forget my.funct, and just do :
>> >>
>> >> x.dt[, sum(sqrt(y)),
by=x]
>> >>
>> >> That is how we recommend it to be used, and
>> >>
allows data.table to optimize the query (which
>> >> use of .SD may
prevent).
>> >>
>> >> David - have you read the introduction vignette
and have
>> >> you worked through example(data.table) at the prompt?
>>
>>
>> >> Matthew
>> >>
>> >>
>> >>
>> >> On 17.01.2013 16:53, Akhil Behl
wrote:
>> >>>
>> >>> If I am not wrong, you are looking for `.SD'. In
fact you can put in
>> >>> the exact function you were throwing at ddply
earlier. There are other
>> >>> special names like .SD that you can find
in the data.table FAQs.
>> >>>
>> >>> Let's see:
>> >>> R>
require(plyr)
>> >>> Loading required package: plyr
>> >>>
>> >>> R>
require(data.table)
>> >>> Loading required package: data.table
>> >>>
data.table 1.8.7 For help type: help("data.table")
>> >>>
>> >>> R> x.df
>>> R> x.dt >>> R>
>> >>> R> my.func >>> + sum(sqrt(d[["y"]]))
>> >>> +
}
>> >>> R>
>> >>> R> # The plyr way:
>> >>> R> ddply(x.df, "x",
my.func) -> ans.plyr
>> >>> R>
>> >>> R> # The data.table way:
>> >>> R>
x.dt[ , my.func(.SD), by=x] -> ans.dt
>> >>> R>
>> >>> R> ans.plyr
>>
>>> x V1
>> >>> 1 a 10.61387
>> >>> 2 b 11.85441
>> >>>
>> >>> R>
ans.dt
>> >>> x V1
>> >>> 1: a 10.61387
>> >>> 2: b 11.85441
>> >>>
>>
>>> For more help, try this on an R prompt:
>> >>>
>> >>> R>
vignette('datatable-faq')
>> >>>
>> >>> --
>> >>> ASB.
>> >>>
>> >>> On
Thu, Jan 17, 2013 at 9:49 PM, David Bellot <david.bellot at gmail.com
[3]>
>> >>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I've been looking
all around the web without a clear answer to this
>> >>>> trivial
>>
>>>> problem. I'm sure I'm not looking where I should:
>> >>>>
>> >>>>
in fact, I want to replace my use of ddply from the plyr package by
>>
>>>> data.table. One of my main use is to group a big data.frame by a
group
>> >>>> of
>> >>>> variable and do something on this sub
data.frame:
>> >>>>
>> >>>> ddply( my_df, my_grouping_var, function (d)
{ do something with d } )
>> >>>> ----> d is a data.frame again
>>
>>>>
>> >>>> and it's slow on big data.frame.
>> >>>>
>> >>>>
>> >>>>
However, I don't really understand how to redo the same thing with a
>>
>>>> data.table. Basically if "j" in a data.table is equivalent to the
select
>> >>>> clause in SQL, then how do I do SELECT * FROM etc...
>>
>>>>
>> >>>> I want to be able to pass a function like in ddply that
will receive not
>> >>>> only a few columns but the full subset that is
selected by the "by"
>> >>>> clause.
>> >>>>
>> >>>> Thanks...
>> >>>>
Best,
>> >>>> David
>> >>>>
>> >>>>
_______________________________________________
>> >>>> datatable-help
mailing list
>> >>>> datatable-help at lists.r-forge.r-project.org [4]
>>
>>>>
>> >>>>
>> >>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[5]
>> >>>
>> >>> _______________________________________________
>>
>>> datatable-help mailing list
>> >>>
datatable-help at lists.r-forge.r-project.org [6]
>> >>>
>> >>>
>> >>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[7]
>> >
>> >

 

Links:
------
[1] mailto:david.bellot at gmail.com
[2]
mailto:mdowle at mdowle.plus.com
[3] mailto:david.bellot at gmail.com
[4]
mailto:datatable-help at lists.r-forge.r-project.org
[5]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[6]
mailto:datatable-help at lists.r-forge.r-project.org
[7]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[8]
mailto:akhil at igidr.ac.in
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130117/301fd170/attachment.html>


More information about the datatable-help mailing list