[datatable-help] select * and getting the full sub data.table/frame

Akhil Behl akhil at igidr.ac.in
Thu Jan 17 19:38:33 CET 2013


That .I example is quite interesting. May I ask:

Suppose I wanted to get the 5 row numbers for each subset (say 5 of
them) and save them in a list in stead of a data.table (kind of like
dlply) to be able to use the lapply idiom later on. Is there a way to
do that?

Thanks.

--
ASB.

PS: Is this question hijacking the thread? Sorry, if it is.

On Fri, Jan 18, 2013 at 12:01 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
>
> Glad all clear.  Given the follow up head() examples, yes, .SD is there
>
> for just that purpose. Something like this :
>
>     DT[, head(.SD,2), by=colA]
>
> is idiomatic in data.table.  That's like a "select top 2 * from" in SQL, but
> by group.
>
> Also things like :
>
>     DT[, .SD[1:2], by=colA]    # similar provided all groups have at least 2
> rows
>
>     DT[, .SD[-1], by=colA]    # all but the first
>
>    DT[,  someFunctionThatWantsADataFrame(..., data=.SD), by=colA]
>
>
>
> It's when you don't use all the data in .SD that it's wasteful to use it
> (since
>
> data.table needs to populate it for each group before running j).
>
> So in the subset of rows of .SD examples above, something like this can
>
> be a lot faster :
>
>     w = DT[,head(.I,5),by=colA][[2]]     # top 5 row numbers of each group
>
>     DT[w]   # select those rows
>
> is the same but must faster than
>
>     DT[, head(.SD,5), by=colA]
>
> especially if each of the groups have a lot more rows than 5.
>
> Hope that adds some colour.
>
>
>
> On 17.01.2013 17:33, David Bellot wrote:
>
> indeed, it makes sense now, as what is passed to the function is indeed a
> data.table and not a data.frame.
>
> Thanks guys for your help. Now I'm a convinced data.table user.
> Best,
> David
>
> On Thu, Jan 17, 2013 at 5:25 PM, Akhil Behl <akhil at igidr.ac.in> wrote:
>>
>> Hey David,
>>
>> I thought your problem may have been a typo, but I realized that it is
>> in fact a subtle difference between the way data.table and data.frame
>> work.
>>
>> One must provide unquoted names in the `j' expression for a
>> data.table, i.e. one can say x.dt[ , y] but not x.dt[ , "y"] (which
>> will evaluate to just "y" and hence the error).
>>
>> There are tricks around it like using with=FALSE, or using the
>> data.frame notation x.dt[["y"]]. But once again, you will find such
>> examples and explanations of idiomatic data.table expressions in the
>> vignettes.
>>
>> --
>> ASB.
>>
>> On Thu, Jan 17, 2013 at 10:42 PM, David Bellot <david.bellot at gmail.com>
>> wrote:
>> > Hi Matthew,
>> >
>> > I read indeed the introduction but I wasn't sure about the way to write
>> > it.
>> > Hence my question.
>> >
>> > In fact, I do agree if the function would sum(sqrt(y)), but in my case,
>> > I
>> > would like to do something like
>> >
>> > f >
>> > It's a small example for the sake of simplicity, just to illustrate that
>> > I
>> > really want to have access to the full sub data.frame (the d variable)
>> > and
>> > not just one column.
>> >
>> > Best,
>> > David
>> >
>> > On Thu, Jan 17, 2013 at 5:07 PM, Matthew Dowle <mdowle at mdowle.plus.com>
>> > wrote:
>> >>
>> >>
>> >> Akhil,
>> >>
>> >> Kind of, but defining :
>> >>
>> >> my.func >>     sum(sqrt(d[["y"]]))
>>
>> >> }
>> >>
>> >> followed by
>> >>
>> >> x.dt[ , my.func(.SD), by=x]
>> >>
>> >> isn't very data.table'ish. In fact the
>> >> advice is to avoid .SD if possible, for speed.
>> >>
>> >> We'd forget my.funct, and just do :
>> >>
>> >> x.dt[, sum(sqrt(y)), by=x]
>> >>
>> >> That is how we recommend it to be used, and
>> >> allows data.table to optimize the query (which
>> >> use of .SD may prevent).
>> >>
>> >> David - have you read the introduction vignette and have
>> >> you worked through example(data.table) at the prompt?
>> >>
>> >> Matthew
>> >>
>> >>
>> >>
>> >> On 17.01.2013 16:53, Akhil Behl wrote:
>> >>>
>> >>> If I am not wrong, you are looking for `.SD'. In fact you can put in
>> >>> the exact function you were throwing at ddply earlier. There are other
>> >>> special names like .SD that you can find in the data.table FAQs.
>> >>>
>> >>> Let's see:
>> >>> R> require(plyr)
>> >>> Loading required package: plyr
>> >>>
>> >>> R> require(data.table)
>> >>> Loading required package: data.table
>> >>> data.table 1.8.7  For help type: help("data.table")
>> >>>
>> >>> R> x.df >>> R> x.dt >>> R>
>> >>> R> my.func >>> + sum(sqrt(d[["y"]]))
>>
>> >>> + }
>> >>> R>
>> >>> R> # The plyr way:
>> >>> R> ddply(x.df, "x", my.func) -> ans.plyr
>> >>> R>
>> >>> R> # The data.table way:
>> >>> R> x.dt[ , my.func(.SD), by=x] -> ans.dt
>> >>> R>
>> >>> R> ans.plyr
>> >>>   x       V1
>> >>> 1 a 10.61387
>> >>> 2 b 11.85441
>> >>>
>> >>> R> ans.dt
>> >>>    x       V1
>> >>> 1: a 10.61387
>> >>> 2: b 11.85441
>> >>>
>> >>> For more help, try this on an R prompt:
>> >>>
>> >>> R> vignette('datatable-faq')
>> >>>
>> >>> --
>> >>> ASB.
>> >>>
>> >>> On Thu, Jan 17, 2013 at 9:49 PM, David Bellot <david.bellot at gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I've been looking all around the web without a clear answer to this
>> >>>> trivial
>> >>>> problem. I'm sure I'm not looking where I should:
>> >>>>
>> >>>> in fact, I want to replace my use of ddply from the plyr package by
>> >>>> data.table. One of my main use is to group a big data.frame by a
>> >>>> group
>> >>>> of
>> >>>> variable and do something on this sub data.frame:
>> >>>>
>> >>>> ddply( my_df, my_grouping_var, function (d)   { do something with d }
>> >>>> )
>> >>>> ----> d is a data.frame again
>> >>>>
>> >>>> and it's slow on big data.frame.
>> >>>>
>> >>>>
>> >>>> However, I don't really understand how to redo the same thing with a
>> >>>> data.table. Basically if "j" in a data.table is equivalent to the
>> >>>> select
>> >>>> clause in SQL, then how do I do SELECT * FROM etc...
>> >>>>
>> >>>> I want to be able to pass a function like in ddply that will receive
>> >>>> not
>> >>>> only a few columns but the full subset that is selected by the "by"
>> >>>> clause.
>> >>>>
>> >>>> Thanks...
>> >>>> Best,
>> >>>> David
>> >>>>
>> >>>> _______________________________________________
>> >>>> datatable-help mailing list
>> >>>> datatable-help at lists.r-forge.r-project.org
>> >>>>
>> >>>>
>> >>>>
>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >>>
>> >>> _______________________________________________
>> >>> datatable-help mailing list
>> >>> datatable-help at lists.r-forge.r-project.org
>> >>>
>> >>>
>> >>>
>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>> >
>
>
>
>


More information about the datatable-help mailing list