[datatable-help] select * and getting the full sub data.table/frame

Matthew Dowle mdowle at mdowle.plus.com
Thu Jan 17 20:06:39 CET 2013


Yes use a list column, iiuc.

DT[,list( list(head(.I,5))), by=ColA]

More useful perhaps is returning the unique items of a column, by group. 
Where the length of each vector in each cell varies.

> That .I example is quite interesting. May I ask:
>
> Suppose I wanted to get the 5 row numbers for each subset (say 5 of
> them) and save them in a list in stead of a data.table (kind of like
> dlply) to be able to use the lapply idiom later on. Is there a way to
> do that?
>
> Thanks.
>
> --
> ASB.
>
> PS: Is this question hijacking the thread? Sorry, if it is.
>
> On Fri, Jan 18, 2013 at 12:01 AM, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
>>
>>
>> Glad all clear.  Given the follow up head() examples, yes, .SD is there
>>
>> for just that purpose. Something like this :
>>
>>     DT[, head(.SD,2), by=colA]
>>
>> is idiomatic in data.table.  That's like a "select top 2 * from" in SQL,
>> but
>> by group.
>>
>> Also things like :
>>
>>     DT[, .SD[1:2], by=colA]    # similar provided all groups have at
>> least 2
>> rows
>>
>>     DT[, .SD[-1], by=colA]    # all but the first
>>
>>    DT[,  someFunctionThatWantsADataFrame(..., data=.SD), by=colA]
>>
>>
>>
>> It's when you don't use all the data in .SD that it's wasteful to use it
>> (since
>>
>> data.table needs to populate it for each group before running j).
>>
>> So in the subset of rows of .SD examples above, something like this can
>>
>> be a lot faster :
>>
>>     w = DT[,head(.I,5),by=colA][[2]]     # top 5 row numbers of each
>> group
>>
>>     DT[w]   # select those rows
>>
>> is the same but must faster than
>>
>>     DT[, head(.SD,5), by=colA]
>>
>> especially if each of the groups have a lot more rows than 5.
>>
>> Hope that adds some colour.
>>
>>
>>
>> On 17.01.2013 17:33, David Bellot wrote:
>>
>> indeed, it makes sense now, as what is passed to the function is indeed
>> a
>> data.table and not a data.frame.
>>
>> Thanks guys for your help. Now I'm a convinced data.table user.
>> Best,
>> David
>>
>> On Thu, Jan 17, 2013 at 5:25 PM, Akhil Behl <akhil at igidr.ac.in> wrote:
>>>
>>> Hey David,
>>>
>>> I thought your problem may have been a typo, but I realized that it is
>>> in fact a subtle difference between the way data.table and data.frame
>>> work.
>>>
>>> One must provide unquoted names in the `j' expression for a
>>> data.table, i.e. one can say x.dt[ , y] but not x.dt[ , "y"] (which
>>> will evaluate to just "y" and hence the error).
>>>
>>> There are tricks around it like using with=FALSE, or using the
>>> data.frame notation x.dt[["y"]]. But once again, you will find such
>>> examples and explanations of idiomatic data.table expressions in the
>>> vignettes.
>>>
>>> --
>>> ASB.
>>>
>>> On Thu, Jan 17, 2013 at 10:42 PM, David Bellot <david.bellot at gmail.com>
>>> wrote:
>>> > Hi Matthew,
>>> >
>>> > I read indeed the introduction but I wasn't sure about the way to
>>> write
>>> > it.
>>> > Hence my question.
>>> >
>>> > In fact, I do agree if the function would sum(sqrt(y)), but in my
>>> case,
>>> > I
>>> > would like to do something like
>>> >
>>> > f >
>>> > It's a small example for the sake of simplicity, just to illustrate
>>> that
>>> > I
>>> > really want to have access to the full sub data.frame (the d
>>> variable)
>>> > and
>>> > not just one column.
>>> >
>>> > Best,
>>> > David
>>> >
>>> > On Thu, Jan 17, 2013 at 5:07 PM, Matthew Dowle
>>> <mdowle at mdowle.plus.com>
>>> > wrote:
>>> >>
>>> >>
>>> >> Akhil,
>>> >>
>>> >> Kind of, but defining :
>>> >>
>>> >> my.func >>     sum(sqrt(d[["y"]]))
>>>
>>> >> }
>>> >>
>>> >> followed by
>>> >>
>>> >> x.dt[ , my.func(.SD), by=x]
>>> >>
>>> >> isn't very data.table'ish. In fact the
>>> >> advice is to avoid .SD if possible, for speed.
>>> >>
>>> >> We'd forget my.funct, and just do :
>>> >>
>>> >> x.dt[, sum(sqrt(y)), by=x]
>>> >>
>>> >> That is how we recommend it to be used, and
>>> >> allows data.table to optimize the query (which
>>> >> use of .SD may prevent).
>>> >>
>>> >> David - have you read the introduction vignette and have
>>> >> you worked through example(data.table) at the prompt?
>>> >>
>>> >> Matthew
>>> >>
>>> >>
>>> >>
>>> >> On 17.01.2013 16:53, Akhil Behl wrote:
>>> >>>
>>> >>> If I am not wrong, you are looking for `.SD'. In fact you can put
>>> in
>>> >>> the exact function you were throwing at ddply earlier. There are
>>> other
>>> >>> special names like .SD that you can find in the data.table FAQs.
>>> >>>
>>> >>> Let's see:
>>> >>> R> require(plyr)
>>> >>> Loading required package: plyr
>>> >>>
>>> >>> R> require(data.table)
>>> >>> Loading required package: data.table
>>> >>> data.table 1.8.7  For help type: help("data.table")
>>> >>>
>>> >>> R> x.df >>> R> x.dt >>> R>
>>> >>> R> my.func >>> + sum(sqrt(d[["y"]]))
>>>
>>> >>> + }
>>> >>> R>
>>> >>> R> # The plyr way:
>>> >>> R> ddply(x.df, "x", my.func) -> ans.plyr
>>> >>> R>
>>> >>> R> # The data.table way:
>>> >>> R> x.dt[ , my.func(.SD), by=x] -> ans.dt
>>> >>> R>
>>> >>> R> ans.plyr
>>> >>>   x       V1
>>> >>> 1 a 10.61387
>>> >>> 2 b 11.85441
>>> >>>
>>> >>> R> ans.dt
>>> >>>    x       V1
>>> >>> 1: a 10.61387
>>> >>> 2: b 11.85441
>>> >>>
>>> >>> For more help, try this on an R prompt:
>>> >>>
>>> >>> R> vignette('datatable-faq')
>>> >>>
>>> >>> --
>>> >>> ASB.
>>> >>>
>>> >>> On Thu, Jan 17, 2013 at 9:49 PM, David Bellot
>>> <david.bellot at gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> I've been looking all around the web without a clear answer to
>>> this
>>> >>>> trivial
>>> >>>> problem. I'm sure I'm not looking where I should:
>>> >>>>
>>> >>>> in fact, I want to replace my use of ddply from the plyr package
>>> by
>>> >>>> data.table. One of my main use is to group a big data.frame by a
>>> >>>> group
>>> >>>> of
>>> >>>> variable and do something on this sub data.frame:
>>> >>>>
>>> >>>> ddply( my_df, my_grouping_var, function (d)   { do something with
>>> d }
>>> >>>> )
>>> >>>> ----> d is a data.frame again
>>> >>>>
>>> >>>> and it's slow on big data.frame.
>>> >>>>
>>> >>>>
>>> >>>> However, I don't really understand how to redo the same thing with
>>> a
>>> >>>> data.table. Basically if "j" in a data.table is equivalent to the
>>> >>>> select
>>> >>>> clause in SQL, then how do I do SELECT * FROM etc...
>>> >>>>
>>> >>>> I want to be able to pass a function like in ddply that will
>>> receive
>>> >>>> not
>>> >>>> only a few columns but the full subset that is selected by the
>>> "by"
>>> >>>> clause.
>>> >>>>
>>> >>>> Thanks...
>>> >>>> Best,
>>> >>>> David
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> datatable-help mailing list
>>> >>>> datatable-help at lists.r-forge.r-project.org
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>> >>>
>>> >>> _______________________________________________
>>> >>> datatable-help mailing list
>>> >>> datatable-help at lists.r-forge.r-project.org
>>> >>>
>>> >>>
>>> >>>
>>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>> >
>>> >
>>
>>
>>
>>
>




More information about the datatable-help mailing list