[datatable-help] select * and getting the full sub data.table/frame

David Bellot david.bellot at gmail.com
Thu Jan 17 18:33:51 CET 2013


indeed, it makes sense now, as what is passed to the function is indeed a
data.table and not a data.frame.

Thanks guys for your help. Now I'm a convinced data.table user.
Best,
David

On Thu, Jan 17, 2013 at 5:25 PM, Akhil Behl <akhil at igidr.ac.in> wrote:

> Hey David,
>
> I thought your problem may have been a typo, but I realized that it is
> in fact a subtle difference between the way data.table and data.frame
> work.
>
> One must provide unquoted names in the `j' expression for a
> data.table, i.e. one can say x.dt[ , y] but not x.dt[ , "y"] (which
> will evaluate to just "y" and hence the error).
>
> There are tricks around it like using with=FALSE, or using the
> data.frame notation x.dt[["y"]]. But once again, you will find such
> examples and explanations of idiomatic data.table expressions in the
> vignettes.
>
> --
> ASB.
>
> On Thu, Jan 17, 2013 at 10:42 PM, David Bellot <david.bellot at gmail.com>
> wrote:
> > Hi Matthew,
> >
> > I read indeed the introduction but I wasn't sure about the way to write
> it.
> > Hence my question.
> >
> > In fact, I do agree if the function would sum(sqrt(y)), but in my case, I
> > would like to do something like
> >
> > f <- function(d)  head(d,1)
> >
> > It's a small example for the sake of simplicity, just to illustrate that
> I
> > really want to have access to the full sub data.frame (the d variable)
> and
> > not just one column.
> >
> > Best,
> > David
> >
> > On Thu, Jan 17, 2013 at 5:07 PM, Matthew Dowle <mdowle at mdowle.plus.com>
> > wrote:
> >>
> >>
> >> Akhil,
> >>
> >> Kind of, but defining :
> >>
> >> my.func <- function (d) {
> >>     sum(sqrt(d[["y"]]))
> >> }
> >>
> >> followed by
> >>
> >> x.dt[ , my.func(.SD), by=x]
> >>
> >> isn't very data.table'ish. In fact the
> >> advice is to avoid .SD if possible, for speed.
> >>
> >> We'd forget my.funct, and just do :
> >>
> >> x.dt[, sum(sqrt(y)), by=x]
> >>
> >> That is how we recommend it to be used, and
> >> allows data.table to optimize the query (which
> >> use of .SD may prevent).
> >>
> >> David - have you read the introduction vignette and have
> >> you worked through example(data.table) at the prompt?
> >>
> >> Matthew
> >>
> >>
> >>
> >> On 17.01.2013 16:53, Akhil Behl wrote:
> >>>
> >>> If I am not wrong, you are looking for `.SD'. In fact you can put in
> >>> the exact function you were throwing at ddply earlier. There are other
> >>> special names like .SD that you can find in the data.table FAQs.
> >>>
> >>> Let's see:
> >>> R> require(plyr)
> >>> Loading required package: plyr
> >>>
> >>> R> require(data.table)
> >>> Loading required package: data.table
> >>> data.table 1.8.7  For help type: help("data.table")
> >>>
> >>> R> x.df <- data.frame(x=letters[1:2], y=1:10)
> >>> R> x.dt <- data.table(x.df)
> >>> R>
> >>> R> my.func <- function (d) { # Define a function on the subset
> >>> + sum(sqrt(d[["y"]]))
> >>> + }
> >>> R>
> >>> R> # The plyr way:
> >>> R> ddply(x.df, "x", my.func) -> ans.plyr
> >>> R>
> >>> R> # The data.table way:
> >>> R> x.dt[ , my.func(.SD), by=x] -> ans.dt
> >>> R>
> >>> R> ans.plyr
> >>>   x       V1
> >>> 1 a 10.61387
> >>> 2 b 11.85441
> >>>
> >>> R> ans.dt
> >>>    x       V1
> >>> 1: a 10.61387
> >>> 2: b 11.85441
> >>>
> >>> For more help, try this on an R prompt:
> >>>
> >>> R> vignette('datatable-faq')
> >>>
> >>> --
> >>> ASB.
> >>>
> >>> On Thu, Jan 17, 2013 at 9:49 PM, David Bellot <david.bellot at gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I've been looking all around the web without a clear answer to this
> >>>> trivial
> >>>> problem. I'm sure I'm not looking where I should:
> >>>>
> >>>> in fact, I want to replace my use of ddply from the plyr package by
> >>>> data.table. One of my main use is to group a big data.frame by a group
> >>>> of
> >>>> variable and do something on this sub data.frame:
> >>>>
> >>>> ddply( my_df, my_grouping_var, function (d)   { do something with d }
> )
> >>>> ----> d is a data.frame again
> >>>>
> >>>> and it's slow on big data.frame.
> >>>>
> >>>>
> >>>> However, I don't really understand how to redo the same thing with a
> >>>> data.table. Basically if "j" in a data.table is equivalent to the
> select
> >>>> clause in SQL, then how do I do SELECT * FROM etc...
> >>>>
> >>>> I want to be able to pass a function like in ddply that will receive
> not
> >>>> only a few columns but the full subset that is selected by the "by"
> >>>> clause.
> >>>>
> >>>> Thanks...
> >>>> Best,
> >>>> David
> >>>>
> >>>> _______________________________________________
> >>>> datatable-help mailing list
> >>>> datatable-help at lists.r-forge.r-project.org
> >>>>
> >>>>
> >>>>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>>
> >>> _______________________________________________
> >>> datatable-help mailing list
> >>> datatable-help at lists.r-forge.r-project.org
> >>>
> >>>
> >>>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130117/96518c5b/attachment-0001.html>


More information about the datatable-help mailing list