[datatable-help] Fast first row of each group

Matthew Dowle mdowle at mdowle.plus.com
Fri Mar 11 10:23:43 CET 2011


I thought there was a FAQ on that, but appears not. Will add.

Yes, you're right, only when .SD is used by j, .SD is built i.e. if (".SD" 
%in% all.vars(j)).

To illustrate :

> DT = data.table(a=1:2,b=1:10)

> DT[,sapply(.SD,sum),a]     # j uses .SD
     a V1
[1,] 1 25
[2,] 2 30

> DT[,{sapply(.SD,sum)},a]    # j uses .SD
     a V1
[1,] 1 25
[2,] 2 30

> f = function()sapply(.SD,sum)
> DT[,f(),a]                             # j doesn't use .SD (f does)
Error in is.vector(X) : object '.SD' not found

> DT[,{.SD;f()},a]                  # attempt : yes, I really want to use 
> .SD !
Error in is.vector(X) : object '.SD' not found
# remember that lexical scoping works from where the function f was 
*defined* not from where it is used
# so you have to pass .SD in, in this case :

> f = function(.SD)sapply(.SD,sum)
> DT[,f(.SD),a]     # works, but *not* recommended, see below
     a V1
[1,] 1 25
[2,] 2 30

BUT we don't like functions in data.table (because of the overhead, and the 
copying),  we prefer lamdas and "macros" i.e. eval(q),
so we don't want to encourage functions really. Also, we avoid .SD if at all 
possible, too. So that last form is *not* recommended, just illustrating.

> Hopefully my example isn't too confusing w/o a real code sample --
> which I can provide if so.

Good approach.  If we need a reproducible example, we can always ask, but I 
understood without.

Matthew

"Steve Lianoglou" <mailinglist.honeypot at gmail.com> wrote in message 
news:AANLkTik5eWuEM28WwHYG1tL3afbys-699U0_g5LpyjpV at mail.gmail.com...
Hey Matthew,

On Mon, Mar 7, 2011 at 8:08 PM, Matthew Dowle <mdowle at mdowle.plus.com> 
wrote:
>
> Hi Steve,
>
> Have posted a follow up to your answer here :
>
> http://stats.stackexchange.com/questions/7884/fast-ways-in-r-to-get-the-first-row-of-a-data-frame-grouped-by-an-identifier/7985#7985
>
> Thought it might be of interest on list as it's such a large difference.
>
> I realise this probably isn't clear in the documentation or FAQs, so
> have added todo to make that clearer.
>
> Btw, could somebody vote me up please - I have 1 point and can't make
> comments!

As I commented on stats.stackeschange: Nicely done! (I also upvoted then, 
too).

I was just curious about what determines when the `.SD` object is built.

I often do things like:

my.data.table[, {
  ## some block of code
  list(a=whatever, b=something)
}, by='some.key']

If I never reference `.SD` and only reference some subset of the
columns of `my.data.table` in my "block of code", can data.table still
avoid building `.SD` and only copy the columns I reference in my
"block of code," or does this magic only restrict my group summaries
to something I can evaluate within a list(...), like:

my.data.table[, list(a=whatever, b=something), by='some.key']

Hopefully my example isn't too confusing w/o a real code sample --
which I can provide if so.

Thanks,
-steve


-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact 





More information about the datatable-help mailing list