[datatable-help] Curiosity with use of .SDcols

Steve Lianoglou mailinglist.honeypot at gmail.com
Fri Sep 23 17:59:11 CEST 2011


Hi,

Comments in line

On Fri, Sep 23, 2011 at 11:01 AM, djmuseR <djmuser at gmail.com> wrote:
> Hi:
>
> I'm playing around with some baseball data and ran into an error whose cause
> I don't quite understand.
> A subset of the data is here, consisting of all season batting records of
> five players:

[cut out data]

> # Variables I want to sum over each player:
> vars <- c('G', 'AB', 'R', 'H', 'X2B', 'X3B',
>          'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP',
>          'SH', 'SF', 'GIDP', 'G_old')
>
> # library('data.table')
> DTtst <- data.table(tst, key = 'playerID')
>
> The following works as I want:
> DT1 <- DTtst[, list(beginYear = min(yearID), endYear = max(yearID),
>              nyears = sum(stint == 1L), nteams = length(unique(teamID))),
>         by = 'playerID']
> DT2 <- DTtst[, lapply(.SD, sum), by = playerID, .SDcols = vars]
> DT1[DT2]
>
> # Combining the two into one call doesn't:
>
> DTtst[, list( beginYear = min(yearID),
>                                    endYear = max(yearID),
>                                    nyears = sum(stint == 1L),
>                                    nteams = length(unique(teamsID)),
>                                    lapply(.SD, sum)),
>                               by = playerID,
>                               .SDcols = vars]
> # Error in eval(expr, envir, enclos) : object 'yearID' not found
>
> What am I missing? Is it the lapply() call within list()?

Using .SDcols restricts the columns/vars that are injected in the
scope of your j-statement (where your `list(...)` is) which are the
same as the columns of .SD.

yearID isn' in `vars`, and therefore isn't in .SD. To convince
yourself, consider this:

R> DTtst[, {
  xx <- .SD
  browser()
}, by='playerID', .SDcols=vars]

Called from: eval(expr, envir, enclos)
Browse[1]> xx
      G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old
[1,] 11  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0    11
[2,] 45  2 0 0   0   0  0   0  0  0  0  0   0   0  1  0    0    45
[3,] 25  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0     2
[4,] 47  1 0 0   0   0  0   0  0  0  0  1   0   0  0  0    0     5
[5,] 73  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0    NA
[6,] 53  0 0 0   0   0  0   0  0  0  0  0   0   0  0  0    0    NA

See? No yearID.

Just make sure all the vars you reference in your j-expression are in
your .SDcols


> Second question, more out of curiosity than anything else: is there an
> analogue in data.table to within() or plyr::mutate, where one can define new
> variables within a call and use them to create other variables? An example
> of what I have in mind is
>
> DT[, list(..., PA = AB + BB + HBP + SH + SF,
>                  OBP = ifelse(PA > 0,
>                                round((H + BB + HBP)/(PA - SH - SF), 3),
> NA)),
>       by = playerID]
>
> I have a fairly strong prior on the answer to this question, but I'll let
> others weigh in first.

Matthew is fixing `within` in the development version (SVN from
r-forge), but there is the recently introduced `:=` -- but this will
add these columns to the data.table you are iterating over, which
doesn't sound like what you want.

Note that your `j-expression` isn't restricted to being a list. Look
at the example I gave above for starters, but also you can do:

DTtst[, {
 PA <- AB + BB + HBP + SH + SF
 list(PA=PA, OBP=ifelse(PA > 0, round((H + BB + HBP)/(PA - SH - SF), 3), NA))
}, by='playerID']

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the datatable-help mailing list