[datatable-help] Curiosity with use of .SDcols
Steve Lianoglou
mailinglist.honeypot at gmail.com
Fri Sep 23 17:59:11 CEST 2011
Hi,
Comments in line
On Fri, Sep 23, 2011 at 11:01 AM, djmuseR <djmuser at gmail.com> wrote:
> Hi:
>
> I'm playing around with some baseball data and ran into an error whose cause
> I don't quite understand.
> A subset of the data is here, consisting of all season batting records of
> five players:
[cut out data]
> # Variables I want to sum over each player:
> vars <- c('G', 'AB', 'R', 'H', 'X2B', 'X3B',
> 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP',
> 'SH', 'SF', 'GIDP', 'G_old')
>
> # library('data.table')
> DTtst <- data.table(tst, key = 'playerID')
>
> The following works as I want:
> DT1 <- DTtst[, list(beginYear = min(yearID), endYear = max(yearID),
> nyears = sum(stint == 1L), nteams = length(unique(teamID))),
> by = 'playerID']
> DT2 <- DTtst[, lapply(.SD, sum), by = playerID, .SDcols = vars]
> DT1[DT2]
>
> # Combining the two into one call doesn't:
>
> DTtst[, list( beginYear = min(yearID),
> endYear = max(yearID),
> nyears = sum(stint == 1L),
> nteams = length(unique(teamsID)),
> lapply(.SD, sum)),
> by = playerID,
> .SDcols = vars]
> # Error in eval(expr, envir, enclos) : object 'yearID' not found
>
> What am I missing? Is it the lapply() call within list()?
Using .SDcols restricts the columns/vars that are injected in the
scope of your j-statement (where your `list(...)` is) which are the
same as the columns of .SD.
yearID isn' in `vars`, and therefore isn't in .SD. To convince
yourself, consider this:
R> DTtst[, {
xx <- .SD
browser()
}, by='playerID', .SDcols=vars]
Called from: eval(expr, envir, enclos)
Browse[1]> xx
G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old
[1,] 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11
[2,] 45 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 45
[3,] 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
[4,] 47 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 5
[5,] 73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA
[6,] 53 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA
See? No yearID.
Just make sure all the vars you reference in your j-expression are in
your .SDcols
> Second question, more out of curiosity than anything else: is there an
> analogue in data.table to within() or plyr::mutate, where one can define new
> variables within a call and use them to create other variables? An example
> of what I have in mind is
>
> DT[, list(..., PA = AB + BB + HBP + SH + SF,
> OBP = ifelse(PA > 0,
> round((H + BB + HBP)/(PA - SH - SF), 3),
> NA)),
> by = playerID]
>
> I have a fairly strong prior on the answer to this question, but I'll let
> others weigh in first.
Matthew is fixing `within` in the development version (SVN from
r-forge), but there is the recently introduced `:=` -- but this will
add these columns to the data.table you are iterating over, which
doesn't sound like what you want.
Note that your `j-expression` isn't restricted to being a list. Look
at the example I gave above for starters, but also you can do:
DTtst[, {
PA <- AB + BB + HBP + SH + SF
list(PA=PA, OBP=ifelse(PA > 0, round((H + BB + HBP)/(PA - SH - SF), 3), NA))
}, by='playerID']
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the datatable-help
mailing list