[datatable-help] datatable-help Digest, Vol 17, Issue 10
Timothée Carayol
timothee.carayol at gmail.com
Sun Jul 17 14:36:41 CEST 2011
Hi --
Just a quick word to say that I would personally find this .SDcols option
very useful. Currently, some (minor) contorsions (e.g. selecting columns
before the [,, by] step) are required to get this behaviour; .SDcols would
be a bit more elegant I think.
Timothee
On Sun, Jul 17, 2011 at 12:43 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
> Nice observations. All tests, observations and motivations were correct
> as far as I could see, although I didn't rerun myself.
>
> Two things going on :
>
> i) Whenever you use .SD in j, .SD will contain *all* the columns from
> the table, regardless of how .SD is used. That's because it's difficult
> for data.table to know which columns of .SD the j really uses. Where the
> subset appears directly in j it's pretty obvious but where the subset of
> columns are held in a variable, and that variable could be the same name
> as a column name, it all gets complicated. But, there is a simple
> solution (I think) : we could add a new argument to data.table called
> '.SDcols' and you could pass the subset of columns in there; e.g.,
>
> DT[,lapply(.SD,sum),by="x,y",.SDcols=names(DT)[40:50]]
>
> Would that be better?
>
> ii) lapply() is the base R lapply, which we know is slow. Recall that
> data.table is over 10 times faster than tapply because tapply calls
> lapply. Note also that lapply takes a function (closure) whereas
> data.table's j is just a body (lambda). The syntax changes for
> data.table weren't just for fun, you know ;) There's a FR on this :
>
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1303&group_id=240&atid=978
>
>
> However, doing both (i) and (ii) just makes the syntax easier to access
> the speed which is already possible by automatically creating a (long) j
> expression. That's how data.table knows which columns are being used (by
> inspecting the expression using all.vars(), only subsetting those) and
> there isn't any call to lapply so that slow down goes away. Maybe making
> helper functions to make that easier is another way to go.
>
> Matthew
>
>
> On Fri, 2011-07-15 at 12:01 -0700, Dennis Murphy wrote:
> > And I just posted something that nicely used colwise() in conjunction
> > with a vector of variable names on R-help just a few minutes ago. A
> > look at the colwise() help page shows that there are several ways to
> > input a variable list for use with colwise(): a bquoted,
> > comma-separated, unquoted string of variables (e.g., .(A, B, C)), a
> > one-sided formula interface or a vector of (quoted) variable names,
> > which was my particular concern. I was ready to recant my assertion,
> > but you guys are too quick and sharp for me today :) So much good
> > stuff going on in the R-related lists the past two days to which I can
> > either contribute or learn from...
> >
> > Back to the point, an efficient variable selection mechanism in
> > conjunction with a processing function that could optionally take a
> > user-contributed (anonymous) function would be a welcome feature in
> > data.table.
> >
> > Cheers,
> > Dennis
> >
> > On Fri, Jul 15, 2011 at 11:29 AM, Chris Neff <caneff at gmail.com> wrote:
> > > Just chiming in to say something similar to colwise from plyr would be
> quite
> > > nice. You could just carry around a vector of variable names, then do
> > > something DT[ ,colwise( f, var_names), by=by_names ].
> > >
> > > On 15 July 2011 14:06, Dennis Murphy <djmuser at gmail.com> wrote:
> > >>
> > >> On Fri, Jul 15, 2011 at 8:23 AM, Steve Lianoglou
> > >> <mailinglist.honeypot at gmail.com> wrote:
> > >> > Hi Dennis,
> > >> >
> > >> > I didn't see your post before I sent my latest reply.
> > >> >
> > >> > Nice detective work!
> > >>
> > >> Thanks, Steve. I just followed my nose and the docs, which I have
> > >> conveniently kept in a small binder for such occasions :) Like JV, I
> > >> don't use data.table every day, so some of its idiosyncracies get
> > >> cobwebbed in the hard drive over time. The wiki entries helped a lot.
> > >>
> > >> >
> > >> > For what it's worth, from what I understand your
> > >> > "punchline"/kewpie-prize solution is so much faster because it
> avoids
> > >> > building the .SD data.table within each group.
> > >>
> > >> That was my deduction from having read the first entry in the wiki. I
> > >> still can't believe I got that thing to work :)
> > >>
> > >> >
> > >> > I'll let Matthew leave a more detailed comment, since he's
> (obviously)
> > >> > much more intimately familiar w/ the inner voodoo of data.table. But
> > >> > as a last comment -- if the speed differences are so drastic because
> > >> > of the cost of creating the .SD data.table, maybe we should think
> > >> > about taking some "inspiration" from plyr and define a similar
> > >> > `colwise` function -- which would operate across each "column" of
> > >> > supposedly-build .SD object applying a function to each of them w/o
> > >> > actually building an .SD object itself.
> > >>
> > >> Your clairvoyance skills are clearly operating today :) More
> > >> seriously, this is what I would consider an 'obvious' "big-data"
> > >> problem - I could easily see situations arising in finance and genomic
> > >> applications where a fairly large subset of variables of the same
> > >> type, but not necessarily all of them, need to be summarized in a
> > >> particular way. The colwise() functions would be problematic as well
> > >> in the scenario described in my eariler post, but I haven't tried
> > >> ddply() to verify that assertion so I could be mistaken.
> > >>
> > >> It would be *really* helpful to have a convenient, fast mechanism in
> > >> data.table that allows one to substitute a (possibly large) vector of
> > >> variable names into a function. Alas, I don't have any bright ideas
> > >> about how to program it. Fortunately, there are some nice functions in
> > >> R to select variable subsets efficiently in data frames (e.g., the
> > >> grep() family of functions, regular expressions, %in% and so on), but
> > >> I don't know how that would translate easily to data.table() since the
> > >> internals are so different.
> > >>
> > >> Looking forward to the team's take on this...
> > >>
> > >> Dennis
> > >>
> > >> >
> > >> > -steve
> > >> >
> > >> > On Fri, Jul 15, 2011 at 10:34 AM, Dennis Murphy <djmuser at gmail.com>
> > >> > wrote:
> > >> >> Hi:
> > >> >>
> > >> >> <A bunch snipped because I get the archives in digest form>
> > >> >>
> > >> >> Re Prof. Voelkel's recent posts:
> > >> >>
> > >> >> (1) Quoting does not work well in data.table; this is mentioned in
> > >> >> several of the FAQs. Apropos to this discussion, some of the
> relevant
> > >> >> ones include 1.2, 1.6 and 2.1; there may be others :)
> > >> >>
> > >> >> (2) Steve's response seems to be the right way to go (although see
> > >> >> below), but I thought I'd up the stakes a little and assume that
> Prof.
> > >> >> Voelkel has a large number of variables, only a subset of which he
> may
> > >> >> want summarized in a particular go. To that end, I created the
> > >> >> following toy data frame cum data.table; this is as much for my own
> > >> >> edification as anyone else's (which explains the eventual length of
> > >> >> this post...I got curious :)
> > >> >>
> > >> >> This goes against the advice given in the first example of the
> > >> >> data.table wiki, but if you have, say, 100 variables to select out
> of
> > >> >> a possible 1000, it doesn't make sense to list them individually as
> > >> >> recommended on the wiki. (But see below...)
> > >> >>
> > >> >> library('data.table')
> > >> >> set.seed(1043)
> > >> >> m <- matrix(rpois(240, 10), nrow = 6)
> > >> >> colnames(m) <- paste('A', 1:40, sep = '')
> > >> >> m <- as.data.frame(m)
> > >> >> dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key =
> > >> >> 'x')
> > >> >> dim(dt2)
> > >> >> # [1] 6 42 ...so far, so good
> > >> >>
> > >> >> # Subset of variables for which sums are desired
> > >> >> vars <- paste('A', c(1, 4, 10, 15, 31), sep = '')
> > >> >>
> > >> >> # One approach: use the select = argument of subset() to restrict
> > >> >> # the variables under consideration:
> > >> >> dt2[, lapply(subset(.SD, select = vars), sum), by = 'x']
> > >> >> x A1 A4 A10 A15 A31
> > >> >> [1,] 1 18 21 22 22 24
> > >> >> [2,] 2 20 13 27 23 21
> > >> >> [3,] 3 22 15 16 23 15
> > >> >>
> > >> >> # Use the with = FALSE construct of data.table to do the same:
> > >> >> dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']
> > >> >> x y A1 A4 A10 A15 A31
> > >> >> [1,] 1 1 11 13 12 11 16
> > >> >> [2,] 1 2 7 8 10 11 8
> > >> >> [3,] 2 1 10 4 16 7 11
> > >> >> [4,] 2 3 10 9 11 16 10
> > >> >> [5,] 3 2 11 8 7 11 7
> > >> >> [6,] 3 3 11 7 9 12 8
> > >> >>
> > >> >> # For this example, it is the same (apart from the key variables)
> as
> > >> >> dt2[, vars, with = FALSE]
> > >> >>
> > >> >> Not bad for this small example, but what happens in a much larger
> data
> > >> >> table?
> > >> >>
> > >> >> To find out, I created a 10000 x 1000 matrix that I converted into
> a
> > >> >> data table, added two grouping variables of 100 levels each and
> then
> > >> >> tried both approaches above again. Performance isn't bad when
> > >> >> summarizing over one variable, but there is a definite hit when two
> > >> >> variables are summarized. [It makes some sense since one is
> grouping
> > >> >> over 10000 level combinations rather than 100, but once again, keep
> > >> >> reading.] Curiously, it makes no difference if there is one key
> > >> >> variable or two, which made me wonder what the preferred approach
> is
> > >> >> in this circumstance.
> > >> >>
> > >> >> m <- matrix(rpois(10000000, 10), nrow = 10000)
> > >> >> m <- as.data.table(m)
> > >> >> m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100))
> > >> >> setkey(m, 'x')
> > >> >> dim(m)
> > >> >> # [1] 10000 1002
> > >> >>
> > >> >> # Randomly select 150 variables from the 1000
> > >> >> vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '')
> > >> >> length(vars)
> > >> >> # [1] 150
> > >> >> key(m)
> > >> >> # [1] "x"
> > >> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by =
> 'x'])
> > >> >> user system elapsed
> > >> >> 0.75 0.00 0.75
> > >> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x'])
> > >> >> user system elapsed
> > >> >> 0.64 0.00 0.64
> > >> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x,
> y'])
> > >> >> user system elapsed
> > >> >> 53.65 0.00 53.85
> > >> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x,
> y'])
> > >> >> user system elapsed
> > >> >> 44.21 0.01 44.35
> > >> >>
> > >> >> m2 <- data.table(m, key = 'x, y')
> > >> >> rm(m)
> > >> >> key(m2)
> > >> >> # [1] "x" "y"
> > >> >>> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x,
> > >> >>> y'])
> > >> >> user system elapsed
> > >> >> 53.54 0.00 53.73
> > >> >>> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x,
> y'])
> > >> >> user system elapsed
> > >> >> 44.30 0.04 44.60
> > >> >>
> > >> >> The first question in the wiki
> > >> >> (http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table)
> says
> > >> >> to use the columns directly rather than to rely on .SD. I wanted to
> > >> >> know how to pass new names to the summaries instead of overwriting
> the
> > >> >> original variable names. For the fun of it, I tried the following:
> > >> >>
> > >> >> select <- sample(1:1000, 150, replace = FALSE)
> > >> >> vars <- paste('A', select, sep = '')
> > >> >> outvars <- paste('S', select, sep = '')
> > >> >>
> > >> >> # Create a long expression of the form 'list(..., Sn = sum(An),
> ...)',
> > >> >> # n a subscript from 1 to 150.
> > >> >> expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep
> =
> > >> >> ''), sep = '=', collapse = ','),
> > >> >> ')', sep = '')
> > >> >> u <- m2[, eval(parse(text = expr)), by = 'x']
> > >> >>> dim(u)
> > >> >> # [1] 100 151 seems reasonable...
> > >> >>
> > >> >> This seemed to run rather fast, so I decided to time it:
> > >> >>
> > >> >>> system.time(m2[, eval(parse(text = expr)), by = 'x'])
> > >> >> user system elapsed
> > >> >> 0.03 0.00 0.03
> > >> >>> system.time(m2[, eval(parse(text = expr)), by = 'x, y'])
> > >> >> user system elapsed
> > >> >> 1.05 0.00 1.04
> > >> >>
> > >> >> I've got to admit, this is not the approach I would have taken
> > >> >> normally, is certainly not intuitively obvious to me and flouts the
> > >> >> usual advice to avoid the eval(parse(text = )) mantra, but the data
> > >> >> don't lie :) Please tell me there's a more code-efficient way to
> do
> > >> >> this (the new variable names included), because my 'solution' was a
> > >> >> complete kludge and accidental kewpie prize.
> > >> >>
> > >> >> Cheers,
> > >> >> Dennis
> > >> >>
> > >> >>> Message: 1
> > >> >>> Date: Thu, 14 Jul 2011 16:36:11 -0400
> > >> >>> From: Joseph Voelkel <jgvcqa at rit.edu>
> > >> >>> Subject: [datatable-help] Skipping some Vi names
> > >> >>> To: "datatable-help at lists.r-forge.r-project.org"
> > >> >>> <datatable-help at r-forge.wu-wien.ac.at>
> > >> >>> Message-ID:
> > >> >>>
> > >> >>> <
> 70EFCDD908F9264785FA08EC3A471320282158585C at ex02mail01.ad.rit.edu>
> > >> >>> Content-Type: text/plain; charset="us-ascii"
> > >> >>>
> > >> >>> I don't use data.table too much (though I probably should use it
> > >> >>> more...).
> > >> >>>
> > >> >>> I was surprised at the results below. It appears that the name V1
> gets
> > >> >>> assigned to the first result, but then the keys ("in the
> background") are
> > >> >>> assigned the next set of Vi names, creating a gap in the names
> depending on
> > >> >>> the number of keys. I would like to see the Vi names appear in
> their
> > >> >>> natural, sequential, order. Not a show stopper, but it's annoying.
> (I have
> > >> >>> over 40 Vi's and it'd be good to have them numbered more
> rationally.)
> > >> >>> Thanks.
> > >> >>>
> > >> >>>>
> > >> >>>>
> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x")
> > >> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"]
> > >> >>> x V1 V3 V4
> > >> >>> [1,] 1 5 17 29
> > >> >>> [2,] 2 7 19 31
> > >> >>> [3,] 3 9 21 33
> > >> >>>> key(dt)<-c("x","y")
> > >> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"]
> > >> >>> x y V1 V4 V5
> > >> >>> [1,] 1 1 1 7 13
> > >> >>> [2,] 1 2 4 10 16
> > >> >>> [3,] 2 1 2 8 14
> > >> >>> [4,] 2 3 5 11 17
> > >> >>> [5,] 3 2 3 9 15
> > >> >>> [6,] 3 3 6 12 18
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>> Joseph G. Voelkel, Ph.D.
> > >> >>> Professor, Center for Quality and Applied Statistics
> > >> >>> Kate Gleason College of Engineering
> > >> >>> Rochester Institute of Technology
> > >> >>> V 585-475-2231
> > >> >>> F 585-475-5959
> > >> >>> joseph.voelkel at rit.edu
> > >> >>>
> > >> >> _______________________________________________
> > >> >> datatable-help mailing list
> > >> >> datatable-help at lists.r-forge.r-project.org
> > >> >>
> > >> >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Steve Lianoglou
> > >> > Graduate Student: Computational Systems Biology
> > >> > | Memorial Sloan-Kettering Cancer Center
> > >> > | Weill Medical College of Cornell University
> > >> > Contact Info: http://cbio.mskcc.org/~lianos/contact
> > >> >
> > >> _______________________________________________
> > >> datatable-help mailing list
> > >> datatable-help at lists.r-forge.r-project.org
> > >>
> > >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> > >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20110717/0dc2d04b/attachment-0001.htm>
More information about the datatable-help
mailing list