<font face="courier new,monospace">Hi --<br></font><div><font class="Apple-style-span" face="'courier new', monospace"><br></font></div><div><font class="Apple-style-span" face="'courier new', monospace">Just a quick word to say that I would personally find this .SDcols option very useful. Currently, some (minor) contorsions (e.g. selecting columns before the [,, by] step) </font><span class="Apple-style-span" style="font-family: 'courier new', monospace; ">are required to get this behaviour; .SDcols would be a bit more elegant I think.</span></div>
<div><span class="Apple-style-span" style="font-family: 'courier new', monospace; "><br></span></div><div><font class="Apple-style-span" face="'courier new', monospace">Timothee</font></div><div><span class="Apple-style-span" style="font-family: 'courier new', monospace; "><br>
</span></div><div><span class="Apple-style-span" style="font-family: 'courier new', monospace; "><br></span></div><div><span class="Apple-style-span" style="font-family: 'courier new', monospace; "><br></span></div>
<div><div class="gmail_quote">On Sun, Jul 17, 2011 at 12:43 PM, Matthew Dowle <span dir="ltr"><<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Nice observations. All tests, observations and motivations were correct<br>
as far as I could see, although I didn't rerun myself.<br>
<br>
Two things going on :<br>
<br>
i) Whenever you use .SD in j, .SD will contain *all* the columns from<br>
the table, regardless of how .SD is used. That's because it's difficult<br>
for data.table to know which columns of .SD the j really uses. Where the<br>
subset appears directly in j it's pretty obvious but where the subset of<br>
columns are held in a variable, and that variable could be the same name<br>
as a column name, it all gets complicated. But, there is a simple<br>
solution (I think) : we could add a new argument to data.table called<br>
'.SDcols' and you could pass the subset of columns in there; e.g.,<br>
<br>
DT[,lapply(.SD,sum),by="x,y",.SDcols=names(DT)[40:50]]<br>
<br>
Would that be better?<br>
<br>
ii) lapply() is the base R lapply, which we know is slow. Recall that<br>
data.table is over 10 times faster than tapply because tapply calls<br>
lapply. Note also that lapply takes a function (closure) whereas<br>
data.table's j is just a body (lambda). The syntax changes for<br>
data.table weren't just for fun, you know ;) There's a FR on this :<br>
<a href="https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1303&group_id=240&atid=978" target="_blank">https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1303&group_id=240&atid=978</a><br>
<br>
<br>
However, doing both (i) and (ii) just makes the syntax easier to access<br>
the speed which is already possible by automatically creating a (long) j<br>
expression. That's how data.table knows which columns are being used (by<br>
inspecting the expression using all.vars(), only subsetting those) and<br>
there isn't any call to lapply so that slow down goes away. Maybe making<br>
helper functions to make that easier is another way to go.<br>
<font color="#888888"><br>
Matthew<br>
</font><div><div></div><div class="h5"><br>
<br>
On Fri, 2011-07-15 at 12:01 -0700, Dennis Murphy wrote:<br>
> And I just posted something that nicely used colwise() in conjunction<br>
> with a vector of variable names on R-help just a few minutes ago. A<br>
> look at the colwise() help page shows that there are several ways to<br>
> input a variable list for use with colwise(): a bquoted,<br>
> comma-separated, unquoted string of variables (e.g., .(A, B, C)), a<br>
> one-sided formula interface or a vector of (quoted) variable names,<br>
> which was my particular concern. I was ready to recant my assertion,<br>
> but you guys are too quick and sharp for me today :) So much good<br>
> stuff going on in the R-related lists the past two days to which I can<br>
> either contribute or learn from...<br>
><br>
> Back to the point, an efficient variable selection mechanism in<br>
> conjunction with a processing function that could optionally take a<br>
> user-contributed (anonymous) function would be a welcome feature in<br>
> data.table.<br>
><br>
> Cheers,<br>
> Dennis<br>
><br>
> On Fri, Jul 15, 2011 at 11:29 AM, Chris Neff <<a href="mailto:caneff@gmail.com">caneff@gmail.com</a>> wrote:<br>
> > Just chiming in to say something similar to colwise from plyr would be quite<br>
> > nice. You could just carry around a vector of variable names, then do<br>
> > something DT[ ,colwise( f, var_names), by=by_names ].<br>
> ><br>
> > On 15 July 2011 14:06, Dennis Murphy <<a href="mailto:djmuser@gmail.com">djmuser@gmail.com</a>> wrote:<br>
> >><br>
> >> On Fri, Jul 15, 2011 at 8:23 AM, Steve Lianoglou<br>
> >> <<a href="mailto:mailinglist.honeypot@gmail.com">mailinglist.honeypot@gmail.com</a>> wrote:<br>
> >> > Hi Dennis,<br>
> >> ><br>
> >> > I didn't see your post before I sent my latest reply.<br>
> >> ><br>
> >> > Nice detective work!<br>
> >><br>
> >> Thanks, Steve. I just followed my nose and the docs, which I have<br>
> >> conveniently kept in a small binder for such occasions :) Like JV, I<br>
> >> don't use data.table every day, so some of its idiosyncracies get<br>
> >> cobwebbed in the hard drive over time. The wiki entries helped a lot.<br>
> >><br>
> >> ><br>
> >> > For what it's worth, from what I understand your<br>
> >> > "punchline"/kewpie-prize solution is so much faster because it avoids<br>
> >> > building the .SD data.table within each group.<br>
> >><br>
> >> That was my deduction from having read the first entry in the wiki. I<br>
> >> still can't believe I got that thing to work :)<br>
> >><br>
> >> ><br>
> >> > I'll let Matthew leave a more detailed comment, since he's (obviously)<br>
> >> > much more intimately familiar w/ the inner voodoo of data.table. But<br>
> >> > as a last comment -- if the speed differences are so drastic because<br>
> >> > of the cost of creating the .SD data.table, maybe we should think<br>
> >> > about taking some "inspiration" from plyr and define a similar<br>
> >> > `colwise` function -- which would operate across each "column" of<br>
> >> > supposedly-build .SD object applying a function to each of them w/o<br>
> >> > actually building an .SD object itself.<br>
> >><br>
> >> Your clairvoyance skills are clearly operating today :) More<br>
> >> seriously, this is what I would consider an 'obvious' "big-data"<br>
> >> problem - I could easily see situations arising in finance and genomic<br>
> >> applications where a fairly large subset of variables of the same<br>
> >> type, but not necessarily all of them, need to be summarized in a<br>
> >> particular way. The colwise() functions would be problematic as well<br>
> >> in the scenario described in my eariler post, but I haven't tried<br>
> >> ddply() to verify that assertion so I could be mistaken.<br>
> >><br>
> >> It would be *really* helpful to have a convenient, fast mechanism in<br>
> >> data.table that allows one to substitute a (possibly large) vector of<br>
> >> variable names into a function. Alas, I don't have any bright ideas<br>
> >> about how to program it. Fortunately, there are some nice functions in<br>
> >> R to select variable subsets efficiently in data frames (e.g., the<br>
> >> grep() family of functions, regular expressions, %in% and so on), but<br>
> >> I don't know how that would translate easily to data.table() since the<br>
> >> internals are so different.<br>
> >><br>
> >> Looking forward to the team's take on this...<br>
> >><br>
> >> Dennis<br>
> >><br>
> >> ><br>
> >> > -steve<br>
> >> ><br>
> >> > On Fri, Jul 15, 2011 at 10:34 AM, Dennis Murphy <<a href="mailto:djmuser@gmail.com">djmuser@gmail.com</a>><br>
> >> > wrote:<br>
> >> >> Hi:<br>
> >> >><br>
> >> >> <A bunch snipped because I get the archives in digest form><br>
> >> >><br>
> >> >> Re Prof. Voelkel's recent posts:<br>
> >> >><br>
> >> >> (1) Quoting does not work well in data.table; this is mentioned in<br>
> >> >> several of the FAQs. Apropos to this discussion, some of the relevant<br>
> >> >> ones include 1.2, 1.6 and 2.1; there may be others :)<br>
> >> >><br>
> >> >> (2) Steve's response seems to be the right way to go (although see<br>
> >> >> below), but I thought I'd up the stakes a little and assume that Prof.<br>
> >> >> Voelkel has a large number of variables, only a subset of which he may<br>
> >> >> want summarized in a particular go. To that end, I created the<br>
> >> >> following toy data frame cum data.table; this is as much for my own<br>
> >> >> edification as anyone else's (which explains the eventual length of<br>
> >> >> this post...I got curious :)<br>
> >> >><br>
> >> >> This goes against the advice given in the first example of the<br>
> >> >> data.table wiki, but if you have, say, 100 variables to select out of<br>
> >> >> a possible 1000, it doesn't make sense to list them individually as<br>
> >> >> recommended on the wiki. (But see below...)<br>
> >> >><br>
> >> >> library('data.table')<br>
> >> >> set.seed(1043)<br>
> >> >> m <- matrix(rpois(240, 10), nrow = 6)<br>
> >> >> colnames(m) <- paste('A', 1:40, sep = '')<br>
> >> >> m <- as.data.frame(m)<br>
> >> >> dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key =<br>
> >> >> 'x')<br>
> >> >> dim(dt2)<br>
> >> >> # [1] 6 42 ...so far, so good<br>
> >> >><br>
> >> >> # Subset of variables for which sums are desired<br>
> >> >> vars <- paste('A', c(1, 4, 10, 15, 31), sep = '')<br>
> >> >><br>
> >> >> # One approach: use the select = argument of subset() to restrict<br>
> >> >> # the variables under consideration:<br>
> >> >> dt2[, lapply(subset(.SD, select = vars), sum), by = 'x']<br>
> >> >> x A1 A4 A10 A15 A31<br>
> >> >> [1,] 1 18 21 22 22 24<br>
> >> >> [2,] 2 20 13 27 23 21<br>
> >> >> [3,] 3 22 15 16 23 15<br>
> >> >><br>
> >> >> # Use the with = FALSE construct of data.table to do the same:<br>
> >> >> dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']<br>
> >> >> x y A1 A4 A10 A15 A31<br>
> >> >> [1,] 1 1 11 13 12 11 16<br>
> >> >> [2,] 1 2 7 8 10 11 8<br>
> >> >> [3,] 2 1 10 4 16 7 11<br>
> >> >> [4,] 2 3 10 9 11 16 10<br>
> >> >> [5,] 3 2 11 8 7 11 7<br>
> >> >> [6,] 3 3 11 7 9 12 8<br>
> >> >><br>
> >> >> # For this example, it is the same (apart from the key variables) as<br>
> >> >> dt2[, vars, with = FALSE]<br>
> >> >><br>
> >> >> Not bad for this small example, but what happens in a much larger data<br>
> >> >> table?<br>
> >> >><br>
> >> >> To find out, I created a 10000 x 1000 matrix that I converted into a<br>
> >> >> data table, added two grouping variables of 100 levels each and then<br>
> >> >> tried both approaches above again. Performance isn't bad when<br>
> >> >> summarizing over one variable, but there is a definite hit when two<br>
> >> >> variables are summarized. [It makes some sense since one is grouping<br>
> >> >> over 10000 level combinations rather than 100, but once again, keep<br>
> >> >> reading.] Curiously, it makes no difference if there is one key<br>
> >> >> variable or two, which made me wonder what the preferred approach is<br>
> >> >> in this circumstance.<br>
> >> >><br>
> >> >> m <- matrix(rpois(10000000, 10), nrow = 10000)<br>
> >> >> m <- as.data.table(m)<br>
> >> >> m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100))<br>
> >> >> setkey(m, 'x')<br>
> >> >> dim(m)<br>
> >> >> # [1] 10000 1002<br>
> >> >><br>
> >> >> # Randomly select 150 variables from the 1000<br>
> >> >> vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '')<br>
> >> >> length(vars)<br>
> >> >> # [1] 150<br>
> >> >> key(m)<br>
> >> >> # [1] "x"<br>
> >> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x'])<br>
> >> >> user system elapsed<br>
> >> >> 0.75 0.00 0.75<br>
> >> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x'])<br>
> >> >> user system elapsed<br>
> >> >> 0.64 0.00 0.64<br>
> >> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])<br>
> >> >> user system elapsed<br>
> >> >> 53.65 0.00 53.85<br>
> >> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])<br>
> >> >> user system elapsed<br>
> >> >> 44.21 0.01 44.35<br>
> >> >><br>
> >> >> m2 <- data.table(m, key = 'x, y')<br>
> >> >> rm(m)<br>
> >> >> key(m2)<br>
> >> >> # [1] "x" "y"<br>
> >> >>> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x,<br>
> >> >>> y'])<br>
> >> >> user system elapsed<br>
> >> >> 53.54 0.00 53.73<br>
> >> >>> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])<br>
> >> >> user system elapsed<br>
> >> >> 44.30 0.04 44.60<br>
> >> >><br>
> >> >> The first question in the wiki<br>
> >> >> (<a href="http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table" target="_blank">http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table</a>) says<br>
> >> >> to use the columns directly rather than to rely on .SD. I wanted to<br>
> >> >> know how to pass new names to the summaries instead of overwriting the<br>
> >> >> original variable names. For the fun of it, I tried the following:<br>
> >> >><br>
> >> >> select <- sample(1:1000, 150, replace = FALSE)<br>
> >> >> vars <- paste('A', select, sep = '')<br>
> >> >> outvars <- paste('S', select, sep = '')<br>
> >> >><br>
> >> >> # Create a long expression of the form 'list(..., Sn = sum(An), ...)',<br>
> >> >> # n a subscript from 1 to 150.<br>
> >> >> expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep =<br>
> >> >> ''), sep = '=', collapse = ','),<br>
> >> >> ')', sep = '')<br>
> >> >> u <- m2[, eval(parse(text = expr)), by = 'x']<br>
> >> >>> dim(u)<br>
> >> >> # [1] 100 151 seems reasonable...<br>
> >> >><br>
> >> >> This seemed to run rather fast, so I decided to time it:<br>
> >> >><br>
> >> >>> system.time(m2[, eval(parse(text = expr)), by = 'x'])<br>
> >> >> user system elapsed<br>
> >> >> 0.03 0.00 0.03<br>
> >> >>> system.time(m2[, eval(parse(text = expr)), by = 'x, y'])<br>
> >> >> user system elapsed<br>
> >> >> 1.05 0.00 1.04<br>
> >> >><br>
> >> >> I've got to admit, this is not the approach I would have taken<br>
> >> >> normally, is certainly not intuitively obvious to me and flouts the<br>
> >> >> usual advice to avoid the eval(parse(text = )) mantra, but the data<br>
> >> >> don't lie :) Please tell me there's a more code-efficient way to do<br>
> >> >> this (the new variable names included), because my 'solution' was a<br>
> >> >> complete kludge and accidental kewpie prize.<br>
> >> >><br>
> >> >> Cheers,<br>
> >> >> Dennis<br>
> >> >><br>
> >> >>> Message: 1<br>
> >> >>> Date: Thu, 14 Jul 2011 16:36:11 -0400<br>
> >> >>> From: Joseph Voelkel <<a href="mailto:jgvcqa@rit.edu">jgvcqa@rit.edu</a>><br>
> >> >>> Subject: [datatable-help] Skipping some Vi names<br>
> >> >>> To: "<a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a>"<br>
> >> >>> <<a href="mailto:datatable-help@r-forge.wu-wien.ac.at">datatable-help@r-forge.wu-wien.ac.at</a>><br>
> >> >>> Message-ID:<br>
> >> >>><br>
> >> >>> <<a href="mailto:70EFCDD908F9264785FA08EC3A471320282158585C@ex02mail01.ad.rit.edu">70EFCDD908F9264785FA08EC3A471320282158585C@ex02mail01.ad.rit.edu</a>><br>
> >> >>> Content-Type: text/plain; charset="us-ascii"<br>
> >> >>><br>
> >> >>> I don't use data.table too much (though I probably should use it<br>
> >> >>> more...).<br>
> >> >>><br>
> >> >>> I was surprised at the results below. It appears that the name V1 gets<br>
> >> >>> assigned to the first result, but then the keys ("in the background") are<br>
> >> >>> assigned the next set of Vi names, creating a gap in the names depending on<br>
> >> >>> the number of keys. I would like to see the Vi names appear in their<br>
> >> >>> natural, sequential, order. Not a show stopper, but it's annoying. (I have<br>
> >> >>> over 40 Vi's and it'd be good to have them numbered more rationally.)<br>
> >> >>> Thanks.<br>
> >> >>><br>
> >> >>>><br>
> >> >>>> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x")<br>
> >> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"]<br>
> >> >>> x V1 V3 V4<br>
> >> >>> [1,] 1 5 17 29<br>
> >> >>> [2,] 2 7 19 31<br>
> >> >>> [3,] 3 9 21 33<br>
> >> >>>> key(dt)<-c("x","y")<br>
> >> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"]<br>
> >> >>> x y V1 V4 V5<br>
> >> >>> [1,] 1 1 1 7 13<br>
> >> >>> [2,] 1 2 4 10 16<br>
> >> >>> [3,] 2 1 2 8 14<br>
> >> >>> [4,] 2 3 5 11 17<br>
> >> >>> [5,] 3 2 3 9 15<br>
> >> >>> [6,] 3 3 6 12 18<br>
> >> >>><br>
> >> >>><br>
> >> >>><br>
> >> >>> Joseph G. Voelkel, Ph.D.<br>
> >> >>> Professor, Center for Quality and Applied Statistics<br>
> >> >>> Kate Gleason College of Engineering<br>
> >> >>> Rochester Institute of Technology<br>
> >> >>> V 585-475-2231<br>
> >> >>> F 585-475-5959<br>
> >> >>> <a href="mailto:joseph.voelkel@rit.edu">joseph.voelkel@rit.edu</a><br>
> >> >>><br>
> >> >> _______________________________________________<br>
> >> >> datatable-help mailing list<br>
> >> >> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> >> >><br>
> >> >> <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
> >> >><br>
> >> ><br>
> >> ><br>
> >> ><br>
> >> > --<br>
> >> > Steve Lianoglou<br>
> >> > Graduate Student: Computational Systems Biology<br>
> >> > | Memorial Sloan-Kettering Cancer Center<br>
> >> > | Weill Medical College of Cornell University<br>
> >> > Contact Info: <a href="http://cbio.mskcc.org/~lianos/contact" target="_blank">http://cbio.mskcc.org/~lianos/contact</a><br>
> >> ><br>
> >> _______________________________________________<br>
> >> datatable-help mailing list<br>
> >> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> >><br>
> >> <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
> ><br>
> ><br>
> _______________________________________________<br>
> datatable-help mailing list<br>
> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
<br>
<br>
_______________________________________________<br>
datatable-help mailing list<br>
<a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
</div></div></blockquote></div><br></div>