I think that this is a really good idea in that<div><ol><li>it is consistent with the current implementation</li><li>it is quite flexible.</li></ol><div>One only has to do decide what to do when the the user gives invalid integer, i.e. integers larger .N. Either one throws an error or gives a warning and take .N. What do the others think of Matthew's proposal?</div>
<div><br></div><div>On a side note: </div><div><br></div><div>I just played around with the mult= option and discovered the following:</div><div><br></div><div><div>DT <- data.table(X = rnorm(100),</div><div> Y = rnorm(100),</div>
<div> C =rep(c(1, 2), each=100),</div><div> key="C")</div></div><div>all.equal(DT[C==1, mult="first"], DT[C==1, mult="all"])</div><div>all.equal(DT[1:100, mult="first"], DT[1:100, mult="all"])</div>
<div>#<span style="font-size:10pt;line-height:1.3;white-space:pre-wrap;font-family:monospace;text-align:-webkit-left">[1] TRUE</span></div><div><span style="font-size:10pt;line-height:1.3;white-space:pre-wrap;font-family:monospace;text-align:-webkit-left"><br>
</span></div><div><span style="font-size:10pt;line-height:1.3;white-space:pre-wrap;font-family:monospace;text-align:-webkit-left">For me, it is rather surprising that both mult="all" and mult="first" return all rows that match C==1, i.e. in the case when i is a logical or integer. From the help page: "</span><span style="font-family:sans-serif;font-size:13px">When</span><span style="font-family:sans-serif;font-size:13px"> </span><em style="font-family:sans-serif;font-size:13px">multiple</em><span style="font-family:sans-serif;font-size:13px"> </span><span style="font-family:sans-serif;font-size:13px">rows in</span><span style="font-family:sans-serif;font-size:13px"> </span><code style="font-size:13px">x</code><span style="font-family:sans-serif;font-size:13px"> </span><span style="font-family:sans-serif;font-size:13px">match to the row in</span><span style="font-family:sans-serif;font-size:13px"> </span><code style="font-size:13px">i</code><span style="font-family:sans-serif;font-size:13px">,</span><span style="font-family:sans-serif;font-size:13px"> </span><code style="font-size:13px">mult</code><span style="font-family:sans-serif;font-size:13px"> </span><span style="font-family:sans-serif;font-size:13px">controls which are returned:</span><span style="font-family:sans-serif;font-size:13px"> </span><code style="font-size:13px">"all"</code><span style="font-family:sans-serif;font-size:13px"> </span><span style="font-family:sans-serif;font-size:13px">(default),</span><span style="font-family:sans-serif;font-size:13px"> </span><code style="font-size:13px">"first"</code><span style="font-family:sans-serif;font-size:13px"> </span><span style="font-family:sans-serif;font-size:13px">or</span><span style="font-family:sans-serif;font-size:13px"> </span><code style="font-size:13px">"last"</code><span style="font-family:sans-serif;font-size:13px">." Although one could argue that a "row" only applies to a data.table and not a vector, as a user I would expect mult= to work on both, i.e. when i is a data.table, a logical or an integer. However, it only works for data.tables, which I checked in the source code. So should I update the documentation to make it clearer that mult= only works if i is a data.table?</span></div>
<br><div class="gmail_quote">On Sun, Jan 8, 2012 at 8:25 PM, Matthew Dowle <span dir="ltr"><<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
How about allowing mult to be integer (or an expression that evaluates<br>
to integer) :<br>
<br>
DT[X, mult="first"]<br>
DT[X, mult=1L] # same<br>
<br>
DT[X, mult="last"]<br>
DT[X, mult=.N] # same<br>
<br>
DT[X, .SD[2]] # 2nd row of each group (inefficient due to .SD, there<br>
are other longer alternatives)<br>
DT[X, mult=2L] # same, but efficient and simple<br>
<br>
DT[X, mult="random"]<br>
DT[X, mult=sample(.N,size=1)] # same, but more general<br>
<br>
DT[X, mult=-1L] # all but the first of each group<br>
<span class="HOEnZb"><font color="#888888"><br>
Matthew<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
On Sat, 2012-01-07 at 19:07 -0800, Steven C. Bagley wrote:<br>
> The mult argument is becoming its own little programming language. I<br>
> worry that this is going to get complicated in an ad hoc way. What if<br>
> someone wants random, but with weighting? Each new value of mult is<br>
> really shorthand for an R language construct. Maybe there is a more<br>
> general way to express these ideas using existing R constructs? (I'm<br>
> not sure how to do this consistently. I'm merely making an<br>
> observation.)<br>
><br>
><br>
> --Steve<br>
><br>
><br>
> On Jan 6, 2012, at 5:58 AM, Christoph Jäckel wrote:<br>
><br>
> > Thanks for your feedback. @Chris: I guess Matthew's example and<br>
> > your's do not really match because he doesn't call sample(dt,...),<br>
> > but sample(dt[i, which=TRUE],... His option, though, returns all the<br>
> > rows that match between the keys of dt and i and takes a random<br>
> > sample of size 1 from that, so I guess it does what I expected.<br>
> > Nevertheless, I think an option mult="random" would still be useful.<br>
> > Here is why:<br>
> ><br>
> ><br>
> > I guess my first example was a little bit too simplistic, sorry for<br>
> > that! Here is an updated, more realistic example of what I do and<br>
> > some hints about my current implementation of mult="random":<br>
> ><br>
> ><br>
> > require(data.table)<br>
> > rawData <- data.table(fundID = 1:1e5,<br>
> > Year = rep(1:10, times=1e4),<br>
> > key = "Year")<br>
> > #Let's have 10000 runs; in each run we want to draw a fund with a<br>
> > year that is<br>
> > #set dynamically<br>
> > intJoin <- J(sample(1:10, size=10000, replace=TRUE))<br>
> ><br>
> ><br>
> > #Best solution I have come up so far with the current options in<br>
> > data.table<br>
> > #Is there one that can beat mult="random" and is easy for the user<br>
> > to implement?<br>
> > foo1 <- function(n, intJoin, rawData) {<br>
> > x <- integer(n)<br>
> > for (r in seq_len(nrow(intJoin))) {<br>
> > x[r] <- sample(rawData[intJoin[r], which=TRUE], size=1)<br>
> > }<br>
> > return(rawData[x])<br>
> > }<br>
> > system.time(finalData <- foo1(10000, intJoin, rawData))<br>
> > # user system elapsed<br>
> > # 43.827 0.000 44.232<br>
> > #Check that it does what it should: match random entities to the<br>
> > exact year in intJoin<br>
> > cbind(finalData, intJoin)<br>
> > # fundID Year V1<br>
> > # [1,] 46556 6 6<br>
> > # [2,] 77642 2 2<br>
> > # [3,] 17325 5 5<br>
> > # [4,] 36617 7 7<br>
> > # [5,] 90697 7 7<br>
> > # [6,] 4536 6 6<br>
> > # [7,] 22273 3 3<br>
> > # [8,] 46825 5 5<br>
> > # [9,] 65788 8 8<br>
> > # [10,] 14153 3 3<br>
> ><br>
> ><br>
> > #My implementation of mult="random"<br>
> > system.time(finalData <- rawData[intJoin, mult="random"])<br>
> > # user system elapsed<br>
> > # 0.324 0.016 0.337<br>
> > #Pretty fast and easy to understand<br>
> > #Check that it does what it should: match random entities to the<br>
> > exact year in intJoin<br>
> > cbind(finalData, intJoin)<br>
> > # Year fundID V1<br>
> > # [1,] 6 39626 6<br>
> > # [2,] 2 98552 2<br>
> > # [3,] 5 85425 5<br>
> > # [4,] 7 24637 7<br>
> > # [5,] 7 74797 7<br>
> > # [6,] 6 87626 6<br>
> > # [7,] 3 88973 3<br>
> > # [8,] 5 60335 5<br>
> > # [9,] 8 62298 8<br>
> > # [10,] 3 23283 3<br>
> ><br>
> ><br>
> > If you want to try it out yourself: Just call<br>
> ><br>
> ><br>
> > fixInNamespace("[.data.table", pos="package:data.table")<br>
> ><br>
> ><br>
> > and change the following lines in the editor (this applies to<br>
> > data.table 1.7.7):<br>
> ><br>
> > OLD LINE: if (!mult %in% c("first", "last", "all")) stop("mult<br>
> > argument can only be 'first','last' or 'all'")<br>
> > NEW LINE: if (!mult %in% c("first","last","all", "random"))<br>
> > stop("mult argument can only be 'first','last', 'all', or 'random'")<br>
> ><br>
> ><br>
> > and<br>
> ><br>
> ><br>
> > OLD LINES: else {<br>
> > irows = if (mult == "first")<br>
> > idx.start<br>
> > else idx.end<br>
> > lengths = rep(1L, length(irows))<br>
> > }<br>
> ><br>
> ><br>
> > NEW LINES: } else if (mult=="first") {<br>
> > irows = idx.start<br>
> > lengths=rep(1L,length(irows))<br>
> > } else if (mult=="last") {<br>
> > irows = idx.end<br>
> > lengths=rep(1L,length(irows))<br>
> > } else {<br>
> > irows = mapply(function(x1, x2) {sample(x1:x2,<br>
> > size=1)}, idx.start, idx.end)<br>
> > lengths = rep(1L,length(irows))<br>
> > }<br>
> ><br>
> > However, I don't know what's going on in the line<br>
> > .Call("binarysearch", i, x, as.integer(leftcols -<br>
> > 1), as.integer(rightcols - 1), haskey(i), roll,<br>
> > rolltolast, idx.start, idx.end, PACKAGE =<br>
> > "data.table")<br>
> ><br>
> ><br>
> > I figured out that idx.start and idx.end are changed with this<br>
> > function call and I guess at this point in the function it should<br>
> > always be that idx.start and idx.end are of the same lenght and both<br>
> > return only integer values that represent rows of x, but here I'm<br>
> > not 100% sure. So maybe additional checks are needed in the else<br>
> > clause when the mapply-function is called.<br>
> ><br>
> ><br>
> > So let me know what you think. I will join the project independent<br>
> > of that particular issue and try to help, but I guess I should start<br>
> > with simple things. So if there is any help needed on documentation<br>
> > checking<br>
> > or stuff like that, just let me know and I try my best!<br>
> ><br>
> > Christoph<br>
> ><br>
> > On Fri, Jan 6, 2012 at 1:52 PM, Chris Neff <<a href="mailto:caneff@gmail.com">caneff@gmail.com</a>> wrote:<br>
> > That isn't doing quite what he does. I don't know what you<br>
> > expected<br>
> ><br>
> > sample(dt, size=1)<br>
> ><br>
> > to do but it seems to essentially do this:<br>
> ><br>
> > dt[sample(1:ncol(dt),size=1),]<br>
> ><br>
> > It picking a random column number and then return that row<br>
> > instead.<br>
> > Try it for yourself:<br>
> ><br>
> > dt=data.table(x=1:10,y=1:10,z=1:10)<br>
> > sample(dt, size=1)<br>
> ><br>
> > The only rows you will get is 1,1,1 2,2,2 and 3,3,3. Caveat<br>
> > as usual<br>
> > is I'm on 1.7.1 until my crashing bug is fixed so apologies<br>
> > if this<br>
> > works properly in later versions.<br>
> ><br>
> > Note that this diverges from what sample(df, size=1) does,<br>
> > which is<br>
> > picks a random column and returns that whole column.<br>
> ><br>
> > What he really wants is to pick a random row from each<br>
> > subset (I<br>
> > think). None of your examples do that and I can't think of a<br>
> > simpler<br>
> > way than what he suggests.<br>
> ><br>
> > On 6 January 2012 03:34, Matthew Dowle<br>
> > <<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>> wrote:<br>
> > > Very keen for direct contributions in that way, happy to<br>
> > help you with<br>
> > > svn etc, and you joining the project.<br>
> > ><br>
> > > In this particular example, how about :<br>
> > ><br>
> > > rawData[sample(rawData[J("eu"), which=TRUE],size=1)]<br>
> > ><br>
> > > This solves the inefficiency of the 1st step; i.e.,<br>
> > > intDT <- rawData[J("eu"), mult="all"]<br>
> > > which copies a subset of all the columns, whilst retaining<br>
> > flexibility<br>
> > > for the user so user can easily sample 2 rows, or any<br>
> > other R method to<br>
> > > select a random subset.<br>
> > ><br>
> > > Because of potential scoping conflicts (say a column was<br>
> > called<br>
> > > "rawData" i.e. the same name of the table), to be more<br>
> > robust :<br>
> > ><br>
> > > x = sample(rawData[J("eu"), which=TRUE],size=1)<br>
> > > rawData[x]<br>
> > ><br>
> > > This is slightly different because when i is a single name<br>
> > (x in this<br>
> > > case), data.table knows the caller must mean the x in<br>
> > calling scope, not<br>
> > > the column called "x" (if any). Is two steps like this<br>
> > ok? I'm<br>
> > > guessing it was really the inefficiency that was the<br>
> > motivation?<br>
> > ><br>
> > > Matthew<br>
> > ><br>
> > > On Fri, 2012-01-06 at 00:20 +0100, Christoph Jäckel wrote:<br>
> > >> Hi together,<br>
> > >><br>
> > >><br>
> > >> I run a Monte Carlo simulation on a data.table and do<br>
> > that currently<br>
> > >> with a loop: on every run, I choose a subset of rows<br>
> > subject to<br>
> > >> certain criteria and from those rows I take a random<br>
> > element.<br>
> > >> Currently, I do the following: Let's say I have funds<br>
> > from two regions<br>
> > >> ("eu" and "us") and I want to choose a random fund from<br>
> > "eu" (could be<br>
> > >> "us" in the next run and a different region in the<br>
> > third):<br>
> > >><br>
> > >><br>
> > >> library(data.table)<br>
> > >> rawData <- data.table(fundID = letters,<br>
> > >> compGeo = rep(c("us", "eu"),<br>
> > each=13))<br>
> > >> setkey(rawData, "compGeo")<br>
> > >> intDT <- rawData[J("eu"), mult="all"]<br>
> > >> intDT[<a href="http://sample.int" target="_blank">sample.int</a>(nrow(intDT), size=1)]<br>
> > >><br>
> > >><br>
> > >> So my idea is to just give the user the option<br>
> > mult="random", which<br>
> > >> does this in one step. What do you think about that<br>
> > feature request?<br>
> > >><br>
> > >><br>
> > >> With respect to the implementation: I changed a few lines<br>
> > in the<br>
> > >> function '[.data.table' and got this to run on my locale<br>
> > data.table<br>
> > >> version, so I guess I could implement it (as far as I can<br>
> > see, one<br>
> > >> just needs to change some R code). However, I haven't<br>
> > done extensive<br>
> > >> testing and I'm not an expert on shared projects and<br>
> > subversion (never<br>
> > >> did that actually), so I guess I would need some help to<br>
> > start with<br>
> > >> and the confirmation I couldn't break anything ;-)<br>
> > >><br>
> > >><br>
> > >> Christoph<br>
> > >><br>
> > >><br>
> > >><br>
> > >><br>
> > >> _______________________________________________<br>
> > >> datatable-help mailing list<br>
> > >> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> > >><br>
> > <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
> > ><br>
> > ><br>
> > > _______________________________________________<br>
> > > datatable-help mailing list<br>
> > > <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> > ><br>
> > <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > _______________________________________________<br>
> > datatable-help mailing list<br>
> > <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> > <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
><br>
> _______________________________________________<br>
> datatable-help mailing list<br>
> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
<br>
<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div><br>
</div>