[datatable-help] What's your opinion on the feature request: add option mult="random"

Steven C. Bagley steven.bagley at gmail.com
Sun Jan 8 04:07:53 CET 2012


The mult argument is becoming its own little programming language. I worry that this is going to get complicated in an ad hoc way. What if someone wants random, but with weighting? Each new value of mult is really shorthand for an R language construct. Maybe there is a more general way to express these ideas using existing R constructs? (I'm not sure how to do this consistently. I'm merely making an observation.)

--Steve

On Jan 6, 2012, at 5:58 AM, Christoph Jäckel wrote:

> Thanks for your feedback. @Chris: I guess Matthew's example and your's do not really match because he doesn't call sample(dt,...), but sample(dt[i, which=TRUE],... His option, though, returns all the rows that match between the keys of dt and i and takes a random sample of size 1 from that, so I guess it does what I expected. Nevertheless, I think an option mult="random" would still be useful. Here is why:
> 
> I guess my first example was a little bit too simplistic, sorry for that! Here is an updated, more realistic example of what I do and some hints about my current implementation of mult="random":
> 
> require(data.table)
> rawData <- data.table(fundID = 1:1e5,
>                       Year   = rep(1:10, times=1e4),
>                       key    = "Year")
> #Let's have 10000 runs; in each run we want to draw a fund with a year that is 
> #set dynamically
> intJoin <- J(sample(1:10, size=10000, replace=TRUE))
> 
> #Best solution I have come up so far with the current options in data.table
> #Is there one that can beat mult="random" and is easy for the user to implement?
> foo1 <- function(n, intJoin, rawData) {
>     x <- integer(n)
>     for (r in seq_len(nrow(intJoin))) {
>       x[r] <- sample(rawData[intJoin[r], which=TRUE], size=1)
>     }
>     return(rawData[x])
> }
> system.time(finalData <- foo1(10000, intJoin, rawData))
> #    user  system elapsed 
> #  43.827   0.000  44.232
> #Check that it does what it should: match random entities to the exact year in intJoin
> cbind(finalData, intJoin)
> #       fundID Year V1
> #  [1,]  46556    6  6
> #  [2,]  77642    2  2
> #  [3,]  17325    5  5
> #  [4,]  36617    7  7
> #  [5,]  90697    7  7
> #  [6,]   4536    6  6
> #  [7,]  22273    3  3
> #  [8,]  46825    5  5
> #  [9,]  65788    8  8
> # [10,]  14153    3  3
> 
> #My implementation of mult="random"
> system.time(finalData <- rawData[intJoin, mult="random"])  
> #   user  system elapsed 
> #  0.324   0.016   0.337
> #Pretty fast and easy to understand
> #Check that it does what it should: match random entities to the exact year in intJoin
> cbind(finalData, intJoin)
> #       Year fundID V1
> #  [1,]    6  39626  6
> #  [2,]    2  98552  2
> #  [3,]    5  85425  5
> #  [4,]    7  24637  7
> #  [5,]    7  74797  7
> #  [6,]    6  87626  6
> #  [7,]    3  88973  3
> #  [8,]    5  60335  5
> #  [9,]    8  62298  8
> # [10,]    3  23283  3
> 
> If you want to try it out yourself: Just call
> 
> fixInNamespace("[.data.table", pos="package:data.table")
> 
> and change the following lines in the editor (this applies to data.table 1.7.7):
>  
> OLD LINE:     if (!mult %in% c("first", "last", "all")) stop("mult argument can only be 'first','last' or 'all'")  
> NEW LINE:     if (!mult %in% c("first","last","all", "random")) stop("mult argument can only be 'first','last', 'all', or 'random'")
> 
> and 
> 
> OLD LINES: else {
>                 irows = if (mult == "first") 
>                   idx.start
>                 else idx.end
>                 lengths = rep(1L, length(irows))
>             }
> 
> NEW LINES:  } else if (mult=="first") { 
>               irows = idx.start
>               lengths=rep(1L,length(irows))
>             } else if (mult=="last") {
>               irows = idx.end
>               lengths=rep(1L,length(irows))
>             } else {
>               irows = mapply(function(x1, x2) {sample(x1:x2, size=1)}, idx.start, idx.end)
>               lengths = rep(1L,length(irows))
>             }
>               
> However, I don't know what's going on in the line
> .Call("binarysearch", i, x, as.integer(leftcols - 
>                 1), as.integer(rightcols - 1), haskey(i), roll, 
>                 rolltolast, idx.start, idx.end, PACKAGE = "data.table")
> 
> I figured out that idx.start and idx.end are changed with this function call and I guess at this point in the function it should always be that idx.start and idx.end are of the same lenght and both return only integer values that represent rows of x, but here I'm not 100% sure. So maybe additional checks are needed in the else clause when the mapply-function is called.
> 
> So let me know what you think. I will join the project independent of that particular issue and try to help, but I guess I should start with simple things. So if there is any help needed on documentation checking
> or stuff like that, just let me know and I try my best!
>   
> Christoph
> 
> On Fri, Jan 6, 2012 at 1:52 PM, Chris Neff <caneff at gmail.com> wrote:
> That isn't doing quite what he does.  I don't know what you expected
> 
> sample(dt, size=1)
> 
> to do but it seems to essentially do this:
> 
> dt[sample(1:ncol(dt),size=1),]
> 
> It picking a random column number and then return that row instead.
> Try it for yourself:
> 
> dt=data.table(x=1:10,y=1:10,z=1:10)
> sample(dt, size=1)
> 
> The only rows you will get is 1,1,1 2,2,2 and 3,3,3.  Caveat as usual
> is I'm on 1.7.1 until my crashing bug is fixed so apologies if this
> works properly in later versions.
> 
> Note that this diverges from what sample(df, size=1) does, which is
> picks a random column and returns that whole column.
> 
> What he really wants is to pick a random row from each subset (I
> think). None of your examples do that and I can't think of a simpler
> way than what he suggests.
> 
> On 6 January 2012 03:34, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> > Very keen for direct contributions in that way, happy to help you with
> > svn etc, and you joining the project.
> >
> > In this particular example, how about :
> >
> >    rawData[sample(rawData[J("eu"), which=TRUE],size=1)]
> >
> > This solves the inefficiency of the 1st step; i.e.,
> >    intDT <- rawData[J("eu"), mult="all"]
> > which copies a subset of all the columns, whilst retaining flexibility
> > for the user so user can easily sample 2 rows, or any other R method to
> > select a random subset.
> >
> > Because of potential scoping conflicts (say a column was called
> > "rawData" i.e. the same name of the table), to be more robust :
> >
> > x = sample(rawData[J("eu"), which=TRUE],size=1)
> > rawData[x]
> >
> > This is slightly different because when i is a single name (x in this
> > case), data.table knows the caller must mean the x in calling scope, not
> > the column called "x" (if any).  Is two steps like this ok?  I'm
> > guessing it was really the inefficiency that was the motivation?
> >
> > Matthew
> >
> > On Fri, 2012-01-06 at 00:20 +0100, Christoph Jäckel wrote:
> >> Hi together,
> >>
> >>
> >> I run a Monte Carlo simulation on a data.table and do that currently
> >> with a loop: on every run, I choose a subset of rows subject to
> >> certain criteria and from those rows I take a random element.
> >> Currently, I do the following: Let's say I have funds from two regions
> >> ("eu" and "us") and I want to choose a random fund from "eu" (could be
> >> "us" in the next run and a different region in the third):
> >>
> >>
> >> library(data.table)
> >> rawData <- data.table(fundID  = letters,
> >>                       compGeo = rep(c("us", "eu"), each=13))
> >> setkey(rawData, "compGeo")
> >> intDT <- rawData[J("eu"), mult="all"]
> >> intDT[sample.int(nrow(intDT), size=1)]
> >>
> >>
> >> So my idea is to just give the user the option mult="random", which
> >> does this in one step. What do you think about that feature request?
> >>
> >>
> >> With respect to the implementation: I changed a few lines in the
> >> function '[.data.table' and got this to run on my locale data.table
> >> version, so I guess I could implement it (as far as I can see, one
> >> just needs to change some R code). However, I haven't done extensive
> >> testing and I'm not an expert on shared projects and subversion (never
> >> did that actually), so I guess I would need some help to start with
> >> and the confirmation I couldn't break anything ;-)
> >>
> >>
> >> Christoph
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120107/15013b94/attachment.htm>


More information about the datatable-help mailing list