[datatable-help] What's your opinion on the feature request: add option mult="random"

Sun Jan 8 20:25:54 CET 2012

How about allowing mult to be integer (or an expression that evaluates
to integer) :

DT[X, mult="first"]
DT[X, mult=1L]   # same

DT[X, mult="last"]
DT[X, mult=.N]   # same

DT[X, .SD[2]]  # 2nd row of each group (inefficient due to .SD, there
are other longer alternatives)
DT[X, mult=2L] # same, but efficient and simple

DT[X, mult="random"]
DT[X, mult=sample(.N,size=1)]  # same, but more general

DT[X, mult=-1L]   # all but the first of each group

Matthew

On Sat, 2012-01-07 at 19:07 -0800, Steven C. Bagley wrote:
> The mult argument is becoming its own little programming language. I
> worry that this is going to get complicated in an ad hoc way. What if
> someone wants random, but with weighting? Each new value of mult is
> really shorthand for an R language construct. Maybe there is a more
> general way to express these ideas using existing R constructs? (I'm
> not sure how to do this consistently. I'm merely making an
> observation.)
> 
> 
> --Steve
> 
> 
> On Jan 6, 2012, at 5:58 AM, Christoph Jäckel wrote:
> 
> > Thanks for your feedback. @Chris: I guess Matthew's example and
> > your's do not really match because he doesn't call sample(dt,...),
> > but sample(dt[i, which=TRUE],... His option, though, returns all the
> > rows that match between the keys of dt and i and takes a random
> > sample of size 1 from that, so I guess it does what I expected.
> > Nevertheless, I think an option mult="random" would still be useful.
> > Here is why:
> > 
> > 
> > I guess my first example was a little bit too simplistic, sorry for
> > that! Here is an updated, more realistic example of what I do and
> > some hints about my current implementation of mult="random":
> > 
> > 
> > require(data.table)
> > rawData <- data.table(fundID = 1:1e5,
> >                       Year   = rep(1:10, times=1e4),
> >                       key    = "Year")
> > #Let's have 10000 runs; in each run we want to draw a fund with a
> > year that is 
> > #set dynamically
> > intJoin <- J(sample(1:10, size=10000, replace=TRUE))
> > 
> > 
> > #Best solution I have come up so far with the current options in
> > data.table
> > #Is there one that can beat mult="random" and is easy for the user
> > to implement?
> > foo1 <- function(n, intJoin, rawData) {
> >     x <- integer(n)
> >     for (r in seq_len(nrow(intJoin))) {
> >       x[r] <- sample(rawData[intJoin[r], which=TRUE], size=1)
> >     }
> >     return(rawData[x])
> > }
> > system.time(finalData <- foo1(10000, intJoin, rawData))
> > #    user  system elapsed 
> > #  43.827   0.000  44.232
> > #Check that it does what it should: match random entities to the
> > exact year in intJoin
> > cbind(finalData, intJoin)
> > #       fundID Year V1
> > #  [1,]  46556    6  6
> > #  [2,]  77642    2  2
> > #  [3,]  17325    5  5
> > #  [4,]  36617    7  7
> > #  [5,]  90697    7  7
> > #  [6,]   4536    6  6
> > #  [7,]  22273    3  3
> > #  [8,]  46825    5  5
> > #  [9,]  65788    8  8
> > # [10,]  14153    3  3
> > 
> > 
> > #My implementation of mult="random"
> > system.time(finalData <- rawData[intJoin, mult="random"])  
> > #   user  system elapsed 
> > #  0.324   0.016   0.337
> > #Pretty fast and easy to understand
> > #Check that it does what it should: match random entities to the
> > exact year in intJoin
> > cbind(finalData, intJoin)
> > #       Year fundID V1
> > #  [1,]    6  39626  6
> > #  [2,]    2  98552  2
> > #  [3,]    5  85425  5
> > #  [4,]    7  24637  7
> > #  [5,]    7  74797  7
> > #  [6,]    6  87626  6
> > #  [7,]    3  88973  3
> > #  [8,]    5  60335  5
> > #  [9,]    8  62298  8
> > # [10,]    3  23283  3
> > 
> > 
> > If you want to try it out yourself: Just call
> > 
> > 
> > fixInNamespace("[.data.table", pos="package:data.table")
> > 
> > 
> > and change the following lines in the editor (this applies to
> > data.table 1.7.7):
> >  
> > OLD LINE:     if (!mult %in% c("first", "last", "all")) stop("mult
> > argument can only be 'first','last' or 'all'")  
> > NEW LINE:     if (!mult %in% c("first","last","all", "random"))
> > stop("mult argument can only be 'first','last', 'all', or 'random'")
> > 
> > 
> > and 
> > 
> > 
> > OLD LINES: else {
> >                 irows = if (mult == "first") 
> >                   idx.start
> >                 else idx.end
> >                 lengths = rep(1L, length(irows))
> >             }
> > 
> > 
> > NEW LINES:  } else if (mult=="first") { 
> >               irows = idx.start
> >               lengths=rep(1L,length(irows))
> >             } else if (mult=="last") {
> >               irows = idx.end
> >               lengths=rep(1L,length(irows))
> >             } else {
> >               irows = mapply(function(x1, x2) {sample(x1:x2,
> > size=1)}, idx.start, idx.end)
> >               lengths = rep(1L,length(irows))
> >             }
> >               
> > However, I don't know what's going on in the line
> > .Call("binarysearch", i, x, as.integer(leftcols - 
> >                 1), as.integer(rightcols - 1), haskey(i), roll, 
> >                 rolltolast, idx.start, idx.end, PACKAGE =
> > "data.table")
> > 
> > 
> > I figured out that idx.start and idx.end are changed with this
> > function call and I guess at this point in the function it should
> > always be that idx.start and idx.end are of the same lenght and both
> > return only integer values that represent rows of x, but here I'm
> > not 100% sure. So maybe additional checks are needed in the else
> > clause when the mapply-function is called.
> > 
> > 
> > So let me know what you think. I will join the project independent
> > of that particular issue and try to help, but I guess I should start
> > with simple things. So if there is any help needed on documentation
> > checking
> > or stuff like that, just let me know and I try my best!
> >   
> > Christoph
> > 
> > On Fri, Jan 6, 2012 at 1:52 PM, Chris Neff <caneff at gmail.com> wrote:
> >         That isn't doing quite what he does.  I don't know what you
> >         expected
> >         
> >         sample(dt, size=1)
> >         
> >         to do but it seems to essentially do this:
> >         
> >         dt[sample(1:ncol(dt),size=1),]
> >         
> >         It picking a random column number and then return that row
> >         instead.
> >         Try it for yourself:
> >         
> >         dt=data.table(x=1:10,y=1:10,z=1:10)
> >         sample(dt, size=1)
> >         
> >         The only rows you will get is 1,1,1 2,2,2 and 3,3,3.  Caveat
> >         as usual
> >         is I'm on 1.7.1 until my crashing bug is fixed so apologies
> >         if this
> >         works properly in later versions.
> >         
> >         Note that this diverges from what sample(df, size=1) does,
> >         which is
> >         picks a random column and returns that whole column.
> >         
> >         What he really wants is to pick a random row from each
> >         subset (I
> >         think). None of your examples do that and I can't think of a
> >         simpler
> >         way than what he suggests.
> >         
> >         On 6 January 2012 03:34, Matthew Dowle
> >         <mdowle at mdowle.plus.com> wrote:
> >         > Very keen for direct contributions in that way, happy to
> >         help you with
> >         > svn etc, and you joining the project.
> >         >
> >         > In this particular example, how about :
> >         >
> >         >    rawData[sample(rawData[J("eu"), which=TRUE],size=1)]
> >         >
> >         > This solves the inefficiency of the 1st step; i.e.,
> >         >    intDT <- rawData[J("eu"), mult="all"]
> >         > which copies a subset of all the columns, whilst retaining
> >         flexibility
> >         > for the user so user can easily sample 2 rows, or any
> >         other R method to
> >         > select a random subset.
> >         >
> >         > Because of potential scoping conflicts (say a column was
> >         called
> >         > "rawData" i.e. the same name of the table), to be more
> >         robust :
> >         >
> >         > x = sample(rawData[J("eu"), which=TRUE],size=1)
> >         > rawData[x]
> >         >
> >         > This is slightly different because when i is a single name
> >         (x in this
> >         > case), data.table knows the caller must mean the x in
> >         calling scope, not
> >         > the column called "x" (if any).  Is two steps like this
> >         ok?  I'm
> >         > guessing it was really the inefficiency that was the
> >         motivation?
> >         >
> >         > Matthew
> >         >
> >         > On Fri, 2012-01-06 at 00:20 +0100, Christoph Jäckel wrote:
> >         >> Hi together,
> >         >>
> >         >>
> >         >> I run a Monte Carlo simulation on a data.table and do
> >         that currently
> >         >> with a loop: on every run, I choose a subset of rows
> >         subject to
> >         >> certain criteria and from those rows I take a random
> >         element.
> >         >> Currently, I do the following: Let's say I have funds
> >         from two regions
> >         >> ("eu" and "us") and I want to choose a random fund from
> >         "eu" (could be
> >         >> "us" in the next run and a different region in the
> >         third):
> >         >>
> >         >>
> >         >> library(data.table)
> >         >> rawData <- data.table(fundID  = letters,
> >         >>                       compGeo = rep(c("us", "eu"),
> >         each=13))
> >         >> setkey(rawData, "compGeo")
> >         >> intDT <- rawData[J("eu"), mult="all"]
> >         >> intDT[sample.int(nrow(intDT), size=1)]
> >         >>
> >         >>
> >         >> So my idea is to just give the user the option
> >         mult="random", which
> >         >> does this in one step. What do you think about that
> >         feature request?
> >         >>
> >         >>
> >         >> With respect to the implementation: I changed a few lines
> >         in the
> >         >> function '[.data.table' and got this to run on my locale
> >         data.table
> >         >> version, so I guess I could implement it (as far as I can
> >         see, one
> >         >> just needs to change some R code). However, I haven't
> >         done extensive
> >         >> testing and I'm not an expert on shared projects and
> >         subversion (never
> >         >> did that actually), so I guess I would need some help to
> >         start with
> >         >> and the confirmation I couldn't break anything ;-)
> >         >>
> >         >>
> >         >> Christoph
> >         >>
> >         >>
> >         >>
> >         >>
> >         >> _______________________________________________
> >         >> datatable-help mailing list
> >         >> datatable-help at lists.r-forge.r-project.org
> >         >>
> >         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >         >
> >         >
> >         > _______________________________________________
> >         > datatable-help mailing list
> >         > datatable-help at lists.r-forge.r-project.org
> >         >
> >         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >         
> > 
> > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help