[datatable-help] What's your opinion on the feature request: add option mult="random"
Matthew Dowle
mdowle at mdowle.plus.com
Sun Jan 8 20:25:54 CET 2012
How about allowing mult to be integer (or an expression that evaluates
to integer) :
DT[X, mult="first"]
DT[X, mult=1L] # same
DT[X, mult="last"]
DT[X, mult=.N] # same
DT[X, .SD[2]] # 2nd row of each group (inefficient due to .SD, there
are other longer alternatives)
DT[X, mult=2L] # same, but efficient and simple
DT[X, mult="random"]
DT[X, mult=sample(.N,size=1)] # same, but more general
DT[X, mult=-1L] # all but the first of each group
Matthew
On Sat, 2012-01-07 at 19:07 -0800, Steven C. Bagley wrote:
> The mult argument is becoming its own little programming language. I
> worry that this is going to get complicated in an ad hoc way. What if
> someone wants random, but with weighting? Each new value of mult is
> really shorthand for an R language construct. Maybe there is a more
> general way to express these ideas using existing R constructs? (I'm
> not sure how to do this consistently. I'm merely making an
> observation.)
>
>
> --Steve
>
>
> On Jan 6, 2012, at 5:58 AM, Christoph Jäckel wrote:
>
> > Thanks for your feedback. @Chris: I guess Matthew's example and
> > your's do not really match because he doesn't call sample(dt,...),
> > but sample(dt[i, which=TRUE],... His option, though, returns all the
> > rows that match between the keys of dt and i and takes a random
> > sample of size 1 from that, so I guess it does what I expected.
> > Nevertheless, I think an option mult="random" would still be useful.
> > Here is why:
> >
> >
> > I guess my first example was a little bit too simplistic, sorry for
> > that! Here is an updated, more realistic example of what I do and
> > some hints about my current implementation of mult="random":
> >
> >
> > require(data.table)
> > rawData <- data.table(fundID = 1:1e5,
> > Year = rep(1:10, times=1e4),
> > key = "Year")
> > #Let's have 10000 runs; in each run we want to draw a fund with a
> > year that is
> > #set dynamically
> > intJoin <- J(sample(1:10, size=10000, replace=TRUE))
> >
> >
> > #Best solution I have come up so far with the current options in
> > data.table
> > #Is there one that can beat mult="random" and is easy for the user
> > to implement?
> > foo1 <- function(n, intJoin, rawData) {
> > x <- integer(n)
> > for (r in seq_len(nrow(intJoin))) {
> > x[r] <- sample(rawData[intJoin[r], which=TRUE], size=1)
> > }
> > return(rawData[x])
> > }
> > system.time(finalData <- foo1(10000, intJoin, rawData))
> > # user system elapsed
> > # 43.827 0.000 44.232
> > #Check that it does what it should: match random entities to the
> > exact year in intJoin
> > cbind(finalData, intJoin)
> > # fundID Year V1
> > # [1,] 46556 6 6
> > # [2,] 77642 2 2
> > # [3,] 17325 5 5
> > # [4,] 36617 7 7
> > # [5,] 90697 7 7
> > # [6,] 4536 6 6
> > # [7,] 22273 3 3
> > # [8,] 46825 5 5
> > # [9,] 65788 8 8
> > # [10,] 14153 3 3
> >
> >
> > #My implementation of mult="random"
> > system.time(finalData <- rawData[intJoin, mult="random"])
> > # user system elapsed
> > # 0.324 0.016 0.337
> > #Pretty fast and easy to understand
> > #Check that it does what it should: match random entities to the
> > exact year in intJoin
> > cbind(finalData, intJoin)
> > # Year fundID V1
> > # [1,] 6 39626 6
> > # [2,] 2 98552 2
> > # [3,] 5 85425 5
> > # [4,] 7 24637 7
> > # [5,] 7 74797 7
> > # [6,] 6 87626 6
> > # [7,] 3 88973 3
> > # [8,] 5 60335 5
> > # [9,] 8 62298 8
> > # [10,] 3 23283 3
> >
> >
> > If you want to try it out yourself: Just call
> >
> >
> > fixInNamespace("[.data.table", pos="package:data.table")
> >
> >
> > and change the following lines in the editor (this applies to
> > data.table 1.7.7):
> >
> > OLD LINE: if (!mult %in% c("first", "last", "all")) stop("mult
> > argument can only be 'first','last' or 'all'")
> > NEW LINE: if (!mult %in% c("first","last","all", "random"))
> > stop("mult argument can only be 'first','last', 'all', or 'random'")
> >
> >
> > and
> >
> >
> > OLD LINES: else {
> > irows = if (mult == "first")
> > idx.start
> > else idx.end
> > lengths = rep(1L, length(irows))
> > }
> >
> >
> > NEW LINES: } else if (mult=="first") {
> > irows = idx.start
> > lengths=rep(1L,length(irows))
> > } else if (mult=="last") {
> > irows = idx.end
> > lengths=rep(1L,length(irows))
> > } else {
> > irows = mapply(function(x1, x2) {sample(x1:x2,
> > size=1)}, idx.start, idx.end)
> > lengths = rep(1L,length(irows))
> > }
> >
> > However, I don't know what's going on in the line
> > .Call("binarysearch", i, x, as.integer(leftcols -
> > 1), as.integer(rightcols - 1), haskey(i), roll,
> > rolltolast, idx.start, idx.end, PACKAGE =
> > "data.table")
> >
> >
> > I figured out that idx.start and idx.end are changed with this
> > function call and I guess at this point in the function it should
> > always be that idx.start and idx.end are of the same lenght and both
> > return only integer values that represent rows of x, but here I'm
> > not 100% sure. So maybe additional checks are needed in the else
> > clause when the mapply-function is called.
> >
> >
> > So let me know what you think. I will join the project independent
> > of that particular issue and try to help, but I guess I should start
> > with simple things. So if there is any help needed on documentation
> > checking
> > or stuff like that, just let me know and I try my best!
> >
> > Christoph
> >
> > On Fri, Jan 6, 2012 at 1:52 PM, Chris Neff <caneff at gmail.com> wrote:
> > That isn't doing quite what he does. I don't know what you
> > expected
> >
> > sample(dt, size=1)
> >
> > to do but it seems to essentially do this:
> >
> > dt[sample(1:ncol(dt),size=1),]
> >
> > It picking a random column number and then return that row
> > instead.
> > Try it for yourself:
> >
> > dt=data.table(x=1:10,y=1:10,z=1:10)
> > sample(dt, size=1)
> >
> > The only rows you will get is 1,1,1 2,2,2 and 3,3,3. Caveat
> > as usual
> > is I'm on 1.7.1 until my crashing bug is fixed so apologies
> > if this
> > works properly in later versions.
> >
> > Note that this diverges from what sample(df, size=1) does,
> > which is
> > picks a random column and returns that whole column.
> >
> > What he really wants is to pick a random row from each
> > subset (I
> > think). None of your examples do that and I can't think of a
> > simpler
> > way than what he suggests.
> >
> > On 6 January 2012 03:34, Matthew Dowle
> > <mdowle at mdowle.plus.com> wrote:
> > > Very keen for direct contributions in that way, happy to
> > help you with
> > > svn etc, and you joining the project.
> > >
> > > In this particular example, how about :
> > >
> > > rawData[sample(rawData[J("eu"), which=TRUE],size=1)]
> > >
> > > This solves the inefficiency of the 1st step; i.e.,
> > > intDT <- rawData[J("eu"), mult="all"]
> > > which copies a subset of all the columns, whilst retaining
> > flexibility
> > > for the user so user can easily sample 2 rows, or any
> > other R method to
> > > select a random subset.
> > >
> > > Because of potential scoping conflicts (say a column was
> > called
> > > "rawData" i.e. the same name of the table), to be more
> > robust :
> > >
> > > x = sample(rawData[J("eu"), which=TRUE],size=1)
> > > rawData[x]
> > >
> > > This is slightly different because when i is a single name
> > (x in this
> > > case), data.table knows the caller must mean the x in
> > calling scope, not
> > > the column called "x" (if any). Is two steps like this
> > ok? I'm
> > > guessing it was really the inefficiency that was the
> > motivation?
> > >
> > > Matthew
> > >
> > > On Fri, 2012-01-06 at 00:20 +0100, Christoph Jäckel wrote:
> > >> Hi together,
> > >>
> > >>
> > >> I run a Monte Carlo simulation on a data.table and do
> > that currently
> > >> with a loop: on every run, I choose a subset of rows
> > subject to
> > >> certain criteria and from those rows I take a random
> > element.
> > >> Currently, I do the following: Let's say I have funds
> > from two regions
> > >> ("eu" and "us") and I want to choose a random fund from
> > "eu" (could be
> > >> "us" in the next run and a different region in the
> > third):
> > >>
> > >>
> > >> library(data.table)
> > >> rawData <- data.table(fundID = letters,
> > >> compGeo = rep(c("us", "eu"),
> > each=13))
> > >> setkey(rawData, "compGeo")
> > >> intDT <- rawData[J("eu"), mult="all"]
> > >> intDT[sample.int(nrow(intDT), size=1)]
> > >>
> > >>
> > >> So my idea is to just give the user the option
> > mult="random", which
> > >> does this in one step. What do you think about that
> > feature request?
> > >>
> > >>
> > >> With respect to the implementation: I changed a few lines
> > in the
> > >> function '[.data.table' and got this to run on my locale
> > data.table
> > >> version, so I guess I could implement it (as far as I can
> > see, one
> > >> just needs to change some R code). However, I haven't
> > done extensive
> > >> testing and I'm not an expert on shared projects and
> > subversion (never
> > >> did that actually), so I guess I would need some help to
> > start with
> > >> and the confirmation I couldn't break anything ;-)
> > >>
> > >>
> > >> Christoph
> > >>
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> datatable-help mailing list
> > >> datatable-help at lists.r-forge.r-project.org
> > >>
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> > >
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > >
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> >
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list