[datatable-help] What's your opinion on the feature request: add option mult="random"
Christoph Jäckel
christoph.jaeckel at wi.tum.de
Fri Jan 6 14:58:29 CET 2012
Thanks for your feedback. @Chris: I guess Matthew's example and your's do
not really match because he doesn't call sample(dt,...), but sample(dt[i,
which=TRUE],... His option, though, returns all the rows that match between
the keys of dt and i and takes a random sample of size 1 from that, so I
guess it does what I expected. Nevertheless, I think an option
mult="random" would still be useful. Here is why:
I guess my first example was a little bit too simplistic, sorry for that!
Here is an updated, more realistic example of what I do and some hints
about my current implementation of mult="random":
require(data.table)
rawData <- data.table(fundID = 1:1e5,
Year = rep(1:10, times=1e4),
key = "Year")
#Let's have 10000 runs; in each run we want to draw a fund with a year that
is
#set dynamically
intJoin <- J(sample(1:10, size=10000, replace=TRUE))
#Best solution I have come up so far with the current options in data.table
#Is there one that can beat mult="random" and is easy for the user to
implement?
foo1 <- function(n, intJoin, rawData) {
x <- integer(n)
for (r in seq_len(nrow(intJoin))) {
x[r] <- sample(rawData[intJoin[r], which=TRUE], size=1)
}
return(rawData[x])
}
system.time(finalData <- foo1(10000, intJoin, rawData))
# user system elapsed
# 43.827 0.000 44.232
#Check that it does what it should: match random entities to the exact year
in intJoin
cbind(finalData, intJoin)
# fundID Year V1
# [1,] 46556 6 6
# [2,] 77642 2 2
# [3,] 17325 5 5
# [4,] 36617 7 7
# [5,] 90697 7 7
# [6,] 4536 6 6
# [7,] 22273 3 3
# [8,] 46825 5 5
# [9,] 65788 8 8
# [10,] 14153 3 3
#My implementation of mult="random"
system.time(finalData <- rawData[intJoin, mult="random"])
# user system elapsed
# 0.324 0.016 0.337
#Pretty fast and easy to understand
#Check that it does what it should: match random entities to the exact year
in intJoin
cbind(finalData, intJoin)
# Year fundID V1
# [1,] 6 39626 6
# [2,] 2 98552 2
# [3,] 5 85425 5
# [4,] 7 24637 7
# [5,] 7 74797 7
# [6,] 6 87626 6
# [7,] 3 88973 3
# [8,] 5 60335 5
# [9,] 8 62298 8
# [10,] 3 23283 3
If you want to try it out yourself: Just call
fixInNamespace("[.data.table", pos="package:data.table")
and change the following lines in the editor (this applies to data.table
1.7.7):
OLD LINE: if (!mult %in% c("first", "last", "all")) stop("mult argument
can only be 'first','last' or 'all'")
NEW LINE: if (!mult %in% c("first","last","all", "random")) stop("mult
argument can only be 'first','last', 'all', or 'random'")
and
OLD LINES: else {
irows = if (mult == "first")
idx.start
else idx.end
lengths = rep(1L, length(irows))
}
NEW LINES: } else if (mult=="first") {
irows = idx.start
lengths=rep(1L,length(irows))
} else if (mult=="last") {
irows = idx.end
lengths=rep(1L,length(irows))
} else {
irows = mapply(function(x1, x2) {sample(x1:x2, size=1)},
idx.start, idx.end)
lengths = rep(1L,length(irows))
}
However, I don't know what's going on in the line
.Call("binarysearch", i, x, as.integer(leftcols -
1), as.integer(rightcols - 1), haskey(i), roll,
rolltolast, idx.start, idx.end, PACKAGE = "data.table")
I figured out that idx.start and idx.end are changed with this function
call and I guess at this point in the function it should always be that
idx.start and idx.end are of the same lenght and both return only integer
values that represent rows of x, but here I'm not 100% sure. So maybe
additional checks are needed in the else clause when the mapply-function is
called.
So let me know what you think. I will join the project independent of that
particular issue and try to help, but I guess I should start with simple
things. So if there is any help needed on documentation checking
or stuff like that, just let me know and I try my best!
Christoph
On Fri, Jan 6, 2012 at 1:52 PM, Chris Neff <caneff at gmail.com> wrote:
> That isn't doing quite what he does. I don't know what you expected
>
> sample(dt, size=1)
>
> to do but it seems to essentially do this:
>
> dt[sample(1:ncol(dt),size=1),]
>
> It picking a random column number and then return that row instead.
> Try it for yourself:
>
> dt=data.table(x=1:10,y=1:10,z=1:10)
> sample(dt, size=1)
>
> The only rows you will get is 1,1,1 2,2,2 and 3,3,3. Caveat as usual
> is I'm on 1.7.1 until my crashing bug is fixed so apologies if this
> works properly in later versions.
>
> Note that this diverges from what sample(df, size=1) does, which is
> picks a random column and returns that whole column.
>
> What he really wants is to pick a random row from each subset (I
> think). None of your examples do that and I can't think of a simpler
> way than what he suggests.
>
> On 6 January 2012 03:34, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> > Very keen for direct contributions in that way, happy to help you with
> > svn etc, and you joining the project.
> >
> > In this particular example, how about :
> >
> > rawData[sample(rawData[J("eu"), which=TRUE],size=1)]
> >
> > This solves the inefficiency of the 1st step; i.e.,
> > intDT <- rawData[J("eu"), mult="all"]
> > which copies a subset of all the columns, whilst retaining flexibility
> > for the user so user can easily sample 2 rows, or any other R method to
> > select a random subset.
> >
> > Because of potential scoping conflicts (say a column was called
> > "rawData" i.e. the same name of the table), to be more robust :
> >
> > x = sample(rawData[J("eu"), which=TRUE],size=1)
> > rawData[x]
> >
> > This is slightly different because when i is a single name (x in this
> > case), data.table knows the caller must mean the x in calling scope, not
> > the column called "x" (if any). Is two steps like this ok? I'm
> > guessing it was really the inefficiency that was the motivation?
> >
> > Matthew
> >
> > On Fri, 2012-01-06 at 00:20 +0100, Christoph Jäckel wrote:
> >> Hi together,
> >>
> >>
> >> I run a Monte Carlo simulation on a data.table and do that currently
> >> with a loop: on every run, I choose a subset of rows subject to
> >> certain criteria and from those rows I take a random element.
> >> Currently, I do the following: Let's say I have funds from two regions
> >> ("eu" and "us") and I want to choose a random fund from "eu" (could be
> >> "us" in the next run and a different region in the third):
> >>
> >>
> >> library(data.table)
> >> rawData <- data.table(fundID = letters,
> >> compGeo = rep(c("us", "eu"), each=13))
> >> setkey(rawData, "compGeo")
> >> intDT <- rawData[J("eu"), mult="all"]
> >> intDT[sample.int(nrow(intDT), size=1)]
> >>
> >>
> >> So my idea is to just give the user the option mult="random", which
> >> does this in one step. What do you think about that feature request?
> >>
> >>
> >> With respect to the implementation: I changed a few lines in the
> >> function '[.data.table' and got this to run on my locale data.table
> >> version, so I guess I could implement it (as far as I can see, one
> >> just needs to change some R code). However, I haven't done extensive
> >> testing and I'm not an expert on shared projects and subversion (never
> >> did that actually), so I guess I would need some help to start with
> >> and the confirmation I couldn't break anything ;-)
> >>
> >>
> >> Christoph
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120106/74af6139/attachment.htm>
More information about the datatable-help
mailing list