[datatable-help] datatable-help Digest, Vol 23, Issue 9

Christoph Jäckel christoph.jaeckel at wi.tum.de
Sun Jan 8 12:57:31 CET 2012


@Dennis

Thanks for your suggestion, that would indeed work in this particular case
in which you are only matching to one numeric column that is statistically
well defined (in this case, uniformly distributed). My use case, which is
not that uncommon I guess, is more general, and instead of trying to make
an example (I failed twice on that now, sorry for that ;-), let's discuss
it more general:

Starting with the idiom x[i, where i is a data.table which columns match to
the keys in x. Quoting the help page: When i is a data.table, x must have a
key. i is *joined* to x using the key and the rows in x that match are
returned. When it comes to "the rows in x that match are returned", there
are currently three options:

   - mult="all". As far as I see it, this is not applicable in its current
   implementation to a real life MCS, because it just gets too big pretty
   fast. So any solution that starts with calling mult="all" first and then do
   some random drawing fails. However, and this is what Dennis' solution does,
   is that one could use the fact that in a MCS many runs are identical. In
   the example above, although intJoin contains 10,000 rows, it's basically
   only 10 joins (for every year one). So one could expect the user to do that
   himself, thus leaving base data.table lean. However, in my opinion this is
   not easy. As an example (I try it again), let's assume i consists of four
   characteristics, so there are 4^4 = 256 combinations. Those combinations
   aren't uniformly distributed, so in a MCS of n=10,000 combination 1 (C1)
   might occur 2,000 times, C2 1,000 times and C256 10 times. Now, the user
   only has to join 256, but he has to keep track about how often every
   combination occurs. After that, he has to sample from every combination
   according to its occurrence (get 2,000 random samples from the join of dt
   and C1, 1,000 random samples from the join of dt and C2, etc.). And after
   that he has to join those tables again. In my opinion, that is tough and
   error-prone.
   - mult="first": Not applicable to MCS.
   - mult="last": Not applicable to MCS.

So I still think there is a valid point for mult="random" because I don't
see an easy and flexible workaround with the current options in data.table.
With respect to Steve's comment: I see your point and I think Matthew has
some thoughts on that. However, I can't think of any good examples for
weighting sampling because i already does that implicitly. That is, in a
MCS, i should put more weight on combinations that agree more frequently,
so you don't really need a weighted sampling (in the example above this is
done by the fact that C1 is 2,000 times in i, C256 only 10 times). The idea
of weighting would be useful if you went with the approach outlined in
mult="all", i.e. you get rid of all duplicates and do every join only once.
Then it would be great to have an option to tell R/data.table how often
every option would occur. In the example above, you then would have a
weighting vector for the different combinations C1, C2, ..., C256 of c(0.2,
0.1,..., 0.001).

However, data.table is out for a while now and apparently, this issue
hasn't come up before, so I guess we should just keep it in mind and move
on. I could raise a feature request with lowest priority that links to this
thread .

Christoph

On Sun, Jan 8, 2012 at 12:00 PM, <
datatable-help-request at r-forge.wu-wien.ac.at> wrote:

> Send datatable-help mailing list submissions to
>        datatable-help at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> or, via email, send a message with subject or body 'help' to
>        datatable-help-request at lists.r-forge.r-project.org
>
> You can reach the person managing the list at
>        datatable-help-owner at lists.r-forge.r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of datatable-help digest..."
>
>
> Today's Topics:
>
>   1. Re: What's your opinion on the feature request: add option
>      mult="random" (djmuseR)
>   2. Re: What's your opinion on the feature request:   add option
>      mult="random" (Steven C. Bagley)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 7 Jan 2012 06:32:22 -0800 (PST)
> From: djmuseR <djmuser at gmail.com>
> Subject: Re: [datatable-help] What's your opinion on the feature
>        request: add option mult="random"
> To: datatable-help at lists.r-forge.r-project.org
> Message-ID: <1325946742615-4273090.post at n4.nabble.com>
> Content-Type: text/plain; charset=us-ascii
>
> Hi:
>
> Here's one possible alternative:
>
> # I just made intJoin an integer vector rather than a one column data table
> intJoin <- sample(seq_len(10), size = 10000, replace = TRUE)
> > table(intJoin)
> intJoin
>   1 2 3 4 5 6 7 8 9 10
>  951 1001  969 1063  999 1007 1004 1035  933 1038
>
> # This function takes samples of size n_i from each year's sub-data
> # with replacement, since the sample size can be higher than the
> # number of rows in each sub-data table (1000 in this case)
> h <- function(dt, svec) {
>     ns <- as.vector(table(svec))
>     dt[, .SD[sample(nrow(.SD), ns[Year], replace = TRUE), ], by = 'Year']
>    }
> u <- h(rawData, intJoin)
> > dim(u)
> [1] 10000     2
> > head(u)
>     Year fundID
> [1,]    1  20091
> [2,]    1  92311
> [3,]    1  18341
> [4,]    1  79721
> [5,]    1  13391
> [6,]    1  15301
>
> # Check:
> > table(u$Year)
>   1 2 3 4 5 6 7 8 9 10
>  951 1001  969 1063  999 1007 1004 1035  933 1038
> > system.time(h(rawData, intJoin))
>   user  system elapsed
>   0.03    0.00    0.03
>
> Since timings differ on machines, I tried out your foo1() function for
> comparison, after converting intJoin to a data table:
> > intJoin <- J(sample(seq_len(10), size = 10000, replace = TRUE))
> > system.time(finalData <- foo1(10000, intJoin, rawData))
>   user  system elapsed
>  30.61    0.03   30.7
>
> HTH,
> Dennis
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/What-s-your-opinion-on-the-feature-request-add-option-mult-random-tp4267483p4273090.html
> Sent from the datatable-help mailing list archive at Nabble.com.
>
>
> ------------------------------
>
> Message: 2
> Date: Sat, 7 Jan 2012 19:07:53 -0800
> From: "Steven C. Bagley" <steven.bagley at gmail.com>
> Subject: Re: [datatable-help] What's your opinion on the feature
>        request:        add option mult="random"
> To: christoph.jaeckel at wi.tum.de
> Cc: datatable-help at r-forge.wu-wien.ac.at
> Message-ID: <3648349B-5BD2-4911-AB5B-9ECC653C5D83 at gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> The mult argument is becoming its own little programming language. I worry
> that this is going to get complicated in an ad hoc way. What if someone
> wants random, but with weighting? Each new value of mult is really
> shorthand for an R language construct. Maybe there is a more general way to
> express these ideas using existing R constructs? (I'm not sure how to do
> this consistently. I'm merely making an observation.)
>
> --Steve
>
> On Jan 6, 2012, at 5:58 AM, Christoph J?ckel wrote:
>
> > Thanks for your feedback. @Chris: I guess Matthew's example and your's
> do not really match because he doesn't call sample(dt,...), but
> sample(dt[i, which=TRUE],... His option, though, returns all the rows that
> match between the keys of dt and i and takes a random sample of size 1 from
> that, so I guess it does what I expected. Nevertheless, I think an option
> mult="random" would still be useful. Here is why:
> >
> > I guess my first example was a little bit too simplistic, sorry for
> that! Here is an updated, more realistic example of what I do and some
> hints about my current implementation of mult="random":
> >
> > require(data.table)
> > rawData <- data.table(fundID = 1:1e5,
> >                       Year   = rep(1:10, times=1e4),
> >                       key    = "Year")
> > #Let's have 10000 runs; in each run we want to draw a fund with a year
> that is
> > #set dynamically
> > intJoin <- J(sample(1:10, size=10000, replace=TRUE))
> >
> > #Best solution I have come up so far with the current options in
> data.table
> > #Is there one that can beat mult="random" and is easy for the user to
> implement?
> > foo1 <- function(n, intJoin, rawData) {
> >     x <- integer(n)
> >     for (r in seq_len(nrow(intJoin))) {
> >       x[r] <- sample(rawData[intJoin[r], which=TRUE], size=1)
> >     }
> >     return(rawData[x])
> > }
> > system.time(finalData <- foo1(10000, intJoin, rawData))
> > #    user  system elapsed
> > #  43.827   0.000  44.232
> > #Check that it does what it should: match random entities to the exact
> year in intJoin
> > cbind(finalData, intJoin)
> > #       fundID Year V1
> > #  [1,]  46556    6  6
> > #  [2,]  77642    2  2
> > #  [3,]  17325    5  5
> > #  [4,]  36617    7  7
> > #  [5,]  90697    7  7
> > #  [6,]   4536    6  6
> > #  [7,]  22273    3  3
> > #  [8,]  46825    5  5
> > #  [9,]  65788    8  8
> > # [10,]  14153    3  3
> >
> > #My implementation of mult="random"
> > system.time(finalData <- rawData[intJoin, mult="random"])
> > #   user  system elapsed
> > #  0.324   0.016   0.337
> > #Pretty fast and easy to understand
> > #Check that it does what it should: match random entities to the exact
> year in intJoin
> > cbind(finalData, intJoin)
> > #       Year fundID V1
> > #  [1,]    6  39626  6
> > #  [2,]    2  98552  2
> > #  [3,]    5  85425  5
> > #  [4,]    7  24637  7
> > #  [5,]    7  74797  7
> > #  [6,]    6  87626  6
> > #  [7,]    3  88973  3
> > #  [8,]    5  60335  5
> > #  [9,]    8  62298  8
> > # [10,]    3  23283  3
> >
> > If you want to try it out yourself: Just call
> >
> > fixInNamespace("[.data.table", pos="package:data.table")
> >
> > and change the following lines in the editor (this applies to data.table
> 1.7.7):
> >
> > OLD LINE:     if (!mult %in% c("first", "last", "all")) stop("mult
> argument can only be 'first','last' or 'all'")
> > NEW LINE:     if (!mult %in% c("first","last","all", "random"))
> stop("mult argument can only be 'first','last', 'all', or 'random'")
> >
> > and
> >
> > OLD LINES: else {
> >                 irows = if (mult == "first")
> >                   idx.start
> >                 else idx.end
> >                 lengths = rep(1L, length(irows))
> >             }
> >
> > NEW LINES:  } else if (mult=="first") {
> >               irows = idx.start
> >               lengths=rep(1L,length(irows))
> >             } else if (mult=="last") {
> >               irows = idx.end
> >               lengths=rep(1L,length(irows))
> >             } else {
> >               irows = mapply(function(x1, x2) {sample(x1:x2, size=1)},
> idx.start, idx.end)
> >               lengths = rep(1L,length(irows))
> >             }
> >
> > However, I don't know what's going on in the line
> > .Call("binarysearch", i, x, as.integer(leftcols -
> >                 1), as.integer(rightcols - 1), haskey(i), roll,
> >                 rolltolast, idx.start, idx.end, PACKAGE = "data.table")
> >
> > I figured out that idx.start and idx.end are changed with this function
> call and I guess at this point in the function it should always be that
> idx.start and idx.end are of the same lenght and both return only integer
> values that represent rows of x, but here I'm not 100% sure. So maybe
> additional checks are needed in the else clause when the mapply-function is
> called.
> >
> > So let me know what you think. I will join the project independent of
> that particular issue and try to help, but I guess I should start with
> simple things. So if there is any help needed on documentation checking
> > or stuff like that, just let me know and I try my best!
> >
> > Christoph
> >
> > On Fri, Jan 6, 2012 at 1:52 PM, Chris Neff <caneff at gmail.com> wrote:
> > That isn't doing quite what he does.  I don't know what you expected
> >
> > sample(dt, size=1)
> >
> > to do but it seems to essentially do this:
> >
> > dt[sample(1:ncol(dt),size=1),]
> >
> > It picking a random column number and then return that row instead.
> > Try it for yourself:
> >
> > dt=data.table(x=1:10,y=1:10,z=1:10)
> > sample(dt, size=1)
> >
> > The only rows you will get is 1,1,1 2,2,2 and 3,3,3.  Caveat as usual
> > is I'm on 1.7.1 until my crashing bug is fixed so apologies if this
> > works properly in later versions.
> >
> > Note that this diverges from what sample(df, size=1) does, which is
> > picks a random column and returns that whole column.
> >
> > What he really wants is to pick a random row from each subset (I
> > think). None of your examples do that and I can't think of a simpler
> > way than what he suggests.
> >
> > On 6 January 2012 03:34, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> > > Very keen for direct contributions in that way, happy to help you with
> > > svn etc, and you joining the project.
> > >
> > > In this particular example, how about :
> > >
> > >    rawData[sample(rawData[J("eu"), which=TRUE],size=1)]
> > >
> > > This solves the inefficiency of the 1st step; i.e.,
> > >    intDT <- rawData[J("eu"), mult="all"]
> > > which copies a subset of all the columns, whilst retaining flexibility
> > > for the user so user can easily sample 2 rows, or any other R method to
> > > select a random subset.
> > >
> > > Because of potential scoping conflicts (say a column was called
> > > "rawData" i.e. the same name of the table), to be more robust :
> > >
> > > x = sample(rawData[J("eu"), which=TRUE],size=1)
> > > rawData[x]
> > >
> > > This is slightly different because when i is a single name (x in this
> > > case), data.table knows the caller must mean the x in calling scope,
> not
> > > the column called "x" (if any).  Is two steps like this ok?  I'm
> > > guessing it was really the inefficiency that was the motivation?
> > >
> > > Matthew
> > >
> > > On Fri, 2012-01-06 at 00:20 +0100, Christoph J?ckel wrote:
> > >> Hi together,
> > >>
> > >>
> > >> I run a Monte Carlo simulation on a data.table and do that currently
> > >> with a loop: on every run, I choose a subset of rows subject to
> > >> certain criteria and from those rows I take a random element.
> > >> Currently, I do the following: Let's say I have funds from two regions
> > >> ("eu" and "us") and I want to choose a random fund from "eu" (could be
> > >> "us" in the next run and a different region in the third):
> > >>
> > >>
> > >> library(data.table)
> > >> rawData <- data.table(fundID  = letters,
> > >>                       compGeo = rep(c("us", "eu"), each=13))
> > >> setkey(rawData, "compGeo")
> > >> intDT <- rawData[J("eu"), mult="all"]
> > >> intDT[sample.int(nrow(intDT), size=1)]
> > >>
> > >>
> > >> So my idea is to just give the user the option mult="random", which
> > >> does this in one step. What do you think about that feature request?
> > >>
> > >>
> > >> With respect to the implementation: I changed a few lines in the
> > >> function '[.data.table' and got this to run on my locale data.table
> > >> version, so I guess I could implement it (as far as I can see, one
> > >> just needs to change some R code). However, I haven't done extensive
> > >> testing and I'm not an expert on shared projects and subversion (never
> > >> did that actually), so I guess I would need some help to start with
> > >> and the confirmation I couldn't break anything ;-)
> > >>
> > >>
> > >> Christoph
> > >>
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> datatable-help mailing list
> > >> datatable-help at lists.r-forge.r-project.org
> > >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> > >
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120107/15013b94/attachment-0001.htm
> >
>
> ------------------------------
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> End of datatable-help Digest, Vol 23, Issue 9
> *********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120108/d9460eae/attachment-0001.htm>


More information about the datatable-help mailing list