[datatable-help] What's your opinion on the feature request: add option mult="random"

Sun Jan 8 23:14:09 CET 2012

I like it. It is starting to look the same as the standard vector indexing operations, and it might be useful to think through all of what exists there already. For example, what about allowing logical vectors as well in the standard way, so mult=c(TRUE,FALSE) would select odd numbered rows? 

--Steve

On Jan 8, 2012, at 11:25 AM, Matthew Dowle wrote:

> How about allowing mult to be integer (or an expression that evaluates
> to integer) :
> 
> DT[X, mult="first"]
> DT[X, mult=1L]   # same
> 
> DT[X, mult="last"]
> DT[X, mult=.N]   # same
> 
> DT[X, .SD[2]]  # 2nd row of each group (inefficient due to .SD, there
> are other longer alternatives)
> DT[X, mult=2L] # same, but efficient and simple
> 
> DT[X, mult="random"]
> DT[X, mult=sample(.N,size=1)]  # same, but more general
> 
> DT[X, mult=-1L]   # all but the first of each group
> 
> Matthew
> 
> On Sat, 2012-01-07 at 19:07 -0800, Steven C. Bagley wrote:
>> The mult argument is becoming its own little programming language. I
>> worry that this is going to get complicated in an ad hoc way. What if
>> someone wants random, but with weighting? Each new value of mult is
>> really shorthand for an R language construct. Maybe there is a more
>> general way to express these ideas using existing R constructs? (I'm
>> not sure how to do this consistently. I'm merely making an
>> observation.)
>> 
>> 
>> --Steve
>> 
>> 
>> On Jan 6, 2012, at 5:58 AM, Christoph Jäckel wrote:
>> 
>>> Thanks for your feedback. @Chris: I guess Matthew's example and
>>> your's do not really match because he doesn't call sample(dt,...),
>>> but sample(dt[i, which=TRUE],... His option, though, returns all the
>>> rows that match between the keys of dt and i and takes a random
>>> sample of size 1 from that, so I guess it does what I expected.
>>> Nevertheless, I think an option mult="random" would still be useful.
>>> Here is why:
>>> 
>>> 
>>> I guess my first example was a little bit too simplistic, sorry for
>>> that! Here is an updated, more realistic example of what I do and
>>> some hints about my current implementation of mult="random":
>>> 
>>> 
>>> require(data.table)
>>> rawData <- data.table(fundID = 1:1e5,
>>>                      Year   = rep(1:10, times=1e4),
>>>                      key    = "Year")
>>> #Let's have 10000 runs; in each run we want to draw a fund with a
>>> year that is 
>>> #set dynamically
>>> intJoin <- J(sample(1:10, size=10000, replace=TRUE))
>>> 
>>> 
>>> #Best solution I have come up so far with the current options in
>>> data.table
>>> #Is there one that can beat mult="random" and is easy for the user
>>> to implement?
>>> foo1 <- function(n, intJoin, rawData) {
>>>    x <- integer(n)
>>>    for (r in seq_len(nrow(intJoin))) {
>>>      x[r] <- sample(rawData[intJoin[r], which=TRUE], size=1)
>>>    }
>>>    return(rawData[x])
>>> }
>>> system.time(finalData <- foo1(10000, intJoin, rawData))
>>> #    user  system elapsed 
>>> #  43.827   0.000  44.232
>>> #Check that it does what it should: match random entities to the
>>> exact year in intJoin
>>> cbind(finalData, intJoin)
>>> #       fundID Year V1
>>> #  [1,]  46556    6  6
>>> #  [2,]  77642    2  2
>>> #  [3,]  17325    5  5
>>> #  [4,]  36617    7  7
>>> #  [5,]  90697    7  7
>>> #  [6,]   4536    6  6
>>> #  [7,]  22273    3  3
>>> #  [8,]  46825    5  5
>>> #  [9,]  65788    8  8
>>> # [10,]  14153    3  3
>>> 
>>> 
>>> #My implementation of mult="random"
>>> system.time(finalData <- rawData[intJoin, mult="random"])  
>>> #   user  system elapsed 
>>> #  0.324   0.016   0.337
>>> #Pretty fast and easy to understand
>>> #Check that it does what it should: match random entities to the
>>> exact year in intJoin
>>> cbind(finalData, intJoin)
>>> #       Year fundID V1
>>> #  [1,]    6  39626  6
>>> #  [2,]    2  98552  2
>>> #  [3,]    5  85425  5
>>> #  [4,]    7  24637  7
>>> #  [5,]    7  74797  7
>>> #  [6,]    6  87626  6
>>> #  [7,]    3  88973  3
>>> #  [8,]    5  60335  5
>>> #  [9,]    8  62298  8
>>> # [10,]    3  23283  3
>>> 
>>> 
>>> If you want to try it out yourself: Just call
>>> 
>>> 
>>> fixInNamespace("[.data.table", pos="package:data.table")
>>> 
>>> 
>>> and change the following lines in the editor (this applies to
>>> data.table 1.7.7):
>>> 
>>> OLD LINE:     if (!mult %in% c("first", "last", "all")) stop("mult
>>> argument can only be 'first','last' or 'all'")  
>>> NEW LINE:     if (!mult %in% c("first","last","all", "random"))
>>> stop("mult argument can only be 'first','last', 'all', or 'random'")
>>> 
>>> 
>>> and 
>>> 
>>> 
>>> OLD LINES: else {
>>>                irows = if (mult == "first") 
>>>                  idx.start
>>>                else idx.end
>>>                lengths = rep(1L, length(irows))
>>>            }
>>> 
>>> 
>>> NEW LINES:  } else if (mult=="first") { 
>>>              irows = idx.start
>>>              lengths=rep(1L,length(irows))
>>>            } else if (mult=="last") {
>>>              irows = idx.end
>>>              lengths=rep(1L,length(irows))
>>>            } else {
>>>              irows = mapply(function(x1, x2) {sample(x1:x2,
>>> size=1)}, idx.start, idx.end)
>>>              lengths = rep(1L,length(irows))
>>>            }
>>> 
>>> However, I don't know what's going on in the line
>>> .Call("binarysearch", i, x, as.integer(leftcols - 
>>>                1), as.integer(rightcols - 1), haskey(i), roll, 
>>>                rolltolast, idx.start, idx.end, PACKAGE =
>>> "data.table")
>>> 
>>> 
>>> I figured out that idx.start and idx.end are changed with this
>>> function call and I guess at this point in the function it should
>>> always be that idx.start and idx.end are of the same lenght and both
>>> return only integer values that represent rows of x, but here I'm
>>> not 100% sure. So maybe additional checks are needed in the else
>>> clause when the mapply-function is called.
>>> 
>>> 
>>> So let me know what you think. I will join the project independent
>>> of that particular issue and try to help, but I guess I should start
>>> with simple things. So if there is any help needed on documentation
>>> checking
>>> or stuff like that, just let me know and I try my best!
>>> 
>>> Christoph
>>> 
>>> On Fri, Jan 6, 2012 at 1:52 PM, Chris Neff <caneff at gmail.com> wrote:
>>>        That isn't doing quite what he does.  I don't know what you
>>>        expected
>>> 
>>>        sample(dt, size=1)
>>> 
>>>        to do but it seems to essentially do this:
>>> 
>>>        dt[sample(1:ncol(dt),size=1),]
>>> 
>>>        It picking a random column number and then return that row
>>>        instead.
>>>        Try it for yourself:
>>> 
>>>        dt=data.table(x=1:10,y=1:10,z=1:10)
>>>        sample(dt, size=1)
>>> 
>>>        The only rows you will get is 1,1,1 2,2,2 and 3,3,3.  Caveat
>>>        as usual
>>>        is I'm on 1.7.1 until my crashing bug is fixed so apologies
>>>        if this
>>>        works properly in later versions.
>>> 
>>>        Note that this diverges from what sample(df, size=1) does,
>>>        which is
>>>        picks a random column and returns that whole column.
>>> 
>>>        What he really wants is to pick a random row from each
>>>        subset (I
>>>        think). None of your examples do that and I can't think of a
>>>        simpler
>>>        way than what he suggests.
>>> 
>>>        On 6 January 2012 03:34, Matthew Dowle
>>>        <mdowle at mdowle.plus.com> wrote:
>>>> Very keen for direct contributions in that way, happy to
>>>        help you with
>>>> svn etc, and you joining the project.
>>>> 
>>>> In this particular example, how about :
>>>> 
>>>>   rawData[sample(rawData[J("eu"), which=TRUE],size=1)]
>>>> 
>>>> This solves the inefficiency of the 1st step; i.e.,
>>>>   intDT <- rawData[J("eu"), mult="all"]
>>>> which copies a subset of all the columns, whilst retaining
>>>        flexibility
>>>> for the user so user can easily sample 2 rows, or any
>>>        other R method to
>>>> select a random subset.
>>>> 
>>>> Because of potential scoping conflicts (say a column was
>>>        called
>>>> "rawData" i.e. the same name of the table), to be more
>>>        robust :
>>>> 
>>>> x = sample(rawData[J("eu"), which=TRUE],size=1)
>>>> rawData[x]
>>>> 
>>>> This is slightly different because when i is a single name
>>>        (x in this
>>>> case), data.table knows the caller must mean the x in
>>>        calling scope, not
>>>> the column called "x" (if any).  Is two steps like this
>>>        ok?  I'm
>>>> guessing it was really the inefficiency that was the
>>>        motivation?
>>>> 
>>>> Matthew
>>>> 
>>>> On Fri, 2012-01-06 at 00:20 +0100, Christoph Jäckel wrote:
>>>>> Hi together,
>>>>> 
>>>>> 
>>>>> I run a Monte Carlo simulation on a data.table and do
>>>        that currently
>>>>> with a loop: on every run, I choose a subset of rows
>>>        subject to
>>>>> certain criteria and from those rows I take a random
>>>        element.
>>>>> Currently, I do the following: Let's say I have funds
>>>        from two regions
>>>>> ("eu" and "us") and I want to choose a random fund from
>>>        "eu" (could be
>>>>> "us" in the next run and a different region in the
>>>        third):
>>>>> 
>>>>> 
>>>>> library(data.table)
>>>>> rawData <- data.table(fundID  = letters,
>>>>>                      compGeo = rep(c("us", "eu"),
>>>        each=13))
>>>>> setkey(rawData, "compGeo")
>>>>> intDT <- rawData[J("eu"), mult="all"]
>>>>> intDT[sample.int(nrow(intDT), size=1)]
>>>>> 
>>>>> 
>>>>> So my idea is to just give the user the option
>>>        mult="random", which
>>>>> does this in one step. What do you think about that
>>>        feature request?
>>>>> 
>>>>> 
>>>>> With respect to the implementation: I changed a few lines
>>>        in the
>>>>> function '[.data.table' and got this to run on my locale
>>>        data.table
>>>>> version, so I guess I could implement it (as far as I can
>>>        see, one
>>>>> just needs to change some R code). However, I haven't
>>>        done extensive
>>>>> testing and I'm not an expert on shared projects and
>>>        subversion (never
>>>>> did that actually), so I guess I would need some help to
>>>        start with
>>>>> and the confirmation I couldn't break anything ;-)
>>>>> 
>>>>> 
>>>>> Christoph
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>> 
>>>        https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>> 
>>>> 
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>> 
>>>        https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> 
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
>