[datatable-help] What's your opinion on the feature request: add option mult="random"

Chris Neff caneff at gmail.com
Fri Jan 6 13:52:40 CET 2012


That isn't doing quite what he does.  I don't know what you expected

sample(dt, size=1)

to do but it seems to essentially do this:

dt[sample(1:ncol(dt),size=1),]

It picking a random column number and then return that row instead.
Try it for yourself:

dt=data.table(x=1:10,y=1:10,z=1:10)
sample(dt, size=1)

The only rows you will get is 1,1,1 2,2,2 and 3,3,3.  Caveat as usual
is I'm on 1.7.1 until my crashing bug is fixed so apologies if this
works properly in later versions.

Note that this diverges from what sample(df, size=1) does, which is
picks a random column and returns that whole column.

What he really wants is to pick a random row from each subset (I
think). None of your examples do that and I can't think of a simpler
way than what he suggests.

On 6 January 2012 03:34, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Very keen for direct contributions in that way, happy to help you with
> svn etc, and you joining the project.
>
> In this particular example, how about :
>
>    rawData[sample(rawData[J("eu"), which=TRUE],size=1)]
>
> This solves the inefficiency of the 1st step; i.e.,
>    intDT <- rawData[J("eu"), mult="all"]
> which copies a subset of all the columns, whilst retaining flexibility
> for the user so user can easily sample 2 rows, or any other R method to
> select a random subset.
>
> Because of potential scoping conflicts (say a column was called
> "rawData" i.e. the same name of the table), to be more robust :
>
> x = sample(rawData[J("eu"), which=TRUE],size=1)
> rawData[x]
>
> This is slightly different because when i is a single name (x in this
> case), data.table knows the caller must mean the x in calling scope, not
> the column called "x" (if any).  Is two steps like this ok?  I'm
> guessing it was really the inefficiency that was the motivation?
>
> Matthew
>
> On Fri, 2012-01-06 at 00:20 +0100, Christoph Jäckel wrote:
>> Hi together,
>>
>>
>> I run a Monte Carlo simulation on a data.table and do that currently
>> with a loop: on every run, I choose a subset of rows subject to
>> certain criteria and from those rows I take a random element.
>> Currently, I do the following: Let's say I have funds from two regions
>> ("eu" and "us") and I want to choose a random fund from "eu" (could be
>> "us" in the next run and a different region in the third):
>>
>>
>> library(data.table)
>> rawData <- data.table(fundID  = letters,
>>                       compGeo = rep(c("us", "eu"), each=13))
>> setkey(rawData, "compGeo")
>> intDT <- rawData[J("eu"), mult="all"]
>> intDT[sample.int(nrow(intDT), size=1)]
>>
>>
>> So my idea is to just give the user the option mult="random", which
>> does this in one step. What do you think about that feature request?
>>
>>
>> With respect to the implementation: I changed a few lines in the
>> function '[.data.table' and got this to run on my locale data.table
>> version, so I guess I could implement it (as far as I can see, one
>> just needs to change some R code). However, I haven't done extensive
>> testing and I'm not an expert on shared projects and subversion (never
>> did that actually), so I guess I would need some help to start with
>> and the confirmation I couldn't break anything ;-)
>>
>>
>> Christoph
>>
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


More information about the datatable-help mailing list