[datatable-help] What's your opinion on the feature request: add option mult="random"

Sat Jan 7 15:32:22 CET 2012

Hi:

Here's one possible alternative:

# I just made intJoin an integer vector rather than a one column data table
intJoin <- sample(seq_len(10), size = 10000, replace = TRUE)
> table(intJoin)
intJoin
   1    2    3    4    5    6    7    8    9   10 
 951 1001  969 1063  999 1007 1004 1035  933 1038

# This function takes samples of size n_i from each year's sub-data 
# with replacement, since the sample size can be higher than the
# number of rows in each sub-data table (1000 in this case)
h <- function(dt, svec) {
     ns <- as.vector(table(svec))
     dt[, .SD[sample(nrow(.SD), ns[Year], replace = TRUE), ], by = 'Year']
    }
u <- h(rawData, intJoin)
> dim(u)
[1] 10000     2
> head(u)
     Year fundID
[1,]    1  20091
[2,]    1  92311
[3,]    1  18341
[4,]    1  79721
[5,]    1  13391
[6,]    1  15301

# Check:
> table(u$Year)
   1    2    3    4    5    6    7    8    9   10 
 951 1001  969 1063  999 1007 1004 1035  933 1038 
> system.time(h(rawData, intJoin))
   user  system elapsed 
   0.03    0.00    0.03

Since timings differ on machines, I tried out your foo1() function for
comparison, after converting intJoin to a data table:
> intJoin <- J(sample(seq_len(10), size = 10000, replace = TRUE))
> system.time(finalData <- foo1(10000, intJoin, rawData))
   user  system elapsed 
  30.61    0.03   30.7

HTH,
Dennis

--
View this message in context: http://r.789695.n4.nabble.com/What-s-your-opinion-on-the-feature-request-add-option-mult-random-tp4267483p4273090.html
Sent from the datatable-help mailing list archive at Nabble.com.