[datatable-help] What's your opinion on the feature request: add option mult="random"
djmuseR
djmuser at gmail.com
Sat Jan 7 15:32:22 CET 2012
Hi:
Here's one possible alternative:
# I just made intJoin an integer vector rather than a one column data table
intJoin <- sample(seq_len(10), size = 10000, replace = TRUE)
> table(intJoin)
intJoin
1 2 3 4 5 6 7 8 9 10
951 1001 969 1063 999 1007 1004 1035 933 1038
# This function takes samples of size n_i from each year's sub-data
# with replacement, since the sample size can be higher than the
# number of rows in each sub-data table (1000 in this case)
h <- function(dt, svec) {
ns <- as.vector(table(svec))
dt[, .SD[sample(nrow(.SD), ns[Year], replace = TRUE), ], by = 'Year']
}
u <- h(rawData, intJoin)
> dim(u)
[1] 10000 2
> head(u)
Year fundID
[1,] 1 20091
[2,] 1 92311
[3,] 1 18341
[4,] 1 79721
[5,] 1 13391
[6,] 1 15301
# Check:
> table(u$Year)
1 2 3 4 5 6 7 8 9 10
951 1001 969 1063 999 1007 1004 1035 933 1038
> system.time(h(rawData, intJoin))
user system elapsed
0.03 0.00 0.03
Since timings differ on machines, I tried out your foo1() function for
comparison, after converting intJoin to a data table:
> intJoin <- J(sample(seq_len(10), size = 10000, replace = TRUE))
> system.time(finalData <- foo1(10000, intJoin, rawData))
user system elapsed
30.61 0.03 30.7
HTH,
Dennis
--
View this message in context: http://r.789695.n4.nabble.com/What-s-your-opinion-on-the-feature-request-add-option-mult-random-tp4267483p4273090.html
Sent from the datatable-help mailing list archive at Nabble.com.
More information about the datatable-help
mailing list