No subject
Thu Aug 26 12:23:38 CEST 2010
has to be pretty slow to benefit. If you have lots of groupings,
the serial method will generally win because it doesn't have the
overhead. You can use "parallel" and "collect" from multicore
with a data.table with something like:
dt[, parallel(mean(b)), by = "a"]
ans <- collect()
See below for examples and some timings. It doesn't work for
me on windows XP, but it does on Linux.
> library(multicore)
> n <- 1e8
> dt <- data.table(a = sample(1:10, n, replace = TRUE),
+ b = sample(1:100, n, replace = TRUE),
+ c = LETTERS[rep(1:500, n/500)], key = "a")
>
> (res <- dt[, list(pid = parallel(mean(b))$pid), by = "a"])
a pid
[1,] 1 19547
[2,] 2 19548
[3,] 3 19549
[4,] 4 19550
[5,] 5 19551
[6,] 6 19552
[7,] 7 19553
[8,] 8 19554
[9,] 9 19555
[10,] 10 19556
> (ans <- collect())
$`19556`
[1] 50.50949
$`19555`
[1] 50.49289
$`19554`
[1] 50.48453
$`19553`
[1] 50.48849
$`19552`
[1] 50.51581
$`19551`
[1] 50.49477
$`19550`
[1] 50.50468
$`19549`
[1] 50.50396
$`19548`
[1] 50.495
$`19547`
[1] 50.51994
$`19545`
[1] 50.51657
We're not done yet, because the "ans" is a list, and we
need to merge res and ans to get the results right. I
won't bother with that.
Here are some timings:
> system.time({
+ res <- dt[, list(pid = parallel(mean(b))$pid), by = "a"]
+ ans <- collect()
+ })
user system elapsed
2.880 3.996 5.561
>
> system.time({
+ dt[, mean(b), by = "a"]
+ })
user system elapsed
3.051 2.605 5.660
No gain there, so let's make R work harder on each grouping:
> system.time({
+ res <- dt[, list(pid = parallel(mean(sort(b)))$pid), by = "a"]
+ ans <- collect()
+ })
user system elapsed
17.416 5.138 8.114
>
> system.time({
+ dt[, mean(sort(b)), by = "a"]
+ })
user system elapsed
11.429 2.682 14.120
- Tom
On Mon, Sep 13, 2010 at 10:36 AM, Branson Owen <branson.owen at gmail.com> wrote:
> I just read an article about new plyr package using parallelization to
> speed up its performance. Just throw out an idea for data.table to
> parallelize some operations and make use of multiple processors
> simultaneously. I don't think this is a must-have feature at this
> moment though.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
More information about the datatable-help
mailing list