No subject

Thu Aug 26 12:23:38 CEST 2010

has to be pretty slow to benefit. If you have lots of groupings,
the serial method will generally win because it doesn't have the
overhead. You can use "parallel" and "collect" from multicore
with a data.table with something like:

dt[, parallel(mean(b)), by = "a"]
ans <- collect()

See below for examples and some timings. It doesn't work for
me on windows XP, but it does on Linux.

> library(multicore)
> n <- 1e8
> dt <- data.table(a = sample(1:10, n, replace = TRUE),
+                  b = sample(1:100, n, replace = TRUE),
+                  c = LETTERS[rep(1:500, n/500)], key = "a")
>
> (res <- dt[, list(pid = parallel(mean(b))$pid), by = "a"])
       a   pid
 [1,]  1 19547
 [2,]  2 19548
 [3,]  3 19549
 [4,]  4 19550
 [5,]  5 19551
 [6,]  6 19552
 [7,]  7 19553
 [8,]  8 19554
 [9,]  9 19555
[10,] 10 19556
> (ans <- collect())
$`19556`
[1] 50.50949

$`19555`
[1] 50.49289

$`19554`
[1] 50.48453

$`19553`
[1] 50.48849

$`19552`
[1] 50.51581

$`19551`
[1] 50.49477

$`19550`
[1] 50.50468

$`19549`
[1] 50.50396

$`19548`
[1] 50.495

$`19547`
[1] 50.51994

$`19545`
[1] 50.51657

We're not done yet, because the "ans" is a list, and we
need to merge res and ans to get the results right. I
won't bother with that.

Here are some timings:

> system.time({
+     res <- dt[, list(pid = parallel(mean(b))$pid), by = "a"]
+     ans <- collect()
+ })
   user  system elapsed
  2.880   3.996   5.561
>
> system.time({
+     dt[, mean(b), by = "a"]
+ })
   user  system elapsed
  3.051   2.605   5.660

No gain there, so let's make R work harder on each grouping:

> system.time({
+     res <- dt[, list(pid = parallel(mean(sort(b)))$pid), by = "a"]
+     ans <- collect()
+ })
   user  system elapsed
 17.416   5.138   8.114
>
> system.time({
+     dt[, mean(sort(b)), by = "a"]
+ })
   user  system elapsed
 11.429   2.682  14.120

- Tom

On Mon, Sep 13, 2010 at 10:36 AM, Branson Owen <branson.owen at gmail.com> wrote:
> I just read an article about new plyr package using parallelization to
> speed up its performance. Just throw out an idea for data.table to
> parallelize some operations and make use of multiple processors
> simultaneously. I don't think this is a must-have feature at this
> moment though.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>