[datatable-help] checking an approach to filtering rows in a data.table

Vincent Carey stvjc at channing.harvard.edu
Mon Mar 10 15:04:30 CET 2014


Thanks Arun, I like your approach, and I had looked at the possibility,
although I had not seen the SO posting, which is indeed relevant.  The .I
solution seemed underperformant relative to expectations, particularly for
millions of rows.  Here are some
timings for 2-300k rows.

> litd = disc_allc200k_dt[1:200000,]
> microbenchmark(rowsWmaxVinG( litd, "score", "snp" ))
Unit: milliseconds
                               expr      min       lq   median       uq
 rowsWmaxVinG(litd, "score", "snp") 86.83909 87.45823 88.16629 89.26693
      max neval
 440.0069   100
> microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ])
Unit: milliseconds
                                       expr      min       lq  median
 uq
 litd[litd[, .I[which.max(score)], snp]$V1] 241.3669 252.2612 279.342
602.113
      max neval
 657.7055   100
> litd = disc_allc200k_dt[1:300000,]
> microbenchmark(rowsWmaxVinG( litd, "score", "snp" ))
Unit: milliseconds
                               expr      min       lq   median       uq
 rowsWmaxVinG(litd, "score", "snp") 119.6237 120.9789 121.6302 122.7155
      max neval
 489.1918   100
> microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ])
Unit: milliseconds
                                       expr      min       lq   median
 uq
 litd[litd[, .I[which.max(score)], snp]$V1] 324.7394 347.5972 684.6746
693.456
      max neval
 1607.186   100

The two approaches do not agree in terms of values returned when there are
ties in the score within groups.  But otherwise the .N based approach seems
to work.  I would like to verify that setkeyv accomplishes the sorting
necessary for the .N based approach to be valid.

> sessionInfo()
R Under development (unstable) (2014-02-02 r64913)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     tools     methods
[8] base

other attached packages:
[1] microbenchmark_1.3-0 data.table_1.9.2     weaver_1.29.1
[4] codetools_0.2-8      digest_0.6.4         BiocInstaller_1.13.3

loaded via a namespace (and not attached):
[1] Rcpp_0.11.0    plyr_1.8.1     reshape2_1.2.2 stringr_0.6.2



On Mon, Mar 10, 2014 at 9:08 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

> Hi Vincent,
>
> Have you checked out the special variable `.I`? Have a look at
> `?data.table`. This SO post may also be relevant:
> http://stackoverflow.com/questions/21198937/subset-data-table-using-min-condition/21199009#21199009
> Arun
> ------------------------------
> From: Vincent Carey Vincent Carey <stvjc at channing.harvard.edu>
> Reply: Vincent Carey stvjc at channing.harvard.edu
> Date: March 10, 2014 at 4:33:27 AM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  [datatable-help] checking an approach to filtering rows in a
> data.table
>
>  I have looked around for code on row filtering with data.table, but have
> not found anything addressing this use case.
>
> I want to retrieve the rows satisfying a certain condition within groups,
> in this case having the maximum value for a specific variable.  The
> following
> seems to work, but I wonder if there is a more direct approach.
>
> rowsWmaxVinG = function(dt, V, by) {
> #
> # filter dt to the rows possessing max value of
> # variable V within groups formed using by
> #
> # example: data(mtcars)
> # ddt = data.table(mtcars)
> #> rowsWmaxVinG( ddt, by="cyl", V="mpg")
> #    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
> #1: 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
> #2: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
> #3: 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
> #
>  setkeyv(dt, c(by, V)) # sort within groups
>  dt[ cumsum(dt[, .N, by=by]$N), ]  # take last row from each group
> }
>  _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140310/971d149c/attachment.html>


More information about the datatable-help mailing list