[datatable-help] checking an approach to filtering rows in a data.table

Mon Mar 10 15:17:51 CET 2014

Hi Vincent,

I linked to the SO post to get an idea of how to use .I. I dint mean to say that it's exactly what you're looking for. which.max returns the first index of the max value (even if there are multiple identical max values). So, it might make sense that the results are not identical.

Looking at what you're trying to do with your code, these are the two ways I'd approach it. I can't really tell which one's faster for your dataset. But it'd be great if you could post your benchmarks on these two methods.

Method 1:

# .I[.N] will get the running row number for every group's last index
ddt[ddt[, .I[.N], by=by]$V1]
Method 2:

# since you've already keyed your data.table, take adv. of the mult="last" option:
ddt[J(unique(cyl)), mult="last"]

Arun
From: Vincent Carey Vincent Carey
Reply: Vincent Carey stvjc at channing.harvard.edu
Date: March 10, 2014 at 3:04:31 PM
To: Arunkumar Srinivasan aragorn168b at gmail.com
Subject:  Re: [datatable-help] checking an approach to filtering rows in a data.table  
Thanks Arun, I like your approach, and I had looked at the possibility, although I had not seen the SO posting, which is indeed relevant.  The .I solution seemed underperformant relative to expectations, particularly for millions of rows.  Here are some
timings for 2-300k rows.

> litd = disc_allc200k_dt[1:200000,]
> microbenchmark(rowsWmaxVinG( litd, "score", "snp" ))
Unit: milliseconds
                               expr      min       lq   median       uq
 rowsWmaxVinG(litd, "score", "snp") 86.83909 87.45823 88.16629 89.26693
      max neval
 440.0069   100
> microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ])
Unit: milliseconds
                                       expr      min       lq  median      uq
 litd[litd[, .I[which.max(score)], snp]$V1] 241.3669 252.2612 279.342 602.113
      max neval
 657.7055   100
> litd = disc_allc200k_dt[1:300000,]
> microbenchmark(rowsWmaxVinG( litd, "score", "snp" ))
Unit: milliseconds
                               expr      min       lq   median       uq
 rowsWmaxVinG(litd, "score", "snp") 119.6237 120.9789 121.6302 122.7155
      max neval
 489.1918   100
> microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ])
Unit: milliseconds
                                       expr      min       lq   median      uq
 litd[litd[, .I[which.max(score)], snp]$V1] 324.7394 347.5972 684.6746 693.456
      max neval
 1607.186   100

The two approaches do not agree in terms of values returned when there are ties in the score within groups.  But otherwise the .N based approach seems to work.  I would like to verify that setkeyv accomplishes the sorting necessary for the .N based approach to be valid.

> sessionInfo()
R Under development (unstable) (2014-02-02 r64913)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     tools     methods  
[8] base     

other attached packages:
[1] microbenchmark_1.3-0 data.table_1.9.2     weaver_1.29.1       
[4] codetools_0.2-8      digest_0.6.4         BiocInstaller_1.13.3

loaded via a namespace (and not attached):
[1] Rcpp_0.11.0    plyr_1.8.1     reshape2_1.2.2 stringr_0.6.2 

On Mon, Mar 10, 2014 at 9:08 AM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
Hi Vincent,

Have you checked out the special variable `.I`? Have a look at `?data.table`. This SO post may also be relevant: http://stackoverflow.com/questions/21198937/subset-data-table-using-min-condition/21199009#21199009
Arun
From: Vincent Carey Vincent Carey
Reply: Vincent Carey stvjc at channing.harvard.edu
Date: March 10, 2014 at 4:33:27 AM
To: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:  [datatable-help] checking an approach to filtering rows in a data.table
I have looked around for code on row filtering with data.table, but have
not found anything addressing this use case.

I want to retrieve the rows satisfying a certain condition within groups, in this case having the maximum value for a specific variable.  The following
seems to work, but I wonder if there is a more direct approach.

rowsWmaxVinG = function(dt, V, by) {
#
# filter dt to the rows possessing max value of
# variable V within groups formed using by
#
# example: data(mtcars)
# ddt = data.table(mtcars)
#> rowsWmaxVinG( ddt, by="cyl", V="mpg")
#    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#1: 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#2: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#3: 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#
 setkeyv(dt, c(by, V)) # sort within groups
 dt[ cumsum(dt[, .N, by=by]$N), ]  # take last row from each group
}
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140310/e0bf9c50/attachment-0001.html>