[datatable-help] checking an approach to filtering rows in a data.table
Vincent Carey
stvjc at channing.harvard.edu
Mon Mar 10 15:04:30 CET 2014
Thanks Arun, I like your approach, and I had looked at the possibility,
although I had not seen the SO posting, which is indeed relevant. The .I
solution seemed underperformant relative to expectations, particularly for
millions of rows. Here are some
timings for 2-300k rows.
> litd = disc_allc200k_dt[1:200000,]
> microbenchmark(rowsWmaxVinG( litd, "score", "snp" ))
Unit: milliseconds
expr min lq median uq
rowsWmaxVinG(litd, "score", "snp") 86.83909 87.45823 88.16629 89.26693
max neval
440.0069 100
> microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ])
Unit: milliseconds
expr min lq median
uq
litd[litd[, .I[which.max(score)], snp]$V1] 241.3669 252.2612 279.342
602.113
max neval
657.7055 100
> litd = disc_allc200k_dt[1:300000,]
> microbenchmark(rowsWmaxVinG( litd, "score", "snp" ))
Unit: milliseconds
expr min lq median uq
rowsWmaxVinG(litd, "score", "snp") 119.6237 120.9789 121.6302 122.7155
max neval
489.1918 100
> microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ])
Unit: milliseconds
expr min lq median
uq
litd[litd[, .I[which.max(score)], snp]$V1] 324.7394 347.5972 684.6746
693.456
max neval
1607.186 100
The two approaches do not agree in terms of values returned when there are
ties in the score within groups. But otherwise the .N based approach seems
to work. I would like to verify that setkeyv accomplishes the sorting
necessary for the .N based approach to be valid.
> sessionInfo()
R Under development (unstable) (2014-02-02 r64913)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices datasets utils tools methods
[8] base
other attached packages:
[1] microbenchmark_1.3-0 data.table_1.9.2 weaver_1.29.1
[4] codetools_0.2-8 digest_0.6.4 BiocInstaller_1.13.3
loaded via a namespace (and not attached):
[1] Rcpp_0.11.0 plyr_1.8.1 reshape2_1.2.2 stringr_0.6.2
On Mon, Mar 10, 2014 at 9:08 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:
> Hi Vincent,
>
> Have you checked out the special variable `.I`? Have a look at
> `?data.table`. This SO post may also be relevant:
> http://stackoverflow.com/questions/21198937/subset-data-table-using-min-condition/21199009#21199009
> Arun
> ------------------------------
> From: Vincent Carey Vincent Carey <stvjc at channing.harvard.edu>
> Reply: Vincent Carey stvjc at channing.harvard.edu
> Date: March 10, 2014 at 4:33:27 AM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject: [datatable-help] checking an approach to filtering rows in a
> data.table
>
> I have looked around for code on row filtering with data.table, but have
> not found anything addressing this use case.
>
> I want to retrieve the rows satisfying a certain condition within groups,
> in this case having the maximum value for a specific variable. The
> following
> seems to work, but I wonder if there is a more direct approach.
>
> rowsWmaxVinG = function(dt, V, by) {
> #
> # filter dt to the rows possessing max value of
> # variable V within groups formed using by
> #
> # example: data(mtcars)
> # ddt = data.table(mtcars)
> #> rowsWmaxVinG( ddt, by="cyl", V="mpg")
> # mpg cyl disp hp drat wt qsec vs am gear carb
> #1: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
> #2: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
> #3: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
> #
> setkeyv(dt, c(by, V)) # sort within groups
> dt[ cumsum(dt[, .N, by=by]$N), ] # take last row from each group
> }
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140310/971d149c/attachment.html>
More information about the datatable-help
mailing list