<div dir="ltr">Thanks Arun, I like your approach, and I had looked at the possibility, although I had not seen the SO posting, which is indeed relevant. The .I solution seemed underperformant relative to expectations, particularly for millions of rows. Here are some<div>
timings for 2-300k rows.<br><div><div><br></div><div><div><font face="courier new, monospace">> litd = disc_allc200k_dt[1:200000,]</font></div><div><font face="courier new, monospace">> microbenchmark(rowsWmaxVinG( litd, "score", "snp" ))</font></div>
<div><font face="courier new, monospace">Unit: milliseconds</font></div><div><font face="courier new, monospace"> expr min lq median uq</font></div><div><font face="courier new, monospace"> rowsWmaxVinG(litd, "score", "snp") 86.83909 87.45823 88.16629 89.26693</font></div>
<div><font face="courier new, monospace"> max neval</font></div><div><font face="courier new, monospace"> 440.0069 100</font></div><div><font face="courier new, monospace">> microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ])</font></div>
<div><font face="courier new, monospace">Unit: milliseconds</font></div><div><font face="courier new, monospace"> expr min lq median uq</font></div><div><font face="courier new, monospace"> litd[litd[, .I[which.max(score)], snp]$V1] 241.3669 252.2612 279.342 602.113</font></div>
<div><font face="courier new, monospace"> max neval</font></div><div><font face="courier new, monospace"> 657.7055 100</font></div><div><font face="courier new, monospace">> litd = disc_allc200k_dt[1:300000,]</font></div>
<div><font face="courier new, monospace">> microbenchmark(rowsWmaxVinG( litd, "score", "snp" ))</font></div><div><font face="courier new, monospace">Unit: milliseconds</font></div><div><font face="courier new, monospace"> expr min lq median uq</font></div>
<div><font face="courier new, monospace"> rowsWmaxVinG(litd, "score", "snp") 119.6237 120.9789 121.6302 122.7155</font></div><div><font face="courier new, monospace"> max neval</font></div><div><font face="courier new, monospace"> 489.1918 100</font></div>
<div><font face="courier new, monospace">> microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ])</font></div><div><font face="courier new, monospace">Unit: milliseconds</font></div><div><font face="courier new, monospace"> expr min lq median uq</font></div>
<div><font face="courier new, monospace"> litd[litd[, .I[which.max(score)], snp]$V1] 324.7394 347.5972 684.6746 693.456</font></div><div><font face="courier new, monospace"> max neval</font></div><div><font face="courier new, monospace"> 1607.186 100</font></div>
</div><div><br></div></div></div><div>The two approaches do not agree in terms of values returned when there are ties in the score within groups. But otherwise the .N based approach seems to work. I would like to verify that setkeyv accomplishes the sorting necessary for the .N based approach to be valid.</div>
<div><br></div><div><div>> sessionInfo()</div><div>R Under development (unstable) (2014-02-02 r64913)</div><div>Platform: x86_64-unknown-linux-gnu (64-bit)</div><div><br></div><div>locale:</div><div>[1] C</div><div><br>
</div><div>attached base packages:</div><div>[1] stats graphics grDevices datasets utils tools methods </div><div>[8] base </div><div><br></div><div>other attached packages:</div><div>[1] microbenchmark_1.3-0 data.table_1.9.2 weaver_1.29.1 </div>
<div>[4] codetools_0.2-8 digest_0.6.4 BiocInstaller_1.13.3</div><div><br></div><div>loaded via a namespace (and not attached):</div><div>[1] Rcpp_0.11.0 plyr_1.8.1 reshape2_1.2.2 stringr_0.6.2 </div></div>
<div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Mar 10, 2014 at 9:08 AM, Arunkumar Srinivasan <span dir="ltr"><<a href="mailto:aragorn168b@gmail.com" target="_blank">aragorn168b@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div style="font-family:Helvetica,Arial;font-size:13px;color:rgba(0,0,0,1.0);margin:0px;line-height:auto">
Hi Vincent,</div><div style="font-family:Helvetica,Arial;font-size:13px;color:rgba(0,0,0,1.0);margin:0px;line-height:auto"><br></div><div style="font-family:Helvetica,Arial;font-size:13px;color:rgba(0,0,0,1.0);margin:0px;line-height:auto">
Have you checked out the special variable `.I`? Have a look at `?data.table`. This SO post may also be relevant: <a href="http://stackoverflow.com/questions/21198937/subset-data-table-using-min-condition/21199009#21199009" target="_blank">http://stackoverflow.com/questions/21198937/subset-data-table-using-min-condition/21199009#21199009</a></div>
<div><div style="font-family:helvetica,arial;font-size:13px">Arun</div></div> <div style="color:gray"><hr>From: <span style="color:black">Vincent Carey</span> <a href="mailto:stvjc@channing.harvard.edu" target="_blank">Vincent Carey</a><br>
Reply: <span style="color:black">Vincent Carey</span> <a href="mailto:stvjc@channing.harvard.edu" target="_blank">stvjc@channing.harvard.edu</a><br>Date: <span style="color:black">March 10, 2014 at 4:33:27 AM</span><br>To: <span style="color:black"><a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a></span> <a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a><br>
Subject: <span style="color:black"> [datatable-help] checking an approach to filtering rows in a data.table <br></span></div> <blockquote type="cite"><span><div><div><div><div class="h5">
<div dir="ltr">
<div>I have looked around for code on row filtering with
data.table, but have</div>
<div>not found anything addressing this use case.</div>
<div><br></div>
<div>I want to retrieve the rows satisfying a certain condition
within groups, in this case having the maximum value for a specific
variable. The following</div>
<div>seems to work, but I wonder if there is a more direct
approach.</div>
<div><br></div>
<div>rowsWmaxVinG = function(dt, V, by) {</div>
<div>#</div>
<div># filter dt to the rows possessing max value of</div>
<div># variable V within groups formed using by</div>
<div>#</div>
<div># example: data(mtcars)</div>
<div># ddt = data.table(mtcars)</div>
<div>#> rowsWmaxVinG( ddt, by="cyl", V="mpg")</div>
<div># mpg cyl disp hp drat
wt qsec vs am gear carb</div>
<div>#1: 33.9 4 71.1 65 4.22 1.835 19.90 1
1 4 1</div>
<div>#2: 21.4 6 258.0 110 3.08 3.215 19.44 1 0
3 1</div>
<div>#3: 19.2 8 400.0 175 3.08 3.845 17.05 0 0
3 2</div>
<div>#</div>
<div> setkeyv(dt, c(by, V)) # sort within groups</div>
<div> dt[ cumsum(dt[, .N, by=by]$N), ] # take last row
from each group</div>
<div>}</div>
</div></div></div>
_______________________________________________
<br>datatable-help mailing list
<br><a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a>
<br><a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a></div></div></span></blockquote></div>
</blockquote></div><br></div>