<div dir="ltr">Thanks Arun, I like your approach, and I had looked at the possibility, although I had not seen the SO posting, which is indeed relevant.  The .I solution seemed underperformant relative to expectations, particularly for millions of rows.  Here are some<div>

timings for 2-300k rows.<br><div><div><br></div><div><div><font face="courier new, monospace">> litd = disc_allc200k_dt[1:200000,]</font></div><div><font face="courier new, monospace">> microbenchmark(rowsWmaxVinG( litd, "score", "snp" ))</font></div>

<div><font face="courier new, monospace">Unit: milliseconds</font></div><div><font face="courier new, monospace">                               expr      min       lq   median       uq</font></div><div><font face="courier new, monospace"> rowsWmaxVinG(litd, "score", "snp") 86.83909 87.45823 88.16629 89.26693</font></div>

<div><font face="courier new, monospace">      max neval</font></div><div><font face="courier new, monospace"> 440.0069   100</font></div><div><font face="courier new, monospace">> microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ])</font></div>

<div><font face="courier new, monospace">Unit: milliseconds</font></div><div><font face="courier new, monospace">                                       expr      min       lq  median      uq</font></div><div><font face="courier new, monospace"> litd[litd[, .I[which.max(score)], snp]$V1] 241.3669 252.2612 279.342 602.113</font></div>

<div><font face="courier new, monospace">      max neval</font></div><div><font face="courier new, monospace"> 657.7055   100</font></div><div><font face="courier new, monospace">> litd = disc_allc200k_dt[1:300000,]</font></div>

<div><font face="courier new, monospace">> microbenchmark(rowsWmaxVinG( litd, "score", "snp" ))</font></div><div><font face="courier new, monospace">Unit: milliseconds</font></div><div><font face="courier new, monospace">                               expr      min       lq   median       uq</font></div>

<div><font face="courier new, monospace"> rowsWmaxVinG(litd, "score", "snp") 119.6237 120.9789 121.6302 122.7155</font></div><div><font face="courier new, monospace">      max neval</font></div><div><font face="courier new, monospace"> 489.1918   100</font></div>

<div><font face="courier new, monospace">> microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ])</font></div><div><font face="courier new, monospace">Unit: milliseconds</font></div><div><font face="courier new, monospace">                                       expr      min       lq   median      uq</font></div>

<div><font face="courier new, monospace"> litd[litd[, .I[which.max(score)], snp]$V1] 324.7394 347.5972 684.6746 693.456</font></div><div><font face="courier new, monospace">      max neval</font></div><div><font face="courier new, monospace"> 1607.186   100</font></div>

</div><div><br></div></div></div><div>The two approaches do not agree in terms of values returned when there are ties in the score within groups.  But otherwise the .N based approach seems to work.  I would like to verify that setkeyv accomplishes the sorting necessary for the .N based approach to be valid.</div>

<div><br></div><div><div>> sessionInfo()</div><div>R Under development (unstable) (2014-02-02 r64913)</div><div>Platform: x86_64-unknown-linux-gnu (64-bit)</div><div><br></div><div>locale:</div><div>[1] C</div><div><br>

</div><div>attached base packages:</div><div>[1] stats     graphics  grDevices datasets  utils     tools     methods  </div><div>[8] base     </div><div><br></div><div>other attached packages:</div><div>[1] microbenchmark_1.3-0 data.table_1.9.2     weaver_1.29.1       </div>

<div>[4] codetools_0.2-8      digest_0.6.4         BiocInstaller_1.13.3</div><div><br></div><div>loaded via a namespace (and not attached):</div><div>[1] Rcpp_0.11.0    plyr_1.8.1     reshape2_1.2.2 stringr_0.6.2 </div></div>

<div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Mar 10, 2014 at 9:08 AM, Arunkumar Srinivasan <span dir="ltr"><<a href="mailto:aragorn168b@gmail.com" target="_blank">aragorn168b@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div style="font-family:Helvetica,Arial;font-size:13px;color:rgba(0,0,0,1.0);margin:0px;line-height:auto">

Hi Vincent,</div><div style="font-family:Helvetica,Arial;font-size:13px;color:rgba(0,0,0,1.0);margin:0px;line-height:auto"><br></div><div style="font-family:Helvetica,Arial;font-size:13px;color:rgba(0,0,0,1.0);margin:0px;line-height:auto">

Have you checked out the special variable `.I`? Have a look at `?data.table`. This SO post may also be relevant: <a href="http://stackoverflow.com/questions/21198937/subset-data-table-using-min-condition/21199009#21199009" target="_blank">http://stackoverflow.com/questions/21198937/subset-data-table-using-min-condition/21199009#21199009</a></div>

 <div><div style="font-family:helvetica,arial;font-size:13px">Arun</div></div> <div style="color:gray"><hr>From: <span style="color:black">Vincent Carey</span> <a href="mailto:stvjc@channing.harvard.edu" target="_blank">Vincent Carey</a><br>

Reply: <span style="color:black">Vincent Carey</span> <a href="mailto:stvjc@channing.harvard.edu" target="_blank">stvjc@channing.harvard.edu</a><br>Date: <span style="color:black">March 10, 2014 at 4:33:27 AM</span><br>To: <span style="color:black"><a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a></span> <a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a><br>

Subject: <span style="color:black"> [datatable-help] checking an approach to filtering rows in a data.table <br></span></div> <blockquote type="cite"><span><div><div><div><div class="h5">


<div dir="ltr">

<div>I have looked around for code on row filtering with

data.table, but have</div>

<div>not found anything addressing this use case.</div>

<div><br></div>

<div>I want to retrieve the rows satisfying a certain condition

within groups, in this case having the maximum value for a specific

variable.  The following</div>

<div>seems to work, but I wonder if there is a more direct

approach.</div>

<div><br></div>

<div>rowsWmaxVinG = function(dt, V, by) {</div>

<div>#</div>

<div># filter dt to the rows possessing max value of</div>

<div># variable V within groups formed using by</div>

<div>#</div>

<div># example: data(mtcars)</div>

<div># ddt = data.table(mtcars)</div>

<div>#> rowsWmaxVinG( ddt, by="cyl", V="mpg")</div>

<div>#    mpg cyl  disp  hp drat  

 wt  qsec vs am gear carb</div>

<div>#1: 33.9   4  71.1  65 4.22 1.835 19.90  1

 1    4    1</div>

<div>#2: 21.4   6 258.0 110 3.08 3.215 19.44  1  0

   3    1</div>

<div>#3: 19.2   8 400.0 175 3.08 3.845 17.05  0  0

   3    2</div>

<div>#</div>

<div> setkeyv(dt, c(by, V)) # sort within groups</div>

<div> dt[ cumsum(dt[, .N, by=by]$N), ]  # take last row

from each group</div>

<div>}</div>

</div></div></div>


_______________________________________________

<br>datatable-help mailing list

<br><a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a>

<br><a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a></div></div></span></blockquote></div>

</blockquote></div><br></div>