<div dir="ltr"><div>I'd suggest:<br></div><div><br></div><div>(1) Get a table identifying the condition "meeting someone who at some point works at Stratton." (They aren't really "alums" if they haven't worked there yet, but this is the definition you seem to be looking for.) You can do this by looking at any (firm,date) combinations that involve bumping into such a person:</div><div><br></div><div>meet.stratton <- unique(employ.hist[icrdn %in% stratton.people,list(fcrdn,date=fromdate:todate)])</div><div><br></div><div>(2) Find people who meet the conditions:</div><div><br></div><div>setkey(employ.hist,fcrdn)</div><div>met.stratton.people <- employ.hist[meet.stratton,any(date>= startdate & date <= todate),by="icrdn,fcrdn"][V1==TRUE,unique(icrdn)]</div><div><br></div><div>(3) If you want to exclude Stratton folks, then use setdiff()</div><div><br></div><div>--Frank</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Mar 14, 2015 at 4:19 PM, Nathaniel Graham <span dir="ltr"><<a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">There's particular problem I often have, and I'm hoping someone can tell me how to speed it up in data.table.  It seems to involve a sort of recursion that data.table (as I'm using it) doesn't do well with, where for each record in a set, I do a another search within the same table.  I hope the formatting of the code below is legible--it's a lot easier to read in the RStudio text editor!<div><br></div><div>I have a moderately large (more than 3 million rows) data.table of the employment histories of brokers in the US.  Each row is an employment record, with a unique individual id (icrdn), a unique firm id (fcrdn), a branch identifier (branch), start and end dates (fromdate and todate), and a few other items (each row has a unique id as well, called job.index).  For example, finding all the brokers that ever worked at Stratton Oakmont (from the Wolf of Wall Street):</div><div><br></div><div><font face="monospace, monospace">employ.hist[fcrdn == 18692, icrdn]</font></div><div><br></div><div>where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is the individual identifier.</div><div><br></div><div>What I want is to find all the individuals that ever met a Stratton alum.  Specifically, every icrdn such that the branch == a branch a Stratton alum ever worked at and the start and end dates overlap.  The only way I've found to do so involves something like this:</div><div><br></div><div><div><font face="monospace, monospace">find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) {</font></div><div><font face="monospace, monospace">  employ.hist[fromdate <= sdt & todate >= edt & branch == brnch,</font></div><div><font face="monospace, monospace">              list(icrdn, branch, job.index, fcrdn)]</font></div><div><font face="monospace, monospace">})</font></div></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">stratton.people <- employ.hist[fcrdn == 18692, icrdn]</font></div><div><font face="monospace, monospace">stratton.contacts <- employ.hist[icrdn %in% stratton.people,</font></div><div><font face="monospace, monospace">                                 find_brokers_by_single_branch(fromdate, todate, branch),</font></div><div><font face="monospace, monospace">                                 by = "job.index"]</font></div><div><br></div><div>This works, but effectively means calling the data.table '[' function thousands of times, once for each job entry</div><div>a Stratton broker ever had (which are in the thousands, as many left before the government busted the place</div><div>and are still in the industry).  It's quite slow, and I'm hoping someone can show me a way to speed it up, as I have</div><div>many similar tasks, some of which are vastly larger.  Memory really isn't an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770 3.4GHz), in case that helps.</div><div><br clear="all"><div><div><div dir="ltr">-------<br>Nathaniel Graham<br><a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a><br><a href="mailto:npgraham1@uky.edu" target="_blank">npgraham1@uky.edu</a><div><a href="https://sites.google.com/site/npgraham1/" target="_blank">https://sites.google.com/site/npgraham1/</a><br></div></div></div></div>

</div></div>

<br>_______________________________________________<br>

datatable-help mailing list<br>

<a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>

<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br></blockquote></div><br></div>