<div dir="ltr">Oh. In that case, I'd suggest checking only using the month and year, not the day. You'll get some false positives, but the data should be small enough to merge, I guess. It depends on your application whether that's tolerable.<div><br></div><div>--Frank</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Mar 15, 2015 at 6:59 PM, Nathaniel Graham <span dir="ltr"><<a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Thanks for the suggestion! Using a data.table of all possible meeting dates & branches and joining it to the employment history didn't occur to me. Unfortunately, even after tinkering with it for a bit, the join (even though it's a temporary structure) isn't feasible due to memory usage--meet.stratton and employ.hist joined produce a table of billions of rows. So I guess I was wrong about memory not being an issue!<div><br></div><div>A note about terminology, because I wasn't very clear: I define a Stratton 'alum' as someone that actually at Stratton-Oakmont; I don't have a term for the brokers that Stratton alums later meet, even though they're the ones I need to find.</div><div><br></div><div>In case someone stumbles across this later:</div><div><br></div><div>The meet.stratton table of possible dates and branches is specified as (and again, I hope the formatting comes through):</div><span class=""><div><br></div><div><font face="monospace, monospace">meet.stratton <- unique(employ.hist[icrdn %in% stratton.people, </font></div></span><div><font face="monospace, monospace"> list(branch, date = as.Date(fromdate:todate)), </font></div><div><font face="monospace, monospace"> by = "job.index"], </font></div><div><font face="monospace, monospace"> by = c("branch", "date"))</font><br></div><div><font face="monospace, monospace"><br></font></div><div><font face="arial, helvetica, sans-serif">The unique() call is important to get right. Obviously, Frank didn't have the opportunity to experiment with the data (it's too big to pass around, and it's built from proprietary data). Also, I use the branch rather than the whole firm, as it's not so clear that just working at the same firm is meaningful--many broker-dealers have branches all over the country. It's also probably easier to drop Stratton people from the final results explicitly, doing something like:</font></div><div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="monospace, monospace">met.stratton.people <- met.stratton.people[!(icrdn %in% stratton.people)]</font></div><div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="arial, helvetica, sans-serif">I'm thinking about cooking something up using foverlaps(), although I'll need to learn its ins and outs first.</font></div></div><div class="gmail_extra"><span class=""><br clear="all"><div><div><div dir="ltr">-------<br>Nathaniel Graham<br><a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a><br><a href="mailto:npgraham1@uky.edu" target="_blank">npgraham1@uky.edu</a><div><a href="https://sites.google.com/site/npgraham1/" target="_blank">https://sites.google.com/site/npgraham1/</a><br></div></div></div></div>
<br></span><div><div class="h5"><div class="gmail_quote">On Sun, Mar 15, 2015 at 12:16 PM, Frank Erickson <span dir="ltr"><<a href="mailto:fperickson@wisc.edu" target="_blank">fperickson@wisc.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>I'd suggest:<br></div><div><br></div><div>(1) Get a table identifying the condition "meeting someone who at some point works at Stratton." (They aren't really "alums" if they haven't worked there yet, but this is the definition you seem to be looking for.) You can do this by looking at any (firm,date) combinations that involve bumping into such a person:</div><div><br></div><div>meet.stratton <- unique(employ.hist[icrdn %in% stratton.people,list(fcrdn,date=fromdate:todate)])</div><div><br></div><div>(2) Find people who meet the conditions:</div><div><br></div><div>setkey(employ.hist,fcrdn)</div><div>met.stratton.people <- employ.hist[meet.stratton,any(date>= startdate & date <= todate),by="icrdn,fcrdn"][V1==TRUE,unique(icrdn)]</div><div><br></div><div>(3) If you want to exclude Stratton folks, then use setdiff()</div><div><br></div><div>--Frank</div></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div>On Sat, Mar 14, 2015 at 4:19 PM, Nathaniel Graham <span dir="ltr"><<a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div><div dir="ltr">There's particular problem I often have, and I'm hoping someone can tell me how to speed it up in data.table. It seems to involve a sort of recursion that data.table (as I'm using it) doesn't do well with, where for each record in a set, I do a another search within the same table. I hope the formatting of the code below is legible--it's a lot easier to read in the RStudio text editor!<div><br></div><div>I have a moderately large (more than 3 million rows) data.table of the employment histories of brokers in the US. Each row is an employment record, with a unique individual id (icrdn), a unique firm id (fcrdn), a branch identifier (branch), start and end dates (fromdate and todate), and a few other items (each row has a unique id as well, called job.index). For example, finding all the brokers that ever worked at Stratton Oakmont (from the Wolf of Wall Street):</div><div><br></div><div><font face="monospace, monospace">employ.hist[fcrdn == 18692, icrdn]</font></div><div><br></div><div>where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is the individual identifier.</div><div><br></div><div>What I want is to find all the individuals that ever met a Stratton alum. Specifically, every icrdn such that the branch == a branch a Stratton alum ever worked at and the start and end dates overlap. The only way I've found to do so involves something like this:</div><div><br></div><div><div><font face="monospace, monospace">find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) {</font></div><div><font face="monospace, monospace"> employ.hist[fromdate <= sdt & todate >= edt & branch == brnch,</font></div><div><font face="monospace, monospace"> list(icrdn, branch, job.index, fcrdn)]</font></div><div><font face="monospace, monospace">})</font></div></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">stratton.people <- employ.hist[fcrdn == 18692, icrdn]</font></div><div><font face="monospace, monospace">stratton.contacts <- employ.hist[icrdn %in% stratton.people,</font></div><div><font face="monospace, monospace"> find_brokers_by_single_branch(fromdate, todate, branch),</font></div><div><font face="monospace, monospace"> by = "job.index"]</font></div><div><br></div><div>This works, but effectively means calling the data.table '[' function thousands of times, once for each job entry</div><div>a Stratton broker ever had (which are in the thousands, as many left before the government busted the place</div><div>and are still in the industry). It's quite slow, and I'm hoping someone can show me a way to speed it up, as I have</div><div>many similar tasks, some of which are vastly larger. Memory really isn't an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770 3.4GHz), in case that helps.</div><div><br clear="all"><div><div><div dir="ltr">-------<br>Nathaniel Graham<br><a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a><br><a href="mailto:npgraham1@uky.edu" target="_blank">npgraham1@uky.edu</a><div><a href="https://sites.google.com/site/npgraham1/" target="_blank">https://sites.google.com/site/npgraham1/</a><br></div></div></div></div>
</div></div>
<br></div></div>_______________________________________________<br>
datatable-help mailing list<br>
<a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a><br>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br></blockquote></div><br></div>
</blockquote></div><br></div></div></div>
<br>_______________________________________________<br>
datatable-help mailing list<br>
<a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br></blockquote></div><br></div>