[datatable-help] multiple data.table calls

Frank Erickson fperickson at wisc.edu
Sun Mar 15 17:16:26 CET 2015


I'd suggest:

(1) Get a table identifying the condition "meeting someone who at some
point works at Stratton." (They aren't really "alums" if they haven't
worked there yet, but this is the definition you seem to be looking for.)
You can do this by looking at any (firm,date) combinations that involve
bumping into such a person:

meet.stratton <- unique(employ.hist[icrdn %in%
stratton.people,list(fcrdn,date=fromdate:todate)])

(2) Find people who meet the conditions:

setkey(employ.hist,fcrdn)
met.stratton.people <- employ.hist[meet.stratton,any(date>= startdate &
date <= todate),by="icrdn,fcrdn"][V1==TRUE,unique(icrdn)]

(3) If you want to exclude Stratton folks, then use setdiff()

--Frank

On Sat, Mar 14, 2015 at 4:19 PM, Nathaniel Graham <npgraham1 at gmail.com>
wrote:

> There's particular problem I often have, and I'm hoping someone can tell
> me how to speed it up in data.table.  It seems to involve a sort of
> recursion that data.table (as I'm using it) doesn't do well with, where for
> each record in a set, I do a another search within the same table.  I hope
> the formatting of the code below is legible--it's a lot easier to read in
> the RStudio text editor!
>
> I have a moderately large (more than 3 million rows) data.table of the
> employment histories of brokers in the US.  Each row is an employment
> record, with a unique individual id (icrdn), a unique firm id (fcrdn), a
> branch identifier (branch), start and end dates (fromdate and todate), and
> a few other items (each row has a unique id as well, called job.index).
> For example, finding all the brokers that ever worked at Stratton Oakmont
> (from the Wolf of Wall Street):
>
> employ.hist[fcrdn == 18692, icrdn]
>
> where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is
> the individual identifier.
>
> What I want is to find all the individuals that ever met a Stratton alum.
> Specifically, every icrdn such that the branch == a branch a Stratton alum
> ever worked at and the start and end dates overlap.  The only way I've
> found to do so involves something like this:
>
> find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) {
>   employ.hist[fromdate <= sdt & todate >= edt & branch == brnch,
>               list(icrdn, branch, job.index, fcrdn)]
> })
>
> stratton.people <- employ.hist[fcrdn == 18692, icrdn]
> stratton.contacts <- employ.hist[icrdn %in% stratton.people,
>                                  find_brokers_by_single_branch(fromdate,
> todate, branch),
>                                  by = "job.index"]
>
> This works, but effectively means calling the data.table '[' function
> thousands of times, once for each job entry
> a Stratton broker ever had (which are in the thousands, as many left
> before the government busted the place
> and are still in the industry).  It's quite slow, and I'm hoping someone
> can show me a way to speed it up, as I have
> many similar tasks, some of which are vastly larger.  Memory really isn't
> an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770 3.4GHz),
> in case that helps.
>
> -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
> https://sites.google.com/site/npgraham1/
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150315/18d86689/attachment.html>


More information about the datatable-help mailing list