[datatable-help] multiple data.table calls

Nathaniel Graham npgraham1 at gmail.com
Sat Mar 14 21:19:41 CET 2015


There's particular problem I often have, and I'm hoping someone can tell me
how to speed it up in data.table.  It seems to involve a sort of recursion
that data.table (as I'm using it) doesn't do well with, where for each
record in a set, I do a another search within the same table.  I hope the
formatting of the code below is legible--it's a lot easier to read in the
RStudio text editor!

I have a moderately large (more than 3 million rows) data.table of the
employment histories of brokers in the US.  Each row is an employment
record, with a unique individual id (icrdn), a unique firm id (fcrdn), a
branch identifier (branch), start and end dates (fromdate and todate), and
a few other items (each row has a unique id as well, called job.index).
For example, finding all the brokers that ever worked at Stratton Oakmont
(from the Wolf of Wall Street):

employ.hist[fcrdn == 18692, icrdn]

where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is
the individual identifier.

What I want is to find all the individuals that ever met a Stratton alum.
Specifically, every icrdn such that the branch == a branch a Stratton alum
ever worked at and the start and end dates overlap.  The only way I've
found to do so involves something like this:

find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) {
  employ.hist[fromdate <= sdt & todate >= edt & branch == brnch,
              list(icrdn, branch, job.index, fcrdn)]
})

stratton.people <- employ.hist[fcrdn == 18692, icrdn]
stratton.contacts <- employ.hist[icrdn %in% stratton.people,
                                 find_brokers_by_single_branch(fromdate,
todate, branch),
                                 by = "job.index"]

This works, but effectively means calling the data.table '[' function
thousands of times, once for each job entry
a Stratton broker ever had (which are in the thousands, as many left before
the government busted the place
and are still in the industry).  It's quite slow, and I'm hoping someone
can show me a way to speed it up, as I have
many similar tasks, some of which are vastly larger.  Memory really isn't
an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770 3.4GHz),
in case that helps.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
https://sites.google.com/site/npgraham1/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150314/e7a03793/attachment.html>


More information about the datatable-help mailing list