[datatable-help] multiple data.table calls

Nathaniel Graham npgraham1 at gmail.com
Sun Mar 15 23:59:03 CET 2015


Thanks for the suggestion!  Using a data.table of all possible meeting
dates & branches and joining it to the employment history didn't occur to
me.  Unfortunately, even after tinkering with it for a bit, the join (even
though it's a temporary structure) isn't feasible due to memory
usage--meet.stratton and employ.hist joined produce a table of billions of
rows.  So I guess I was wrong about memory not being an issue!

A note about terminology, because I wasn't very clear: I define a Stratton
'alum' as someone that actually at Stratton-Oakmont; I don't have a term
for the brokers that Stratton alums later meet, even though they're the
ones I need to find.

In case someone stumbles across this later:

The meet.stratton table of possible dates and branches is specified as (and
again, I hope the formatting comes through):

meet.stratton <- unique(employ.hist[icrdn %in% stratton.people,
                                    list(branch, date =
as.Date(fromdate:todate)),
                                    by = "job.index"],
                        by = c("branch", "date"))

The unique() call is important to get right.  Obviously, Frank didn't have
the opportunity to experiment with the data (it's too big to pass around,
and it's built from proprietary data).  Also, I use the branch rather than
the whole firm, as it's not so clear that just working at the same firm is
meaningful--many broker-dealers have branches all over the country.  It's
also probably easier to drop Stratton people from the final results
explicitly, doing something like:

met.stratton.people <- met.stratton.people[!(icrdn %in% stratton.people)]

I'm thinking about cooking something up using foverlaps(), although I'll
need to learn its ins and outs first.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
https://sites.google.com/site/npgraham1/

On Sun, Mar 15, 2015 at 12:16 PM, Frank Erickson <fperickson at wisc.edu>
wrote:

> I'd suggest:
>
> (1) Get a table identifying the condition "meeting someone who at some
> point works at Stratton." (They aren't really "alums" if they haven't
> worked there yet, but this is the definition you seem to be looking for.)
> You can do this by looking at any (firm,date) combinations that involve
> bumping into such a person:
>
> meet.stratton <- unique(employ.hist[icrdn %in%
> stratton.people,list(fcrdn,date=fromdate:todate)])
>
> (2) Find people who meet the conditions:
>
> setkey(employ.hist,fcrdn)
> met.stratton.people <- employ.hist[meet.stratton,any(date>= startdate &
> date <= todate),by="icrdn,fcrdn"][V1==TRUE,unique(icrdn)]
>
> (3) If you want to exclude Stratton folks, then use setdiff()
>
> --Frank
>
> On Sat, Mar 14, 2015 at 4:19 PM, Nathaniel Graham <npgraham1 at gmail.com>
> wrote:
>
>> There's particular problem I often have, and I'm hoping someone can tell
>> me how to speed it up in data.table.  It seems to involve a sort of
>> recursion that data.table (as I'm using it) doesn't do well with, where for
>> each record in a set, I do a another search within the same table.  I hope
>> the formatting of the code below is legible--it's a lot easier to read in
>> the RStudio text editor!
>>
>> I have a moderately large (more than 3 million rows) data.table of the
>> employment histories of brokers in the US.  Each row is an employment
>> record, with a unique individual id (icrdn), a unique firm id (fcrdn), a
>> branch identifier (branch), start and end dates (fromdate and todate), and
>> a few other items (each row has a unique id as well, called job.index).
>> For example, finding all the brokers that ever worked at Stratton Oakmont
>> (from the Wolf of Wall Street):
>>
>> employ.hist[fcrdn == 18692, icrdn]
>>
>> where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is
>> the individual identifier.
>>
>> What I want is to find all the individuals that ever met a Stratton
>> alum.  Specifically, every icrdn such that the branch == a branch a
>> Stratton alum ever worked at and the start and end dates overlap.  The only
>> way I've found to do so involves something like this:
>>
>> find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) {
>>   employ.hist[fromdate <= sdt & todate >= edt & branch == brnch,
>>               list(icrdn, branch, job.index, fcrdn)]
>> })
>>
>> stratton.people <- employ.hist[fcrdn == 18692, icrdn]
>> stratton.contacts <- employ.hist[icrdn %in% stratton.people,
>>                                  find_brokers_by_single_branch(fromdate,
>> todate, branch),
>>                                  by = "job.index"]
>>
>> This works, but effectively means calling the data.table '[' function
>> thousands of times, once for each job entry
>> a Stratton broker ever had (which are in the thousands, as many left
>> before the government busted the place
>> and are still in the industry).  It's quite slow, and I'm hoping someone
>> can show me a way to speed it up, as I have
>> many similar tasks, some of which are vastly larger.  Memory really isn't
>> an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770 3.4GHz),
>> in case that helps.
>>
>> -------
>> Nathaniel Graham
>> npgraham1 at gmail.com
>> npgraham1 at uky.edu
>> https://sites.google.com/site/npgraham1/
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150315/d097561d/attachment.html>


More information about the datatable-help mailing list