[datatable-help] multiple data.table calls

Frank Erickson fperickson at wisc.edu
Mon Mar 16 14:13:21 CET 2015


Oh. In that case, I'd suggest checking only using the month and year, not
the day. You'll get some false positives, but the data should be small
enough to merge, I guess. It depends on your application whether that's
tolerable.

--Frank

On Sun, Mar 15, 2015 at 6:59 PM, Nathaniel Graham <npgraham1 at gmail.com>
wrote:

> Thanks for the suggestion!  Using a data.table of all possible meeting
> dates & branches and joining it to the employment history didn't occur to
> me.  Unfortunately, even after tinkering with it for a bit, the join (even
> though it's a temporary structure) isn't feasible due to memory
> usage--meet.stratton and employ.hist joined produce a table of billions of
> rows.  So I guess I was wrong about memory not being an issue!
>
> A note about terminology, because I wasn't very clear: I define a Stratton
> 'alum' as someone that actually at Stratton-Oakmont; I don't have a term
> for the brokers that Stratton alums later meet, even though they're the
> ones I need to find.
>
> In case someone stumbles across this later:
>
> The meet.stratton table of possible dates and branches is specified as
> (and again, I hope the formatting comes through):
>
> meet.stratton <- unique(employ.hist[icrdn %in% stratton.people,
>                                     list(branch, date =
> as.Date(fromdate:todate)),
>                                     by = "job.index"],
>                         by = c("branch", "date"))
>
> The unique() call is important to get right.  Obviously, Frank didn't have
> the opportunity to experiment with the data (it's too big to pass around,
> and it's built from proprietary data).  Also, I use the branch rather than
> the whole firm, as it's not so clear that just working at the same firm is
> meaningful--many broker-dealers have branches all over the country.  It's
> also probably easier to drop Stratton people from the final results
> explicitly, doing something like:
>
> met.stratton.people <- met.stratton.people[!(icrdn %in% stratton.people)]
>
> I'm thinking about cooking something up using foverlaps(), although I'll
> need to learn its ins and outs first.
>
> -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
> https://sites.google.com/site/npgraham1/
>
> On Sun, Mar 15, 2015 at 12:16 PM, Frank Erickson <fperickson at wisc.edu>
> wrote:
>
>> I'd suggest:
>>
>> (1) Get a table identifying the condition "meeting someone who at some
>> point works at Stratton." (They aren't really "alums" if they haven't
>> worked there yet, but this is the definition you seem to be looking for.)
>> You can do this by looking at any (firm,date) combinations that involve
>> bumping into such a person:
>>
>> meet.stratton <- unique(employ.hist[icrdn %in%
>> stratton.people,list(fcrdn,date=fromdate:todate)])
>>
>> (2) Find people who meet the conditions:
>>
>> setkey(employ.hist,fcrdn)
>> met.stratton.people <- employ.hist[meet.stratton,any(date>= startdate &
>> date <= todate),by="icrdn,fcrdn"][V1==TRUE,unique(icrdn)]
>>
>> (3) If you want to exclude Stratton folks, then use setdiff()
>>
>> --Frank
>>
>> On Sat, Mar 14, 2015 at 4:19 PM, Nathaniel Graham <npgraham1 at gmail.com>
>> wrote:
>>
>>> There's particular problem I often have, and I'm hoping someone can tell
>>> me how to speed it up in data.table.  It seems to involve a sort of
>>> recursion that data.table (as I'm using it) doesn't do well with, where for
>>> each record in a set, I do a another search within the same table.  I hope
>>> the formatting of the code below is legible--it's a lot easier to read in
>>> the RStudio text editor!
>>>
>>> I have a moderately large (more than 3 million rows) data.table of the
>>> employment histories of brokers in the US.  Each row is an employment
>>> record, with a unique individual id (icrdn), a unique firm id (fcrdn), a
>>> branch identifier (branch), start and end dates (fromdate and todate), and
>>> a few other items (each row has a unique id as well, called job.index).
>>> For example, finding all the brokers that ever worked at Stratton Oakmont
>>> (from the Wolf of Wall Street):
>>>
>>> employ.hist[fcrdn == 18692, icrdn]
>>>
>>> where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is
>>> the individual identifier.
>>>
>>> What I want is to find all the individuals that ever met a Stratton
>>> alum.  Specifically, every icrdn such that the branch == a branch a
>>> Stratton alum ever worked at and the start and end dates overlap.  The only
>>> way I've found to do so involves something like this:
>>>
>>> find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) {
>>>   employ.hist[fromdate <= sdt & todate >= edt & branch == brnch,
>>>               list(icrdn, branch, job.index, fcrdn)]
>>> })
>>>
>>> stratton.people <- employ.hist[fcrdn == 18692, icrdn]
>>> stratton.contacts <- employ.hist[icrdn %in% stratton.people,
>>>                                  find_brokers_by_single_branch(fromdate,
>>> todate, branch),
>>>                                  by = "job.index"]
>>>
>>> This works, but effectively means calling the data.table '[' function
>>> thousands of times, once for each job entry
>>> a Stratton broker ever had (which are in the thousands, as many left
>>> before the government busted the place
>>> and are still in the industry).  It's quite slow, and I'm hoping someone
>>> can show me a way to speed it up, as I have
>>> many similar tasks, some of which are vastly larger.  Memory really
>>> isn't an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770
>>> 3.4GHz), in case that helps.
>>>
>>> -------
>>> Nathaniel Graham
>>> npgraham1 at gmail.com
>>> npgraham1 at uky.edu
>>> https://sites.google.com/site/npgraham1/
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150316/f48874da/attachment.html>


More information about the datatable-help mailing list