[datatable-help] Rolling Joins Replicated in Java MapReduce

Dan LaBar danielrlabar at gmail.com
Wed Dec 3 22:41:35 CET 2014


You may want to look into Spark SQL.  There is currently discussion on
adding support for range joins <https://github.com/apache/spark/pull/2939>,
which I think are similar to rolling joins in data.table.

I started looking into rmr2, but Hive and Spark SQL look like better
options for my use cases.


On Wed, Dec 3, 2014 at 6:00 AM, <
datatable-help-request at lists.r-forge.r-project.org> wrote:

> Send datatable-help mailing list submissions to
>         datatable-help at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> or, via email, send a message with subject or body 'help' to
>         datatable-help-request at lists.r-forge.r-project.org
>
> You can reach the person managing the list at
>         datatable-help-owner at lists.r-forge.r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of datatable-help digest..."
>
> Today's Topics:
>
>    1. Rolling Joins Replicated in Java MapReduce (Mike.Gahan)
>    2. Re: Rolling Joins Replicated in Java MapReduce (Michael Smith)
>
>
> ---------- Forwarded message ----------
> From: "Mike.Gahan" <michael.gahan at gmail.com>
> To: datatable-help at lists.r-forge.r-project.org
> Cc:
> Date: Tue, 2 Dec 2014 19:47:38 -0800 (PST)
> Subject: [datatable-help] Rolling Joins Replicated in Java MapReduce
> Hello all,
>
> I absolutely love the rolling join capabilities of data.table. It is
> extremely useful for the work I do. However, sometimes I work with data
> that
> is too large to fit into RAM (even when using a large server). I want to
> implement this rolling join code in a Java Map Reduce setting to be able to
> leverage some of the other resources available at the company I work for.
> Unfortunately I am not an experienced Java programmer. I figured that a
> project like this would provide an excellent incentive to learn this skill.
>
> My question is this: what data.table current code for rolling joins would
> be
> most useful to reference in starting this project? I am guessing the
> bmerge.c code
> <https://github.com/Rdatatable/data.table/blob/master/src/bmerge.c>   has
> much of what I want. Any other code in the data.table package I should be
> aware of? Any other advice that might make this process go more smoothly? I
> know the function is based on a Modified Binary Search algorithm. Are there
> any libraries anyone is aware of that might help this along?
>
> I really appreciate all help.
> Mike
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Rolling-Joins-Replicated-in-Java-MapReduce-tp4700329.html
> Sent from the datatable-help mailing list archive at Nabble.com.
>
>
>
> ---------- Forwarded message ----------
> From: Michael Smith <my.r.help at gmail.com>
> To: "Mike.Gahan" <michael.gahan at gmail.com>,
> datatable-help at lists.r-forge.r-project.org
> Cc:
> Date: Wed, 03 Dec 2014 14:44:11 +0800
> Subject: Re: [datatable-help] Rolling Joins Replicated in Java MapReduce
> Maybe it is easier to build what you're looking for by contributing to
> plyrmr:
>
> https://github.com/RevolutionAnalytics/plyrmr
>
> It already implements "plyr for Hadoop" on top or the rmr2 package. Not
> sure whether merging is already implemented, but using rmr2 it should not
> be prohibitively difficult (hopefully).
>
> Best,
> M
>
>
> On 12/03/2014 11:47 AM, Mike.Gahan wrote:
>
>> Hello all,
>>
>> I absolutely love the rolling join capabilities of data.table. It is
>> extremely useful for the work I do. However, sometimes I work with data
>> that
>> is too large to fit into RAM (even when using a large server). I want to
>> implement this rolling join code in a Java Map Reduce setting to be able
>> to
>> leverage some of the other resources available at the company I work for.
>> Unfortunately I am not an experienced Java programmer. I figured that a
>> project like this would provide an excellent incentive to learn this
>> skill.
>>
>> My question is this: what data.table current code for rolling joins would
>> be
>> most useful to reference in starting this project? I am guessing the
>> bmerge.c code
>> <https://github.com/Rdatatable/data.table/blob/master/src/bmerge.c>   has
>> much of what I want. Any other code in the data.table package I should be
>> aware of? Any other advice that might make this process go more smoothly?
>> I
>> know the function is based on a Modified Binary Search algorithm. Are
>> there
>> any libraries anyone is aware of that might help this along?
>>
>> I really appreciate all help.
>> Mike
>>
>>
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/
>> Rolling-Joins-Replicated-in-Java-MapReduce-tp4700329.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/
>> listinfo/datatable-help
>>
>>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20141203/aa54de10/attachment.html>


More information about the datatable-help mailing list