[Rprotobuf-commits] r921 - papers/jss
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Mon Dec 1 08:58:25 CET 2014
Author: jeroenooms
Date: 2014-12-01 08:58:25 +0100 (Mon, 01 Dec 2014)
New Revision: 921
Modified:
papers/jss/article.Rnw
Log:
Rewrite mapreduce introduction.
Modified: papers/jss/article.Rnw
===================================================================
--- papers/jss/article.Rnw 2014-12-01 02:00:53 UTC (rev 920)
+++ papers/jss/article.Rnw 2014-12-01 07:58:25 UTC (rev 921)
@@ -1063,37 +1063,35 @@
\section{Application: Distributed data collection with MapReduce}
\label{sec:mapreduce}
-Protocol Buffers have been used extensively at Google for almost all
-RPC protocols, and for storing structured information in a variety of
-persistent storage systems since 2000 \citep{dean2009designs}. The
-\pkg{RProtoBuf} package has been in widespread use by hundreds of
-statisticians and software engineers at Google since 2010. This
-section describes a simplified example of a common design pattern of
-collecting a large structured data set in one language for later
-analysis in \proglang{R}.
+Protocol Buffers are used extensively at Google for almost all
+RPC protocols, and to store structured information on a variety of
+persistent storage systems \citep{dean2009designs}. Since the
+initial release in 2010, hundreds of Google's statisticians and
+software engineers use the \pkg{RProtoBuf} package on daily basis
+to interact with these systems from within \proglang{R}.
+The current section illustrates the power of Protocol Buffers to
+collect and manage large structured data in one language
+before analyzing it in \proglang{R}. Our example uses MapReduce
+\citep{dean2008mapreduce}, which has emerged in the last
+decade as a popular design pattern to facilitate parallel
+processing of big data using distributed computing clusters.
-Many large data sets in fields such as particle physics and information
-processing are stored in binned or histogram form in order to reduce
-the data storage requirements \citep{scott2009multivariate}. In the
-last decade, the MapReduce programming model \citep{dean2008mapreduce}
-has emerged as a popular design pattern that enables the processing of
-very large data sets on large compute clusters.
-
-Many types of data analysis over large data sets may involve very rare
+Big data sets in fields such as particle physics and information
+processing are often stored in binned (histogram) form in order
+to reduce storage requirements \citep{scott2009multivariate}.
+Because analysis over such large data sets may involve very rare
phenomenon or deal with highly skewed data sets or inflexible
-raw data storage systems from which unbiased sampling is not feasible.
-In such situations, MapReduce and binning may be combined as a
+raw data storage systems, unbiased sampling is often not feasible.
+In these situations, MapReduce and binning may be combined as a
pre-processing step for a wide range of statistical and scientific
analyses \citep{blocker2013}.
There are two common patterns for generating histograms of large data
-sets in a single pass with MapReduce. In the first method, each
+sets in a single pass with MapReduce. In the first method, each
mapper task generates a histogram over a subset of the data that it
has been assigned, serializes this histogram and sends it to one or
more reducer tasks which merge the intermediate histograms from the
-mappers.
-
-In the second method, illustrated in
+mappers. In the second method, illustrated in
Figure~\ref{fig:mr-histogram-pattern1}, each mapper rounds a data
point to a bucket width and outputs that bucket as a key and '1' as a
value. Reducers then sum up all of the values with the same key and
More information about the Rprotobuf-commits
mailing list