[Rprotobuf-commits] r696 - papers/rjournal

Fri Jan 3 21:39:06 CET 2014

Author: murray
Date: 2014-01-03 21:39:06 +0100 (Fri, 03 Jan 2014)
New Revision: 696

Modified:
   papers/rjournal/eddelbuettel-francois-stokely.Rnw
Log:
Improve introductory text.  Compare with the binary JSON formats like
MessagePack and BSON.  Add the new high level figure about
serializing/deserializing data from an interactive R session to other
distributed systems.



Modified: papers/rjournal/eddelbuettel-francois-stokely.Rnw
===================================================================

--- papers/rjournal/eddelbuettel-francois-stokely.Rnw	2014-01-03 20:21:47 UTC (rev 695)
+++ papers/rjournal/eddelbuettel-francois-stokely.Rnw	2014-01-03 20:39:06 UTC (rev 696)
@@ -36,16 +36,24 @@
 Modern data collection and analysis pipelines are increasingly being
 built using collections of components to better manage software
 complexity through reusability, modularity, and fault
-isolation \citep{Wegiel:2010:CTT:1932682.1869479}.  Different
-programming languages are often used for the different phases of data
+isolation \citep{Wegiel:2010:CTT:1932682.1869479}.  
+Data analysis patterns such as Split-Apply-Combine
+\citep{wickham2011split} explicitly break up large problems into
+manageable pieces.  These patterns are frequently employed with
+different programming languages used for the different phases of data
 analysis -- collection, cleaning, analysis, post-processing, and
 presentation in order to take advantage of the unique combination of
 performance, speed of development, and library support offered by
 different environments.  Each stage of the data
 analysis pipeline may involve storing intermediate results in a
-file or sending them over the network.  Programming languages such as
-Java, Ruby, Python, and R include built-in serialization support, but
-these formats are tied to the specific programming language in use.
+file or sending them over the network.
+
+Programming languages such as Java, Ruby, Python, and R include
+built-in serialization support, but these formats are tied to the
+specific programming language in use and thus lock the user into a
+single environment.
+%
+% do not facilitate
 % TODO(ms): and they often don't support versioning among other faults.
 CSV files can be read and written by many applications and so are
 often used for exporting tabular data.  However, CSV files have a
@@ -55,25 +63,43 @@
 characters.  JSON is another widely-supported format used mostly on
 the web that removes many of these disadvantages, but it too suffers
 from being too slow to parse and also does not provide strong typing
-between integers and floating point.  Large numbers of JSON messages
-would also be required to duplicate the field names with each message.
+between integers and floating point.  Because the schema information
+is not kept separately, multiple JSON messages of the same
+type needlessly duplicate the field names with each message.
+%
+%
+%
+A number of binary formats based on JSON have been proposed that
+reduce the parsing cost and improve the efficiency.  MessagePack
+\citep{msgpackR} and BSON \citep{rmongodb} both have R interfaces, but
+these formats lack a separate schema for the serialized data and thus
+still duplicate field names with each message sent over the network or
+stored in a file.  Such formats also lack support for versioning when
+data storage needs evolve over time.
 
-TODO(ms): Also work in reference to Split-Apply-Combine pattern for
-data analysis \citep{wickham2011split}, since that is a great pattern
-but it seems overly optimistic to expect all of those phases to always
-be done in the same language.
+% TODO(mstokely): Take a more conversational tone here asking
+% questions and motivating protocol buffers?
 
 This article describes the basics of Google's Protocol Buffers through
 an easy to use R package, \CRANpkg{RProtoBuf}.  After describing the
 basics of protocol buffers and \CRANpkg{RProtoBuf}, we illustrate
 several common use cases for protocol buffers in data analysis.
 
+\section{Protocol Buffers}
+% This content is good.  Maybe use and cite?
+% http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
+
+Protocol Buffers are a widely used modern language-neutral,
+platform-neutral, extensible mechanism for sharing structured data.
+
+
+one of the more popular examples of the modern 
+
+
 XXX Related work on IDLs (greatly expanded )
 
 XXX Design tradeoffs: reflection vs proto compiler
 
-\section{Protocol Buffers}
-
 Once the data serialization needs get complex enough, application
 developers typically benefit from the use of an \emph{interface
 description language}, or \emph{IDL}.  IDLs like Google's Protocol
@@ -139,6 +165,14 @@
 languages to support protocol buffers is compiled as part of the
 project page: \url{http://code.google.com/p/protobuf/wiki/ThirdPartyAddOns}
 
+\begin{figure}[t]
+\begin{center}
+\includegraphics[width=\textwidth]{protobuf-distributed-system-crop.pdf}
+\end{center}
+\caption{Example protobuf usage}
+\label{fig:protobuf-distributed-usecase}
+\end{figure}
+
 \section{Basic Usage: Messages and Descriptors}
 
 This section describes how to use the R API to create and manipulate