[Rprotobuf-commits] r750 - papers/jss

Sat Jan 11 22:12:39 CET 2014

Author: edd
Date: 2014-01-11 22:12:39 +0100 (Sat, 11 Jan 2014)
New Revision: 750

Modified:
   papers/jss/article.Rnw
Log:
bunch of edits


Modified: papers/jss/article.Rnw
===================================================================

--- papers/jss/article.Rnw	2014-01-11 17:28:04 UTC (rev 749)
+++ papers/jss/article.Rnw	2014-01-11 21:12:39 UTC (rev 750)
@@ -20,7 +20,7 @@
 \title{\pkg{RProtoBuf}: Efficient Cross-Language Data Serialization in R}
 
 %% for pretty printing and a nice hypersummary also set:
-\Plainauthor{Dirk Eddelbuettel, Murray Stokely} %% comma-separated
+\Plainauthor{Dirk Eddelbuettel, Murray Stokely, Jeroen Ooms} %% comma-separated
 \Plaintitle{RProtoBuf: Efficient Cross-Language Data Serialization in R}
 \Shorttitle{\pkg{RProtoBuf}: Protocol Buffers in R} %% a short title (if necessary)
 
@@ -121,13 +121,12 @@
 \citep{wickham2011split} explicitly break up large problems into
 manageable pieces.  These patterns are frequently employed with
 different programming languages used for the different phases of data
-analysis -- collection, cleaning, analysis, post-processing, and
+analysis -- collection, cleaning, modeling, analysis, post-processing, and
 presentation in order to take advantage of the unique combination of
 performance, speed of development, and library support offered by
-different environments.  Each stage of the data
+different environments and languages.  Each stage of such a data
 analysis pipeline may involve storing intermediate results in a
 file or sending them over the network.
-% DE: Nice!
 
 Given these requirements, how do we safely share intermediate results
 between different applications, possibly written in different
@@ -137,34 +136,35 @@
 serialization support, but these formats are tied to the specific
 % DE: need to define serialization?
 programming language in use and thus lock the user into a single
-environment.  CSV files can be read and written by many applications
-and so are often used for exporting tabular data.  However, CSV files
-have a number of disadvantages, such as a limitation of exporting only
-tabular datasets, lack of type-safety, inefficient text representation
-and parsing, and ambiguities in the format involving special
-characters.  JSON is another widely-supported format used mostly on
-the web that removes many of these disadvantages, but it too suffers
-from being too slow to parse and also does not provide strong typing
-between integers and floating point.  Because the schema information
-is not kept separately, multiple JSON messages of the same type
-needlessly duplicate the field names with each message.
-Lastly, XML is a well-established and widely-supported protocol with the ability to define
-just about any arbitrarily complex schema. However, it pays for this
-complexity with comparatively large and verbose messages, and added
-complexities at the parsing side. 
-%
-%
-%
+environment.  
+
+\emph{Comma-separated values} (CSV) files can be read and written by many
+applications and so are often used for exporting tabular data.  However, CSV
+files have a number of disadvantages, such as a limitation of exporting only
+tabular datasets, lack of type-safety, inefficient text representation and
+parsing, possibly limited precision and ambiguities in the format involving
+special characters.  \emph{JavaScript Object Notation} (JSON) is another
+widely-supported format used mostly on the web that removes many of these
+disadvantages, but it too suffers from being too slow to parse and also does
+not provide strong typing between integers and floating point.  Because the
+schema information is not kept separately, multiple JSON messages of the same
+type needlessly duplicate the field names with each message.  Lastly,
+\emph{Extensible Markup Language} (XML) is a well-established and widely-supported
+protocol with the ability to define just about any arbitrarily complex
+schema. However, it pays for this complexity with comparatively large and
+verbose messages, and added complexities at the parsing side (which are
+somewhat metigated by the availability of mature libraries and
+parsers).
+
 A number of binary formats based on JSON have been proposed that
 reduce the parsing cost and improve the efficiency.  MessagePack
-and BSON both have R interfaces \citep{msgpackR,rmongodb}, but
+and BSON both have R interfaces, but % \citep{msgpackR,rmongodb}, but
 % DE Why do we cite these packages, but not the numerous JSON packages?
 these formats lack a separate schema for the serialized data and thus
 still duplicate field names with each message sent over the network or
 stored in a file.  Such formats also lack support for versioning when
 data storage needs evolve over time, or when application logic and
 requirement changes dictate update to the message format.
-% DE: Need to talk about XML -- added a few lines at previous paragraph
 
 Once the data serialization needs of an application become complex
 enough, developers typically benefit from the use of an
@@ -188,19 +188,19 @@
 % in the middle (full class/method details) and interesting
 % applications at the end.
 
-Section~\ref{sec:protobuf} provides a general overview of Protocol
-Buffers.  Section~\ref{sec:rprotobuf-basic} describes the interactive
-R interface provided by \CRANpkg{RProtoBuf} and introduces the two
-main abstractions: \emph{Messages} and \emph{Descriptors}.
-Section~\ref{sec:rprotobuf-classes} describes the implementation
-details of the main S4 classes making up this package.
-Section~\ref{sec:types} describes the challenges of type coercion
-between R and other languages.  Section~\ref{sec:evaluation}
-introduces a general R language schema for serializing arbitrary R
-objects and evaluates it against R's built-in serialization.
-Sections~\label{sec:opencpu} and \label{sec:mapreduce} provide
-real-world use cases of \CRANpkg{RProtoBuf} in web service and
-MapReduce environments, respectively.
+The rest of the paper is organized as follows. Section~\ref{sec:protobuf}
+provides a general overview of Protocol Buffers.
+Section~\ref{sec:rprotobuf-basic} describes the interactive R interface
+provided by \CRANpkg{RProtoBuf} and introduces the two main abstractions:
+\emph{Messages} and \emph{Descriptors}.  Section~\ref{sec:rprotobuf-classes}
+describes the implementation details of the main S4 classes making up this
+package.  Section~\ref{sec:types} describes the challenges of type coercion
+between R and other languages.  Section~\ref{sec:evaluation} introduces a
+general R language schema for serializing arbitrary R objects and evaluates
+it against R's built-in serialization.  Sections~\ref{sec:opencpu}
+and \ref{sec:mapreduce} provide real-world use cases of \CRANpkg{RProtoBuf}
+in web service and MapReduce environments, respectively, before
+Section~\ref{sec:summary} concludes.
 
 %This article describes the basics of Google's Protocol Buffers through
 %an easy to use R package, \CRANpkg{RProtoBuf}.  After describing the
@@ -221,10 +221,10 @@
 Protocol Buffers can be described as a modern, language-neutral, platform-neutral,
 extensible mechanism for sharing and storing structured data.  Since their
 introduction, Protocol Buffers have been widely adopted in industry with
-applications as varied as database-internal messaging (Drizzle), % DE: citation?
-Sony Playstations, Twitter, Google Search, Hadoop, and Open Street Map.  While
+applications as varied as %database-internal messaging (Drizzle), % DE: citation?
+Sony Playstations, Twitter, Google Search, Hadoop, and Open Street Map.  
 % TODO(DE): This either needs a citation, or remove the name drop
-traditional IDLs have at time been criticized for code bloat and
+While traditional IDLs have at time been criticized for code bloat and
 complexity, Protocol Buffers are based on a simple list and records
 model that is compartively flexible and simple to use.
 
@@ -232,22 +232,22 @@
 include:
 
 \begin{itemize}
-\item \emph{Portable}:  Allows users to send and receive data between
-  applications or different computers.
+\item \emph{Portable}:  Enable users to send and receive data between
+  applications as well as different computers or operating systems.
 \item \emph{Efficient}:  Data is serialized into a compact binary
   representation for transmission or storage.
 \item \emph{Extensible}:  New fields can be added to Protocol Buffer Schemas
-  in a forward-compatible way that do not break older applications.
+  in a forward-compatible way that does not break older applications.
 \item \emph{Stable}:  Protocol Buffers have been in wide use for over a
   decade.
 \end{itemize}
 
 Figure~\ref{fig:protobuf-distributed-usecase} illustrates an example
-communication workflow with protocol buffers and an interactive R
-session.  Common use cases include populating a request RPC protocol
-buffer in R that is then serialized and sent over the network to a
-remote server.  The server would then deserialize the message, act on
-the request, and respond with a new protocol buffer over the network. The key
+communication workflow with Protocol Buffers and an interactive R session.
+Common use cases include populating a request remote-procedure call (RPC)
+Protocol Buffer in R that is then serialized and sent over the network to a
+remote server.  The server would then deserialize the message, act on the
+request, and respond with a new Protocol Buffer over the network. The key
 difference to, say, a request to an Rserve instance is that the remote server
 may not even know the R language.
 
@@ -267,9 +267,9 @@
 %between three to ten times \textsl{smaller}, between twenty and one hundred
 %times \textsl{faster}, as well as less ambiguous and easier to program.
 
-Many sources compare data serialization formats and show protocol
-buffers very favorably to the alternatives, such
-as \citet{Sumaray:2012:CDS:2184751.2184810}
+Many sources compare data serialization formats and show Protocol
+Buffers very favorably to the alternatives; see
+\citet{Sumaray:2012:CDS:2184751.2184810} for one such comparison.
 
 %The flexibility of the reflection-based API is particularly well
 %suited for interactive data analysis.
@@ -277,11 +277,11 @@
 % XXX Design tradeoffs: reflection vs proto compiler
 
 For added speed and efficiency, the C++, Java, and Python bindings to
-Protocol Buffers are used with a compiler that translates a protocol
-buffer schema description file (ending in \texttt{.proto}) into
+Protocol Buffers are used with a compiler that translates a Protocol
+Buffer schema description file (ending in \texttt{.proto}) into
 language-specific classes that can be used to create, read, write and
-manipulate protocol buffer messages.  The R interface, in contrast,
-uses a reflection-based API that is particularly well suited for
+manipulate Protocol Buffer messages.  The R interface, in contrast,
+uses a reflection-based API that is particularly well-suited for
 interactive data analysis.  All messages in R have a single class
 structure, but different accessor methods are created at runtime based
 on the name fields of the specified message type.
@@ -324,8 +324,8 @@
 binary \emph{payload} of the messages to files and arbitrary binary
 R connections.
 
-The two fundamental building blocks of Protocol Buffers are Messages
-and Descriptors.  Messages provide a common abstract encapsulation of
+The two fundamental building blocks of Protocol Buffers are \emph{Messages}
+and \emph{Descriptors}.  Messages provide a common abstract encapsulation of
 structured data fields of the type specified in a Message Descriptor.
 Message Descriptors are defined in \texttt{.proto} files and define a
 schema for a particular named class of messages.
@@ -353,11 +353,11 @@
 %% TODO(de) Can we make this not break the width of the page?
 \noindent
 \begin{table}
-\begin{tabular}{@{\hskip .01\textwidth}p{.40\textwidth}@{\hskip .02\textwidth}@{\hskip .02\textwidth}p{0.55\textwidth}@{\hskip .01\textwidth}}
+\begin{tabular}{p{.40\textwidth}p{0.55\textwidth}}
 \toprule
 Schema : \texttt{addressbook.proto} & Example R Session\\
 \cmidrule{1-2}
-\begin{minipage}{.35\textwidth}
+\begin{minipage}{.40\textwidth}
 \vspace{2mm}
 \begin{example}
 package tutorial;
@@ -377,10 +377,10 @@
 }
 \end{example}
 \vspace{2mm}
-\end{minipage} & \begin{minipage}{.5\textwidth}
+\end{minipage} & \begin{minipage}{.55\textwidth}
 <<echo=TRUE>>=
 library(RProtoBuf)
-p <- new(tutorial.Person, id=1, name="Dirk")
+p <- new(tutorial.Person,id=1,name="Dirk")
 class(p)
 p$name
 p$name <- "Murray"
@@ -421,8 +421,8 @@
 all \texttt{.proto} files provided by another R package.
 
 The \texttt{.proto} file syntax for defining the structure of protocol
-buffer data is described comprehensively on Google Code:
-\url{http://code.google.com/apis/protocolbuffers/docs/proto.html}.
+buffer data is described comprehensively on Google Code\footnote{See 
+\url{http://code.google.com/apis/protocolbuffers/docs/proto.html}.}.
 
 Once the proto files are imported, all message descriptors are
 are available in the R search path in the \texttt{RProtoBuf:DescriptorPool}
@@ -473,7 +473,6 @@
 
 However, as opposed to R lists, no partial matching is performed
 and the name must be given entirely.
-
 The \verb|[[| operator can also be used to query and set fields
 of a messages, supplying either their name or their tag number :
 
@@ -483,7 +482,7 @@
 p[[ "email" ]]
 @
 
-Protocol buffers include a 64-bit integer type, but R lacks native
+Protocol Buffers include a 64-bit integer type, but R lacks native
 64-bit integer support.  A workaround is available and described in
 Section~\ref{sec:int64} for working with large integer values.
 
@@ -492,7 +491,7 @@
 
 \subsection{Display messages}
 
-Protocol buffer messages and descriptors implement \texttt{show}
+Protocol Buffer messages and descriptors implement \texttt{show}
 methods that provide basic information about the message :
 
 <<>>=
@@ -509,10 +508,10 @@
 
 \subsection{Serializing messages}
 
-However, the main focus of protocol buffer messages is
+However, the main focus of Protocol Buffer messages is
 efficiency. Therefore, messages are transported as a sequence
 of bytes. The \texttt{serialize} method is implemented for
-protocol buffer messages to serialize a message into a sequence of
+Protocol Buffer messages to serialize a message into a sequence of
 bytes that represents the message.
 %(raw vector in R speech) that represents the message.
 
@@ -589,7 +588,7 @@
 @
 
 
-\texttt{read} can also be used as a pseudo method of the descriptor
+\texttt{read} can also be used as a pseudo-method of the descriptor
 object :
 
 <<>>=
@@ -614,8 +613,9 @@
 \texttt{serialize}.
 
 Each R object stores an external pointer to an object managed by
-the \texttt{protobuf} C++ library.
-The \CRANpkg{Rcpp} package \citep{eddelbuettel2011rcpp,eddelbuettel2013seamless} is used to
+the \texttt{protobuf} C++ library which implements the core Protocol Buffer
+functionality.  The \CRANpkg{Rcpp} package
+\citep{eddelbuettel2011rcpp,eddelbuettel2013seamless} is used to 
 facilitate the integration of the R and C++ code for these objects.
 
 % Message, Descriptor, FieldDescriptor, EnumDescriptor,
@@ -636,12 +636,12 @@
 which provide a more concise way of wrapping C++ functions and classes
 in a single entity.
 
-The \texttt{RProtoBuf} package combines the \emph{R typical} dispatch
-of the form \verb|method(object, arguments)| and the more traditional
-object oriented notation \verb|object$method(arguments)|.
+The \texttt{RProtoBuf} package combines a dispatch mechanism 
+of the form \verb|method(object, arguments)| (common to R) and the more
+traditional object oriented notation \verb|object$method(arguments)|.
 Additionally, \texttt{RProtoBuf} implements the \texttt{.DollarNames} S3 generic function
 (defined in the \texttt{utils} package) for all classes to enable tab
-completion.  Completion possibilities include pseudo method names for all
+completion.  Completion possibilities include pseudo-method names for all
 classes, plus dynamic dispatch on names or types specific to a given object.
 
 % TODO(ms): Add column check box for doing dynamic dispatch based on type.
@@ -683,9 +683,9 @@
 \toprule
 \textbf{Slot} & \textbf{Description} \\
 \cmidrule(r){2-2}
-\texttt{pointer} & External pointer to the \texttt{Message} object of the C++ proto library. Documentation for the
-\texttt{Message} class is available from the protocol buffer project page:
-\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.message.html#Message} \\
+\texttt{pointer} & External pointer to the \texttt{Message} object of the C++ protobuf library. Documentation for the
+\texttt{Message} class is available from the Protocol Buffer project page. \\
+%(\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.message.html#Message}) \\
 \texttt{type} & Fully qualified name of the message. For example a \texttt{Person} message
 has its \texttt{type} slot set to \texttt{tutorial.Person} \\[.3cm]
 \textbf{Method} & \textbf{Description} \\
@@ -758,8 +758,8 @@
 \textbf{Slot} & \textbf{Description} \\
 \cmidrule(r){2-2}
 \texttt{pointer} & External pointer to the \texttt{Descriptor} object of the C++ proto library. Documentation for the
-\texttt{Descriptor} class is available from the protocol buffer project page:
-\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.descriptor.html#Descriptor} \\
+\texttt{Descriptor} class is available from the Protocol Buffer project page.\\
+%\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.descriptor.html#Descriptor} \\
 \texttt{type} & Fully qualified path of the message type. \\[.3cm]
 %
 \textbf{Method} & \textbf{Description} \\
@@ -781,7 +781,7 @@
 \texttt{field\_count} & Return the number of fields in this descriptor.\\
 \texttt{field} & Return the descriptor for the specified field in this descriptor.\\
 \texttt{nested\_type\_count} & The number of nested types in this descriptor.\\
-\texttt{nested\_type} & Return the descriptor for the specified nested
+\texttt{nested\_type} & Return the descriptor for the specified nested 
 type in this descriptor.\\
 \texttt{enum\_type\_count} & The number of enum types in this descriptor.\\
 \texttt{enum\_type} & Return the descriptor for the specified enum
@@ -984,8 +984,9 @@
 
 One of the benefits of using an Interface Definition Language (IDL)
 like Protocol Buffers is that it provides a highly portable basic type
-system that different language and hardware implementations can map to
+system. This permits different language and hardware implementations to map to
 the most appropriate type in different environments.
+
 Table~\ref{table-get-types} details the correspondence between the
 field type and the type of data that is retrieved by \verb|$| and \verb|[[|
 extractors.
@@ -1005,11 +1006,11 @@
 sint32	  & \texttt{integer} vector & \texttt{integer} vector \\
 sfixed32  & \texttt{integer} vector & \texttt{integer} vector \\[3mm]
 int64	  & \texttt{integer} or \texttt{character}
-vector \footnotemark & \texttt{integer} or \texttt{character} vector \\
+vector    & \texttt{integer} or \texttt{character} vector \\
 uint64	  & \texttt{integer} or \texttt{character} vector & \texttt{integer} or \texttt{character} vector \\
 sint64	  & \texttt{integer} or \texttt{character} vector & \texttt{integer} or \texttt{character} vector \\
 fixed64	  & \texttt{integer} or \texttt{character} vector & \texttt{integer} or \texttt{character} vector \\
-sfixed64  & \texttt{integer} or \texttt{character} vector & \texttt{integer} or \texttt{character} vector \\\hline
+sfixed64  & \texttt{integer} or \texttt{character} vector & \texttt{integer} or \texttt{character} vector \\[3mm]
 bool	& \texttt{logical} vector & \texttt{logical} vector \\[3mm]
 string	& \texttt{character} vector & \texttt{character} vector \\
 bytes	& \texttt{character} vector & \texttt{character} vector \\[3mm]
@@ -1019,17 +1020,17 @@
 \end{tabular}
 \end{small}
 \caption{\label{table-get-types}Correspondence between field type and
-  R type retrieved by the extractors. \footnotesize{1. R lacks native
+  R type retrieved by the extractors. Note that R lacks native
   64-bit integers, so the \texttt{RProtoBuf.int64AsString} option is
   available to return large integers as characters to avoid losing
-  precision.  This option is described in Section~\ref{sec:int64}}.}
+  precision.  This option is described in Section~\ref{sec:int64}.}
 \end{table}
 
 \subsection{Booleans}
 
 R booleans can accept three values: \texttt{TRUE}, \texttt{FALSE}, and
-\texttt{NA}.  However, most other languages, including the protocol
-buffer schema, only accept \texttt{TRUE} or \texttt{FALSE}.  This means
+\texttt{NA}.  However, most other languages, including the Protocol
+Buffer schema, only accept \texttt{TRUE} or \texttt{FALSE}.  This means
 that we simply can not store R logical vectors that include all three
 possible values as booleans.  The library will refuse to store
 \texttt{NA}s in protocol buffer boolean fields, and users must instead
@@ -1059,7 +1060,7 @@
 \subsection{Unsigned Integers}
 
 R lacks a native unsigned integer type.  Values between $2^{31}$ and
-$2^{32} - 1$ read from unsigned int protocol buffer fields must be
+$2^{32} - 1$ read from unsigned into Protocol Buffer fields must be
 stored as doubles in R.
 
 <<>>=
@@ -1140,22 +1141,21 @@
 \section{Evaluation: data.frame to Protocol Buffer Serialization}
 \label{sec:evaluation}
 
-Saptarshi Guha wrote the RHIPE package \citep{rhipe} which includes
-protocol buffer integration with R.  However, this implementation
-takes a different approach: any R object is serialized into a message
-based on a single catch-all \texttt{proto} schema.  Jeroen Ooms took a
-similar approach influenced by Saptarshi in the \pkg{RProtoBufUtils}
-package (which has now been integrated in \pkg{RProtoBuf}).  Unlike
-Saptarshi's package, however, RProtoBufUtils depends 
-on, and extends, RProtoBuf for underlying message operations.  
+The \pkg{RHIPE} package \citep{rhipe} also includes a Protocol integration with R.
+However, its implementation takes a different approach: any R object is
+serialized into a message based on a single catch-all \texttt{proto} schema.
+A similar approach was taken by \pkg{RProtoBufUtils} package (which has now been integrated in
+\pkg{RProtoBuf}). Unlike \pkg{RHIPE}, however, \pkg{RProtoBufUtils}
+depended upon on, and extended, \pkg{RProtoBuf} for underlying message operations.
+%DE Shall this go away now that we sucket RPBUtils into RBP?
 
 One key extension of \pkg{RProtoBufUtils} is the 
 \texttt{serialize\_pb} method to convert R objects into serialized
-protocol buffers in the catch-all schema. The \texttt{can\_serialize\_pb}
-method can be used to determine whether the given R object can safely
+Protocol Buffers in the catch-all schema. The \texttt{can\_serialize\_pb}
+method can then be used to determine whether the given R object can safely
 be expressed in this way.  To illustrate how this method works, we
 attempt to convert all of the built-in datasets from R into this
-serialized protocol buffer representation.
+serialized Protocol Buffer representation.
 
 <<echo=TRUE>>=
 datasets <- subset(as.data.frame(data()$results), Package=="datasets")
@@ -1165,7 +1165,7 @@
 
 There are \Sexpr{n} standard data sets included in R.  We use the
 \texttt{can\_serialize\_pb} method to determine how many of those can
-be safely converted to a serialized protocol buffer representation.
+be safely converted to a serialized Protocol Buffer representation.
 
 <<echo=TRUE>>=
 #datasets$valid.proto <- sapply(datasets$load.name, function(x) can_serialize_pb(eval(as.name(x))))
@@ -1177,7 +1177,7 @@
 (\Sexpr{format(100*m/n,digits=1)}\%).  The next section illustrates how
 many bytes were used to store the data sets under four different
 situations (1) normal R serialization, (2) R serialization followed by
-gzip, (3) normal protocol buffer serialization, (4) protocol buffer
+gzip, (3) normal Protocol Buffer serialization, (4) Protocol Buffer
 serialization followed by gzip.
 
 \subsection{Compression Performance}
@@ -1601,6 +1601,7 @@
 
 
 \section{Summary}
+\label{sec:summary}
 
 % RProtoBuf has been used.