[Rprotobuf-commits] r805 - papers/jss

Tue Jan 21 06:37:30 CET 2014

Author: murray
Date: 2014-01-21 06:37:30 +0100 (Tue, 21 Jan 2014)
New Revision: 805

Modified:
   papers/jss/article.Rnw
Log:
Mostly work on section 2 to address some flaws identified by Jeroen,
Karl, and others.

Move up the basic description of the protocol buffer schema from
section 3 to section 2, including the example of how protocol buffers
are manipulated with this package.

Revert a regression -- fix the reference to BSON and MessagePack by
putting the citations next to the text about the R interfaces rather
than the formats themselves (re-apply fix from Dirk).

Add an explicit transition from section 2 to section 3 as the last sentence of 2.

Define payload at the beginning of section 3 just once, so we don't
repeat ourselves later in the section.

Add a sentence to section 6 that provides more context about when you
use the basic RProtoBuf functionality with specific schemas -- "This
is useful when there are pre-existing systems with defined schemas or
significant software components written in other languages that need
to be accessed from within R." before transitioning to talk about the
universal r language schema.  (This point suggested by Karl).



Modified: papers/jss/article.Rnw
===================================================================

--- papers/jss/article.Rnw	2014-01-21 04:21:13 UTC (rev 804)
+++ papers/jss/article.Rnw	2014-01-21 05:37:30 UTC (rev 805)
@@ -190,8 +190,8 @@
 
 A number of binary formats based on \texttt{JSON} have been proposed
 that reduce the parsing cost and improve efficiency.  \pkg{MessagePack}
-\citep{msgpackR} and \pkg{BSON} \citep{rmongodb} both have R
-interfaces, but these formats lack a separate schema for the seralized
+and \pkg{BSON} both have R
+interfaces \citep{msgpackR,rmongodb}, but these formats lack a separate schema for the serialized
 data and thus still duplicate field names with each message sent over
 the network or stored in a file.  Such formats also lack support for
 versioning when data storage needs evolve over time, or when
@@ -268,22 +268,10 @@
 \section{Protocol Buffers}
 \label{sec:protobuf}
 
-
 % JO: I'm not sure where to put this paragraph. I think it is too technical
 % for the introduction section. Maybe start this section with some explanation
 % of what a schema is and then continue with showing how PB implement this?
-Once the data serialization needs of an application become complex
-enough, developers typically benefit from the use of an
-\emph{interface description language}, or \emph{IDL}.  IDLs like
-Protocol Buffers \citep{protobuf}, Apache Thrift, and Apache Avro provide a compact
-well-documented schema for cross-language data structures and
-efficient binary interchange formats.
-Since the schema is provided separately from the encoded data, the data can be
-efficiently encoded to minimize storage costs of the stored data when compared with simple
-``schema-less'' binary interchange formats. 
-The schema can be used to generate classes for statically-typed programming languages
-such as C++ and Java, or can be used with reflection for dynamically-typed programming
-languages.
+% MS: Yes I agree, tried to address below.
 
 %FIXME Introductory section which may include references in parentheses
 %\citep{R}, or cite a reference such as \citet{R} in the text.
@@ -293,26 +281,10 @@
 
 %% TODO(de,ms)  What follows is oooooold and was lifted from the webpage
 %%              Rewrite?
-Protocol Buffers can be described as a modern, language-neutral, platform-neutral,
-extensible mechanism for sharing and storing structured data.  Since their
-introduction, Protocol Buffers have been widely adopted in industry with
-applications as varied as %database-internal messaging (Drizzle), % DE: citation?
-Sony Playstations, Twitter, Google Search, Hadoop, and Open Street Map.  
-% TODO(DE): This either needs a citation, or remove the name drop
-% MS: These are mostly from blog posts, I can't find a good reference
-% that has a long list, and the name and year citation style seems
-% less conducive to long lists of marginal citations like blog posts
-% compared to say concise CS/math style citations [3,4,5,6]. Thoughts?
+Protocol Buffers are a modern, language-neutral, platform-neutral,
+extensible mechanism for sharing and storing structured data.  Some of
+the key features provided by Protocol Buffers for data analysis include:
 
-
-
-While traditional IDLs have at times been criticized for code bloat and
-complexity, Protocol Buffers are based on a simple list and records
-model that is compartively flexible and simple to use.
-
-Some of the key features provided by Protocol Buffers for data analysis
-include:
-
 \begin{itemize}
 \item \emph{Portable}:  Enable users to send and receive data between
   applications as well as different computers or operating systems.
@@ -324,6 +296,16 @@
   decade.
 \end{itemize}
 
+% Lets place this at the top of the page or the bottom, or on a float
+% page, but not just here in the middle of the page.
+\begin{figure}[tbp]
+\begin{center}
+\includegraphics[width=\textwidth]{protobuf-distributed-system-crop.pdf}
+\end{center}
+\caption{Example protobuf usage}
+\label{fig:protobuf-distributed-usecase}
+\end{figure}
+
 Figure~\ref{fig:protobuf-distributed-usecase} illustrates an example
 communication workflow with Protocol Buffers and an interactive R session.
 Common use cases include populating a request remote-procedure call (RPC)
@@ -334,6 +316,94 @@
 the remote server may be implemented in any language, with no
 dependence on R.
 
+While traditional IDLs have at times been criticized for code bloat and
+complexity, Protocol Buffers are based on a simple list and records
+model that is flexible and simple to use.  The schema for structured
+protocol buffer data is defined in \texttt{.proto} files which may
+contain one or more message types.  Each message type has one or more
+fields.  A field is specified with a unique number, a name, a value
+type, and a field rule specifying whether the field is optional,
+required, or repeated.  The supported value types are numbers,
+enumerations, booleans, strings, raw bytes, or other nested message
+types.  The \texttt{.proto} file syntax for defining the structure of protocol
+buffer data is described comprehensively on Google Code\footnote{See 
+\url{http://code.google.com/apis/protocolbuffers/docs/proto.html}.}.
+Table~\ref{tab:proto} shows an example \texttt{.proto} file which
+defines the \texttt{tutorial.Person} type.  The R code in the right
+column shows an example of creating a new message of this type and
+populating its fields.
+
+%% TODO(de) Can we make this not break the width of the page?
+\noindent
+\begin{table}
+\begin{tabular}{p{.40\textwidth}p{0.55\textwidth}}
+\toprule
+Schema : \texttt{addressbook.proto} & Example R Session\\
+\cmidrule{1-2}
+\begin{minipage}{.40\textwidth}
+\vspace{2mm}
+\begin{example}
+package tutorial;
+message Person {
+ required string name = 1;
+ required int32 id = 2;
+ optional string email = 3;
+ enum PhoneType {
+   MOBILE = 0; HOME = 1;
+   WORK = 2;
+ }
+ message PhoneNumber {
+   required string number = 1;
+   optional PhoneType type = 2;
+ }
+ repeated PhoneNumber phone = 4;
+}
+\end{example}
+\vspace{2mm}
+\end{minipage} & \begin{minipage}{.55\textwidth}
+<<echo=TRUE>>=
+library(RProtoBuf)
+p <- new(tutorial.Person,id=1,name="Dirk")
+class(p)
+p$name
+p$name <- "Murray"
+cat(as.character(p))
+serialize(p, NULL)
+@
+\end{minipage} \\
+\bottomrule
+\end{tabular}
+\caption{The schema representation from a \texttt{.proto} file for the
+  \texttt{tutorial.Person} class (left) and simple R code for creating
+  an object of this class and accessing its fields (right).}
+\label{tab:proto}
+\end{table}
+
+
+% The schema can be used to generate model classes for statically-typed programming languages
+%such as C++ and Java, or can be used with reflection for dynamically-typed programming
+%languages.
+
+% TODO(mstokely): Maybe find a place to add this?  
+% Since their
+% introduction, Protocol Buffers have been widely adopted in industry with
+% applications as varied as %database-internal messaging (Drizzle), % DE: citation?
+% Sony Playstations, Twitter, Google Search, Hadoop, and Open Street
+% Map.  
+
+% TODO(DE): This either needs a citation, or remove the name drop
+% MS: These are mostly from blog posts, I can't find a good reference
+% that has a long list, and the name and year citation style seems
+% less conducive to long lists of marginal citations like blog posts
+% compared to say concise CS/math style citations [3,4,5,6]. Thoughts?
+
+
+% The schema can be used to generate classes for statically-typed programming languages
+% such as C++ and Java, or can be used with reflection for dynamically-typed programming
+% languages.
+
+
+
 %Protocol buffers are a language-neutral, platform-neutral, extensible
 %way of serializing structured data for use in communications
 %protocols, data storage, and more.
@@ -345,16 +415,6 @@
 %buffers are also forward compatible: updates to the \texttt{proto}
 %files do not break programs built against the previous specification.
 
-%While benchmarks are not available, Google states on the project page that in
-%comparison to XML, protocol buffers are at the same time \textsl{simpler},
-%between three to ten times \textsl{smaller}, between twenty and one hundred
-%times \textsl{faster}, as well as less ambiguous and easier to program.
-
-%The flexibility of the reflection-based API is particularly well
-%suited for interactive data analysis.
-
-% XXX Design tradeoffs: reflection vs proto compiler
-
 For added speed and efficiency, the C++, Java, and Python bindings to
 Protocol Buffers are used with a compiler that translates a Protocol
 Buffer schema description file (ending in \texttt{.proto}) into
@@ -364,7 +424,8 @@
 interactive data analysis.  
 All messages in R have a single class
 structure, but different accessor methods are created at runtime based
-on the named fields of the specified message type.
+on the named fields of the specified message type, as described in the
+next section.
 
 % In other words, given the 'proto'
 %description file, code is automatically generated for the chosen
@@ -388,40 +449,19 @@
 %languages to support protocol buffers is compiled as part of the
 %project page: \url{http://code.google.com/p/protobuf/wiki/ThirdPartyAddOns}
 
-\begin{figure}[t]
-\begin{center}
-\includegraphics[width=\textwidth]{protobuf-distributed-system-crop.pdf}
-\end{center}
-\caption{Example protobuf usage}
-\label{fig:protobuf-distributed-usecase}
-\end{figure}
-
 \section{Basic Usage: Messages and Descriptors}
 \label{sec:rprotobuf-basic}
 
 This section describes how to use the R API to create and manipulate
 protocol buffer messages in R, and how to read and write the
-binary \emph{payload} of the messages to files and arbitrary binary
+binary representation of the message (often called the \emph{payload}) to files and arbitrary binary
 R connections.
-
 The two fundamental building blocks of Protocol Buffers are \emph{Messages}
 and \emph{Descriptors}.  Messages provide a common abstract encapsulation of
 structured data fields of the type specified in a Message Descriptor.
 Message Descriptors are defined in \texttt{.proto} files and define a
 schema for a particular named class of messages.
 
-Table~\ref{tab:proto} shows an example \texttt{.proto} file which
-defines the \texttt{tutorial.Person} type.  The R code in the right
-column shows an example of creating a new message of this type and
-populating its fields.  A \texttt{.proto} file may contain one or more
-message types, and each message type has one or more fields.  A field
-is specified with a unique number, a name, a value type, and a field
-rule specifying whether the field is optional, required, or repeated.
-The supported value types are numbers, enumerations, booleans,
-strings, raw bytes, or other nested message types.
-The \texttt{.proto} file syntax for defining the structure of protocol
-buffer data is described comprehensively on Google Code\footnote{See 
-\url{http://code.google.com/apis/protocolbuffers/docs/proto.html}.}.
 
 % Commented out because we said this earlier.
 %This separation
@@ -438,51 +478,6 @@
 %languages.  The definition
 
 
-%% TODO(de) Can we make this not break the width of the page?
-\noindent
-\begin{table}
-\begin{tabular}{p{.40\textwidth}p{0.55\textwidth}}
-\toprule
-Schema : \texttt{addressbook.proto} & Example R Session\\
-\cmidrule{1-2}
-\begin{minipage}{.40\textwidth}
-\vspace{2mm}
-\begin{example}
-package tutorial;
-message Person {
- required string name = 1;
- required int32 id = 2;
- optional string email = 3;
- enum PhoneType {
-   MOBILE = 0; HOME = 1;
-   WORK = 2;
- }
- message PhoneNumber {
-   required string number = 1;
-   optional PhoneType type = 2;
- }
- repeated PhoneNumber phone = 4;
-}
-\end{example}
-\vspace{2mm}
-\end{minipage} & \begin{minipage}{.55\textwidth}
-<<echo=TRUE>>=
-library(RProtoBuf)
-p <- new(tutorial.Person,id=1,name="Dirk")
-class(p)
-p$name
-p$name <- "Murray"
-cat(as.character(p))
-serialize(p, NULL)
-@
-\end{minipage} \\
-\bottomrule
-\end{tabular}
-\caption{The schema representation from a \texttt{.proto} file for the
-  \texttt{tutorial.Person} class (left) and simple R code for creating
-  an object of this class and accessing its fields (right).}
-\label{tab:proto}
-\end{table}
 
 %This section may contain a figure such as Figure~\ref{figure:rlogo}.
 %
@@ -663,7 +658,7 @@
 the human-readable ASCII output that is created with
 \code{as.character}.
 
-The binary representation of the message (often called the payload)
+The binary representation of the message
 does not contain information that can be used to dynamically
 infer the message type, so we have to provide this information
 to the \texttt{read} function in the form of a descriptor :
@@ -1250,8 +1245,12 @@
 
 The previous sections discussed functionality in the \pkg{RProtoBuf} package
 for creating, manipulating, parsing and serializing Protocol Buffer
-messages of a specific pre-defined schema.  The package also provides
-methods for converting arbitrary R data structures into protocol
+messages of a pre-defined schema.  This is useful when there are
+pre-existing systems with defined schemas or significant software
+components written in other languages that need to be accessed from
+within R.
+
+The package also provides methods for converting arbitrary R data structures into protocol
 buffers and vice versa with a universal R object schema. The \texttt{serialize\_pb} and \texttt{unserialize\_pb}
 functions serialize arbitrary R objects into a universal Protocol Buffer 
 message: