[Rprotobuf-commits] r805 - papers/jss
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Tue Jan 21 06:37:30 CET 2014
Author: murray
Date: 2014-01-21 06:37:30 +0100 (Tue, 21 Jan 2014)
New Revision: 805
Modified:
papers/jss/article.Rnw
Log:
Mostly work on section 2 to address some flaws identified by Jeroen,
Karl, and others.
Move up the basic description of the protocol buffer schema from
section 3 to section 2, including the example of how protocol buffers
are manipulated with this package.
Revert a regression -- fix the reference to BSON and MessagePack by
putting the citations next to the text about the R interfaces rather
than the formats themselves (re-apply fix from Dirk).
Add an explicit transition from section 2 to section 3 as the last sentence of 2.
Define payload at the beginning of section 3 just once, so we don't
repeat ourselves later in the section.
Add a sentence to section 6 that provides more context about when you
use the basic RProtoBuf functionality with specific schemas -- "This
is useful when there are pre-existing systems with defined schemas or
significant software components written in other languages that need
to be accessed from within R." before transitioning to talk about the
universal r language schema. (This point suggested by Karl).
Modified: papers/jss/article.Rnw
===================================================================
--- papers/jss/article.Rnw 2014-01-21 04:21:13 UTC (rev 804)
+++ papers/jss/article.Rnw 2014-01-21 05:37:30 UTC (rev 805)
@@ -190,8 +190,8 @@
A number of binary formats based on \texttt{JSON} have been proposed
that reduce the parsing cost and improve efficiency. \pkg{MessagePack}
-\citep{msgpackR} and \pkg{BSON} \citep{rmongodb} both have R
-interfaces, but these formats lack a separate schema for the seralized
+and \pkg{BSON} both have R
+interfaces \citep{msgpackR,rmongodb}, but these formats lack a separate schema for the serialized
data and thus still duplicate field names with each message sent over
the network or stored in a file. Such formats also lack support for
versioning when data storage needs evolve over time, or when
@@ -268,22 +268,10 @@
\section{Protocol Buffers}
\label{sec:protobuf}
-
% JO: I'm not sure where to put this paragraph. I think it is too technical
% for the introduction section. Maybe start this section with some explanation
% of what a schema is and then continue with showing how PB implement this?
-Once the data serialization needs of an application become complex
-enough, developers typically benefit from the use of an
-\emph{interface description language}, or \emph{IDL}. IDLs like
-Protocol Buffers \citep{protobuf}, Apache Thrift, and Apache Avro provide a compact
-well-documented schema for cross-language data structures and
-efficient binary interchange formats.
-Since the schema is provided separately from the encoded data, the data can be
-efficiently encoded to minimize storage costs of the stored data when compared with simple
-``schema-less'' binary interchange formats.
-The schema can be used to generate classes for statically-typed programming languages
-such as C++ and Java, or can be used with reflection for dynamically-typed programming
-languages.
+% MS: Yes I agree, tried to address below.
%FIXME Introductory section which may include references in parentheses
%\citep{R}, or cite a reference such as \citet{R} in the text.
@@ -293,26 +281,10 @@
%% TODO(de,ms) What follows is oooooold and was lifted from the webpage
%% Rewrite?
-Protocol Buffers can be described as a modern, language-neutral, platform-neutral,
-extensible mechanism for sharing and storing structured data. Since their
-introduction, Protocol Buffers have been widely adopted in industry with
-applications as varied as %database-internal messaging (Drizzle), % DE: citation?
-Sony Playstations, Twitter, Google Search, Hadoop, and Open Street Map.
-% TODO(DE): This either needs a citation, or remove the name drop
-% MS: These are mostly from blog posts, I can't find a good reference
-% that has a long list, and the name and year citation style seems
-% less conducive to long lists of marginal citations like blog posts
-% compared to say concise CS/math style citations [3,4,5,6]. Thoughts?
+Protocol Buffers are a modern, language-neutral, platform-neutral,
+extensible mechanism for sharing and storing structured data. Some of
+the key features provided by Protocol Buffers for data analysis include:
-
-
-While traditional IDLs have at times been criticized for code bloat and
-complexity, Protocol Buffers are based on a simple list and records
-model that is compartively flexible and simple to use.
-
-Some of the key features provided by Protocol Buffers for data analysis
-include:
-
\begin{itemize}
\item \emph{Portable}: Enable users to send and receive data between
applications as well as different computers or operating systems.
@@ -324,6 +296,16 @@
decade.
\end{itemize}
+% Lets place this at the top of the page or the bottom, or on a float
+% page, but not just here in the middle of the page.
+\begin{figure}[tbp]
+\begin{center}
+\includegraphics[width=\textwidth]{protobuf-distributed-system-crop.pdf}
+\end{center}
+\caption{Example protobuf usage}
+\label{fig:protobuf-distributed-usecase}
+\end{figure}
+
Figure~\ref{fig:protobuf-distributed-usecase} illustrates an example
communication workflow with Protocol Buffers and an interactive R session.
Common use cases include populating a request remote-procedure call (RPC)
@@ -334,6 +316,94 @@
the remote server may be implemented in any language, with no
dependence on R.
+While traditional IDLs have at times been criticized for code bloat and
+complexity, Protocol Buffers are based on a simple list and records
+model that is flexible and simple to use. The schema for structured
+protocol buffer data is defined in \texttt{.proto} files which may
+contain one or more message types. Each message type has one or more
+fields. A field is specified with a unique number, a name, a value
+type, and a field rule specifying whether the field is optional,
+required, or repeated. The supported value types are numbers,
+enumerations, booleans, strings, raw bytes, or other nested message
+types. The \texttt{.proto} file syntax for defining the structure of protocol
+buffer data is described comprehensively on Google Code\footnote{See
+\url{http://code.google.com/apis/protocolbuffers/docs/proto.html}.}.
+Table~\ref{tab:proto} shows an example \texttt{.proto} file which
+defines the \texttt{tutorial.Person} type. The R code in the right
+column shows an example of creating a new message of this type and
+populating its fields.
+
+%% TODO(de) Can we make this not break the width of the page?
+\noindent
+\begin{table}
+\begin{tabular}{p{.40\textwidth}p{0.55\textwidth}}
+\toprule
+Schema : \texttt{addressbook.proto} & Example R Session\\
+\cmidrule{1-2}
+\begin{minipage}{.40\textwidth}
+\vspace{2mm}
+\begin{example}
+package tutorial;
+message Person {
+ required string name = 1;
+ required int32 id = 2;
+ optional string email = 3;
+ enum PhoneType {
+ MOBILE = 0; HOME = 1;
+ WORK = 2;
+ }
+ message PhoneNumber {
+ required string number = 1;
+ optional PhoneType type = 2;
+ }
+ repeated PhoneNumber phone = 4;
+}
+\end{example}
+\vspace{2mm}
+\end{minipage} & \begin{minipage}{.55\textwidth}
+<<echo=TRUE>>=
+library(RProtoBuf)
+p <- new(tutorial.Person,id=1,name="Dirk")
+class(p)
+p$name
+p$name <- "Murray"
+cat(as.character(p))
+serialize(p, NULL)
+@
+\end{minipage} \\
+\bottomrule
+\end{tabular}
+\caption{The schema representation from a \texttt{.proto} file for the
+ \texttt{tutorial.Person} class (left) and simple R code for creating
+ an object of this class and accessing its fields (right).}
+\label{tab:proto}
+\end{table}
+
+
+% The schema can be used to generate model classes for statically-typed programming languages
+%such as C++ and Java, or can be used with reflection for dynamically-typed programming
+%languages.
+
+% TODO(mstokely): Maybe find a place to add this?
+% Since their
+% introduction, Protocol Buffers have been widely adopted in industry with
+% applications as varied as %database-internal messaging (Drizzle), % DE: citation?
+% Sony Playstations, Twitter, Google Search, Hadoop, and Open Street
+% Map.
+
+% TODO(DE): This either needs a citation, or remove the name drop
+% MS: These are mostly from blog posts, I can't find a good reference
+% that has a long list, and the name and year citation style seems
+% less conducive to long lists of marginal citations like blog posts
+% compared to say concise CS/math style citations [3,4,5,6]. Thoughts?
+
+
+% The schema can be used to generate classes for statically-typed programming languages
+% such as C++ and Java, or can be used with reflection for dynamically-typed programming
+% languages.
+
+
+
%Protocol buffers are a language-neutral, platform-neutral, extensible
%way of serializing structured data for use in communications
%protocols, data storage, and more.
@@ -345,16 +415,6 @@
%buffers are also forward compatible: updates to the \texttt{proto}
%files do not break programs built against the previous specification.
-%While benchmarks are not available, Google states on the project page that in
-%comparison to XML, protocol buffers are at the same time \textsl{simpler},
-%between three to ten times \textsl{smaller}, between twenty and one hundred
-%times \textsl{faster}, as well as less ambiguous and easier to program.
-
-%The flexibility of the reflection-based API is particularly well
-%suited for interactive data analysis.
-
-% XXX Design tradeoffs: reflection vs proto compiler
-
For added speed and efficiency, the C++, Java, and Python bindings to
Protocol Buffers are used with a compiler that translates a Protocol
Buffer schema description file (ending in \texttt{.proto}) into
@@ -364,7 +424,8 @@
interactive data analysis.
All messages in R have a single class
structure, but different accessor methods are created at runtime based
-on the named fields of the specified message type.
+on the named fields of the specified message type, as described in the
+next section.
% In other words, given the 'proto'
%description file, code is automatically generated for the chosen
@@ -388,40 +449,19 @@
%languages to support protocol buffers is compiled as part of the
%project page: \url{http://code.google.com/p/protobuf/wiki/ThirdPartyAddOns}
-\begin{figure}[t]
-\begin{center}
-\includegraphics[width=\textwidth]{protobuf-distributed-system-crop.pdf}
-\end{center}
-\caption{Example protobuf usage}
-\label{fig:protobuf-distributed-usecase}
-\end{figure}
-
\section{Basic Usage: Messages and Descriptors}
\label{sec:rprotobuf-basic}
This section describes how to use the R API to create and manipulate
protocol buffer messages in R, and how to read and write the
-binary \emph{payload} of the messages to files and arbitrary binary
+binary representation of the message (often called the \emph{payload}) to files and arbitrary binary
R connections.
-
The two fundamental building blocks of Protocol Buffers are \emph{Messages}
and \emph{Descriptors}. Messages provide a common abstract encapsulation of
structured data fields of the type specified in a Message Descriptor.
Message Descriptors are defined in \texttt{.proto} files and define a
schema for a particular named class of messages.
-Table~\ref{tab:proto} shows an example \texttt{.proto} file which
-defines the \texttt{tutorial.Person} type. The R code in the right
-column shows an example of creating a new message of this type and
-populating its fields. A \texttt{.proto} file may contain one or more
-message types, and each message type has one or more fields. A field
-is specified with a unique number, a name, a value type, and a field
-rule specifying whether the field is optional, required, or repeated.
-The supported value types are numbers, enumerations, booleans,
-strings, raw bytes, or other nested message types.
-The \texttt{.proto} file syntax for defining the structure of protocol
-buffer data is described comprehensively on Google Code\footnote{See
-\url{http://code.google.com/apis/protocolbuffers/docs/proto.html}.}.
% Commented out because we said this earlier.
%This separation
@@ -438,51 +478,6 @@
%languages. The definition
-%% TODO(de) Can we make this not break the width of the page?
-\noindent
-\begin{table}
-\begin{tabular}{p{.40\textwidth}p{0.55\textwidth}}
-\toprule
-Schema : \texttt{addressbook.proto} & Example R Session\\
-\cmidrule{1-2}
-\begin{minipage}{.40\textwidth}
-\vspace{2mm}
-\begin{example}
-package tutorial;
-message Person {
- required string name = 1;
- required int32 id = 2;
- optional string email = 3;
- enum PhoneType {
- MOBILE = 0; HOME = 1;
- WORK = 2;
- }
- message PhoneNumber {
- required string number = 1;
- optional PhoneType type = 2;
- }
- repeated PhoneNumber phone = 4;
-}
-\end{example}
-\vspace{2mm}
-\end{minipage} & \begin{minipage}{.55\textwidth}
-<<echo=TRUE>>=
-library(RProtoBuf)
-p <- new(tutorial.Person,id=1,name="Dirk")
-class(p)
-p$name
-p$name <- "Murray"
-cat(as.character(p))
-serialize(p, NULL)
-@
-\end{minipage} \\
-\bottomrule
-\end{tabular}
-\caption{The schema representation from a \texttt{.proto} file for the
- \texttt{tutorial.Person} class (left) and simple R code for creating
- an object of this class and accessing its fields (right).}
-\label{tab:proto}
-\end{table}
%This section may contain a figure such as Figure~\ref{figure:rlogo}.
%
@@ -663,7 +658,7 @@
the human-readable ASCII output that is created with
\code{as.character}.
-The binary representation of the message (often called the payload)
+The binary representation of the message
does not contain information that can be used to dynamically
infer the message type, so we have to provide this information
to the \texttt{read} function in the form of a descriptor :
@@ -1250,8 +1245,12 @@
The previous sections discussed functionality in the \pkg{RProtoBuf} package
for creating, manipulating, parsing and serializing Protocol Buffer
-messages of a specific pre-defined schema. The package also provides
-methods for converting arbitrary R data structures into protocol
+messages of a pre-defined schema. This is useful when there are
+pre-existing systems with defined schemas or significant software
+components written in other languages that need to be accessed from
+within R.
+
+The package also provides methods for converting arbitrary R data structures into protocol
buffers and vice versa with a universal R object schema. The \texttt{serialize\_pb} and \texttt{unserialize\_pb}
functions serialize arbitrary R objects into a universal Protocol Buffer
message:
More information about the Rprotobuf-commits
mailing list