[Rprotobuf-commits] r877 - papers/jss

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Sun Mar 23 23:44:03 CET 2014


Author: edd
Date: 2014-03-23 23:44:03 +0100 (Sun, 23 Mar 2014)
New Revision: 877

Modified:
   papers/jss/article.Rnw
   papers/jss/article.bib
Log:
two new citation (as, I think, suggested by the note)
minor twiddling with floats; slightly narrower figures


Modified: papers/jss/article.Rnw
===================================================================
--- papers/jss/article.Rnw	2014-03-17 02:28:07 UTC (rev 876)
+++ papers/jss/article.Rnw	2014-03-23 22:44:03 UTC (rev 877)
@@ -16,7 +16,6 @@
 %
 % Local helpers to make this more compatible with R Journal style.
 %
-\newcommand{\CRANpkg}[1]{\pkg{#1}}
 \RequirePackage{fancyvrb}
 \RequirePackage{alltt}
 \DefineVerbatimEnvironment{example}{Verbatim}{}
@@ -30,33 +29,32 @@
 %% for pretty printing and a nice hypersummary also set:
 \Plainauthor{Dirk Eddelbuettel, Murray Stokely, Jeroen Ooms} %% comma-separated
 \Plaintitle{RProtoBuf: Efficient Cross-Language Data Serialization in R}
-\Shorttitle{\CRANpkg{RProtoBuf}: Protocol Buffers in \proglang{R}} %% a short title (if necessary)
+\Shorttitle{\pkg{RProtoBuf}: Protocol Buffers in \proglang{R}} %% a short title (if necessary)
 
 %% an abstract and keywords
 \Abstract{
-Modern data collection and analysis pipelines often involve
-a sophisticated mix of applications written in general purpose and
-specialized programming languages.  
-Many formats commonly used to import and export data between
-different programs or systems, such as \texttt{CSV} or \texttt{JSON}, are
-verbose, inefficient, not type-safe, or tied to a specific programming language.
-Protocol Buffers are a popular
-method of serializing structured data between applications---while remaining
-independent of programming languages or operating systems.
-They offer a unique combination of features, performance, and maturity that seems
-particularly well suited for data-driven applications and numerical
-computing.
-The
-\CRANpkg{RProtoBuf} package provides a complete interface to Protocol
-Buffers from the
-\proglang{R} environment for statistical computing.
-This paper outlines the general class of data serialization
-requirements for statistical computing, describes the implementation
-of the \CRANpkg{RProtoBuf} package, and illustrates its use with
-example applications in large-scale data collection pipelines and web
-services.
-%TODO(ms) keep it less than 150 words. -- I think this may be 154,
-%depending how emacs is counting.
+  Modern data collection and analysis pipelines often involve
+  a sophisticated mix of applications written in general purpose and
+  specialized programming languages.  
+  Many formats commonly used to import and export data between
+  different programs or systems, such as \texttt{CSV} or \texttt{JSON}, are
+  verbose, inefficient, not type-safe, or tied to a specific programming language.
+  Protocol Buffers are a popular
+  method of serializing structured data between applications---while remaining
+  independent of programming languages or operating systems.
+  They offer a unique combination of features, performance, and maturity that seems
+  particularly well suited for data-driven applications and numerical
+  computing.
+  The \pkg{RProtoBuf} package provides a complete interface to Protocol
+  Buffers from the
+  \proglang{R} environment for statistical computing.
+  This paper outlines the general class of data serialization
+  requirements for statistical computing, describes the implementation
+  of the \pkg{RProtoBuf} package, and illustrates its use with
+  example applications in large-scale data collection pipelines and web
+  services.
+  %% TODO(ms) keep it less than 150 words. -- I think this may be 154,
+  %% depending how emacs is counting.
 }
 \Keywords{\proglang{R}, \pkg{Rcpp}, Protocol Buffers, serialization, cross-platform}
 \Plainkeywords{R, Rcpp, Protocol Buffers, serialization, cross-platform} %% without formatting
@@ -194,29 +192,31 @@
 Once the data serialization needs of an application become complex
 enough, developers typically benefit from the use of an
 \emph{interface description language}, or \emph{IDL}.  IDLs like
-Protocol Buffers \citep{protobuf}, Apache Thrift, and Apache Avro
+Protocol Buffers \citep{protobuf}, Apache Thrift \citep{Apache:Thrift}, and Apache Avro \citep{Apache:Avro}
 provide a compact well-documented schema for cross-language data
 structures and efficient binary interchange formats.  Since the schema
 is provided separately from the data, the data can be
 efficiently encoded to minimize storage costs when
 compared with simple ``schema-less'' binary interchange formats.
-Many sources compare data serialization formats
-and show Protocol Buffers perform favorably to the alternatives; see
-\citet{Sumaray:2012:CDS:2184751.2184810} for one such comparison.
+%Many sources compare data serialization formats
+%and show Protocol Buffers perform favorably to the alternatives; see
+%\citet{Sumaray:2012:CDS:2184751.2184810} for one such comparison.
+Protocol Buffers performs well in the comparison of such formats by
+\citet{Sumaray:2012:CDS:2184751.2184810}.
 
 This paper describes an \proglang{R} interface to Protocol Buffers,
 and is organized as follows. Section~\ref{sec:protobuf}
 provides a general high-level overview of Protocol Buffers as well as a basic
 motivation for their use.
 Section~\ref{sec:rprotobuf-basic} describes the interactive \proglang{R} interface
-provided by the \CRANpkg{RProtoBuf} package, and introduces the two main abstractions:
+provided by the \pkg{RProtoBuf} package, and introduces the two main abstractions:
 \emph{Messages} and \emph{Descriptors}.  Section~\ref{sec:rprotobuf-classes}
 details the implementation details of the main S4 classes and methods.  
 Section~\ref{sec:types} describes the challenges of type coercion
 between \proglang{R} and other languages.  Section~\ref{sec:evaluation} introduces a
 general \proglang{R} language schema for serializing arbitrary \proglang{R} objects and evaluates
 it against the serialization capabilities built directly into \proglang{R}.  Sections~\ref{sec:mapreduce}
-and \ref{sec:opencpu} provide real-world use cases of \CRANpkg{RProtoBuf}
+and \ref{sec:opencpu} provide real-world use cases of \pkg{RProtoBuf}
 in MapReduce and web service environments, respectively, before
 Section~\ref{sec:summary} concludes.
 
@@ -238,9 +238,10 @@
   decade.
 \end{itemize}
 
-\begin{figure}[bp]
+%\begin{figure}[bp]
+\begin{figure}[h!]
 \begin{center}
-\includegraphics[width=\textwidth]{figures/protobuf-distributed-system-crop.pdf}
+\includegraphics[width=0.9\textwidth]{figures/protobuf-distributed-system-crop.pdf}
 \end{center}
 \caption{Example usage of Protocol Buffers.}
 \label{fig:protobuf-distributed-usecase}
@@ -254,8 +255,8 @@
 request, and respond with a new Protocol Buffer over the network. 
 The key difference to, say, a request to an \pkg{Rserve} 
 \citep{Urbanek:2003:Rserve,CRAN:Rserve} instance is that
-the remote server may be implemented in any language, with no
-dependence on \proglang{R}.
+the remote server may be implemented in any language.
+%, with no dependence on \proglang{R}.
 
 While traditional IDLs have at times been criticized for code bloat and
 complexity, Protocol Buffers are based on a simple list and records
@@ -480,7 +481,7 @@
 
 \subsection{Parsing messages}
 
-The \CRANpkg{RProtoBuf} package defines the \code{read} and
+The \pkg{RProtoBuf} package defines the \code{read} and
 \code{readASCII} functions to read messages from files, raw vectors,
 or arbitrary connections.  \code{read} expects to read the message
 payload from binary files or connections and \code{readASCII} parses
@@ -533,13 +534,13 @@
 \section{Under the hood: S4 classes, methods, and pseudo methods}
 \label{sec:rprotobuf-classes}
 
-The \CRANpkg{RProtoBuf} package uses the S4 system to store
+The \pkg{RProtoBuf} package uses the S4 system to store
 information about descriptors and messages.  Using the S4 system
 allows the package to dispatch methods that are not
 generic in the S3 sense, such as \texttt{new} and
 \texttt{serialize}.
 Table~\ref{class-summary-table} lists the six
-primary Message and Descriptor classes in \CRANpkg{RProtoBuf}.  Each \proglang{R} object
+primary Message and Descriptor classes in \pkg{RProtoBuf}.  Each \proglang{R} object
 contains an external pointer to an object managed by the
 \texttt{protobuf} \proglang{C++} library, and the \proglang{R} objects make calls into more
 than 100 \proglang{C++} functions that provide the
@@ -564,19 +565,19 @@
   dispatch relationships.}
 \end{table}
 
-The \CRANpkg{Rcpp} package
+The \pkg{Rcpp} package
 \citep{eddelbuettel2011rcpp,eddelbuettel2013seamless} is used to 
 facilitate this integration of the \proglang{R} and \proglang{C++} code for these objects.
 Each method is wrapped individually which allows us to add 
 user-friendly custom error handling, type coercion, and performance
 improvements at the cost of a more verbose implementation.
-The \CRANpkg{RProtoBuf} package in many ways motivated
-the development of \CRANpkg{Rcpp} Modules \citep{eddelbuettel2013exposing},
+The \pkg{RProtoBuf} package in many ways motivated
+the development of \pkg{Rcpp} Modules \citep{eddelbuettel2013exposing},
 which provide a more concise way of wrapping \proglang{C++} functions and classes
 in a single entity.
 
 
-The \CRANpkg{RProtoBuf} package supports two forms for calling
+The \pkg{RProtoBuf} package supports two forms for calling
 functions with these S4 classes:
 \begin{itemize}
 \item The functional dispatch mechanism of the the form
@@ -585,7 +586,7 @@
   \verb|object$method(arguments)|.
 \end{itemize}
 
-Additionally, \CRANpkg{RProtoBuf} supports tab completion for all
+Additionally, \pkg{RProtoBuf} supports tab completion for all
 classes.  Completion possibilities include pseudo-method names for all
 classes, plus \emph{dynamic dispatch} on names or types specific to a given
 object.  This functionality is implemented with the
@@ -595,7 +596,7 @@
 \subsection{Messages}
 
 The \texttt{Message} S4 class represents Protocol Buffer Messages and
-is the core abstraction of \CRANpkg{RProtoBuf}. Each \texttt{Message}
+is the core abstraction of \pkg{RProtoBuf}. Each \texttt{Message}
 contains a pointer to a \texttt{Descriptor} which defines the schema
 of the data defined in the Message, as well as a number of
 \texttt{FieldDescriptors} for the individual fields of the message.  A
@@ -659,7 +660,7 @@
 used to retrieve descriptors that are contained in the descriptor, or
 invoke pseudo-methods.
 
-When \CRANpkg{RProtoBuf} is first loaded it calls
+When \pkg{RProtoBuf} is first loaded it calls
 \texttt{readProtoFiles} to read in the example \texttt{addressbook.proto} file
 included with the package.  The \texttt{tutorial.Person} descriptor
 and all other descriptors defined in the loaded \texttt{.proto} files are
@@ -1028,9 +1029,9 @@
 @
 
 However, most modern languages do have support for 64-bit integers, 
-which becomes problematic when \CRANpkg{RProtoBuf} is used to exchange data 
+which becomes problematic when \pkg{RProtoBuf} is used to exchange data 
 with a system that requires this integer type. To work around this, 
-\CRANpkg{RProtoBuf} allows users to get and set 64-bit integer values by specifying 
+\pkg{RProtoBuf} allows users to get and set 64-bit integer values by specifying 
 them as character strings.
 
 If we try to set an int64 field in \proglang{R} to double values, we lose
@@ -1042,7 +1043,7 @@
 length(unique(test$repeated_int64))
 @
 
-But when the values are specified as character strings, \CRANpkg{RProtoBuf}
+But when the values are specified as character strings, \pkg{RProtoBuf}
 will automatically coerce them into a true 64-bit integer types 
 before storing them in the Protocol Buffer message:
 
@@ -1055,7 +1056,7 @@
 will be returned if the \code{RProtoBuf.int64AsString} option is set
 to \texttt{TRUE}.  The character values are useful because they can
 accurately be used as unique identifiers and can easily be passed to \proglang{R}
-packages such as \CRANpkg{int64} \citep{int64} or \CRANpkg{bit64}
+packages such as \pkg{int64} \citep{int64} or \pkg{bit64}
 \citep{bit64} which represent 64-bit integers in \proglang{R}.
 
 <<>>=
@@ -1074,7 +1075,7 @@
 \section[Converting R data structures into Protocol Buffers]{Converting \proglang{R} data structures into Protocol Buffers}
 \label{sec:evaluation}
 
-The previous sections discussed functionality in the \CRANpkg{RProtoBuf} package
+The previous sections discussed functionality in the \pkg{RProtoBuf} package
 for creating, manipulating, parsing, and serializing Protocol Buffer
 messages of a defined schema.  This is useful when there are
 pre-existing systems with defined schemas or significant software
@@ -1090,12 +1091,12 @@
 identical(iris, unserialize_pb(msg))
 @
 
-In order to accomplish this, \CRANpkg{RProtoBuf} uses the same catch-all \texttt{proto}
+In order to accomplish this, \pkg{RProtoBuf} uses the same catch-all \texttt{proto}
 schema used by \pkg{RHIPE} for exchanging \proglang{R} data with Hadoop \citep{rhipe}. This 
 schema, which we will refer to as \texttt{rexp.proto}, is printed in
 %appendix \ref{rexp.proto}.
 the appendix.
-The Protocol Buffer messages generated by \CRANpkg{RProtoBuf} and
+The Protocol Buffer messages generated by \pkg{RProtoBuf} and
 \pkg{RHIPE} are naturally compatible between the two systems because they use the 
 same schema. This shows the power of using a schema-based cross-platform format such
 as Protocol Buffers: interoperability is achieved without effort or close coordination.
@@ -1201,11 +1202,11 @@
 %in multiple languages instead of requiring other programs to parse the \proglang{R}
 %serialization format. % \citep{serialization}.
 One takeaway from this table is that the universal \proglang{R} object schema
-included in \CRANpkg{RProtoBuf} does not in general provide
+included in \pkg{RProtoBuf} does not in general provide
 any significant saving in file size compared to the normal serialization
 mechanism in \proglang{R}.
 % redundant: which is seen as equally compact.
-The benefits of \CRANpkg{RProtoBuf} accrue more naturally in applications where
+The benefits of \pkg{RProtoBuf} accrue more naturally in applications where
 multiple programming languages are involved, or when a more concise
 application-specific schema has been defined.  The example in the next
 section satisfies both of these conditions.
@@ -1279,7 +1280,7 @@
 \end{tabular}
 }
 \caption{Serialization sizes for default serialization in \proglang{R} and
-  \CRANpkg{RProtoBuf} for 50 \proglang{R} data sets.}
+  \pkg{RProtoBuf} for 50 \proglang{R} data sets.}
 \label{tab:compression}
 \end{center}
 \end{table}
@@ -1317,7 +1318,7 @@
 
 \begin{figure}[h!]
 \begin{center}
-\includegraphics[width=\textwidth]{figures/histogram-mapreduce-diag1.pdf}
+\includegraphics[width=0.9\textwidth]{figures/histogram-mapreduce-diag1.pdf}
 \end{center}
 \caption{Diagram of MapReduce histogram generation pattern.}
 \label{fig:mr-histogram-pattern1}
@@ -1331,8 +1332,8 @@
 share a schema of the histogram representation to coordinate
 effectively.
 
-The \CRANpkg{HistogramTools} package \citep{histogramtools} enhances
-\CRANpkg{RProtoBuf} by providing a concise schema for \proglang{R} histogram objects:
+The \pkg{HistogramTools} package \citep{histogramtools} enhances
+\pkg{RProtoBuf} by providing a concise schema for \proglang{R} histogram objects:
 
 \begin{example}
 package HistogramTools;
@@ -1439,7 +1440,7 @@
 function calls, and arguments/return values can be posted/retrieved
 using several data interchange formats, such as Protocol Buffers.  
 OpenCPU uses the \texttt{serialize\_pb} and \texttt{unserialize\_pb} functions
-from the \CRANpkg{RProtoBuf} package to convert between \proglang{R} objects and protobuf
+from the \pkg{RProtoBuf} package to convert between \proglang{R} objects and protobuf
 messages. Therefore, clients need the \texttt{rexp.proto} descriptor mentioned
 earlier to parse and generate protobuf messages when interacting with OpenCPU.
 
@@ -1535,7 +1536,7 @@
 are contained within a list.
 
 <<eval=FALSE>>=
-library("httr")       #requires httr >= 0.2.99
+library("httr")       
 library("RProtoBuf")
 
 args <- list(n=42, mean=100)
@@ -1576,13 +1577,13 @@
 performance, and maturity, that seems particularly well suited for data-driven 
 applications and numerical computing.
 
-The \CRANpkg{RProtoBuf} package builds on the Protocol Buffers \proglang{C++} library, 
+The \pkg{RProtoBuf} package builds on the Protocol Buffers \proglang{C++} library, 
 and extends the \proglang{R} system with the ability to create, read,
 write, parse, and manipulate Protocol
-Buffer messages. \CRANpkg{RProtoBuf} has been used extensively inside Google 
+Buffer messages. \pkg{RProtoBuf} has been used extensively inside Google 
 for the past three years by statisticians, analysts, and software engineers.
 At the time of this writing there are over 300 active
-users of \CRANpkg{RProtoBuf} using it to read data from and otherwise interact
+users of \pkg{RProtoBuf} using it to read data from and otherwise interact
 with distributed systems written in \proglang{C++}, \proglang{Java}, \proglang{Python}, and 
 other languages. We hope that making Protocol Buffers available to the
 \proglang{R} community will contribute towards better software integration
@@ -1593,11 +1594,11 @@
 
 \section*{Acknowledgments}
 
-The first versions of \CRANpkg{RProtoBuf} were written during 2009-2010.
+The first versions of \pkg{RProtoBuf} were written during 2009 - 2010.
 Very significant contributions, both in code and design, were made by
 Romain Fran\c{c}ois whose continued influence on design and code is
 greatly appreciated. Several features of the package reflect
-the design of the \CRANpkg{rJava} package by Simon Urbanek.
+the design of the \pkg{rJava} package by Simon Urbanek.
 The user-defined table mechanism, implemented by Duncan Temple Lang for the
 purpose of the \pkg{RObjectTables} package, allows for the dynamic symbol lookup.
 Kenton Varda was generous with his time in reviewing code and explaining
@@ -1618,7 +1619,7 @@
 \label{rexp.proto}
 
 Below a print of the \texttt{rexp.proto} schema (originally designed by \cite{rhipe})
-that is included with the \CRANpkg{RProtoBuf} package and used by \texttt{serialize\_pb} and
+that is included with the \pkg{RProtoBuf} package and used by \texttt{serialize\_pb} and
 \texttt{unserialize\_pb}.
 
 \begin{verbatim}

Modified: papers/jss/article.bib
===================================================================
--- papers/jss/article.bib	2014-03-17 02:28:07 UTC (rev 876)
+++ papers/jss/article.bib	2014-03-23 22:44:03 UTC (rev 877)
@@ -145,8 +145,7 @@
 @misc{serialization,
   author =       {Luke Tierney},
   title =        {A New Serialization Mechanism for R},
-  url =
-                  {http://www.cs.uiowa.edu/~luke/R/serialize/serialize.ps},
+  url =          {http://www.cs.uiowa.edu/~luke/R/serialize/serialize.ps},
   year =         2003,
 }
 
@@ -431,8 +430,8 @@
 @Manual{httr,
   title =        {httr: Tools for Working with URLs and HTTP},
   author =       {Hadley Wickham},
-  year =         2012,
-  note =         {R package version 0.2},
+  year =         2014,
+  note =         {R package version 0.3},
   url =          {http://CRAN.R-project.org/package=httr},
 }
 
@@ -487,3 +486,20 @@
   url		= {http://www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/},
   note		= {{ISSN 1609-395X}}
 }
+
+ at Misc{Apache:Avro,
+  author =       {{Apache Software Foundation}},
+  title =        {Apache Avro},
+  url =          {http://avro.apache.org},
+  note =         {Data Serialization System, Version 1.7.6},
+  year =         2014
+}
+
+ at Misc{Apache:Thrift,
+  author =       {{Apache Software Foundation}},
+  title =        {Apache Thrift},
+  url =          {http://thrift.apache.org},
+  note =         {Software Framework for Scalable Cross-Language Services, Version 0.9.1},
+  year =         2013
+}
+



More information about the Rprotobuf-commits mailing list