[Rprotobuf-commits] r729 - papers/jss

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Thu Jan 9 02:35:01 CET 2014


Author: murray
Date: 2014-01-09 02:35:01 +0100 (Thu, 09 Jan 2014)
New Revision: 729

Added:
   papers/jss/article.tex
Modified:
   papers/jss/eddelbuettel-stokely.bib
Log:
Check in the new JSS article, and rewrite/improve my MapReduce example
application section.



Added: papers/jss/article.tex
===================================================================
--- papers/jss/article.tex	                        (rev 0)
+++ papers/jss/article.tex	2014-01-09 01:35:01 UTC (rev 729)
@@ -0,0 +1,1813 @@
+\documentclass[article]{jss}
+\usepackage{booktabs}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+%
+% Local helpers to make this more compatible with R Journal style.
+%
+\newcommand{\CRANpkg}[1]{\pkg{#1}}
+\RequirePackage{fancyvrb}
+\RequirePackage{alltt}
+\DefineVerbatimEnvironment{example}{Verbatim}{}
+
+%% almost as usual
+\author{Dirk Eddelbuettel\\Debian and R Projects \And 
+        Murray Stokely\\Google, Inc}
+\title{\pkg{RProtoBuf}: Efficient Cross-Language Data Serialization in R}
+
+%% for pretty printing and a nice hypersummary also set:
+\Plainauthor{Dirk Eddelbuettel, Murray Stokely} %% comma-separated
+\Plaintitle{RProtoBuf: Efficient Cross-Language Data Serialization in R}
+\Shorttitle{\pkg{RProtoBuf}: Protocol Buffers in R} %% a short title (if necessary)
+
+%% an abstract and keywords
+\Abstract{
+Modern data collection and analysis pipelines often involve
+a sophisticated mix of applications written in general purpose and
+specialized programming languages.  Protocol Buffers are a popular
+method of serializing structured data between applications---while remaining
+independent of programming languages or operating system.  The
+\CRANpkg{RProtoBuf} package provides a complete interface to this
+library.
+}
+\Keywords{r, protocol buffers, serialization, cross-platform}
+\Plainkeywords{r, protocol buffers, serialization, cross-platform} %% without formatting
+%% at least one keyword must be supplied
+
+%% publication information
+%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
+%% \Volume{50}
+%% \Issue{9}
+%% \Month{June}
+%% \Year{2012}
+%% \Submitdate{2012-06-04}
+%% \Acceptdate{2012-06-04}
+
+%% The address of (at least) one author should be given
+%% in the following format:
+\Address{
+  Dirk Eddelbuettel\\
+  \\
+  Murray Stokely\\
+  Google, Inc.\\
+  1600 Amphitheatre Parkway\\
+  Mountain View, CA 94040\\
+  USA\\
+  E-mail: \email{mstokely at google.com}\\
+  URL: \url{http://www.stokely.org/}
+}
+%% It is also possible to add a telephone and fax number
+%% before the e-mail in the following format:
+%% Telephone: +43/512/507-7103
+%% Fax: +43/512/507-2851
+
+%% for those who use Sweave please include the following line (with % symbols):
+%% need no \usepackage{Sweave.sty}
+
+%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+\begin{document}
+
+
+%% include your article here, just as usual
+%% Note that you should use the \pkg{}, \proglang{} and \code{} commands.
+
+
+% We don't want a left margin for Sinput or Soutput for our table 1.
+%\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=0em}
+%\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=0em}
+%\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
+% Setting the topsep to 0 reduces spacing from input to output and
+% improves table 1.
+\fvset{listparameters={\setlength{\topsep}{0pt}}}
+\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
+
+\title{RProtoBuf: Efficient Cross-Language Data Serialization in R}
+\author{by Dirk Eddelbuettel and Murray Stokely}
+
+%% DE: I tend to have wider option(width=...) so this
+%%     guarantees better line breaks
+
+\maketitle
+
+\abstract{Modern data collection and analysis pipelines often involve
+ a sophisticated mix of applications written in general purpose and
+ specialized programming languages.  Protocol Buffers are a popular
+ method of serializing structured data between applications---while remaining
+ independent of programming languages or operating system.  The
+ \CRANpkg{RProtoBuf} package provides a complete interface between this
+ library and the R environment for statistical computing.
+ %TODO(ms) keep it less than 150 words.
+}
+
+%TODO(de) 'protocol buffers' or 'Protocol Buffers' ?
+
+\section{Introduction}
+
+Modern data collection and analysis pipelines are increasingly being
+built using collections of components to better manage software
+complexity through reusability, modularity, and fault
+isolation \citep{Wegiel:2010:CTT:1932682.1869479}.
+Data analysis patterns such as Split-Apply-Combine
+\citep{wickham2011split} explicitly break up large problems into
+manageable pieces.  These patterns are frequently employed with
+different programming languages used for the different phases of data
+analysis -- collection, cleaning, analysis, post-processing, and
+presentation in order to take advantage of the unique combination of
+performance, speed of development, and library support offered by
+different environments.  Each stage of the data
+analysis pipeline may involve storing intermediate results in a
+file or sending them over the network.
+% DE: Nice!
+
+Given these requirements, how do we safely share intermediate results
+between different applications, possibly written in different
+languages, and possibly running on different computer system, possibly
+spanning different operating systems?  Programming
+languages such as R, Julia, Java, and Python include built-in
+serialization support, but these formats are tied to the specific
+% DE: need to define serialization?
+programming language in use and thus lock the user into a single
+environment.  CSV files can be read and written by many applications
+and so are often used for exporting tabular data.  However, CSV files
+have a number of disadvantages, such as a limitation of exporting only
+tabular datasets, lack of type-safety, inefficient text representation
+and parsing, and ambiguities in the format involving special
+characters.  JSON is another widely-supported format used mostly on
+the web that removes many of these disadvantages, but it too suffers
+from being too slow to parse and also does not provide strong typing
+between integers and floating point.  Because the schema information
+is not kept separately, multiple JSON messages of the same type
+needlessly duplicate the field names with each message.
+%
+%
+%
+A number of binary formats based on JSON have been proposed that
+reduce the parsing cost and improve the efficiency.  MessagePack
+\citep{msgpackR} and BSON \citep{rmongodb} both have R interfaces, but
+these formats lack a separate schema for the serialized data and thus
+still duplicate field names with each message sent over the network or
+stored in a file.  Such formats also lack support for versioning when
+data storage needs evolve over time, or when application logic and
+requirement changes dictate update to the message format.
+% DE: Need to talk about XML ?
+
+Once the data serialization needs of an application become complex
+enough, developers typically benefit from the use of an
+\emph{interface description language}, or \emph{IDL}.  IDLs like
+Protocol Buffers \citep{protobuf}, Apache Thrift, and Apache Avro provide a compact
+well-documented schema for cross-langauge data structures and
+efficient binary interchange formats.  The schema can be used to
+generate model classes for statically-typed programming languages such
+as C++ and Java, or can be used with reflection for dynamically-typed
+programming languages.  Since the schema is provided separately from
+the encoded data, the data can be efficiently encoded to minimize
+storage costs of the stored data when compared with simple
+``schema-less'' binary interchange formats.
+
+% TODO(mstokely): Take a more conversational tone here asking
+% questions and motivating protocol buffers?
+
+% TODO(mstokely): If we go to JSS, include a larger paragraph here
+% referencing each numbered section.  I don't like these generally,
+% but its useful for this paper I think because we have a boring bit
+% in the middle (full class/method details) and interesting
+% applications at the end.
+This article describes the basics of Google's Protocol Buffers through
+an easy to use R package, \CRANpkg{RProtoBuf}.  After describing the
+basics of protocol buffers and \CRANpkg{RProtoBuf}, we illustrate
+several common use cases for protocol buffers in data analysis.
+
+\section{Protocol Buffers}
+
+FIXME Introductory section which may include references in parentheses
+\citep{R}, or cite a reference such as \citet{R} in the text.
+
+% This content is good.  Maybe use and cite?
+% http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
+
+
+%% TODO(de,ms)  What follows is oooooold and was lifted from the webpage
+%%              Rewrite?
+Protocol Buffers can be described as a modern, language-neutral, platform-neutral,
+extensible mechanism for sharing and storing structured data.  Since their
+introduction, Protocol Buffers have been widely adopted in industry with
+applications as varied as database-internal messaging (Drizzle), % DE: citation?
+Sony Playstations, Twitter, Google Search, Hadoop, and Open Street Map.  While
+% TODO(DE): This either needs a citation, or remove the name drop
+traditional IDLs have at time been criticized for code bloat and
+complexity, Protocol Buffers are based on a simple list and records
+model that is compartively flexible and simple to use.
+
+Some of the key features provided by Protocol Buffers for data analysis
+include:
+
+\begin{itemize}
+\item \emph{Portable}:  Allows users to send and receive data between
+  applications or different computers.
+\item \emph{Efficient}:  Data is serialized into a compact binary
+  representation for transmission or storage.
+\item \emph{Exentsible}:  New fields can be added to Protocol Buffer Schemas
+  in a forward-compatible way that do not break older applications.
+\item \emph{Stable}:  Protocol Buffers have been in wide use for over a
+  decade.
+\end{itemize}
+
+Figure~\ref{fig:protobuf-distributed-usecase} illustrates an example
+communication workflow with protocol buffers and an interactive R
+session.  Common use cases include populating a request RPC protocol
+buffer in R that is then serialized and sent over the network to a
+remote server.  The server would then deserialize the message, act on
+the request, and respond with a new protocol buffer over the network. The key
+difference to, say, a request to an Rserve instance is that the remote server
+may not even know the R language.
+
+%Protocol buffers are a language-neutral, platform-neutral, extensible
+%way of serializing structured data for use in communications
+%protocols, data storage, and more.
+
+%Protocol Buffers offer key features such as an efficient data interchange
+%format that is both language- and operating system-agnostic yet uses a
+%lightweight and highly performant encoding, object serialization and
+%de-serialization as well data and configuration management. Protocol
+%buffers are also forward compatible: updates to the \texttt{proto}
+%files do not break programs built against the previous specification.
+
+%While benchmarks are not available, Google states on the project page that in
+%comparison to XML, protocol buffers are at the same time \textsl{simpler},
+%between three to ten times \textsl{smaller}, between twenty and one hundred
+%times \textsl{faster}, as well as less ambiguous and easier to program.
+
+Many sources compare data serialization formats and show protocol
+buffers very favorably to the alternatives, such
+as \citet{Sumaray:2012:CDS:2184751.2184810}
+
+%The flexibility of the reflection-based API is particularly well
+%suited for interactive data analysis.
+
+% XXX Design tradeoffs: reflection vs proto compiler
+
+For added speed and efficiency, the C++, Java, and Python bindings to
+Protocol Buffers are used with a compiler that translates a protocol
+buffer schema description file (ending in \texttt{.proto}) into
+language-specific classes that can be used to create, read, write and
+manipulate protocol buffer messages.  The R interface, in contrast,
+uses a reflection-based API that is particularly well suited for
+interactive data analysis.  All messages in R have a single class
+structure, but different accessor methods are created at runtime based
+on the name fields of the specified message type.
+
+% In other words, given the 'proto'
+%description file, code is automatically generated for the chosen
+%target language(s). The project page contains a tutorial for each of
+%these officially supported languages:
+%\url{http://code.google.com/apis/protocolbuffers/docs/tutorials.html}
+
+%The protocol buffers code is released under an open-source (BSD) license. The
+%protocol buffer project (\url{http://code.google.com/p/protobuf/})
+%contains a C++ library and a set of runtime libraries and compilers for
+%C++, Java and Python.
+
+%With these languages, the workflow follows standard practice of so-called
+%Interface Description Languages (IDL)
+%(c.f. \href{http://en.wikipedia.org/wiki/Interface_description_language}{Wikipedia
+%  on IDL}).  This consists of compiling a protocol buffer description file
+%(ending in \texttt{.proto}) into language specific classes that can be used
+
+%Besides the officially supported C++, Java and Python implementations, several projects have been
+%created to support protocol buffers for many languages. The list of known
+%languages to support protocol buffers is compiled as part of the
+%project page: \url{http://code.google.com/p/protobuf/wiki/ThirdPartyAddOns}
+
+\begin{figure}[t]
+\begin{center}
+\includegraphics[width=\textwidth]{protobuf-distributed-system-crop.pdf}
+\end{center}
+\caption{Example protobuf usage}
+\label{fig:protobuf-distributed-usecase}
+\end{figure}
+
+\section{Basic Usage: Messages and Descriptors}
+
+This section describes how to use the R API to create and manipulate
+protocol buffer messages in R, and how to read and write the
+binary \emph{payload} of the messages to files and arbitrary binary
+R connections.
+
+The two fundamental building blocks of Protocol Buffers are Messages
+and Descriptors.  Messages provide a common abstract encapsulation of
+structured data fields of the type specified in a Message Descriptor.
+Message Descriptors are defined in \texttt{.proto} files and define a
+schema for a particular named class of messages.
+
+Table~\ref{tab:proto} shows an example \texttt{.proto} file which
+defines the \texttt{tutorial.Person} type.  The R code in the right
+column shows an example of creating a new message of this type and
+populating its fields.
+
+% Commented out because we said this earlier.
+%This separation
+%between schema and the message objects is in contrast to
+%more verbose formats like JSON, and when combined with the efficient
+%binary representation of any Message object explains a large part of
+%the performance and storage-space advantage offered by Protocol
+%Buffers. TODO(ms): we already said some of this above.  clean up.
+
+% lifted from protobuf page:
+%With Protocol Buffers you define how you want your data to be
+%structured once, and then you can read or write structured data to and
+%from a variety of data streams using a variety of different
+%languages.  The definition
+
+
+%% TODO(de) Can we make this not break the width of the page?
+\noindent
+\begin{table}
+\begin{tabular}{@{\hskip .01\textwidth}p{.40\textwidth}@{\hskip .02\textwidth}@{\hskip .02\textwidth}p{0.55\textwidth}@{\hskip .01\textwidth}}
+\toprule
+Schema : \texttt{addressbook.proto} & Example R Session\\
+\cmidrule{1-2}
+\begin{minipage}{.35\textwidth}
+\vspace{2mm}
+\begin{example}
+package tutorial;
+message Person {
+ required string name = 1;
+ required int32 id = 2;
+ optional string email = 3;
+ enum PhoneType {
+   MOBILE = 0; HOME = 1;
+   WORK = 2;
+ }
+ message PhoneNumber {
+   required string number = 1;
+   optional PhoneType type = 2;
+ }
+ repeated PhoneNumber phone = 4;
+}
+\end{example}
+\vspace{2mm}
+\end{minipage} & \begin{minipage}{.5\textwidth}
+\begin{Schunk}
+\begin{Sinput}
+R> library(RProtoBuf)
+R> p <- new(tutorial.Person, id=1, name="Dirk")
+R> class(p)
+\end{Sinput}
+\begin{Soutput}
+[1] "Message"
+attr(,"package")
+[1] "RProtoBuf"
+\end{Soutput}
+\begin{Sinput}
+R> p$name
+\end{Sinput}
+\begin{Soutput}
+[1] "Dirk"
+\end{Soutput}
+\begin{Sinput}
+R> p$name <- "Murray"
+R> cat(as.character(p))
+\end{Sinput}
+\begin{Soutput}
+name: "Murray"
+id: 1
+\end{Soutput}
+\begin{Sinput}
+R> serialize(p, NULL)
+\end{Sinput}
+\begin{Soutput}
+ [1] 0a 06 4d 75 72 72 61 79 10 01
+\end{Soutput}
+\end{Schunk}
+\end{minipage} \\
+\bottomrule
+\end{tabular}
+\caption{The schema representation from a \texttt{.proto} file for the
+  \texttt{tutorial.Person} class (left) and simple R code for creating
+  an object of this class and accessing its fields (right).}
+\label{tab:proto}
+\end{table}
+
+%This section may contain a figure such as Figure~\ref{figure:rlogo}.
+%
+%\begin{figure}[htbp]
+%  \centering
+%  \includegraphics{Rlogo}
+%  \caption{The logo of R.}
+%  \label{figure:rlogo}
+%\end{figure}
+
+\subsection{Importing Message Descriptors from .proto files}
+
+%The three basic abstractions of \CRANpkg{RProtoBuf} are Messages,
+%which encapsulate a data structure, Descriptors, which define the
+%schema used by one or more messages, and DescriptorPools, which
+%provide access to descriptors.
+
+Before one can create a new Protocol Buffer Message or parse a
+serialized stream of bytes as a Message, one must first read in the message
+type specification from a \texttt{.proto} file.
+
+New \texttt{.proto} files are imported with the \code{readProtoFiles}
+function, which can import a single file, all files in a directory, or
+all \texttt{.proto} files provided by another R package.
+
+The \texttt{.proto} file syntax for defining the structure of protocol
+buffer data is described comprehensively on Google Code:
+\url{http://code.google.com/apis/protocolbuffers/docs/proto.html}.
+
+Once the proto files are imported, all message descriptors are
+are available in the R search path in the \texttt{RProtoBuf:DescriptorPool}
+special environment. The underlying mechanism used here is
+described in more detail in Section~\ref{sec-lookup}.
+
+\begin{Schunk}
+\begin{Sinput}
+R> ls("RProtoBuf:DescriptorPool")
+\end{Sinput}
+\begin{Soutput}
+ [1] "rexp.CMPLX"                  
+ [2] "rexp.REXP"                   
+ [3] "rexp.STRING"                 
+ [4] "rprotobuf.HelloWorldRequest" 
+ [5] "rprotobuf.HelloWorldResponse"
+ [6] "tutorial.AddressBook"        
+ [7] "tutorial.Person"             
+ [8] "tutorial.Test1"              
+ [9] "tutorial.Test2"              
+[10] "tutorial.Test3"              
+[11] "tutorial.Test4"              
+\end{Soutput}
+\end{Schunk}
+
+%\subsection{Importing proto files}
+%In contrast to the other languages (Java, C++, Python) that are officially
+%supported by Google, the implementation used by the \texttt{RProtoBuf}
+%package does not rely on the \texttt{protoc} compiler (with the exception of
+%the two functions discussed in the previous section). This means that no
+%initial step of statically compiling the proto file into C++ code that is
+%then accessed by R code is necessary. Instead, \texttt{proto} files are
+%parsed and processed \textsl{at runtime} by the protobuf C++ library---which
+%is much more appropriate for a dynamic language.
+
+\subsection{Creating a message}
+
+New messages are created with the \texttt{new} function which accepts
+a Message Descriptor and optionally a list of ``name = value'' pairs
+to set in the message.
+%The objects contained in the special environment are
+%descriptors for their associated message types. Descriptors will be
+%discussed in detail in another part of this document, but for the
+%purpose of this section, descriptors are just used with the \texttt{new}
+%function to create messages.
+
+\begin{Schunk}
+\begin{Sinput}
+R> p1 <- new(tutorial.Person)
+R> p <- new(tutorial.Person, name = "Murray", id = 1)
+\end{Sinput}
+\end{Schunk}
+
+\subsection{Access and modify fields of a message}
+
+Once the message is created, its fields can be queried
+and modified using the dollar operator of R, making protocol
+buffer messages seem like lists.
+
+\begin{Schunk}
+\begin{Sinput}
+R> p$name
+\end{Sinput}
+\begin{Soutput}
+[1] "Murray"
+\end{Soutput}
+\begin{Sinput}
+R> p$id
+\end{Sinput}
+\begin{Soutput}
+[1] 1
+\end{Soutput}
+\begin{Sinput}
+R> p$email <- "murray at stokely.org"
+\end{Sinput}
+\end{Schunk}
+
+However, as opposed to R lists, no partial matching is performed
+and the name must be given entirely.
+
+The \verb|[[| operator can also be used to query and set fields
+of a mesages, supplying either their name or their tag number :
+
+\begin{Schunk}
+\begin{Sinput}
+R> p[["name"]] <- "Murray Stokely"
+R> p[[ 2 ]] <- 3
+R> p[[ "email" ]]
+\end{Sinput}
+\begin{Soutput}
+[1] "murray at stokely.org"
+\end{Soutput}
+\end{Schunk}
+
+Protocol buffers include a 64-bit integer type, but R lacks native
+64-bit integer support.  A workaround is available and described in
+Section~\ref{sec:int64} for working with large integer values.
+
+% TODO(mstokely): Document extensions here.
+% There are none in addressbook.proto though.
+
+\subsection{Display messages}
+
+Protocol buffer messages and descriptors implement \texttt{show}
+methods that provide basic information about the message :
+
+\begin{Schunk}
+\begin{Sinput}
+R> p
+\end{Sinput}
+\begin{Soutput}
+[1] "message of type 'tutorial.Person' with 3 fields set"
+\end{Soutput}
+\end{Schunk}
+
+For additional information, such as for debugging purposes,
+the \texttt{as.character} method provides a more complete ASCII
+representation of the contents of a message.
+
+\begin{Schunk}
+\begin{Sinput}
+R> writeLines(as.character(p))
+\end{Sinput}
+\begin{Soutput}
+name: "Murray Stokely"
+id: 3
+email: "murray at stokely.org"
+\end{Soutput}
+\end{Schunk}
+
+\subsection{Serializing messages}
+
+However, the main focus of protocol buffer messages is
+efficiency. Therefore, messages are transported as a sequence
+of bytes. The \texttt{serialize} method is implemented for
+protocol buffer messages to serialize a message into a sequence of
+bytes that represents the message.
+%(raw vector in R speech) that represents the message.
+
+\begin{Schunk}
+\begin{Sinput}
+R> serialize(p, NULL)
+\end{Sinput}
+\begin{Soutput}
+ [1] 0a 0e 4d 75 72 72 61 79 20 53 74 6f 6b 65 6c 79 10 03 1a 12
+[21] 6d 75 72 72 61 79 40 73 74 6f 6b 65 6c 79 2e 6f 72 67
+\end{Soutput}
+\end{Schunk}
+
+The same method can also be used to serialize messages to files :
+
+\begin{Schunk}
+\begin{Sinput}
+R> tf1 <- tempfile()
+R> serialize(p, tf1)
+R> readBin(tf1, raw(0), 500)
+\end{Sinput}
+\begin{Soutput}
+ [1] 0a 0e 4d 75 72 72 61 79 20 53 74 6f 6b 65 6c 79 10 03 1a 12
+[21] 6d 75 72 72 61 79 40 73 74 6f 6b 65 6c 79 2e 6f 72 67
+\end{Soutput}
+\end{Schunk}
+
+Or to arbitrary binary connections:
+
+\begin{Schunk}
+\begin{Sinput}
+R> tf2 <- tempfile()
+R> con <- file(tf2, open = "wb")
+R> serialize(p, con)
+R> close(con)
+R> readBin(tf2, raw(0), 500)
+\end{Sinput}
+\begin{Soutput}
+ [1] 0a 0e 4d 75 72 72 61 79 20 53 74 6f 6b 65 6c 79 10 03 1a 12
+[21] 6d 75 72 72 61 79 40 73 74 6f 6b 65 6c 79 2e 6f 72 67
+\end{Soutput}
+\end{Schunk}
+
+\texttt{serialize} can also be used in a more traditional
+object oriented fashion using the dollar operator :
+
+\begin{Schunk}
+\begin{Sinput}
+R> # serialize to a file
+R> p$serialize(tf1)
+R> # serialize to a binary connection
+R> con <- file(tf2, open = "wb")
+R> p$serialize(con)
+R> close(con)
+\end{Sinput}
+\end{Schunk}
+
+
+\subsection{Parsing messages}
+
+The \texttt{RProtoBuf} package defines the \texttt{read} and
+\texttt{readASCII} functions to read messages from files, raw vectors,
+or arbitrary connections.  \texttt{read} expects to read the message
+payload from binary files or connections and \texttt{readASCII} parses
+the human-readable ASCII output that is created with
+\code{as.character}.
+
+The binary representation of the message (often called the payload)
+does not contain information that can be used to dynamically
+infer the message type, so we have to provide this information
+to the \texttt{read} function in the form of a descriptor :
+
+\begin{Schunk}
+\begin{Sinput}
+R> msg <- read(tutorial.Person, tf1)
+R> writeLines(as.character(msg))
+\end{Sinput}
+\begin{Soutput}
+name: "Murray Stokely"
+id: 3
+email: "murray at stokely.org"
+\end{Soutput}
+\end{Schunk}
+
+The \texttt{input} argument of \texttt{read} can also be a binary
+readable R connection, such as a binary file connection:
+
+\begin{Schunk}
+\begin{Sinput}
+R> con <- file(tf2, open = "rb")
+R> message <- read(tutorial.Person, con)
+R> close(con)
+R> writeLines(as.character(message))
+\end{Sinput}
+\begin{Soutput}
+name: "Murray Stokely"
+id: 3
+email: "murray at stokely.org"
+\end{Soutput}
+\end{Schunk}
+
+Finally, the payload of the message can be used :
+
+\begin{Schunk}
+\begin{Sinput}
+R> # reading the raw vector payload of the message
+R> payload <- readBin(tf1, raw(0), 5000)
+R> message <- read(tutorial.Person, payload)
+\end{Sinput}
+\end{Schunk}
+
+
+\texttt{read} can also be used as a pseudo method of the descriptor
+object :
+
+\begin{Schunk}
+\begin{Sinput}
+R> # reading from a file
+R> message <- tutorial.Person$read(tf1)
+R> # reading from a binary connection
+R> con <- file(tf2, open = "rb")
+R> message <- tutorial.Person$read(con)
+R> close(con)
+R> # read from the payload
+R> message <- tutorial.Person$read(payload)
+\end{Sinput}
+\end{Schunk}
+
+
+\section{Under the hood: S4 Classes, Methods, and Pseudo Methods}
+
+The \CRANpkg{RProtoBuf} package uses the S4 system to store
+information about descriptors and messages.  Using the S4 system
+allows the \texttt{RProtoBuf} package to dispatch methods that are not
+generic in the S3 sense, such as \texttt{new} and
+\texttt{serialize}.
+
+Each R object stores an external pointer to an object managed by
+the \texttt{protobuf} C++ library.
+The \CRANpkg{Rcpp} package \citep{eddelbuettel2011rcpp,eddelbuettel2013seamless} is used to
+facilitate the integration of the R and C++ code for these objects.
+
+% Message, Descriptor, FieldDescriptor, EnumDescriptor,
+% FileDescriptor, EnumValueDescriptor
+%
+% grep RPB_FUNC * | grep -v define|wc -l
+% 84
+% grep RPB_ * | grep -v RPB_FUNCTION | grep METHOD|wc -l
+% 33
+
+There are over 100 C++ functions that provide the glue code between
+the member functions of the 6 primary Message and Descriptor classes
+in the protobuf library.  Wrapping each method individually allows us
+to add user friendly custom error handling, type coercion, and
+performance improvements at the cost of a more verbose
+implementation.  The RProtoBuf implementation in many ways motivated
+the development of Rcpp Modules \citep{eddelbuettel2013exposing},
+which provide a more concise way of wrapping C++ functions and classes
+in a single entity.
+
+The \texttt{RProtoBuf} package combines the \emph{R typical} dispatch
+of the form \verb|method(object, arguments)| and the more traditional
+object oriented notation \verb|object$method(arguments)|.
+Additionally, \texttt{RProtoBuf} implements the \texttt{.DollarNames} S3 generic function
+(defined in the \texttt{utils} package) for all classes to enable tab
+completion.  Completion possibilities include pseudo method names for all
+classes, plus dynamic dispatch on names or types specific to a given object.
+
+% TODO(ms): Add column check box for doing dynamic dispatch based on type.
+\begin{table}[h]
+\centering
+\begin{tabular}{|l|c|c|l|}
+\hline
+\textbf{Class} & \textbf{Slots} & \textbf{Methods} & \textbf{Dynamic Dispatch}\\
+\hline
+\hline
+Message & 2 & 20 & yes (field names)\\
+\hline
+Descriptor & 2 & 16 & yes (field names, enum types, nested types)\\
+\hline
+FieldDescriptor & 4 & 18 & no\\
+\hline
+EnumDescriptor & 4 & 11 & yes (enum constant names)\\
+\hline
+FileDescriptor & 3 & 6 & yes (message/field definitions)\\
+\hline
+EnumValueDescriptor & 3 & 6 & no\\
+\hline
+\end{tabular}
+\end{table}
+
+\subsection{Messages}
+
+The \texttt{Message} S4 class represents Protocol Buffer Messages and
+is the core abstraction of \CRANpkg{RProtoBuf}. Each \texttt{Message}
+contains a pointer to a \texttt{Descriptor} which defines the schema
+of the data defined in the Message, as well as a number of
+\texttt{FieldDescriptors} for the individual fields of the message.  A
+complete list of the slots and methods for \texttt{Messages}
+is available in Table~\ref{Message-methods-table}.
+
+\begin{table}[h]
+\centering
+\begin{small}
+\begin{tabular}{l|p{10cm}}
+\hline
+\textbf{Slot} & \textbf{Description} \\
+\hline
+\texttt{pointer} & External pointer to the \texttt{Message} object of the C++ proto library. Documentation for the
+\texttt{Message} class is available from the protocol buffer project page:
+\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.message.html#Message} \\
+\hline
+\texttt{type} & Fully qualified name of the message. For example a \texttt{Person} message
+has its \texttt{type} slot set to \texttt{tutorial.Person} \\[.3cm]
+\hline
+\textbf{Method} & \textbf{Description} \\
+\hline
+\texttt{has} & Indicates if a message has a given field.   \\
+\texttt{clone} & Creates a clone of the message \\
+\texttt{isInitialized} & Indicates if a message has all its required fields set\\
+\texttt{serialize} & serialize a message to a file, binary connection, or raw vector\\
+\texttt{clear} & Clear one or several fields of a message, or the entire message\\
+\texttt{size} & The number of elements in a message field\\
+\texttt{bytesize} & The number of bytes the message would take once serialized\\
+\hline
+\texttt{swap} & swap elements of a repeated field of a message\\
+\texttt{set} & set elements of a repeated field\\
+\texttt{fetch} & fetch elements of a repeated field\\
+\texttt{setExtension} & set an extension of a message\\
+\texttt{getExtension} & get the value of an extension of a message\\
+\texttt{add} & add elements to a repeated field \\
+\hline
+\texttt{str} & the R structure of the message\\
+\texttt{as.character} & character representation of a message\\
+\texttt{toString} & character representation of a message (same as \texttt{as.character}) \\
+\texttt{as.list} & converts message to a named R list\\
+\texttt{update} & updates several fields of a message at once\\
+\texttt{descriptor} & get the descriptor of the message type of this message\\
+\texttt{fileDescriptor} & get the file descriptor of this message's descriptor\\
+\hline
+\end{tabular}
+\end{small}
+\caption{\label{Message-methods-table}Description of slots and methods for the \texttt{Message} S4 class}
+\end{table}
+
+\subsection{Descriptors}
+
+Descriptors describe the type of a Message.  This includes what fields
+a message contains and what the types of those fields are.  Message
+descriptors are represented in R with the \emph{Descriptor} S4
+class. The class contains the slots \texttt{pointer} and
+\texttt{type}.  Similarly to messages, the \verb|$| operator can be
+used to retrieve descriptors that are contained in the descriptor, or
+invoke pseudo-methods.
+
+When \CRANpkg{RProtoBuf} is first loaded it calls
+\texttt{readProtoFiles} to read in an example \texttt{.proto} file
+included with the package.  The \texttt{tutorial.Person} descriptor
+and any other descriptors defined in loaded \texttt{.proto} files are
+then available on the search path.
+
+\begin{Schunk}
+\begin{Sinput}
+R> # field descriptor
+R> tutorial.Person$email
+\end{Sinput}
+\begin{Soutput}
+[1] "descriptor for field 'email' of type 'tutorial.Person' "
+\end{Soutput}
+\begin{Sinput}
+R> # enum descriptor
+R> tutorial.Person$PhoneType
+\end{Sinput}
+\begin{Soutput}
+[1] "descriptor for enum 'PhoneType' of type 'tutorial.Person' with 3 values"
+\end{Soutput}
+\begin{Sinput}
+R> # nested type descriptor
+R> tutorial.Person$PhoneNumber
+\end{Sinput}
+\begin{Soutput}
+[1] "descriptor for type 'tutorial.Person.PhoneNumber' "
+\end{Soutput}
+\begin{Sinput}
+R> # same as
+R> tutorial.Person.PhoneNumber
+\end{Sinput}
+\begin{Soutput}
+[1] "descriptor for type 'tutorial.Person.PhoneNumber' "
+\end{Soutput}
+\end{Schunk}
+
+Table~\ref{Descriptor-methods-table} provides a complete list of the
+slots and avalailable methods for Descriptors.
+
+\begin{table}[h]
+\centering
+\begin{small}
+\begin{tabular}{l|p{10cm}}
+\hline
+\textbf{Slot} & \textbf{Description} \\
+\hline
+\texttt{pointer} & External pointer to the \texttt{Descriptor} object of the C++ proto library. Documentation for the
+\texttt{Descriptor} class is available from the protocol buffer project page:
+\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.descriptor.html#Descriptor} \\
+\hline
+\texttt{type} & Fully qualified path of the message type. \\[.3cm]
+\hline
+\textbf{Method} & \textbf{Description} \\
+\hline
+\texttt{new} & Creates a prototype of a message described by this descriptor.\\
+\texttt{read} & Reads a message from a file or binary connection.\\
+\texttt{readASCII} & Read a message in ASCII format from a file or
+text connection.\\
+\hline
+\texttt{name} & Retrieve the name of the message type associated with
+this descriptor.\\
+\texttt{as.character} & character representation of a descriptor\\
+\texttt{toString} & character representation of a descriptor (same as \texttt{as.character}) \\
+\texttt{as.list} & return a named
+list of the field, enum, and nested descriptors included in this descriptor.\\
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/rprotobuf -r 729


More information about the Rprotobuf-commits mailing list