[Rprotobuf-commits] r942 - papers/jss

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Mon Apr 13 19:51:23 CEST 2015


Author: edd
Date: 2015-04-13 19:51:23 +0200 (Mon, 13 Apr 2015)
New Revision: 942

Added:
   papers/jss/jss1313.Rnw
   papers/jss/jss1313.bib
Removed:
   papers/jss/article.Rnw
   papers/jss/article.bib
Modified:
   papers/jss/Makefile
Log:
renaming article.Rnw to jss1313.Rnw as requested, along with support files, commit 1

Modified: papers/jss/Makefile
===================================================================
--- papers/jss/Makefile	2015-04-13 00:28:02 UTC (rev 941)
+++ papers/jss/Makefile	2015-04-13 17:51:23 UTC (rev 942)
@@ -1,16 +1,17 @@
-all: clean article.pdf
+article=jss1313
+all: clean ${article}.pdf
 
 clean:
-	rm -fr article.out article.aux article.log article.bbl \
-	  article.blg article.brf figures/fig-0??.pdf
+	rm -fr ${article}.out ${article}.aux ${article}.log ${article}.bbl \
+	  ${article}.blg ${article}.brf figures/fig-0??.pdf
 
-article.pdf: article.Rnw
-	R CMD Sweave article.Rnw
-	pdflatex article.tex
-	bibtex article
-	pdflatex article.tex
-	pdflatex article.tex
-	R CMD Stangle article.Rnw
+${article}.pdf: ${article}.Rnw
+	R CMD Sweave ${article}.Rnw
+	pdflatex ${article}.tex
+	bibtex ${article}
+	pdflatex ${article}.tex
+	pdflatex ${article}.tex
+	R CMD Stangle ${article}.Rnw
 
 jssarchive:
 	(cd .. && zip -r jssarchive.zip jss/)

Deleted: papers/jss/article.Rnw
===================================================================
--- papers/jss/article.Rnw	2015-04-13 00:28:02 UTC (rev 941)
+++ papers/jss/article.Rnw	2015-04-13 17:51:23 UTC (rev 942)
@@ -1,1518 +0,0 @@
-\documentclass[article]{jss}
-\usepackage{booktabs}
-\usepackage{listings}
-\usepackage[toc,page]{appendix}
-
-% Line numbers for drafts.
-%\usepackage[switch, modulo]{lineno}
-%\linenumbers
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-% Spelling Standardization:
-% Protocol Buffers, not protocol buffers
-% large-scale, not large scale
-% Oxford comma: foo, bar, and baz.
-
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-%
-% Local helpers to make this more compatible with R Journal style.
-%
-\RequirePackage{fancyvrb}
-\RequirePackage{alltt}
-\DefineVerbatimEnvironment{example}{Verbatim}{}
-% Articles with many authors we should shorten to FirstAuthor, et al.
-\shortcites{sciencecloud,janus,dremel,nlme}
-\author{Dirk Eddelbuettel\\Debian Project \And 
-        Murray Stokely\\Google, Inc \And
-        Jeroen Ooms\\UCLA}
-\title{\pkg{RProtoBuf}: Efficient Cross-Language Data Serialization in \proglang{R}}
-
-%% for pretty printing and a nice hypersummary also set:
-\Plainauthor{Dirk Eddelbuettel, Murray Stokely, Jeroen Ooms} %% comma-separated
-\Plaintitle{RProtoBuf: Efficient Cross-Language Data Serialization in R}
-\Shorttitle{\pkg{RProtoBuf}: Protocol Buffers in \proglang{R}} %% a short title (if necessary)
-
-%% an abstract and keywords
-\Abstract{
-  Modern data collection and analysis pipelines often involve
-  a sophisticated mix of applications written in general purpose and
-  specialized programming languages.  
-  Many formats commonly used to import and export data between
-  different programs or systems, such as \code{CSV} or \code{JSON}, are
-  verbose, inefficient, not type-safe, or tied to a specific programming language.
-  Protocol Buffers are a popular
-  method of serializing structured data between applications---while remaining
-  independent of programming languages or operating systems.
-  They offer a unique combination of features, performance, and maturity that seems
-  particularly well suited for data-driven applications and numerical
-  computing.
-  The \pkg{RProtoBuf} package provides a complete interface to Protocol
-  Buffers from the
-  \proglang{R} environment for statistical computing.
-  This paper outlines the general class of data serialization
-  requirements for statistical computing, describes the implementation
-  of the \pkg{RProtoBuf} package, and illustrates its use with
-  example applications in large-scale data collection pipelines and web
-  services.
-  %% TODO(ms) keep it less than 150 words. -- I think this may be 154,
-  %% depending how emacs is counting.
-}
-\Keywords{\proglang{R}, \pkg{Rcpp}, Protocol Buffers, serialization, cross-platform}
-\Plainkeywords{R, Rcpp, Protocol Buffers, serialization, cross-platform} %% without formatting
-%% at least one keyword must be supplied
-
-%% publication information
-%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
-%% \Volume{50}
-%% \Issue{9}
-%% \Month{June}
-%% \Year{2012}
-%% \Submitdate{2012-06-04}
-%% \Acceptdate{2012-06-04}
-
-%% The address of (at least) one author should be given
-%% in the following format:
-\Address{
-  Dirk Eddelbuettel \\
-  Debian Project \\
-  River Forest, IL, USA\\
-  E-mail: \email{edd at debian.org}\\
-  URL: \url{http://dirk.eddelbuettel.com}\\
-  \\
-  Murray Stokely\\
-  Google, Inc.\\
-  1600 Amphitheatre Parkway\\
-  Mountain View, CA, USA\\
-  E-mail: \email{mstokely at google.com}\\
-  URL: \url{http://www.stokely.org/}\\
-  \\
-  Jeroen Ooms\\
-  UCLA Department of Statistics\\
-  University of California\\
-  Los Angeles, CA, USA\\
-  E-mail: \email{jeroen.ooms at stat.ucla.edu}\\
-  URL: \url{https://jeroenooms.github.io}
-}
-%% It is also possible to add a telephone and fax number
-%% before the e-mail in the following format:
-%% Telephone: +43/512/507-7103
-%% Fax: +43/512/507-2851
-
-%% for those who use Sweave please include the following line (with % symbols):
-%% need no \usepackage{Sweave.sty}
-
-%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-
-\begin{document}
-\SweaveOpts{concordance=FALSE,prefix.string=figures/fig}
-
-
-%% include your article here, just as usual
-%% Note that you should use the \pkg{}, \proglang{} and \code{} commands.
-
-
-% We don't want a left margin for Sinput or Soutput for our table 1.
-%\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=0em}
-%\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=0em}
-%\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
-% Setting the topsep to 0 reduces spacing from input to output and
-% improves table 1.
-\fvset{listparameters={\setlength{\topsep}{0pt}}}
-\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
-
-%% DE: I tend to have wider option(width=...) so this
-%%     guarantees better line breaks
-<<echo=FALSE,print=FALSE>>=
-## cf http://www.jstatsoft.org/style#q12
-options(prompt = "R> ", 
-        continue = "+  ", 
-        width = 70, 
-        useFancyQuotes = FALSE, 
-        digits = 4)
-@
-
-\maketitle
-
-\section{Introduction} 
-
-Modern data collection and analysis pipelines increasingly involve collections
-of decoupled components in order to better manage software complexity 
-through reusability, modularity, and fault isolation \citep{Wegiel:2010:CTT:1932682.1869479}.
-These pipelines are frequently built using different programming 
-languages for the different phases of data analysis --- collection,
-cleaning, modeling, analysis, post-processing, and
-presentation --- in order to take advantage of the unique combination of
-performance, speed of development, and library support offered by
-different environments and languages.  Each stage of such a data
-analysis pipeline may produce intermediate results that need to be
-stored in a file, or sent over the network for further processing. 
-
-Given these requirements, how do we safely and efficiently share intermediate results
-between different applications, possibly written in different
-languages, and possibly running on different computer systems?
-In computer programming, \emph{serialization} is the process of
-translating data structures, variables, and session state into a
-format that can be stored or transmitted and then reconstructed in the
-original form later \citep{clinec++}.
-Programming
-languages such as \proglang{R}, \proglang{Julia}, \proglang{Java}, and \proglang{Python} include built-in
-support for serialization, but the default formats 
-are usually language-specific and thereby lock the user into a single
-environment.  
-
-Data analysts and researchers often use character-separated text formats such
-as \code{CSV} \citep{shafranovich2005common} to export and import
-data. However, anyone who has ever used \code{CSV} files will have noticed
-that this method has many limitations: it is restricted to tabular data,
-lacks type-safety, and has limited precision for numeric values.  Moreover,
-ambiguities in the format itself frequently cause problems.  For example,
-conventions on which characters is used as separator or decimal point vary by
-country.  \emph{Extensible Markup Language} (\code{XML}) is a
-well-established and widely-supported format with the ability to define just
-about any arbitrarily complex schema \citep{nolan2013xml}. However, it pays
-for this complexity with comparatively large and verbose messages, and added
-complexity at the parsing side (these problems are somewhat mitigated by the
-availability of mature libraries and parsers). Because \code{XML} is
-text-based and has no native notion of numeric types or arrays, it usually not a
-very practical format to store numeric data sets as they appear in statistical
-applications.
-
-
-A more modern format is \emph{JavaScript ObjectNotation}
-(\code{JSON}), which is derived from the object literals of
-\proglang{JavaScript}, and already widely-used on the world wide web.
-Several \proglang{R} packages implement functions to parse and generate
-\code{JSON} data from \proglang{R} objects \citep{rjson,RJSONIO,jsonlite}.
-\code{JSON} natively supports arrays and four primitive types: numbers, strings,
-booleans, and null. However, as it too is a text-based format, numbers are
-stored as human-readable decimal notation which is inefficient and
-leads to loss of type (double versus integer) and precision. 
-A number of binary formats based on \code{JSON} have been proposed
-that reduce the parsing cost and improve efficiency, but these formats
-are not widely supported.  Furthermore, such formats lack a separate
-schema for the serialized data and thus still duplicate field names
-with each message sent over the network or stored in a file.
-
-Once the data serialization needs of an application become complex
-enough, developers typically benefit from the use of an
-\emph{interface description language}, or \emph{IDL}.  IDLs like
-Protocol Buffers \citep{protobuf}, Apache Thrift \citep{Apache:Thrift}, and Apache Avro \citep{Apache:Avro}
-provide a compact well-documented schema for cross-language data
-structures and efficient binary interchange formats.  Since the schema
-is provided separately from the data, the data can be
-efficiently encoded to minimize storage costs when
-compared with simple ``schema-less'' binary interchange formats.
-%Many sources compare data serialization formats
-%and show Protocol Buffers perform favorably to the alternatives; see
-%\citet{Sumaray:2012:CDS:2184751.2184810} for one such comparison.
-Protocol Buffers performs well in the comparison of such formats by
-\citet{Sumaray:2012:CDS:2184751.2184810}.
-
-This paper describes an \proglang{R} interface to Protocol Buffers,
-and is organized as follows. Section~\ref{sec:protobuf}
-provides a general high-level overview of Protocol Buffers as well as a basic
-motivation for their use.
-Section~\ref{sec:rprotobuf-basic} describes the interactive \proglang{R} interface
-provided by the \pkg{RProtoBuf} package, and introduces the two main abstractions:
-\emph{Messages} and \emph{Descriptors}.  Section~\ref{sec:rprotobuf-classes}
-details the implementation of the main S4 classes and methods.
-Section~\ref{sec:types} describes the challenges of type coercion
-between \proglang{R} and other languages.  Section~\ref{sec:evaluation} introduces a
-general \proglang{R} language schema for serializing arbitrary \proglang{R} objects and compares it to
-the serialization capabilities built directly into \proglang{R}.  Sections~\ref{sec:mapreduce}
-and \ref{sec:opencpu} provide real-world use cases of \pkg{RProtoBuf}
-in MapReduce and web service environments, respectively, before
-Section~\ref{sec:summary} concludes.
-
-\section{Protocol Buffers}
-\label{sec:protobuf}
-
-Protocol Buffers are a modern, language-neutral, platform-neutral,
-extensible mechanism for sharing and storing structured data. Some of their
-features, particularly in the context of data analysis, are:
-
-\begin{itemize}
-\item \emph{Portable}:  Enable users to send and receive data between
-  applications as well as different computers or operating systems.
-\item \emph{Efficient}:  Data is serialized into a compact binary
-  representation for transmission or storage.
-\item \emph{Extensible}:  New fields can be added to Protocol Buffer schemas
-  in a forward-compatible way that does not break older applications.
-\item \emph{Stable}:  Protocol Buffers have been in wide use for over a
-  decade.
-\end{itemize}
-
-%\begin{figure}[bp]
-\begin{figure}[h!]
-\begin{center}
-\includegraphics[width=0.9\textwidth]{figures/protobuf-distributed-system-crop.pdf}
-\end{center}
-\caption{Example usage of Protocol Buffers.}
-\label{fig:protobuf-distributed-usecase}
-\end{figure}
-
-Figure~\ref{fig:protobuf-distributed-usecase} illustrates an example
-communication work flow with Protocol Buffers and an interactive \proglang{R} session.
-Common use cases include populating a request remote-procedure call (RPC)
-Protocol Buffer in \proglang{R} that is then serialized and sent over the network to a
-remote server.  The server deserializes the message, acts on the
-request, and responds with a new Protocol Buffer over the network.
-The key difference to, say, a request to an \pkg{Rserve}
-\citep{Urbanek:2003:Rserve,CRAN:Rserve} instance is that
-the remote server may be implemented in any language.
-%, with no dependence on \proglang{R}.
-
-While traditional IDLs have at times been criticized for code bloat and
-complexity, Protocol Buffers are based on a simple list and records
-model that is flexible and easy to use.  The schema for structured
-Protocol Buffer data is defined in \code{.proto} files, which may
-contain one or more message types.  Each message type has one or more
-fields.  A field is specified with a unique number (called a \emph{tag number}), a name, a value
-type, and a field rule specifying whether the field is optional,
-required, or repeated.  The supported value types are numbers,
-enumerations, booleans, strings, raw bytes, or other nested message
-types.  The \code{.proto} file syntax for defining the structure of Protocol
-Buffer data is described comprehensively on Google Code\footnote{See 
-\url{http://code.google.com/apis/protocolbuffers/docs/proto.html}.}.
-Table~\ref{tab:proto} shows an example \code{.proto} file that
-defines the \code{tutorial.Person} type\footnote{The compound name
-  \code{tutorial.Person} in R is derived from the name of the
-  message (\emph{Person}) and the name of the package defined at the top of the
-  \code{.proto} file in which it is defined (\emph{tutorial}).}.  The \proglang{R} code in the right
-column shows an example of creating a new message of this type and
-populating its fields.
-
-\noindent
-\begin{table}
-\begin{tabular}{p{0.45\textwidth}p{0.5\textwidth}}
-\toprule
-Schema : \code{addressbook.proto} & Example \proglang{R} session\\
-\cmidrule{1-2}
-\begin{minipage}{.40\textwidth}
-\vspace{2mm}
-\begin{example}
-package tutorial;
-message Person {
-  required string name = 1;
-  required int32 id = 2;
-  optional string email = 3;
-  enum PhoneType {
-    MOBILE = 0; 
-    HOME = 1;
-    WORK = 2;
-  }
-  message PhoneNumber {
-    required string number = 1;
-    optional PhoneType type = 2;
-  }
-  repeated PhoneNumber phone = 4;
-}
-\end{example}
-\vspace{2mm}
-\end{minipage} & \begin{minipage}{.55\textwidth}
-<<echo=TRUE>>=
-library("RProtoBuf")
-p <- new(tutorial.Person, id=1,
-         name="Dirk")
-p$name
-p$name <- "Murray"
-cat(as.character(p))
-serialize(p, NULL)
-class(p)
-@
-\end{minipage} \\
-\bottomrule
-\end{tabular}
-\caption{The schema representation from a \code{.proto} file for the
-  \code{tutorial.Person} class (left) and simple \proglang{R} code for creating
-  an object of this class and accessing its fields (right).}
-\label{tab:proto}
-\end{table}
-
-
-For added speed and efficiency, the \proglang{C++}, \proglang{Java},
-and \proglang{Python} bindings to
-Protocol Buffers are used with a compiler that translates a Protocol
-Buffer schema description file (ending in \code{.proto}) into
-language-specific classes that can be used to create, read, write, and
-manipulate Protocol Buffer messages.  The \proglang{R} interface, in contrast,
-uses a reflection-based API that makes some operations slightly
-slower but which is much more convenient for interactive data analysis.
-All messages in \proglang{R} have a single class
-structure, but different accessor methods are created at runtime based
-on the named fields of the specified message type, as described in the
-next section.
-
-\section{Basic usage: Messages and descriptors}
-\label{sec:rprotobuf-basic}
-
-This section describes how to use the \proglang{R} API to create and manipulate
-Protocol Buffer messages in \proglang{R}, and how to read and write the
-binary representation of the message (often called the \emph{payload}) to files and arbitrary binary
-\proglang{R} connections.
-The two fundamental building blocks of Protocol Buffers are \emph{Messages}
-and \emph{Descriptors}.  Messages provide a common abstract encapsulation of
-structured data fields of the type specified in a Message Descriptor.
-Message Descriptors are defined in \code{.proto} files and define a
-schema for a particular named class of messages.
-
-% Note: We comment out subsections in favor of textbf blocks to save
-% space and shrink down this section a little bit.
-%\subsection[Importing message descriptors from .proto files]{Importing message descriptors from \code{.proto} files}
-
-\subsection*{Importing message descriptors from \code{.proto} files}
-
-To create or parse a Protocol Buffer Message, one must first read in
-the message descriptor (\emph{message type}) from a \code{.proto} file.
-A small number of message types are imported when the package is first
-loaded, including the \code{tutorial.Person} type we saw in the last
-section.
-All other types must be imported from
-\code{.proto} files using the \code{readProtoFiles}
-function, which can either import a single file, all files in a directory,
-or every \code{.proto} file provided by a particular \proglang{R} package.
-
-After importing proto files, the corresponding message descriptors are
-available by name from the \code{RProtoBuf:DescriptorPool} environment in 
-the \proglang{R} search path.  This environment is implemented with the 
-user-defined tables framework from the \pkg{RObjectTables} package
-available from the OmegaHat project \citep{RObjectTables}.  Instead of
-being associated with a static hash table, this environment
-dynamically queries the in-memory database of loaded descriptors
-during normal variable lookup.  This allows new descriptors to be
-parsed from \code{.proto} files and added to the global
-namespace.\footnote{Note that there is a significant performance
-  overhead with this RObjectTable implementation.  Because the table
-  is on the search path and is not cacheable, lookups of symbols that
-  are behind it in the search path cannot be added to the global object
-  cache, and R must perform an expensive lookup through all of the
-  attached environments and the protocol buffer definitions to find common
-  symbols (most notably those in base) from the global environment.
-  Fortunately, proper use of namespaces and package imports reduces
-  the impact of this for code in packages.}
-
-% Commented out for now because its too detailed.  Lets shorten
-% section 3 per referee feedback.
-
-%<<echo=FALSE,print=FALSE>>=
-%ls("RProtoBuf:DescriptorPool")
-%@
-
-% \subsection{Creating a message}
-
-% \\
-
-\subsection*{Creating, accessing, and modifying messages.}
-
-New messages are created with the \code{new} function which accepts
-a Message Descriptor and optionally a list of ``name = value'' pairs
-to set in the message.
-%The objects contained in the special environment are
-%descriptors for their associated message types. Descriptors will be
-%discussed in detail in another part of this document, but for the
-%purpose of this section, descriptors are just used with the \code{new}
-%function to create messages.
-
-<<>>=
-p <- new(tutorial.Person, name = "Murray", id = 1)
-@
-
-% \subsection*{Access and modify fields of a message}
-
-Once the message is created, its fields can be queried
-and modified using the dollar operator of \proglang{R}, making Protocol
-Buffer messages seem like lists.
-
-<<>>=
-p$name
-p$id
-p$email <- "murray at stokely.org"
-@
-
-As opposed to \proglang{R} lists, no partial matching is performed
-and the name must be given entirely.
-The \verb|[[| operator can also be used to query and set fields
-of a messages, supplying either their name or their tag number:
-
-<<>>=
-p[["name"]] <- "Murray Stokely"
-p[[ 2 ]] <- 3
-p[["email"]]
-@
-
-Protocol Buffers include a 64-bit integer type, but \proglang{R} lacks native
-64-bit integer support.  A workaround is available and described in
-Section~\ref{sec:int64} for working with large integer values.
-
-\subsection*{Printing, reading, and writing Messages}
-
-%\\
-
-% \textbf{Printing, Reading, and Writing Messages}
-
-Protocol Buffer messages and descriptors implement \code{show}
-methods that provide basic information about the message:
-
-<<>>=
-p
-@
-
-%For additional information, such as for debugging purposes,
-The \code{as.character} method provides a more complete ASCII
-representation of the contents of a message.
-
-<<>>=
-writeLines(as.character(p))
-@
-
-% \subsection{Serializing messages}
-
-A primary benefit of Protocol Buffers is an efficient
-binary wire-format representation.
-The \code{serialize} method is implemented for
-Protocol Buffer messages to serialize a message into a sequence of
-bytes (raw vector) that represents the message.
-The raw bytes can then be parsed back into the original message safely
-as long as the message type is known and its descriptor is available.
-
-<<>>=
-serialize(p, NULL)
-@
-
-The same method can be used to serialize messages to files or arbitrary binary connections:
-
-<<>>=
-tf1 <- tempfile()
-serialize(p, tf1)
-readBin(tf1, raw(0), 500)
-@
-
-% TODO(mstokely): Comment out, combined with last statement. make this
-% shorter, more succinct summary of the key features of RProtoBuf.
-
-%Or to arbitrary binary connections:
-%
-%<<>>=
-%tf2 <- tempfile()
-%con <- file(tf2, open = "wb")
-%serialize(p, con)
-%close(con)
-%readBin(tf2, raw(0), 500)
-%@
-
-% TODO(mstokely): commentd out per referee feedback, but see if this is
-% covered in the package documentation well.
-%
-%\code{serialize} can also be called in a more traditional
-%object-oriented fashion using the dollar operator.
-%
-%<<>>=
-%p$serialize(tf1)
-%con <- file(tf2, open = "wb")
-%p$serialize(con)
-%close(con)
-%@
-%
-%Here, we first serialize to a file \code{tf1} before we serialize to a binary
-%connection to file \code{tf2}.
-
-%\subsection{Parsing messages}
-
-The \pkg{RProtoBuf} package defines the \code{read} and
-\code{readASCII} functions to read messages from files, raw vectors,
-or arbitrary connections.  \code{read} expects to read the message
-payload from binary files or connections and \code{readASCII} parses
-the human-readable ASCII output that is created with
-\code{as.character}.
-
-The binary representation of the message
-does not contain information that can be used to dynamically
-infer the message type, so we have to provide this information
-to the \code{read} function in the form of a descriptor:
-
-<<>>=
-msg <- read(tutorial.Person, tf1)
-writeLines(as.character(msg))
-@
-
-The \code{input} argument of \code{read} can also be a binary
-readable \proglang{R} connection, such as a binary file connection, or a raw vector of serialized bytes.
-
-% <<>>=
-% con <- file(tf2, open = "rb")
-% message <- read(tutorial.Person, con)
-% close(con)
-% writeLines(as.character(message))
-% @
-
-% Finally, the raw vector payload of the message can be used:
-%
-%<<>>=
-%payload <- readBin(tf1, raw(0), 5000)
-%message <- read(tutorial.Person, payload)
-%@
-
-% TODO(mstokely): comment out and use only one style, not both per
-% referee feedback.  Also avoid using the term 'pseudo-method' which
-% is unclear.
-%
-%\code{read} can also be used as a method of the descriptor
-%object:
-%
-%<<>>=
-%message <- tutorial.Person$read(tf1)
-%con <- file(tf2, open = "rb")
-%message <- tutorial.Person$read(con)
-%close(con)
-%message <- tutorial.Person$read(payload)
-%@
-%
-%Here we read first from a file, then from a binary connection and lastly from
-%a message payload.
-
-\section{Under the hood: S4 classes and methods}
-\label{sec:rprotobuf-classes}
-
-The \pkg{RProtoBuf} package uses the S4 system to store
-information about descriptors and messages.
-Each \proglang{R} object
-contains an external pointer to an object managed by the
-\code{protobuf} \proglang{C++} library, and the \proglang{R} objects make calls into more
-than 100 \proglang{C++} functions that provide the
-glue code between the \proglang{R} language classes and the underlying \proglang{C++}
-classes.
-S4 objects are immutable, and so the methods that modify field values of a message return a new copy of the object with R's usual functional copy on modify semantics\footnote{RProtoBuf was designed and implemented before Reference Classes were introduced to offer a new class system with mutable objects.  If RProtoBuf were
-implemented today Reference Classes would almost certainly be a better
-design choice than S4 classes.}.
-Using the S4 system
-allows the package to dispatch methods that are not
-generic in the S3 sense, such as \code{new} and
-\code{serialize}.
-
-The \pkg{Rcpp} package
-\citep{eddelbuettel2011rcpp,eddelbuettel2013seamless} is used to 
-facilitate this integration of the \proglang{R} and \proglang{C++} code for these objects.
-Each method is wrapped individually which allows us to add 
-user-friendly custom error handling, type coercion, and performance
-improvements at the cost of a more verbose implementation.
-The \pkg{RProtoBuf} package in many ways motivated
-the development of \pkg{Rcpp} Modules \citep{eddelbuettel2013exposing},
-which provide a more concise way of wrapping \proglang{C++} functions and classes
-in a single entity.
-
-Since \pkg{RProtoBuf} users are most often switching between two or
-more different languages as part of a larger data analysis pipeline,
-both generic function and message passing OO style calling conventions
-are supported:
-
-\begin{itemize}
-\item The functional dispatch mechanism of the the form
-  \verb|method(object, arguments)| (common to \proglang{R}).
-\item The message passing object-oriented notation of the form
-  \verb|object$method(arguments)|.
-\end{itemize}
-
-Additionally, \pkg{RProtoBuf} supports tab completion for all
-classes.  Completion possibilities include method names for all
-classes, plus \emph{dynamic dispatch} on names or types specific to a given
-object.  This functionality is implemented with the
-\code{.DollarNames} S3 generic function defined in the \pkg{utils}
-package that is included with \proglang{R} \citep{r}.
-
-
-Table~\ref{class-summary-table} lists the six primary Message and
-Descriptor classes in \pkg{RProtoBuf}.
-The package documentation provides a complete description of the slots and methods for
-each class.
-
-% [bp]
-\begin{table}
-\centering
-\begin{tabular}{lccl}
-\toprule
-Class               & Slots & Methods & Dynamic dispatch\\
-\cmidrule{2-4}
-Message             & 2 & 20 & yes (field names)\\
-Descriptor          & 2 & 16 & yes (field names, enum types, nested types)\\
-FieldDescriptor     & 4 & 18 & no\\
-EnumDescriptor      & 4 & 11 & yes (enum constant names)\\
-EnumValueDescriptor & 3 & \phantom{1}6 & no\\
-FileDescriptor      & 3 & \phantom{1}6 & yes (message/field definitions)\\
-\bottomrule
-\end{tabular}
-\caption{\label{class-summary-table}Overview of class, slot, method and
-  dispatch relationships.}
-\end{table}
-
-\subsection{Messages}
-
-The \code{Message} S4 class represents Protocol Buffer Messages and
-is the core abstraction of \pkg{RProtoBuf}. Each \code{Message}
-contains a pointer to a \code{Descriptor} which defines the schema
-of the data defined in the Message, as well as a number of
-\code{FieldDescriptors} for the individual fields of the message.
-
-<<>>=
-new(tutorial.Person)
-@
-
-\subsection{Descriptors}
-
-Descriptors describe the type of a Message.  This includes what fields
-a message contains and what the types of those fields are.  Message
-descriptors are represented in \proglang{R} by the \emph{Descriptor} S4
-class. The class contains the slots \code{pointer} and
-\code{type}.  Similarly to messages, the \verb|$| operator can be
-used to retrieve descriptors that are contained in the descriptor, or
-invoke methods.
-
-When \pkg{RProtoBuf} is first loaded it calls
-\code{readProtoFiles} to read in the example \code{addressbook.proto} file
-included with the package.  The \code{tutorial.Person} descriptor
-and all other descriptors defined in the loaded \code{.proto} files are
-then available on the search path\footnote{This explains why the example in
-Table~\ref{tab:proto} lacked an explicit call to
-\code{readProtoFiles}.}.
-
-\subsubsection*{Field descriptors}
-\label{subsec-field-descriptor}
-
-<<>>=
-tutorial.Person$email 
-tutorial.Person$email$is_required()
-tutorial.Person$email$type()
-tutorial.Person$email$as.character()
-class(tutorial.Person$email)
-@
-
-\subsubsection*{Enum and EnumValue descriptors}
-\label{subsec-enum-descriptor}
-
-The \code{EnumDescriptor} type contains information about what values a
-type defines, while the \code{EnumValueDescriptor} describes an
-individual enum constant of a particular type.  The \verb|$| operator
-can be used to retrieve the value of enum constants contained in the
-EnumDescriptor, or to invoke methods.
-
-<<>>=
-tutorial.Person$PhoneType
-tutorial.Person$PhoneType$WORK
-class(tutorial.Person$PhoneType)
-tutorial.Person$PhoneType$value(1)
-tutorial.Person$PhoneType$value(name="HOME")
-tutorial.Person$PhoneType$value(number=1)
-class(tutorial.Person$PhoneType$value(1))
-@
-
-\subsubsection*{File descriptors}
-\label{subsec-file-descriptor}
-
-The class \emph{FileDescriptor} represents file descriptors in \proglang{R}.
-The \verb|$| operator can be used to retrieve named fields defined in
-the FileDescriptor, or to invoke methods.
-
-% < < > > =
-% f <- tutorial.Person$fileDescriptor()
-% f
-% f$Person
-% @
-
-\begin{Schunk}
-\begin{Sinput}
-R> f <- tutorial.Person$fileDescriptor()
-R> f
-\end{Sinput}
-\begin{Soutput}
-file descriptor for package tutorial \
-    (/usr/local/lib/R/site-library/RProtoBuf/proto/addressbook.proto)
-\end{Soutput}
-\begin{Sinput}
-R> f$Person
-\end{Sinput}
-\begin{Soutput}
-descriptor for type 'tutorial.Person' 
-\end{Soutput}
-\end{Schunk}
-
-
-\section{Type coercion}
-\label{sec:types}
-
-One of the benefits of using an Interface Definition Language (IDL)
-like Protocol Buffers is that it provides a highly portable basic type
-system. This permits different language and hardware implementations to map to
-the most appropriate type in different environments.
-
-Table~\ref{table-get-types} details the correspondence between the
-field type and the type of data that is retrieved by \verb|$| and \verb|[[|
-extractors.  Three types in particular need further attention due to
-specific differences in the \proglang{R} language: booleans, unsigned
-integers, and 64-bit integers.
-
-\begin{table}[h]
-\centering
-\begin{small}
-\begin{tabular}{lp{5cm}p{5.5cm}}
-\toprule
-Field type & \proglang{R} type (non repeated) & \proglang{R} type (repeated) \\
-\cmidrule(r){2-3}
-double	& \code{double} vector & \code{double} vector \\
-float	& \code{double} vector & \code{double} vector \\[3mm]
-uint32	  & \code{double} vector & \code{double} vector \\
-fixed32	  & \code{double} vector & \code{double} vector \\[3mm]
-int32	  & \code{integer} vector & \code{integer} vector \\
-sint32	  & \code{integer} vector & \code{integer} vector \\
-sfixed32  & \code{integer} vector & \code{integer} vector \\[3mm]
-int64	  & \code{integer} or \code{character}
-vector    & \code{integer} or \code{character} vector \\
-uint64	  & \code{integer} or \code{character} vector & \code{integer} or \code{character} vector \\
-sint64	  & \code{integer} or \code{character} vector & \code{integer} or \code{character} vector \\
-fixed64	  & \code{integer} or \code{character} vector & \code{integer} or \code{character} vector \\
-sfixed64  & \code{integer} or \code{character} vector & \code{integer} or \code{character} vector \\[3mm]
-bool	& \code{logical} vector & \code{logical} vector \\[3mm]
-string	& \code{character} vector & \code{character} vector \\
-bytes	& \code{character} vector & \code{character} vector \\[3mm]
-enum & \code{integer} vector & \code{integer} vector \\[3mm]
-message & \code{S4} object of class \code{Message} & \code{list} of \code{S4} objects of class \code{Message} \\
-\bottomrule
-\end{tabular}
-\end{small}
-\caption{\label{table-get-types}Correspondence between field type and
-  \proglang{R} type retrieved by the extractors. \proglang{R} lacks native
-  64-bit integers, so the \code{RProtoBuf.int64AsString} option is
-  available to return large integers as characters to avoid losing
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/rprotobuf -r 942


More information about the Rprotobuf-commits mailing list