[Rprotobuf-commits] r766 - papers/jss

Mon Jan 13 07:36:03 CET 2014

Author: jeroenooms
Date: 2014-01-13 07:36:03 +0100 (Mon, 13 Jan 2014)
New Revision: 766

Modified:
   papers/jss/article.Rnw
Log:
pass 3

Modified: papers/jss/article.Rnw
===================================================================

--- papers/jss/article.Rnw	2014-01-13 02:19:36 UTC (rev 765)
+++ papers/jss/article.Rnw	2014-01-13 06:36:03 UTC (rev 766)
@@ -1148,7 +1148,8 @@
 In addition to these low-level methods, the package also has some high level
 functionality for automatically converting R data structures into protocol
 buffers and vice versa. The \texttt{serialize\_pb} and \texttt{unserialize\_pb}
-functions convert arbitrary R objects into a universal Protocol Buffer structure:
+functions serialize arbitrary R objects into a universal Protocol Buffer 
+message:
 
 <<>>=
 msg <- serialize_pb(iris, NULL)
@@ -1156,14 +1157,26 @@
 @
 
 In order to accomplish this, \pkg{RProtoBuf} uses the same catch-all \texttt{proto}
-schema that \pkg{RHIPE} uses for exchanging R data with Hadoop \citep{rhipe}. This 
+schema used by \pkg{RHIPE} for exchanging R data with Hadoop \citep{rhipe}. This 
 schema, which we will refer to as \texttt{rexp.proto} is printed in appendix 
-\ref{rex.proto}. Even though the \texttt{RHIPE} implementation is written in Java and
-\texttt{RProtoBuf} is writting in R and \texttt{C++}, the Protocol Buffer messages
-are naturally compatible between the two systems because they use the same schema. 
-This shows the power of using a schema based cross-platform format such as Protocol 
-Buffers: interoperability is archieved without tight coordination or collaboration.
+\ref{rexp.proto}. The Protocol Buffer messages generated by \pkg{RProtoBuf} and
+\pkg{RHIPE} are naturally compatible between the two systems because they use the 
+same schema. This shows the power of using a schema based cross-platform format such
+as Protocol Buffers: interoperability is archieved without effort or close coordination.
 
+The \texttt{rexp.proto} schema supports all main R storage types holding \emph{data}.
+These include \texttt{NULL}, \texttt{list} and vectors of type \texttt{logical}, 
+\texttt{character}, \texttt{double}, \texttt{integer} and \texttt{complex}. In addition,
+every type can contain a named set of attributes, as is the case in R. The \texttt{rexp.proto}
+schema does not support some of the special R specific storage types, such as \texttt{function},
+\texttt{language} or \texttt{environment}. Such objects have no native equivalent 
+type in Protocol Buffers, and have little meaning outside the context of R.
+When serializing R objects using \texttt{serialize\_pb}, values or attributes of
+unsupported types are skipped with a warning. If the user really wishes to serialize these 
+objects, they need to be converted into a supported type. For example, the  can use 
+\texttt{deparse} to convert functions or language objects into strings, or \texttt{as.list}
+for environments.
+
 \subsection{Evaluation: Converting R Data Sets}
 
 To illustrate how this method works, we attempt to convert all of the built-in 
@@ -1177,14 +1190,13 @@
 
 There are \Sexpr{n} standard data sets included in the base-r \pkg{datasets}
 package. These datasets include data frames, matrices, time series, tables lists,
-and some more exotic data classes. The \texttt{can\_serialize\_pb} method can be 
+and some more exotic data classes. The \texttt{can\_serialize\_pb} method is 
 used to determine which of those can fully be converted to the \texttt{rexp.proto}
-Protocol Buffer representation:
+Protocol Buffer representation. This method simply checks if any of the values or
+attributes in an object is of an unsupported type:
 
 <<echo=TRUE>>=
-datasets$valid.proto <- sapply(datasets$name, function(x) can_serialize_pb(get(x)))
-datasets <- subset(datasets, valid.proto==TRUE)
-m <- nrow(datasets)
+m <- sum(sapply(datasets$name, function(x) can_serialize_pb(get(x))))
 @
 
 \Sexpr{m} data sets can be converted to Protocol Buffers
@@ -1194,10 +1206,19 @@
 attributes used by the \pkg{nlme} package, among which a \emph{formula} object.
 Because formulas are R \emph{language} objects, they have little meaning to
 other systems, and are not supported by the \texttt{rexp.proto} descriptor.
-When \texttt{serialize\_pb} is used on objects of this class (or other objects
-containing unsupported data types), it will serialize all other values and 
-attributes of the object, but skip over the unsupported types with a warning.
+When \texttt{serialize\_pb} is used on objects of this class, it will serialize
+the data frame and all attributes, except for the formula.
 
+<<>>=
+attr(CO2, "formula")
+msg <- serialize_pb(CO2, NULL)
+object <- unserialize_pb(msg)
+identical(CO2, object)
+identical(class(CO2), class(object))
+identical(dim(CO2), dim(object))
+attr(object, "formula")
+@
+
 \subsection{Compression Performance}
 \label{sec:compression}
 
@@ -1246,11 +1267,6 @@
 comes from its interoperability with other environments, as well as its safe
 versioning,
 
-TODO comparison of protobuf serialization sizes/times for various vectors.
-Compared to R's native serialization.  Discussion of the RHIPE approach of
-serializing any/all R objects, vs more specific Protocol Buffers for specific
-R objects.
-
 % N.B. see table.Rnw for how this table is created.
 %
 % latex table generated in R 3.0.2 by xtable 1.7-0 package
@@ -1340,6 +1356,8 @@
 \section{Descriptor lookup}
 \label{sec-lookup}
 
+%JO: is this section really relevant? Maybe just a citation will do instead?
+
 The \texttt{RProtoBuf} package uses the user defined tables framework
 that is defined as part of the \texttt{RObjectTables} package available
 from the OmegaHat project \citep{RObjectTables}.
@@ -1374,7 +1392,7 @@
 As described earlier, the primary application of Protocol Buffers is data
 interchange in the context of inter-system communications.  Network protocols
 such as HTTP provide mechanisms for client-server communication, i.e. how to
-initiate requests, authenticate, send messages, etc.  However, many network
+initiate requests, authenticate, send messages, etc.  However, network
 protocols generally do not regulate the \emph{content} of messages: they
 allow transfer of any media type, such as web pages, static files or
 multimedia content.  When designing systems where various components require
@@ -1424,10 +1442,7 @@
 languages, clients can be implemented in just a few lines of code. Below
 is example code for both R and Python that retrieves a dataset from R with 
 OpenCPU using a protobuf message. In R, we use the HTTP client from 
-the \texttt{httr} package \citep{httr}.
-% superfluous?
-%, and the protobuf parser from the \texttt{RProtoBuf} package.
-In this example we
+the \texttt{httr} package \citep{httr}. In this example we
 download a dataset which is part of the base R distribution, so we can
 verify that the object was transferred without loss of information.
 
@@ -1444,8 +1459,8 @@
 identical(output, MASS::Animals)
 @
 
-This code suggests a method for exchanging objects between R servers, however this can 
-also be done without Protocol Buffers. The main advantage of using an inter-operable format 
+This code suggests a method for exchanging objects between R servers, however this might as 
+well be done without Protocol Buffers. The main advantage of using an inter-operable format 
 is that we can actually access R objects from within another
 programming language. For example, in a very similar fashion we can retrieve the same
 dataset in a Python client. To parse messages in Python, we first compile the 
@@ -1536,7 +1551,10 @@
 output of a function call on the server, instead of directly retrieving it. Thereby 
 objects can be shared with other users or used as arguments in a subsequent
 function call. But in its essence, the HTTP API provides a simple way to perform remote 
-R function calls over HTTPS. The same request can be performed in Python as follows:
+R function calls over HTTPS. The same request can be performed in Python as demonstrated
+below. The code is a bit verbose because to show how the REXP message is created from 
+scratch. In practice would probably write a function or small module construct a Protocol
+Buffer message representing an R list from a Python dictionary object. 
 
 \begin{verbatim}
 import urllib2;
@@ -1578,8 +1596,8 @@
 \section{Application: Distributed Data Collection with MapReduce}
 \label{sec:mapreduce}
 
-The MapReduce programming model \citep{dean2008mapreduce} has emerged
-in the last decade as a popular framework for processing large data
+Over the past years, the MapReduce programming model \citep{dean2008mapreduce}
+has emerged as a poweful design pattern for processing large data
 sets in parallel on large compute clusters.  Protocol Buffers
 provide a convenient mechanism to send structured data between tasks
 in a MapReduce cluster.  In particular, the large data sets in fields