[Rprotobuf-commits] r766 - papers/jss
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Mon Jan 13 07:36:03 CET 2014
Author: jeroenooms
Date: 2014-01-13 07:36:03 +0100 (Mon, 13 Jan 2014)
New Revision: 766
Modified:
papers/jss/article.Rnw
Log:
pass 3
Modified: papers/jss/article.Rnw
===================================================================
--- papers/jss/article.Rnw 2014-01-13 02:19:36 UTC (rev 765)
+++ papers/jss/article.Rnw 2014-01-13 06:36:03 UTC (rev 766)
@@ -1148,7 +1148,8 @@
In addition to these low-level methods, the package also has some high level
functionality for automatically converting R data structures into protocol
buffers and vice versa. The \texttt{serialize\_pb} and \texttt{unserialize\_pb}
-functions convert arbitrary R objects into a universal Protocol Buffer structure:
+functions serialize arbitrary R objects into a universal Protocol Buffer
+message:
<<>>=
msg <- serialize_pb(iris, NULL)
@@ -1156,14 +1157,26 @@
@
In order to accomplish this, \pkg{RProtoBuf} uses the same catch-all \texttt{proto}
-schema that \pkg{RHIPE} uses for exchanging R data with Hadoop \citep{rhipe}. This
+schema used by \pkg{RHIPE} for exchanging R data with Hadoop \citep{rhipe}. This
schema, which we will refer to as \texttt{rexp.proto} is printed in appendix
-\ref{rex.proto}. Even though the \texttt{RHIPE} implementation is written in Java and
-\texttt{RProtoBuf} is writting in R and \texttt{C++}, the Protocol Buffer messages
-are naturally compatible between the two systems because they use the same schema.
-This shows the power of using a schema based cross-platform format such as Protocol
-Buffers: interoperability is archieved without tight coordination or collaboration.
+\ref{rexp.proto}. The Protocol Buffer messages generated by \pkg{RProtoBuf} and
+\pkg{RHIPE} are naturally compatible between the two systems because they use the
+same schema. This shows the power of using a schema based cross-platform format such
+as Protocol Buffers: interoperability is archieved without effort or close coordination.
+The \texttt{rexp.proto} schema supports all main R storage types holding \emph{data}.
+These include \texttt{NULL}, \texttt{list} and vectors of type \texttt{logical},
+\texttt{character}, \texttt{double}, \texttt{integer} and \texttt{complex}. In addition,
+every type can contain a named set of attributes, as is the case in R. The \texttt{rexp.proto}
+schema does not support some of the special R specific storage types, such as \texttt{function},
+\texttt{language} or \texttt{environment}. Such objects have no native equivalent
+type in Protocol Buffers, and have little meaning outside the context of R.
+When serializing R objects using \texttt{serialize\_pb}, values or attributes of
+unsupported types are skipped with a warning. If the user really wishes to serialize these
+objects, they need to be converted into a supported type. For example, the can use
+\texttt{deparse} to convert functions or language objects into strings, or \texttt{as.list}
+for environments.
+
\subsection{Evaluation: Converting R Data Sets}
To illustrate how this method works, we attempt to convert all of the built-in
@@ -1177,14 +1190,13 @@
There are \Sexpr{n} standard data sets included in the base-r \pkg{datasets}
package. These datasets include data frames, matrices, time series, tables lists,
-and some more exotic data classes. The \texttt{can\_serialize\_pb} method can be
+and some more exotic data classes. The \texttt{can\_serialize\_pb} method is
used to determine which of those can fully be converted to the \texttt{rexp.proto}
-Protocol Buffer representation:
+Protocol Buffer representation. This method simply checks if any of the values or
+attributes in an object is of an unsupported type:
<<echo=TRUE>>=
-datasets$valid.proto <- sapply(datasets$name, function(x) can_serialize_pb(get(x)))
-datasets <- subset(datasets, valid.proto==TRUE)
-m <- nrow(datasets)
+m <- sum(sapply(datasets$name, function(x) can_serialize_pb(get(x))))
@
\Sexpr{m} data sets can be converted to Protocol Buffers
@@ -1194,10 +1206,19 @@
attributes used by the \pkg{nlme} package, among which a \emph{formula} object.
Because formulas are R \emph{language} objects, they have little meaning to
other systems, and are not supported by the \texttt{rexp.proto} descriptor.
-When \texttt{serialize\_pb} is used on objects of this class (or other objects
-containing unsupported data types), it will serialize all other values and
-attributes of the object, but skip over the unsupported types with a warning.
+When \texttt{serialize\_pb} is used on objects of this class, it will serialize
+the data frame and all attributes, except for the formula.
+<<>>=
+attr(CO2, "formula")
+msg <- serialize_pb(CO2, NULL)
+object <- unserialize_pb(msg)
+identical(CO2, object)
+identical(class(CO2), class(object))
+identical(dim(CO2), dim(object))
+attr(object, "formula")
+@
+
\subsection{Compression Performance}
\label{sec:compression}
@@ -1246,11 +1267,6 @@
comes from its interoperability with other environments, as well as its safe
versioning,
-TODO comparison of protobuf serialization sizes/times for various vectors.
-Compared to R's native serialization. Discussion of the RHIPE approach of
-serializing any/all R objects, vs more specific Protocol Buffers for specific
-R objects.
-
% N.B. see table.Rnw for how this table is created.
%
% latex table generated in R 3.0.2 by xtable 1.7-0 package
@@ -1340,6 +1356,8 @@
\section{Descriptor lookup}
\label{sec-lookup}
+%JO: is this section really relevant? Maybe just a citation will do instead?
+
The \texttt{RProtoBuf} package uses the user defined tables framework
that is defined as part of the \texttt{RObjectTables} package available
from the OmegaHat project \citep{RObjectTables}.
@@ -1374,7 +1392,7 @@
As described earlier, the primary application of Protocol Buffers is data
interchange in the context of inter-system communications. Network protocols
such as HTTP provide mechanisms for client-server communication, i.e. how to
-initiate requests, authenticate, send messages, etc. However, many network
+initiate requests, authenticate, send messages, etc. However, network
protocols generally do not regulate the \emph{content} of messages: they
allow transfer of any media type, such as web pages, static files or
multimedia content. When designing systems where various components require
@@ -1424,10 +1442,7 @@
languages, clients can be implemented in just a few lines of code. Below
is example code for both R and Python that retrieves a dataset from R with
OpenCPU using a protobuf message. In R, we use the HTTP client from
-the \texttt{httr} package \citep{httr}.
-% superfluous?
-%, and the protobuf parser from the \texttt{RProtoBuf} package.
-In this example we
+the \texttt{httr} package \citep{httr}. In this example we
download a dataset which is part of the base R distribution, so we can
verify that the object was transferred without loss of information.
@@ -1444,8 +1459,8 @@
identical(output, MASS::Animals)
@
-This code suggests a method for exchanging objects between R servers, however this can
-also be done without Protocol Buffers. The main advantage of using an inter-operable format
+This code suggests a method for exchanging objects between R servers, however this might as
+well be done without Protocol Buffers. The main advantage of using an inter-operable format
is that we can actually access R objects from within another
programming language. For example, in a very similar fashion we can retrieve the same
dataset in a Python client. To parse messages in Python, we first compile the
@@ -1536,7 +1551,10 @@
output of a function call on the server, instead of directly retrieving it. Thereby
objects can be shared with other users or used as arguments in a subsequent
function call. But in its essence, the HTTP API provides a simple way to perform remote
-R function calls over HTTPS. The same request can be performed in Python as follows:
+R function calls over HTTPS. The same request can be performed in Python as demonstrated
+below. The code is a bit verbose because to show how the REXP message is created from
+scratch. In practice would probably write a function or small module construct a Protocol
+Buffer message representing an R list from a Python dictionary object.
\begin{verbatim}
import urllib2;
@@ -1578,8 +1596,8 @@
\section{Application: Distributed Data Collection with MapReduce}
\label{sec:mapreduce}
-The MapReduce programming model \citep{dean2008mapreduce} has emerged
-in the last decade as a popular framework for processing large data
+Over the past years, the MapReduce programming model \citep{dean2008mapreduce}
+has emerged as a poweful design pattern for processing large data
sets in parallel on large compute clusters. Protocol Buffers
provide a convenient mechanism to send structured data between tasks
in a MapReduce cluster. In particular, the large data sets in fields
More information about the Rprotobuf-commits
mailing list