[Rprotobuf-commits] r765 - papers/jss
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Mon Jan 13 03:19:36 CET 2014
Author: jeroenooms
Date: 2014-01-13 03:19:36 +0100 (Mon, 13 Jan 2014)
New Revision: 765
Modified:
papers/jss/article.Rnw
Log:
second pass at data frames
Modified: papers/jss/article.Rnw
===================================================================
--- papers/jss/article.Rnw 2014-01-13 00:21:23 UTC (rev 764)
+++ papers/jss/article.Rnw 2014-01-13 02:19:36 UTC (rev 765)
@@ -1,5 +1,6 @@
\documentclass[article]{jss}
\usepackage{booktabs}
+\usepackage[toc,page]{appendix}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -1139,25 +1140,35 @@
options("RProtoBuf.int64AsString" = FALSE)
@
-\section{Evaluation: data.frame to Protocol Buffer Serialization}
+\section{Converting R Data Structures into Protocol Buffers}
\label{sec:evaluation}
-The \pkg{RHIPE} package \citep{rhipe} also includes a Protocol integration with R.
-However, its implementation takes a different approach: any R object is
-serialized into a message based on a single catch-all \texttt{proto} schema.
-A similar approach was taken by \pkg{RProtoBufUtils} package (which has now been integrated in
-\pkg{RProtoBuf}). Unlike \pkg{RHIPE}, however, \pkg{RProtoBufUtils}
-depended upon on, and extended, \pkg{RProtoBuf} for underlying message operations.
-%DE Shall this go away now that we sucket RPBUtils into RBP?
+The previous sections discussed functionality in the \pkg{RProtoBuf} package
+for creating, manipulating, parsing and serializing Protocol Buffer messages.
+In addition to these low-level methods, the package also has some high level
+functionality for automatically converting R data structures into protocol
+buffers and vice versa. The \texttt{serialize\_pb} and \texttt{unserialize\_pb}
+functions convert arbitrary R objects into a universal Protocol Buffer structure:
-One key extension which \pkg{RProtoBufUtils} brought to \pkg{RProtoBuf} is the
-\texttt{serialize\_pb} method to convert R objects into serialized
-Protocol Buffers in the catch-all schema. The \texttt{can\_serialize\_pb}
-method can then be used to determine whether the given R object can safely
-be expressed in this way. To illustrate how this method works, we
-attempt to convert all of the built-in datasets from R into this
-serialized Protocol Buffer representation.
+<<>>=
+msg <- serialize_pb(iris, NULL)
+identical(iris, unserialize_pb(msg))
+@
+In order to accomplish this, \pkg{RProtoBuf} uses the same catch-all \texttt{proto}
+schema that \pkg{RHIPE} uses for exchanging R data with Hadoop \citep{rhipe}. This
+schema, which we will refer to as \texttt{rexp.proto} is printed in appendix
+\ref{rex.proto}. Even though the \texttt{RHIPE} implementation is written in Java and
+\texttt{RProtoBuf} is writting in R and \texttt{C++}, the Protocol Buffer messages
+are naturally compatible between the two systems because they use the same schema.
+This shows the power of using a schema based cross-platform format such as Protocol
+Buffers: interoperability is archieved without tight coordination or collaboration.
+
+\subsection{Evaluation: Converting R Data Sets}
+
+To illustrate how this method works, we attempt to convert all of the built-in
+datasets from R into this serialized Protocol Buffer representation.
+
<<echo=TRUE>>=
datasets <- as.data.frame(data(package="datasets")$results)
datasets$name <- sub("\\s+.*$", "", datasets$Item)
@@ -1167,7 +1178,7 @@
There are \Sexpr{n} standard data sets included in the base-r \pkg{datasets}
package. These datasets include data frames, matrices, time series, tables lists,
and some more exotic data classes. The \texttt{can\_serialize\_pb} method can be
-used to determine which of those can fully be converted to the \textt{rexp.proto}
+used to determine which of those can fully be converted to the \texttt{rexp.proto}
Protocol Buffer representation:
<<echo=TRUE>>=
@@ -1646,6 +1657,59 @@
%The contemporaneous work by Saptarshi Guha on \pkg{RHIPE} was a strong
%initial motivator.
+\newpage
+\begin{appendices}
+
+\section{The rexp.proto schema descriptor}
+\label{rexp.proto}
+
+Below a print of the \texttt{rexp.proto} schema (originally designed by \cite{rhipe})
+that is included with the \pkg{RProtoBuf} package and used by \texttt{serialize\_pb} and
+\texttt{unserialize\_pb}.
+
+\begin{verbatim}
+package rexp;
+
+message REXP {
+ enum RClass {
+ STRING = 0;
+ RAW = 1;
+ REAL = 2;
+ COMPLEX = 3;
+ INTEGER = 4;
+ LIST = 5;
+ LOGICAL = 6;
+ NULLTYPE = 7;
+ }
+ enum RBOOLEAN {
+ F=0;
+ T=1;
+ NA=2;
+ }
+
+ required RClass rclass = 1 ;
+ repeated double realValue = 2 [packed=true];
+ repeated sint32 intValue = 3 [packed=true];
+ repeated RBOOLEAN booleanValue = 4;
+ repeated STRING stringValue = 5;
+ optional bytes rawValue = 6;
+ repeated CMPLX complexValue = 7;
+ repeated REXP rexpValue = 8;
+ repeated string attrName = 11;
+ repeated REXP attrValue = 12;
+}
+message STRING {
+ optional string strval = 1;
+ optional bool isNA = 2 [default=false];
+}
+message CMPLX {
+ optional double real = 1 [default=0];
+ required double imag = 2;
+}
+\end{verbatim}
+\end{appendices}
+
+
\bibliography{article}
%\section[About Java]{About \proglang{Java}}
More information about the Rprotobuf-commits
mailing list