[Rprotobuf-commits] r914 - papers/jss

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Wed Nov 26 22:53:19 CET 2014


Author: murray
Date: 2014-11-26 22:53:18 +0100 (Wed, 26 Nov 2014)
New Revision: 914

Modified:
   papers/jss/article.Rnw
Log:
Cut section 6 in half by removing the section explaining the caveats
about formulas and types not supported by serialized_pb, and just
explain now that we serialize everything, but in one sentence explain
the caveat that we fall back to base::serialize for R-specific types
like language,function,and environment.  This basically removes the
need for 6.1 at all, so remove that section, and move a tiny bit of
the top text about which datasets we are using into the top of section
6.2 which explains the compresison performance.

Next step: Replace table with a plot as hadley and one of the referees
both suggested.

Another referee suggested just merging 5 and 6 together completely.
Now that section 6 is one page plus one large table, that seems
feasible, but I'll revisit that after replacing the table with a plot.
I do want to make a stark distinction between having a schema (section
5) and not (section 6).  I never really use section 6 schema-less
method, but its the one that is easier for people to play around with
probably if they don't have a real application with protocol buffers
yet.



Modified: papers/jss/article.Rnw
===================================================================
--- papers/jss/article.Rnw	2014-11-26 21:13:54 UTC (rev 913)
+++ papers/jss/article.Rnw	2014-11-26 21:53:18 UTC (rev 914)
@@ -868,7 +868,7 @@
 within \proglang{R}.
 The package also provides methods for converting arbitrary \proglang{R} data structures into Protocol
 Buffers and vice versa with a universal \proglang{R} object schema. The \code{serialize\_pb} and \code{unserialize\_pb}
-functions serialize arbitrary \proglang{R} objects into a universal Protocol Buffer 
+functions serialize arbitrary \proglang{R} objects into a universal Protocol Buffer
 message:
 
 <<>>=
@@ -877,76 +877,44 @@
 @
 
 In order to accomplish this, \pkg{RProtoBuf} uses the same catch-all \code{proto}
-schema used by \pkg{RHIPE} for exchanging \proglang{R} data with Hadoop \citep{rhipe}. This 
+schema used by \pkg{RHIPE} for exchanging \proglang{R} data with Hadoop \citep{rhipe}. This
 schema, which we will refer to as \code{rexp.proto}, is printed in
 %appendix \ref{rexp.proto}.
 the appendix.
 The Protocol Buffer messages generated by \pkg{RProtoBuf} and
-\pkg{RHIPE} are naturally compatible between the two systems because they use the 
+\pkg{RHIPE} are naturally compatible between the two systems because they use the
 same schema. This shows the power of using a schema-based cross-platform format such
 as Protocol Buffers: interoperability is achieved without effort or close coordination.
 
-The \code{rexp.proto} schema supports all main \proglang{R} storage types holding \emph{data}.
-These include \code{NULL}, \code{list} and vectors of type \code{logical}, 
-\code{character}, \code{double}, \code{integer}, and \code{complex}. In addition,
-every type can contain a named set of attributes, as is the case in \proglang{R}. The \code{rexp.proto}
-schema does not support some of the special \proglang{R} specific storage types, such as \code{function},
-\code{language} or \code{environment}. Such objects have no native equivalent 
-type in Protocol Buffers, and have little meaning outside the context of \proglang{R}.
-When serializing \proglang{R} objects using \code{serialize\_pb}, values or attributes of
-unsupported types are skipped with a warning. If the user really wishes to serialize these 
-objects, they need to be converted into a supported type. For example, the  can use 
-\code{deparse} to convert functions or language objects into strings, or \code{as.list}
-for environments.
+The \code{rexp.proto} schema natively supports all main \proglang{R}
+storage types holding \emph{data}.  These include \code{NULL},
+\code{list} and vectors of type \code{logical}, \code{character},
+\code{double}, \code{integer}, and \code{complex}. In addition, every
+type can contain a named set of attributes, as is the case in
+\proglang{R}. The storage types \code{function}, \code{language}, and
+\code{environment} are specific to \proglang{R} and have no equivalent
+native type in Protocol Buffers.  These three types are supported by
+first serializing with \code{base::serialize} in \proglang{R} and
+then stored in a raw bytes field.
 
-\subsection[Evaluation: Converting R data sets]{Evaluation: Converting \proglang{R} data sets}
 
-To illustrate how this method works, we attempt to convert all of the built-in 
-data sets from \proglang{R} into this serialized Protocol Buffer representation.
+\subsection[Evaluation: Serializing R data sets]{Evaluation: Serializing \proglang{R} data sets}
+\label{sec:compression}
 
-<<echo=TRUE>>=
+<<echo=FALSE>>=
 datasets <- as.data.frame(data(package="datasets")$results)
 datasets$name <- sub("\\s+.*$", "", datasets$Item)
 n <- nrow(datasets)
 @
 
-There are \Sexpr{n} standard data sets included in the \pkg{datasets}
-package included with \proglang{R}. These data sets include data frames, matrices, time series, tables lists,
-and some more exotic data classes. The \code{can\_serialize\_pb} method is 
-used to determine which of those can fully be converted to the \code{rexp.proto}
-Protocol Buffer representation. This method simply checks if any of the values or
-attributes in an object is of an unsupported type:
+This section evaluates the effectiveness of serializing arbitrary
+\proglang{R} data structures into Protocol Buffers.  We use the
+\Sexpr{n} standard data sets included in the \pkg{datasets} package
+included with \proglang{R} as our evaluation data. These data sets
+include data frames, matrices, time series, tables, lists, and some
+more exotic data classes.  For each data set, we compare how many
+bytes are used to store the data set using four different methods:
 
-<<echo=TRUE>>=
-m <- sum(sapply(datasets$name, function(x) can_serialize_pb(get(x))))
-@
-
-\Sexpr{m} data sets can be converted to Protocol Buffers
-without loss of information (\Sexpr{format(100*m/n,digits=1)}\%). Upon closer
-inspection, all other data sets are objects of class \code{nfnGroupedData}.
-This class represents a special type of data frame that has some additional 
-attributes (such as a \emph{formula} object) used by the \pkg{nlme} package \citep{nlme}.
-Because formulas are \proglang{R} \emph{language} objects, they have little meaning to
-other systems, and are not supported by the \code{rexp.proto} descriptor.
-When \code{serialize\_pb} is used on objects of this class, it will serialize
-the data frame and all attributes, except for the formula.
-
-<<>>=
-attr(CO2, "formula")
-msg <- serialize_pb(CO2, NULL)
-object <- unserialize_pb(msg)
-identical(CO2, object)
-identical(class(CO2), class(object))
-identical(dim(CO2), dim(object))
-attr(object, "formula")
-@
-
-\subsection{Compression performance}
-\label{sec:compression}
-
-This section compares how many bytes are used to store data sets
-using four different methods:
-
 \begin{itemize}
 \item normal \proglang{R} serialization \citep{serialization},
 \item \proglang{R} serialization followed by gzip,
@@ -977,9 +945,9 @@
 @
 
 Table~\ref{tab:compression} shows the sizes of 50 sample \proglang{R} data sets as
-returned by object.size() compared to the serialized sizes.
+returned by \code{object.size()} compared to the serialized sizes.
 %The summary compression sizes are listed below, and a full table for a
-%sample of 50 data sets is included on the next page.  
+%sample of 50 data sets is included on the next page.
 Note that Protocol Buffer serialization results in slightly
 smaller byte streams compared to native \proglang{R} serialization in most cases,
 but this difference disappears if the results are compressed with gzip.
@@ -1443,7 +1411,6 @@
 
 \begin{verbatim}
 package rexp;
-
 message REXP {
   enum RClass {
     STRING = 0;
@@ -1454,14 +1421,16 @@
     LIST = 5;
     LOGICAL = 6;
     NULLTYPE = 7;
+    LANGUAGE = 8;
+    ENVIRONMENT = 9;
+    FUNCTION = 10;
   }
   enum RBOOLEAN {
     F=0;
     T=1;
     NA=2;
   }
-
-  required RClass rclass = 1 ; 
+  required RClass rclass = 1;
   repeated double realValue = 2 [packed=true];
   repeated sint32 intValue = 3 [packed=true];
   repeated RBOOLEAN booleanValue = 4;
@@ -1471,6 +1440,9 @@
   repeated REXP rexpValue = 8;
   repeated string attrName = 11;
   repeated REXP attrValue = 12;
+  optional bytes languageValue = 13;
+  optional bytes environmentValue = 14;
+  optional bytes functionValue = 14;
 }
 message STRING {
   optional string strval = 1;



More information about the Rprotobuf-commits mailing list