[Rprotobuf-commits] r914 - papers/jss
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Wed Nov 26 22:53:19 CET 2014
Author: murray
Date: 2014-11-26 22:53:18 +0100 (Wed, 26 Nov 2014)
New Revision: 914
Modified:
papers/jss/article.Rnw
Log:
Cut section 6 in half by removing the section explaining the caveats
about formulas and types not supported by serialized_pb, and just
explain now that we serialize everything, but in one sentence explain
the caveat that we fall back to base::serialize for R-specific types
like language,function,and environment. This basically removes the
need for 6.1 at all, so remove that section, and move a tiny bit of
the top text about which datasets we are using into the top of section
6.2 which explains the compresison performance.
Next step: Replace table with a plot as hadley and one of the referees
both suggested.
Another referee suggested just merging 5 and 6 together completely.
Now that section 6 is one page plus one large table, that seems
feasible, but I'll revisit that after replacing the table with a plot.
I do want to make a stark distinction between having a schema (section
5) and not (section 6). I never really use section 6 schema-less
method, but its the one that is easier for people to play around with
probably if they don't have a real application with protocol buffers
yet.
Modified: papers/jss/article.Rnw
===================================================================
--- papers/jss/article.Rnw 2014-11-26 21:13:54 UTC (rev 913)
+++ papers/jss/article.Rnw 2014-11-26 21:53:18 UTC (rev 914)
@@ -868,7 +868,7 @@
within \proglang{R}.
The package also provides methods for converting arbitrary \proglang{R} data structures into Protocol
Buffers and vice versa with a universal \proglang{R} object schema. The \code{serialize\_pb} and \code{unserialize\_pb}
-functions serialize arbitrary \proglang{R} objects into a universal Protocol Buffer
+functions serialize arbitrary \proglang{R} objects into a universal Protocol Buffer
message:
<<>>=
@@ -877,76 +877,44 @@
@
In order to accomplish this, \pkg{RProtoBuf} uses the same catch-all \code{proto}
-schema used by \pkg{RHIPE} for exchanging \proglang{R} data with Hadoop \citep{rhipe}. This
+schema used by \pkg{RHIPE} for exchanging \proglang{R} data with Hadoop \citep{rhipe}. This
schema, which we will refer to as \code{rexp.proto}, is printed in
%appendix \ref{rexp.proto}.
the appendix.
The Protocol Buffer messages generated by \pkg{RProtoBuf} and
-\pkg{RHIPE} are naturally compatible between the two systems because they use the
+\pkg{RHIPE} are naturally compatible between the two systems because they use the
same schema. This shows the power of using a schema-based cross-platform format such
as Protocol Buffers: interoperability is achieved without effort or close coordination.
-The \code{rexp.proto} schema supports all main \proglang{R} storage types holding \emph{data}.
-These include \code{NULL}, \code{list} and vectors of type \code{logical},
-\code{character}, \code{double}, \code{integer}, and \code{complex}. In addition,
-every type can contain a named set of attributes, as is the case in \proglang{R}. The \code{rexp.proto}
-schema does not support some of the special \proglang{R} specific storage types, such as \code{function},
-\code{language} or \code{environment}. Such objects have no native equivalent
-type in Protocol Buffers, and have little meaning outside the context of \proglang{R}.
-When serializing \proglang{R} objects using \code{serialize\_pb}, values or attributes of
-unsupported types are skipped with a warning. If the user really wishes to serialize these
-objects, they need to be converted into a supported type. For example, the can use
-\code{deparse} to convert functions or language objects into strings, or \code{as.list}
-for environments.
+The \code{rexp.proto} schema natively supports all main \proglang{R}
+storage types holding \emph{data}. These include \code{NULL},
+\code{list} and vectors of type \code{logical}, \code{character},
+\code{double}, \code{integer}, and \code{complex}. In addition, every
+type can contain a named set of attributes, as is the case in
+\proglang{R}. The storage types \code{function}, \code{language}, and
+\code{environment} are specific to \proglang{R} and have no equivalent
+native type in Protocol Buffers. These three types are supported by
+first serializing with \code{base::serialize} in \proglang{R} and
+then stored in a raw bytes field.
-\subsection[Evaluation: Converting R data sets]{Evaluation: Converting \proglang{R} data sets}
-To illustrate how this method works, we attempt to convert all of the built-in
-data sets from \proglang{R} into this serialized Protocol Buffer representation.
+\subsection[Evaluation: Serializing R data sets]{Evaluation: Serializing \proglang{R} data sets}
+\label{sec:compression}
-<<echo=TRUE>>=
+<<echo=FALSE>>=
datasets <- as.data.frame(data(package="datasets")$results)
datasets$name <- sub("\\s+.*$", "", datasets$Item)
n <- nrow(datasets)
@
-There are \Sexpr{n} standard data sets included in the \pkg{datasets}
-package included with \proglang{R}. These data sets include data frames, matrices, time series, tables lists,
-and some more exotic data classes. The \code{can\_serialize\_pb} method is
-used to determine which of those can fully be converted to the \code{rexp.proto}
-Protocol Buffer representation. This method simply checks if any of the values or
-attributes in an object is of an unsupported type:
+This section evaluates the effectiveness of serializing arbitrary
+\proglang{R} data structures into Protocol Buffers. We use the
+\Sexpr{n} standard data sets included in the \pkg{datasets} package
+included with \proglang{R} as our evaluation data. These data sets
+include data frames, matrices, time series, tables, lists, and some
+more exotic data classes. For each data set, we compare how many
+bytes are used to store the data set using four different methods:
-<<echo=TRUE>>=
-m <- sum(sapply(datasets$name, function(x) can_serialize_pb(get(x))))
-@
-
-\Sexpr{m} data sets can be converted to Protocol Buffers
-without loss of information (\Sexpr{format(100*m/n,digits=1)}\%). Upon closer
-inspection, all other data sets are objects of class \code{nfnGroupedData}.
-This class represents a special type of data frame that has some additional
-attributes (such as a \emph{formula} object) used by the \pkg{nlme} package \citep{nlme}.
-Because formulas are \proglang{R} \emph{language} objects, they have little meaning to
-other systems, and are not supported by the \code{rexp.proto} descriptor.
-When \code{serialize\_pb} is used on objects of this class, it will serialize
-the data frame and all attributes, except for the formula.
-
-<<>>=
-attr(CO2, "formula")
-msg <- serialize_pb(CO2, NULL)
-object <- unserialize_pb(msg)
-identical(CO2, object)
-identical(class(CO2), class(object))
-identical(dim(CO2), dim(object))
-attr(object, "formula")
-@
-
-\subsection{Compression performance}
-\label{sec:compression}
-
-This section compares how many bytes are used to store data sets
-using four different methods:
-
\begin{itemize}
\item normal \proglang{R} serialization \citep{serialization},
\item \proglang{R} serialization followed by gzip,
@@ -977,9 +945,9 @@
@
Table~\ref{tab:compression} shows the sizes of 50 sample \proglang{R} data sets as
-returned by object.size() compared to the serialized sizes.
+returned by \code{object.size()} compared to the serialized sizes.
%The summary compression sizes are listed below, and a full table for a
-%sample of 50 data sets is included on the next page.
+%sample of 50 data sets is included on the next page.
Note that Protocol Buffer serialization results in slightly
smaller byte streams compared to native \proglang{R} serialization in most cases,
but this difference disappears if the results are compressed with gzip.
@@ -1443,7 +1411,6 @@
\begin{verbatim}
package rexp;
-
message REXP {
enum RClass {
STRING = 0;
@@ -1454,14 +1421,16 @@
LIST = 5;
LOGICAL = 6;
NULLTYPE = 7;
+ LANGUAGE = 8;
+ ENVIRONMENT = 9;
+ FUNCTION = 10;
}
enum RBOOLEAN {
F=0;
T=1;
NA=2;
}
-
- required RClass rclass = 1 ;
+ required RClass rclass = 1;
repeated double realValue = 2 [packed=true];
repeated sint32 intValue = 3 [packed=true];
repeated RBOOLEAN booleanValue = 4;
@@ -1471,6 +1440,9 @@
repeated REXP rexpValue = 8;
repeated string attrName = 11;
repeated REXP attrValue = 12;
+ optional bytes languageValue = 13;
+ optional bytes environmentValue = 14;
+ optional bytes functionValue = 14;
}
message STRING {
optional string strval = 1;
More information about the Rprotobuf-commits
mailing list