[Rprotobuf-commits] r624 - papers/rjournal
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Sat Dec 28 02:43:57 CET 2013
Author: murray
Date: 2013-12-28 02:43:57 +0100 (Sat, 28 Dec 2013)
New Revision: 624
Added:
papers/rjournal/histogram-mapreduce-diag1.pdf
Modified:
papers/rjournal/eddelbuettel-francois-stokely.Rnw
Log:
Add a large new section showing how RProtoBuf can be used to serialize
all of the example datasets that ship with R, and compare the sizes
with R's built-in serialziation method.
Steal a section on applications/MapReduce from the HistogramTools
vignette. To be improved further.
Combine each of the class slot and method tables into a single table
for each class.
Add a brief section on type coercion issues for e.g. bools and int64s.
Improve the wording in various places.
Modified: papers/rjournal/eddelbuettel-francois-stokely.Rnw
===================================================================
--- papers/rjournal/eddelbuettel-francois-stokely.Rnw 2013-12-28 00:54:18 UTC (rev 623)
+++ papers/rjournal/eddelbuettel-francois-stokely.Rnw 2013-12-28 01:43:57 UTC (rev 624)
@@ -11,6 +11,7 @@
\title{RProtoBuf: Efficient Cross-Language Data Serialization in R}
\author{by Dirk Eddelbuettel, Romain Fran\c{c}ois, and Murray Stokely}
+
\maketitle
\abstract{Modern data collection and analysis pipelines often involve
@@ -62,6 +63,10 @@
basics of protocol buffers and \CRANpkg{RProtoBuf}, we illustrate
several common use cases for protocol buffers in data analysis.
+XXX Related work on IDLs (greatly expanded )
+
+XXX Design tradeoffs: reflection vs proto compiler
+
\section{Protocol Buffers}
Once the data serialization needs get complex enough, application
@@ -474,13 +479,17 @@
\subsection{Messages}
The \texttt{Message} S4 class represents Protocol Buffer Messages and
-is the core abstraction of \CRANpkg{RProtoBuf}. The class contains
-the slots \texttt{pointer} and \texttt{type} as described on the
-Table~\ref{Message-class-table}.
+is the core abstraction of \CRANpkg{RProtoBuf}. Each \texttt{Message}
+contains a pointer to a \texttt{Descriptor} which defines the schema
+of the data defined in the Message, as well as a number of
+\texttt{FieldDescriptors} for the individual fields of the message. A
+complete list of the slots and methods for \texttt{Messages}
+is available in Table~\ref{Message-methods-table}.
\begin{table}[h]
\centering
-\begin{tabular}{|cp{10cm}|}
+\begin{small}
+\begin{tabular}{l|p{10cm}}
\hline
\textbf{Slot} & \textbf{Description} \\
\hline
@@ -489,26 +498,10 @@
\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.message.html#Message} \\
\hline
\texttt{type} & Fully qualified name of the message. For example a \texttt{Person} message
-has its \texttt{type} slot set to \texttt{tutorial.Person} \\
+has its \texttt{type} slot set to \texttt{tutorial.Person} \\[.3cm]
\hline
-\end{tabular}
-\caption{\label{Message-class-table}Description of slots for the \texttt{Message} S4 class}
-\end{table}
-
-Each \texttt{Message} contains a pointer to a \texttt{Descriptor}
-which defines the schema of the data defined in the Message, as well
-as a number of \texttt{FieldDescriptors} for the individual fields of
-the message. In addition to the field name extractors of
-\texttt{Messages} introduced in the previous section, a complete list
-of Message methods is available in Table~\ref{Message-methods-table}.
-
-\begin{table}[h]
-\centering
-\begin{small}
-\begin{tabular}{l|l}
\textbf{Method} & \textbf{Description} \\
\hline
-\hline
\texttt{has} & Indicates if a message has a given field. \\
\texttt{clone} & Creates a clone of the message \\
\texttt{isInitialized} & Indicates if a message has all its required fields set\\
@@ -534,36 +527,17 @@
\hline
\end{tabular}
\end{small}
-\caption{\label{Message-methods-table}Description of methods for the \texttt{Message} S4 class}
+\caption{\label{Message-methods-table}Description of slots and methods for the \texttt{Message} S4 class}
\end{table}
\subsection{Descriptors}
-Message descriptors are represented in R with the
-\emph{Descriptor} S4 class. The class contains
-the slots \texttt{pointer} and \texttt{type} :
+Message descriptors are represented in R with the \emph{Descriptor} S4
+class. The class contains the slots \texttt{pointer} and
+\texttt{type}. Similarly to messages, the \verb|$| operator can be
+used to retrieve descriptors that are contained in the descriptor, or
+invoke pseudo-methods.
-\begin{table}[h]
-\centering
-\begin{tabular}{|cp{10cm}|}
-\hline
-\textbf{Slot} & \textbf{Description} \\
-\hline
-\texttt{pointer} & External pointer to the \texttt{Descriptor} object of the C++ proto library. Documentation for the
-\texttt{Descriptor} class is available from the protocol buffer project page:
-\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.descriptor.html#Descriptor} \\
-\hline
-\texttt{type} & Fully qualified path of the message type. \\
-\hline
-\end{tabular}
-\caption{\label{Descriptor-class-table}Description of slots for the \texttt{Descriptor} S4 class}
-\end{table}
-
-Similarly to messages, the \verb|$| operator can be used to retrieve
-descriptors that are contained in the descriptor, or invoke
-pseudo-methods. Thise can be used to extract field descriptors, enum
-descriptors, or descriptors for a nested type.
-
<<>>=
# field descriptor
tutorial.Person$email
@@ -578,15 +552,23 @@
@
Table~\ref{Descriptor-methods-table} provides a complete list of the
-avalailable methods for Descriptors.
+slots and avalailable methods for Descriptors.
\begin{table}[h]
\centering
\begin{small}
-\begin{tabular}{l|l}
+\begin{tabular}{l|p{10cm}}
+\hline
+\textbf{Slot} & \textbf{Description} \\
+\hline
+\texttt{pointer} & External pointer to the \texttt{Descriptor} object of the C++ proto library. Documentation for the
+\texttt{Descriptor} class is available from the protocol buffer project page:
+\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.descriptor.html#Descriptor} \\
+\hline
+\texttt{type} & Fully qualified path of the message type. \\[.3cm]
+\hline
\textbf{Method} & \textbf{Description} \\
\hline
-\hline
\texttt{new} & Creates a prototype of a message described by this descriptor.\\
\texttt{read} & Reads a message from a file or binary connection.\\
\texttt{readASCII} & Read a message in ASCII format from a file or
@@ -614,10 +596,10 @@
\hline
\end{tabular}
\end{small}
-\caption{\label{Descriptor-methods-table}Description of methods for the \texttt{Descriptor} S4 class}
+\caption{\label{Descriptor-methods-table}Description of slots and methods for the \texttt{Descriptor} S4 class}
\end{table}
-\subsection{field descriptors}
+\subsection{Field Descriptors}
\label{subsec-field-descriptor}
The class \emph{FieldDescriptor} represents field
@@ -628,7 +610,8 @@
\begin{table}[h]
\centering
-\begin{tabular}{|cp{10cm}|}
+\begin{small}
+\begin{tabular}{l|p{10cm}}
\hline
\textbf{Slot} & \textbf{Description} \\
\hline
@@ -638,21 +621,10 @@
\hline
\texttt{full\_name} & Fully qualified name of the field \\
\hline
-\texttt{type} & Name of the message type where the field is declared \\
+\texttt{type} & Name of the message type where the field is declared \\[.3cm]
\hline
-\end{tabular}
-\caption{\label{FieldDescriptor-class-table}Description of slots for the \texttt{FieldDescriptor} S4 class}
-\end{table}
-
-
-\begin{table}[h]
-\centering
-\begin{small}
-\begin{tabular}{l|l}
-\hline
\textbf{Method} & \textbf{Description} \\
\hline
-\hline
\texttt{as.character} & Character representation of a descriptor\\
\texttt{toString} & Character
representation of a descriptor (same as \texttt{as.character}) \\
@@ -675,14 +647,15 @@
\hline
\end{tabular}
\end{small}
-\caption{\label{fielddescriptor-methods-table}Description of methods for the \texttt{FieldDescriptor} S4 class}
+\caption{\label{fielddescriptor-methods-table}Description of slots and
+ methods for the \texttt{FieldDescriptor} S4 class}
\end{table}
% TODO(ms): Useful distinction to make -- FieldDescriptor does not do
% separate '$' dispatch like Messages, Descriptors, and
% EnumDescriptors do. Should it?
-\subsection{enum descriptors}
+\subsection{Enum Descriptors}
\label{subsec-enum-descriptor}
The class \emph{EnumDescriptor} is an R wrapper
@@ -701,7 +674,8 @@
\begin{table}[h]
\centering
-\begin{tabular}{|cp{10cm}|}
+\begin{small}
+\begin{tabular}{l|p{10cm}}
\hline
\textbf{Slot} & \textbf{Description} \\
\hline
@@ -711,20 +685,10 @@
\hline
\texttt{full\_name} & Fully qualified name of the enum \\
\hline
-\texttt{type} & Name of the message type where the enum is declared \\
+\texttt{type} & Name of the message type where the enum is declared \\[.3cm]
\hline
-\end{tabular}
-\caption{\label{EnumDescriptor-class-table}Description of slots for the \texttt{EnumDescriptor} S4 class}
-\end{table}
-
-\begin{table}[h]
-\centering
-\begin{small}
-\begin{tabular}{l|l}
-\hline
\textbf{Method} & \textbf{Description} \\
\hline
-\hline
\texttt{as.list} & return a named
integer vector with the values of the enum and their names.\\
\texttt{as.character} & character representation of a descriptor\\
@@ -741,10 +705,10 @@
\hline
\end{tabular}
\end{small}
-\caption{\label{enumdescriptor-methods-table}Description of methods for the \texttt{EnumDescriptor} S4 class}
+\caption{\label{enumdescriptor-methods-table}Description of slots and methods for the \texttt{EnumDescriptor} S4 class}
\end{table}
-\subsection{file descriptors}
+\subsection{File Descriptors}
\label{subsec-file-descriptor}
The class \emph{FileDescriptor} is an R wrapper
@@ -763,7 +727,8 @@
\begin{table}[h]
\centering
-\begin{tabular}{|cp{10cm}|}
+\begin{small}
+\begin{tabular}{l|p{10cm}}
\hline
\textbf{slot} & \textbf{description} \\
\hline
@@ -773,20 +738,10 @@
\hline
\texttt{filename} & fully qualified pathname of the \texttt{.proto} file.\\
\hline
-\texttt{package} & package name defined in this \texttt{.proto} file.\\
+\texttt{package} & package name defined in this \texttt{.proto} file.\\[.3cm]
\hline
-\end{tabular}
-\caption{\label{FileDescriptor-class-table}Description of slots for the \texttt{FileDescriptor} S4 class}
-\end{table}
-
-\begin{table}[h]
-\centering
-\begin{small}
-\begin{tabular}{l|l}
-\hline
\textbf{Method} & \textbf{Description} \\
\hline
-\hline
\texttt{name} & Return the filename for this FileDescriptorProto.\\
\texttt{package} & Return the file-level package name specified in this FileDescriptorProto.\\
\texttt{as.character} & character representation of a descriptor. \\
@@ -796,10 +751,10 @@
\hline
\end{tabular}
\end{small}
-\caption{\label{filedescriptor-methods-table}Description of methods for the \texttt{FileDescriptor} S4 class}
+\caption{\label{filedescriptor-methods-table}Description of slots and methods for the \texttt{FileDescriptor} S4 class}
\end{table}
-\subsection{enum value descriptors}
+\subsection{Enum Value Descriptors}
\label{subsec-enumvalue-descriptor}
The class \emph{EnumValueDescriptor} is an R wrapper
@@ -817,7 +772,8 @@
\begin{table}[h]
\centering
-\begin{tabular}{|cp{10cm}|}
+\begin{small}
+\begin{tabular}{l|p{10cm}}
\hline
\textbf{slot} & \textbf{description} \\
\hline
@@ -825,20 +781,10 @@
\hline
\texttt{name} & simple name of the enum value \\
\hline
-\texttt{full\_name} & fully qualified name of the enum value \\
+\texttt{full\_name} & fully qualified name of the enum value \\[.3cm]
\hline
-\end{tabular}
-\caption{\label{EnumValueDescriptor-class-table}Description of slots for the \texttt{EnumValueDescriptor} S4 class}
-\end{table}
-
-\begin{table}[h]
-\centering
-\begin{small}
-\begin{tabular}{l|l}
-\hline
\textbf{Method} & \textbf{Description} \\
\hline
-\hline
\texttt{number} & return the number of this EnumValueDescriptor. \\
\texttt{name} & Return the name of the enum value descriptor.\\
\texttt{enum\_type} & return the EnumDescriptor type of this EnumValueDescriptor. \\
@@ -848,15 +794,93 @@
\hline
\end{tabular}
\end{small}
-\caption{\label{enumvaluedescriptor-methods-table}Description of methods for the \texttt{EnumValueDescriptor} S4 class}
+\caption{\label{EnumValueDescriptor-methods-table}Description of slots
+ and methods for the \texttt{EnumValueDescriptor} S4 class}
\end{table}
\section{Type Coercion}
+One of the benefits of using an Interface Definition Language (IDL)
+like Protocol Buffers is that it provides a highly portable basic type
+system that different language and hardware implementations can map to
+the most appropriate type in different environments.
+Table~\ref{table-get-types} details the correspondance between the
+field type and the type of data that is retrieved by \verb|$| and \verb|[[|
+extractors.
+
+\begin{table}[h]
+\centering
+\begin{small}
+\begin{tabular}{|c|p{5cm}p{5cm}|}
+\hline
+field type & R type (non repeated) & R type (repeated) \\
+\hline
+\hline
+double & \texttt{double} vector & \texttt{double} vector \\
+float & \texttt{double} vector & \texttt{double} vector \\
+\hline
+int32 & \texttt{integer} vector & \texttt{integer} vector \\
+uint32 & \texttt{integer} vector & \texttt{integer} vector \\
+sint32 & \texttt{integer} vector & \texttt{integer} vector \\
+fixed32 & \texttt{integer} vector & \texttt{integer} vector \\
+sfixed32 & \texttt{integer} vector & \texttt{integer} vector \\
+\hline
+int64 & \texttt{integer} or \texttt{character}
+vector \footnotemark & \texttt{integer} or \texttt{character} vector \\
+uint64 & \texttt{integer} or \texttt{character} vector & \texttt{integer} or \texttt{character} vector \\
+sint64 & \texttt{integer} or \texttt{character} vector & \texttt{integer} or \texttt{character} vector \\
+fixed64 & \texttt{integer} or \texttt{character} vector & \texttt{integer} or \texttt{character} vector \\
+sfixed64 & \texttt{integer} or \texttt{character} vector & \texttt{integer} or \texttt{character} vector \\
+\hline
+bool & \texttt{logical} vector & \texttt{logical} vector \\
+\hline
+string & \texttt{character} vector & \texttt{character} vector \\
+bytes & \texttt{character} vector & \texttt{character} vector \\
+\hline
+enum & \texttt{integer} vector & \texttt{integer} vector \\
+\hline
+message & \texttt{S4} object of class \texttt{Message} & \texttt{list} of \texttt{S4} objects of class \texttt{Message} \\
+\hline
+\end{tabular}
+\end{small}
+\caption{\label{table-get-types}Correspondance between field type and
+ R type retrieved by the extractors. \footnotesize{1. R lacks native
+ 64-bit integers, so the \texttt{RProtoBuf.int64AsString} option is
+ available to return large integers as characters to avoid losing
+ precision. This option is described in Section~\ref{sec:int64}}.}
+\end{table}
+
\subsection{Booleans}
-Bools
-Int64s.
+R booleans can accept three values: \texttt{TRUE}, \texttt{FALSE}, and
+\texttt{NA}. However, most other languages, including the protocol
+buffer schema, only accept \text{TRUE} or \text{FALSE}. This means
+that we simply can not store R logical vectors that include all three
+possible values as booleans. The library will refuse to store
+\texttt{NA}s in protocol buffer boolean fields, and users must instead
+choose another type (such as integers) capable of storing three
+distinct values.
+
+<<echo=FALSE,print=FALSE>>=
+ if (!exists("protobuf_unittest.TestAllTypes",
+ "RProtoBuf:DescriptorPool")) {
+ unittest.proto.file <- system.file("unitTests", "data",
+ "unittest.proto",
+ package="RProtoBuf")
+ readProtoFiles(file=unittest.proto.file)
+ }
+@
+
+<<>>=
+a <- new(protobuf_unittest.TestAllTypes)
+a$optional_bool <- TRUE
+a$optional_bool <- FALSE
+<<eval=F>>=
+a$optional_bool <- NA
+<<echo=FALSE,eval=TRUE,print=TRUE>>=
+try(a$optional_bool <- NA,silent=TRUE)
+@
+
\subsection{64-bit integers}
\label{sec:int64}
@@ -869,7 +893,7 @@
@
Protocol Buffers are frequently used to pass data between different
-systems, however, and most other systems these days have support for
+systems, however, and most other modern systems do have support for
64-bit integers. To work around this, RProtoBuf allows users to get
and set 64-bit integer types by treating them as characters.
@@ -919,11 +943,161 @@
@
-\section{Related work on IDLs (greatly expanded from what you have)}
+\section{Evaluation: data.frame to Protocol Buffer Serialization}
-\section{Design tradeoffs: reflection vs proto compiler (not addressed
- at all in current vignettes)}
+Saptarshi Guha wrote the RHIPE package \citep{rhipe} which includes
+protocol buffer integration with R. However, this implementation
+takes a different approach: any R object is serialized into a message
+based on a single catch-all \texttt{proto} schema. Jeroen Ooms took a
+similar approach influenced by Saptarshi in his \pkg{RProtoBufUtils}
+package. Unlike Saptarshi's package, however, RProtoBufUtils depends
+on RProtoBuf for underlying message operations. This package is
+available at \url{https://github.com/jeroenooms/RProtoBufUtils}.
+The \textbf{RProtoBufUtils} package by Jereoen Ooms provides a
+\texttt{serialize\_pb} method to convert R objects into serialized
+protocol buffers in this format, and the \texttt{can\_serialize\_pb}
+method can be used to determine whether the given R object can safely
+be expressed in this way. To show how how this method works, we
+attempt to convert all of the built-in datasets from R into this
+serialized protocol buffer representation.
+
+<<echo=TRUE>>=
+library(RProtoBufUtils)
+
+datasets <- subset(as.data.frame(data()$results), Package=="datasets")
+datasets$load.name <- sub("\\s+.*$", "", datasets$Item)
+n <- nrow(datasets)
+@
+
+There are \Sexpr{n} standard data sets included in R. We use the
+\texttt{can\_serialize\_pb} method to determine how many of those can
+be safely converted to a serialized protocol buffer representation.
+
+<<echo=TRUE>>=
+datasets$valid.proto <- sapply(datasets$load.name, function(x) can_serialize_pb(eval(as.name(x))))
+datasets <- subset(datasets, valid.proto==TRUE)
+m <- nrow(datasets)
+@
+
+\Sexpr{m} data sets could be converted to Protocol Buffers
+(\Sexpr{format(100*m/n,digits=1)}\%). The next section illustrates how
+many bytes were usued to store the data sets under four different
+situations (1) normal R serialization, (2) R serialization followed by
+gzip, (3) normal protocol buffer serialization, (4) protocol buffer
+serialization followed by gzip.
+
+\subsection{Compression Performance}
+\label{sec:compression}
+
+<<echo=FALSE,print=FALSE>>=
+datasets$object.size <- unname(sapply(datasets$load.name, function(x) object.size(eval(as.name(x)))))
+
+datasets$R.serialize.size <- unname(sapply(datasets$load.name, function(x) length(serialize(eval(as.name(x)), NULL))))
+
+datasets$R.serialize.size <- unname(sapply(datasets$load.name, function(x) length(serialize(eval(as.name(x)), NULL))))
+
+datasets$R.serialize.size.gz <- unname(sapply(datasets$load.name, function(x) length(memCompress(serialize(eval(as.name(x)), NULL), "gzip"))))
+
+datasets$RProtoBuf.serialize.size <- unname(sapply(datasets$load.name, function(x) length(serialize_pb(eval(as.name(x)), NULL))))
+
+datasets$RProtoBuf.serialize.size.gz <- unname(sapply(datasets$load.name, function(x) length(memCompress(serialize_pb(eval(as.name(x)), NULL), "gzip"))))
+
+clean.df <- data.frame(dataset=datasets$load.name,
+ object.size=datasets$object.size,
+ "serialized"=datasets$R.serialize.size,
+ "gzipped serialized"=datasets$R.serialize.size.gz,
+ "RProtoBuf"=datasets$RProtoBuf.serialize.size,
+ "gzipped RProtoBuf"=datasets$RProtoBuf.serialize.size.gz,
+ check.names=FALSE)
+@
+
+Table~\ref{tab:compression} shows the sizes of 50 sample R datasets as
+returned by object.size() compared to the serialized sizes.
+The summary compression sizes are listed below, and a full table for a
+sample of 50 datasets is included on the next page. Sizes are comparable
+but protocol buffers provide simple getters and setters in multiple
+languages instead of requiring other programs to parse the R
+serialization format \citep{serialization}. One takeaway from this
+table is that RProtoBuf does not in general provide any significant
+space-savings over R's normal serialization mechanism. The benefit
+from RProtoBuf comes from its interoperability with other
+environments.
+
+TODO comparison of protobuf serialization sizes/times for various vectors. Compared to R's native serialization. Discussion of the RHIPE approach of serializing any/all R objects, vs more specific protocol buffers for specific R objects.
+
+% N.B. see table.Rnw for how this table is created.
+%
+% latex table generated in R 3.0.2 by xtable 1.7-0 package
+% Fri Dec 27 17:00:03 2013
+\begin{table}[ht]
+\begin{center}
+\scalebox{0.9}{
+\begin{tabular}{l|r|r|r|r|r}
+ \hline
+Data Set & object.size & \multicolumn{2}{c|}{R Serialization} &
+\multicolumn{2}{c}{RProtoBuf Serialization} \\
+ & & Default & gzipped & Default & gzipped \\
+ \hline
+uspop & 584.00 & 268 & 172 & 211 & 148 \\
+ Titanic & 1960.00 & 633 & 257 & 481 & 249 \\
+ volcano & 42656.00 & 42517 & 5226 & 42476 & 4232 \\
+ euro.cross & 2728.00 & 1319 & 910 & 1207 & 891 \\
+ attenu & 14568.00 & 8234 & 2165 & 7771 & 2336 \\
+ ToothGrowth & 2568.00 & 1486 & 349 & 1239 & 391 \\
+ lynx & 1344.00 & 1028 & 429 & 971 & 404 \\
+ nottem & 2352.00 & 2036 & 627 & 1979 & 641 \\
+ sleep & 2752.00 & 746 & 282 & 483 & 260 \\
+ co2 & 4176.00 & 3860 & 1473 & 3803 & 1453 \\
+ austres & 1144.00 & 828 & 439 & 771 & 410 \\
+ ability.cov & 1944.00 & 716 & 357 & 589 & 341 \\
+ EuStockMarkets & 60664.00 & 59785 & 21232 & 59674 & 19882 \\
+ treering & 64272.00 & 63956 & 17647 & 63900 & 17758 \\
+ freeny.x & 1944.00 & 1445 & 1311 & 1372 & 1289 \\
+ Puromycin & 2088.00 & 813 & 306 & 620 & 320 \\
+ warpbreaks & 2768.00 & 1231 & 310 & 811 & 343 \\
+ BOD & 1088.00 & 334 & 182 & 226 & 168 \\
+ sunspots & 22992.00 & 22676 & 6482 & 22620 & 6742 \\
+ beaver2 & 4184.00 & 3423 & 751 & 3468 & 840 \\
+ anscombe & 2424.00 & 991 & 375 & 884 & 352 \\
+ esoph & 5624.00 & 3111 & 548 & 2240 & 665 \\
+ PlantGrowth & 1680.00 & 646 & 303 & 459 & 314 \\
+ infert & 15848.00 & 14328 & 1172 & 13197 & 1404 \\
+ BJsales & 1632.00 & 1316 & 496 & 1259 & 465 \\
+ stackloss & 1688.00 & 917 & 293 & 844 & 283 \\
+ crimtab & 7936.00 & 4641 & 713 & 1655 & 576 \\
+ LifeCycleSavings & 6048.00 & 3014 & 1420 & 2825 & 1407 \\
+ Harman74.cor & 9144.00 & 6056 & 2045 & 5861 & 2070 \\
+ nhtemp & 912.00 & 596 & 240 & 539 & 223 \\
+ faithful & 5136.00 & 4543 & 1339 & 4936 & 1776 \\
+ freeny & 5296.00 & 2465 & 1518 & 2271 & 1507 \\
+ discoveries & 1232.00 & 916 & 199 & 859 & 180 \\
+ state.x77 & 7168.00 & 4251 & 1754 & 4068 & 1756 \\
+ pressure & 1096.00 & 498 & 277 & 427 & 273 \\
+ fdeaths & 1008.00 & 692 & 291 & 635 & 272 \\
+ euro & 976.00 & 264 & 186 & 202 & 161 \\
+ LakeHuron & 1216.00 & 900 & 420 & 843 & 404 \\
+ mtcars & 6736.00 & 3798 & 1204 & 3633 & 1206 \\
+ precip & 4992.00 & 1793 & 813 & 1615 & 815 \\
+ state.area & 440.00 & 422 & 246 & 405 & 235 \\
+ attitude & 3024.00 & 1990 & 544 & 1920 & 561 \\
+ randu & 10496.00 & 9794 & 8859 & 10441 & 9558 \\
+ state.name & 3088.00 & 844 & 408 & 724 & 415 \\
+ airquality & 5496.00 & 4551 & 1241 & 2874 & 1294 \\
+ airmiles & 624.00 & 308 & 170 & 251 & 148 \\
+ quakes & 33112.00 & 32246 & 9898 & 29063 & 11595 \\
+ islands & 3496.00 & 1232 & 563 & 1098 & 561 \\
+ OrchardSprays & 3600.00 & 2164 & 445 & 1897 & 483 \\
+ WWWusage & 1232.00 & 916 & 274 & 859 & 251 \\
+ \hline
+\end{tabular}
+}
+\caption{Serialization sizes with R's built-in serialization and
+ RProtoBuf for 50 sample R datasets.}
+\label{tab:compression}
+\end{center}
+\end{table}
+
\subsection{Performance considerations}
TODO RProtoBuf is quite flexible and easy to use for interactive
@@ -936,11 +1110,7 @@
about this to clarify the goals and strengths of RProtoBuf and its
reflection and object mapping.
-\subsection{Serialization comparison}
-TODO comparison of protobuf serialization sizes/times for various vectors. Compared to R's native serialization. Discussion of the RHIPE approach of serializing any/all R objects, vs more specific protocol buffers for specific R objects.
-
-
\section{Descriptor lookup}
\label{sec-lookup}
@@ -958,34 +1128,58 @@
implemented by the \texttt{RProtoBuf} package by calling an internal
method of the \texttt{protobuf} C++ library.
-\section{Other approaches}
+%\section{Other approaches}
-Saptarshi Guha wrote another package that deals with integration
-of protocol buffer messages with R, taking a different angle :
-serializing any R object as a message, based on a single catch-all
-\texttt{proto} file. Saptarshi's package is available at
-\url{http://ml.stat.purdue.edu/rhipe/doc/html/ProtoBuffers.html}.
-
-Jeroen Ooms took a similar approach influenced by Saptarshi in his
-\pkg{RProtoBufUtils} package. Unlike Saptarshi's package,
-RProtoBufUtils depends on RProtoBuf for underlying message operations.
-This package is available at
-\url{https://github.com/jeroenooms/RProtoBufUtils}.
-
% Phillip Yelland wrote another implementation, currently proprietary,
% that has significant speed advantages when querying fields from a
% large number of protocol buffers, but is less user friendly for the
% basic cases documented here.
-\section{Basic usage example - tutorial.Person}
+%\section{Basic usage example - tutorial.Person}
-\section{Application: distributed Data Collection with MapReduce}
+\section{Application: Distributed Data Collection with MapReduce}
-We could describe a common MapReduce pattern of having the MR written
-in another language output protocol buffers that are later pulled into
-R. There is some text about this in section 2 of
-http://cran.r-project.org/web/packages/HistogramTools/vignettes/HistogramTools.pdf
+TODO(mstokely): Make this better.
+Many large data sets in fields such as particle physics and
+information processing are stored in binned or histogram form in order
+to reduce the data storage requirements
+\citep{scott2009multivariate}. Protocol Buffers make a particularly
+good data transport format in distributed MapReduces environments
+where large numbers of computers process a large data set for analysis.
+
+There are two common patterns for generating histograms of large data
+sets with MapReduce. In the first method, each mapper task can
+generate a histogram over a subset of the data that is has been
+assigned, and then the histograms of each mapper are sent to one or
+more reducer tasks to merge.
+
+In the second method, each mapper rounds a data point to a bucket
+width and outputs that bucket as a key and '1' as a value. Reducers
+then sum up all of the values with the same key and output to a data store.
+
+In both methods, the mapper tasks must choose identical
+bucket boundaries even though they are analyzing disjoint parts of the
+input set that may cover different ranges, or we must implement
+multiple phases.
+
+\begin{figure}[h!]
+\begin{center}
+\includegraphics[width=\textwidth]{histogram-mapreduce-diag1.pdf}
+\end{center}
+\caption{Diagram of MapReduce Histogram Generation Pattern}
+\label{fig:mr-histogram-pattern1}
+\end{figure}
+
+Figure~\ref{fig:mr-histogram-pattern1} illustrates the second method
+described above for histogram generation of large data sets with
+MapReduce.
+
+This package is designed to be helpful if some of the Map or Reduce
+tasks are written in R, or if those components are written in other
+languages and only the resulting output histograms need to be
+manipulated in R.
+
\section{Application: Sending/receiving Interaction With Servers}
Unlike Apache Thrift, Protocol Buffers do not include a concrete RPC
Added: papers/rjournal/histogram-mapreduce-diag1.pdf
===================================================================
(Binary files differ)
Property changes on: papers/rjournal/histogram-mapreduce-diag1.pdf
___________________________________________________________________
Added: svn:mime-type
+ application/octet-stream
More information about the Rprotobuf-commits
mailing list