From noreply at r-forge.r-project.org Sat Nov 15 03:00:53 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Sat, 15 Nov 2014 03:00:53 +0100 (CET) Subject: [Rprotobuf-commits] r898 - papers/jss Message-ID: <20141115020053.8F30E18781A@r-forge.r-project.org> Author: murray Date: 2014-11-15 03:00:52 +0100 (Sat, 15 Nov 2014) New Revision: 898 Modified: papers/jss/article.Rnw Log: Make section 3 a little more concise and address referee feedback by not having so many small subsections, only using one style and sticking to it in this section, and not talking about ill defined 'pseudo-methods'. Sorry I didn't build this yet about to go home wanted to easily capture some edits on work computer. hopefully it doesn't break the build. will be easy to fix if i did. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-09-25 05:36:52 UTC (rev 897) +++ papers/jss/article.Rnw 2014-11-15 02:00:52 UTC (rev 898) @@ -356,8 +356,12 @@ Message Descriptors are defined in \code{.proto} files and define a schema for a particular named class of messages. -\subsection[Importing message descriptors from .proto files]{Importing message descriptors from \code{.proto} files} +% Note: We comment out subsections in favor of textbf blocks to save +% space and shrink down this section a little bit. +%\subsection[Importing message descriptors from .proto files]{Importing message descriptors from \code{.proto} files} +\textbf{Importing message descriptors from \code{.proto} files} + To create or parse a Protocol Buffer Message, one must first read in the message type specification from a \code{.proto} file. A small number of message types are imported when the package is first @@ -387,11 +391,15 @@ Fortunately, proper use of namespaces and package imports reduces the impact of this for code in packages.} -<<>>= +% Commented out for now because its too detailed. Lets shorten +% section 3 per referee feedback. + +<>= ls("RProtoBuf:DescriptorPool") @ -\subsection{Creating a message} +% \subsection{Creating a message} +\textbf{Creating, accessing, and modifying a message.} New messages are created with the \code{new} function which accepts a Message Descriptor and optionally a list of ``name = value'' pairs @@ -403,11 +411,10 @@ %function to create messages. <<>>= -p1 <- new(tutorial.Person) p <- new(tutorial.Person, name = "Murray", id = 1) @ -\subsection{Access and modify fields of a message} +%\subsection{Access and modify fields of a message} Once the message is created, its fields can be queried and modified using the dollar operator of \proglang{R}, making Protocol @@ -434,9 +441,9 @@ 64-bit integer support. A workaround is available and described in Section~\ref{sec:int64} for working with large integer values. +%\subsection{Display messages} +\textbf{Printing, Reading, and Writing Messages} -\subsection{Display messages} - Protocol Buffer messages and descriptors implement \code{show} methods that provide basic information about the message: @@ -452,7 +459,7 @@ writeLines(as.character(p)) @ -\subsection{Serializing messages} +% \subsection{Serializing messages} One of the primary benefits of Protocol Buffers is the efficient binary wire-format representation. @@ -484,21 +491,24 @@ readBin(tf2, raw(0), 500) @ -\code{serialize} can also be called in a more traditional -object oriented fashion using the dollar operator. +% TODO(mstokely): commentd out per referee feedback, but see if this is +% covered in the package documentation well. +% +%\code{serialize} can also be called in a more traditional +%object oriented fashion using the dollar operator. +% +%<<>>= +%p$serialize(tf1) +%con <- file(tf2, open = "wb") +%p$serialize(con) +%close(con) +%@ +% +%Here, we first serialize to a file \code{tf1} before we serialize to a binary +%connection to file \code{tf2}. -<<>>= -p$serialize(tf1) -con <- file(tf2, open = "wb") -p$serialize(con) -close(con) -@ +%\subsection{Parsing messages} -Here, we first serialize to a file \code{tf1} before we serialize to a binary -connection to file \code{tf2}. - -\subsection{Parsing messages} - The \pkg{RProtoBuf} package defines the \code{read} and \code{readASCII} functions to read messages from files, raw vectors, or arbitrary connections. \code{read} expects to read the message @@ -533,22 +543,25 @@ message <- read(tutorial.Person, payload) @ +% TODO(mstokely): comment out and use only one style, not both per +% referee feedback. Also avoid using the term 'pseudo-method' which +% is unclear. +% +%\code{read} can also be used as a pseudo-method of the descriptor +%object: +% +%<<>>= +%message <- tutorial.Person$read(tf1) +%con <- file(tf2, open = "rb") +%message <- tutorial.Person$read(con) +%close(con) +%message <- tutorial.Person$read(payload) +%@ +% +%Here we read first from a file, then from a binary connection and lastly from +%a message payload. -\code{read} can also be used as a pseudo-method of the descriptor -object: -<<>>= -message <- tutorial.Person$read(tf1) -con <- file(tf2, open = "rb") -message <- tutorial.Person$read(con) -close(con) -message <- tutorial.Person$read(payload) -@ - -Here we read first from a file, then from a binary connection and lastly from -a message payload. - - \section{Under the hood: S4 classes, methods, and pseudo methods} \label{sec:rprotobuf-classes} From noreply at r-forge.r-project.org Sat Nov 15 03:35:45 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Sat, 15 Nov 2014 03:35:45 +0100 (CET) Subject: [Rprotobuf-commits] r899 - papers/jss Message-ID: <20141115023546.1F4CC18783D@r-forge.r-project.org> Author: murray Date: 2014-11-15 03:35:44 +0100 (Sat, 15 Nov 2014) New Revision: 899 Modified: papers/jss/article.Rnw Log: Comment out additional redundant information to make section 3 a more concise introduction. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-15 02:00:52 UTC (rev 898) +++ papers/jss/article.Rnw 2014-11-15 02:35:44 UTC (rev 899) @@ -394,13 +394,16 @@ % Commented out for now because its too detailed. Lets shorten % section 3 per referee feedback. -<>= -ls("RProtoBuf:DescriptorPool") -@ +%<>= +%ls("RProtoBuf:DescriptorPool") +%@ % \subsection{Creating a message} -\textbf{Creating, accessing, and modifying a message.} +\\ + +\textbf{Creating, accessing, and modifying messages.} + New messages are created with the \code{new} function which accepts a Message Descriptor and optionally a list of ``name = value'' pairs to set in the message. @@ -442,6 +445,9 @@ Section~\ref{sec:int64} for working with large integer values. %\subsection{Display messages} + +\\ + \textbf{Printing, Reading, and Writing Messages} Protocol Buffer messages and descriptors implement \code{show} @@ -451,8 +457,8 @@ p @ -For additional information, such as for debugging purposes, -the \code{as.character} method provides a more complete ASCII +%For additional information, such as for debugging purposes, +The \code{as.character} method provides a more complete ASCII representation of the contents of a message. <<>>= @@ -473,7 +479,7 @@ serialize(p, NULL) @ -The same method can be used to serialize messages to files: +The same method can be used to serialize messages to files or arbitrary binary connections: <<>>= tf1 <- tempfile() @@ -481,15 +487,18 @@ readBin(tf1, raw(0), 500) @ -Or to arbitrary binary connections: +% TODO(mstokely): Comment out, combined with last statement. make this +% shorter, more succinct summary of the key features of RProtoBuf. -<<>>= -tf2 <- tempfile() -con <- file(tf2, open = "wb") -serialize(p, con) -close(con) -readBin(tf2, raw(0), 500) -@ +%Or to arbitrary binary connections: +% +%<<>>= +%tf2 <- tempfile() +%con <- file(tf2, open = "wb") +%serialize(p, con) +%close(con) +%readBin(tf2, raw(0), 500) +%@ % TODO(mstokely): commentd out per referee feedback, but see if this is % covered in the package documentation well. @@ -527,22 +536,22 @@ @ The \code{input} argument of \code{read} can also be a binary -readable \proglang{R} connection, such as a binary file connection: +readable \proglang{R} connection, such as a binary file connection, or a raw vector of serialized bytes. -<<>>= -con <- file(tf2, open = "rb") -message <- read(tutorial.Person, con) -close(con) -writeLines(as.character(message)) -@ +% <<>>= +% con <- file(tf2, open = "rb") +% message <- read(tutorial.Person, con) +% close(con) +% writeLines(as.character(message)) +% @ -Finally, the raw vector payload of the message can be used: +% Finally, the raw vector payload of the message can be used: +% +%<<>>= +%payload <- readBin(tf1, raw(0), 5000) +%message <- read(tutorial.Person, payload) +%@ -<<>>= -payload <- readBin(tf1, raw(0), 5000) -message <- read(tutorial.Person, payload) -@ - % TODO(mstokely): comment out and use only one style, not both per % referee feedback. Also avoid using the term 'pseudo-method' which % is unclear. @@ -561,7 +570,6 @@ %Here we read first from a file, then from a binary connection and lastly from %a message payload. - \section{Under the hood: S4 classes, methods, and pseudo methods} \label{sec:rprotobuf-classes} From noreply at r-forge.r-project.org Wed Nov 19 02:08:02 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 19 Nov 2014 02:08:02 +0100 (CET) Subject: [Rprotobuf-commits] r900 - papers/jss Message-ID: <20141119010802.6314E187860@r-forge.r-project.org> Author: murray Date: 2014-11-19 02:08:02 +0100 (Wed, 19 Nov 2014) New Revision: 900 Modified: papers/jss/article.Rnw Log: Address referee #1 feedback about section 4. Answer many of the specific questions from the referee: Why use S4 and not RC (answer: written before RC availalbe, acknowledge that RC would be better). objects made mutable by usual functional copy on modify semantics in mutation methods, avoid pseudo-method. Fix typographical issue with subheadings. Make it more concise with \subsection* and combine a few. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-15 02:35:44 UTC (rev 899) +++ papers/jss/article.Rnw 2014-11-19 01:08:02 UTC (rev 900) @@ -360,7 +360,7 @@ % space and shrink down this section a little bit. %\subsection[Importing message descriptors from .proto files]{Importing message descriptors from \code{.proto} files} -\textbf{Importing message descriptors from \code{.proto} files} +\subsection*{Importing message descriptors from \code{.proto} files} To create or parse a Protocol Buffer Message, one must first read in the message type specification from a \code{.proto} file. @@ -400,9 +400,9 @@ % \subsection{Creating a message} -\\ +% \\ -\textbf{Creating, accessing, and modifying messages.} +\subsection*{Creating, accessing, and modifying messages.} New messages are created with the \code{new} function which accepts a Message Descriptor and optionally a list of ``name = value'' pairs @@ -417,7 +417,7 @@ p <- new(tutorial.Person, name = "Murray", id = 1) @ -%\subsection{Access and modify fields of a message} +% \subsection*{Access and modify fields of a message} Once the message is created, its fields can be queried and modified using the dollar operator of \proglang{R}, making Protocol @@ -444,11 +444,11 @@ 64-bit integer support. A workaround is available and described in Section~\ref{sec:int64} for working with large integer values. -%\subsection{Display messages} +\subsection*{Printing, reading, and writing Messages} -\\ +%\\ -\textbf{Printing, Reading, and Writing Messages} +% \textbf{Printing, Reading, and Writing Messages} Protocol Buffer messages and descriptors implement \code{show} methods that provide basic information about the message: @@ -556,7 +556,7 @@ % referee feedback. Also avoid using the term 'pseudo-method' which % is unclear. % -%\code{read} can also be used as a pseudo-method of the descriptor +%\code{read} can also be used as a method of the descriptor %object: % %<<>>= @@ -570,40 +570,25 @@ %Here we read first from a file, then from a binary connection and lastly from %a message payload. -\section{Under the hood: S4 classes, methods, and pseudo methods} +\section{Under the hood: S4 classes and methods} \label{sec:rprotobuf-classes} The \pkg{RProtoBuf} package uses the S4 system to store -information about descriptors and messages. Using the S4 system -allows the package to dispatch methods that are not -generic in the S3 sense, such as \code{new} and -\code{serialize}. -Table~\ref{class-summary-table} lists the six -primary Message and Descriptor classes in \pkg{RProtoBuf}. Each \proglang{R} object +information about descriptors and messages. +Each \proglang{R} object contains an external pointer to an object managed by the \code{protobuf} \proglang{C++} library, and the \proglang{R} objects make calls into more than 100 \proglang{C++} functions that provide the glue code between the \proglang{R} language classes and the underlying \proglang{C++} classes. +S4 objects are immutable, and so the methods that modify field values of a message return a new copy of the object with R's usual functional copy on modify semantics\footnote{RProtoBuf was designed and implemented before Reference Classes were introduced to offer a new class system with mutable objects. If RProtoBuf were +implemented today Reference Classes would almost certainly be a better +design choice than S4 classes.}. +Using the S4 system +allows the package to dispatch methods that are not +generic in the S3 sense, such as \code{new} and +\code{serialize}. -\begin{table}[bp] -\centering -\begin{tabular}{lccl} -\toprule -Class & Slots & Methods & Dynamic dispatch\\ -\cmidrule{2-4} -Message & 2 & 20 & yes (field names)\\ -Descriptor & 2 & 16 & yes (field names, enum types, nested types)\\ -FieldDescriptor & 4 & 18 & no\\ -EnumDescriptor & 4 & 11 & yes (enum constant names)\\ -EnumValueDescriptor & 3 & \phantom{1}6 & no\\ -FileDescriptor & 3 & \phantom{1}6 & yes (message/field definitions)\\ -\bottomrule -\end{tabular} -\caption{\label{class-summary-table}Overview of class, slot, method and - dispatch relationships.} -\end{table} - The \pkg{Rcpp} package \citep{eddelbuettel2011rcpp,eddelbuettel2013seamless} is used to facilitate this integration of the \proglang{R} and \proglang{C++} code for these objects. @@ -615,9 +600,11 @@ which provide a more concise way of wrapping \proglang{C++} functions and classes in a single entity. +Since \pkg{RProtoBuf} users are most often switching between two or +more different languages as part of a larger data analysis pipeline, +both generic function and message passing OO style calling conventions +are supported: -The \pkg{RProtoBuf} package supports two forms for calling -functions with these S4 classes: \begin{itemize} \item The functional dispatch mechanism of the the form \verb|method(object, arguments)| (common to \proglang{R}), and @@ -626,12 +613,38 @@ \end{itemize} Additionally, \pkg{RProtoBuf} supports tab completion for all -classes. Completion possibilities include pseudo-method names for all +classes. Completion possibilities include method names for all classes, plus \emph{dynamic dispatch} on names or types specific to a given object. This functionality is implemented with the \code{.DollarNames} S3 generic function defined in the \pkg{utils} package that is included with \proglang{R} \citep{r}. + +Table~\ref{class-summary-table} lists the six primary Message and +Descriptor classes in \pkg{RProtoBuf}. +% Please see the package +%documentation for a complete description of the slots and methods for +%each class. + + +\begin{table}[bp] +\centering +\begin{tabular}{lccl} +\toprule +Class & Slots & Methods & Dynamic dispatch\\ +\cmidrule{2-4} +Message & 2 & 20 & yes (field names)\\ +Descriptor & 2 & 16 & yes (field names, enum types, nested types)\\ +FieldDescriptor & 4 & 18 & no\\ +EnumDescriptor & 4 & 11 & yes (enum constant names)\\ +EnumValueDescriptor & 3 & \phantom{1}6 & no\\ +FileDescriptor & 3 & \phantom{1}6 & yes (message/field definitions)\\ +\bottomrule +\end{tabular} +\caption{\label{class-summary-table}Overview of class, slot, method and + dispatch relationships.} +\end{table} + \subsection{Messages} The \code{Message} S4 class represents Protocol Buffer Messages and @@ -697,7 +710,7 @@ class. The class contains the slots \code{pointer} and \code{type}. Similarly to messages, the \verb|$| operator can be used to retrieve descriptors that are contained in the descriptor, or -invoke pseudo-methods. +invoke methods. When \pkg{RProtoBuf} is first loaded it calls \code{readProtoFiles} to read in the example \code{addressbook.proto} file @@ -824,7 +837,7 @@ The \verb|$| operator can be used to retrieve the value of enum constants contained in the EnumDescriptor, or to invoke -pseudo-methods. +methods. The \code{EnumDescriptor} contains information about what values this type defines, while the \code{EnumValueDescriptor} describes an @@ -879,7 +892,7 @@ Table~\ref{EnumValueDescriptor-methods-table} describes the methods defined for the \code{EnumValueDescriptor} class. -The \verb|$| operator can be used to invoke pseudo-methods. +The \verb|$| operator can be used to invoke methods. <<>>= tutorial.Person$PhoneType$value(1) @@ -951,7 +964,7 @@ defined for the \code{FileDescriptor} class. The \verb|$| operator can be used to retrieve named fields defined in -the FileDescriptor, or to invoke pseudo-methods. +the FileDescriptor, or to invoke methods. <<>>= f <- tutorial.Person$fileDescriptor() From noreply at r-forge.r-project.org Wed Nov 19 02:20:51 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 19 Nov 2014 02:20:51 +0100 (CET) Subject: [Rprotobuf-commits] r901 - papers/jss Message-ID: <20141119012051.1A1DD187854@r-forge.r-project.org> Author: murray Date: 2014-11-19 02:20:50 +0100 (Wed, 19 Nov 2014) New Revision: 901 Modified: papers/jss/article.Rnw Log: Remove tables 3-8 with full slot and method descriptions for each class in RProtoBuf. This belongs in the documentation, not the paper. Refer the reader to the full docs once at the beginning of this section and remove all the references to tables in each subsection. Further minor edits are needed here. This was suggested by referee #1. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-19 01:08:02 UTC (rev 900) +++ papers/jss/article.Rnw 2014-11-19 01:20:50 UTC (rev 901) @@ -622,12 +622,12 @@ Table~\ref{class-summary-table} lists the six primary Message and Descriptor classes in \pkg{RProtoBuf}. -% Please see the package -%documentation for a complete description of the slots and methods for -%each class. +Please see the package +documentation for a complete description of the slots and methods for +each class. - -\begin{table}[bp] +% [bp] +\begin{table} \centering \begin{tabular}{lccl} \toprule @@ -651,57 +651,12 @@ is the core abstraction of \pkg{RProtoBuf}. Each \code{Message} contains a pointer to a \code{Descriptor} which defines the schema of the data defined in the Message, as well as a number of -\code{FieldDescriptors} for the individual fields of the message. A -complete list of the slots and methods for \code{Messages} -is available in Table~\ref{Message-methods-table}. +\code{FieldDescriptors} for the individual fields of the message. <<>>= new(tutorial.Person) @ -\begin{table}[tbp] -\centering -\begin{small} -\begin{tabular}{lp{10cm}} -\toprule -Slot & Description \\ -\cmidrule(r){2-2} -\code{pointer} & External pointer to the \code{Message} object of the \proglang{C++} protobuf library. Documentation for the -\code{Message} class is available from the Protocol Buffer project page. \\ -%(\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.message.html#Message}) \\ -\code{type} & Fully qualified name of the message. For example a \code{Person} message -has its \code{type} slot set to \code{tutorial.Person} \\[.3cm] - -Method & Description \\ -\cmidrule(r){2-2} -\code{has} & Indicates if a message has a given field. \\ -\code{clone} & Creates a clone of the message \\ -\code{isInitialized} & Indicates if a message has all its required fields set\\ -\code{serialize} & serialize a message to a file, binary connection, or raw vector\\ -\code{clear} & Clear one or several fields of a message, or the entire message\\ -\code{size} & The number of elements in a message field\\ -\code{bytesize} & The number of bytes the message would take once serialized\\[3mm] -% -\code{swap} & swap elements of a repeated field of a message\\ -\code{set} & set elements of a repeated field\\ -\code{fetch} & fetch elements of a repeated field\\ -\code{setExtension} & set an extension of a message\\ -\code{getExtension} & get the value of an extension of a message\\ -\code{add} & add elements to a repeated field \\[3mm] -% -\code{str} & the \proglang{R} structure of the message\\ -\code{as.character} & character representation of a message\\ -\code{toString} & character representation of a message (same as \code{as.character}) \\ -\code{as.list} & converts message to a named \proglang{R} list\\ -\code{update} & updates several fields of a message at once\\ -\code{descriptor} & get the descriptor of the message type of this message\\ -\code{fileDescriptor} & get the file descriptor of this message's descriptor\\ -\hline -\end{tabular} -\end{small} -\caption{\label{Message-methods-table}Description of slots and methods for the \code{Message} S4 class.} -\end{table} - \subsection{Descriptors} Descriptors describe the type of a Message. This includes what fields @@ -730,110 +685,26 @@ tutorial.Person.PhoneNumber @ -Table~\ref{Descriptor-methods-table} provides a complete list of the -slots and available methods for Descriptors. - -\begin{table}[tbp] -\centering -\begin{small} -\begin{tabular}{lp{10cm}} -\toprule -Slot & Description \\ -\cmidrule(r){2-2} -\code{pointer} & External pointer to the \code{Descriptor} object of the \proglang{C++} proto library. Documentation for the -\code{Descriptor} class is available from the Protocol Buffer project page.\\ -%\url{http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.descriptor.html#Descriptor} \\ -\code{type} & Fully qualified path of the message type. \\[.3cm] -% - -Method & Description \\ -\cmidrule(r){2-2} -\code{new} & Creates a prototype of a message described by this descriptor.\\ -\code{read} & Reads a message from a file or binary connection.\\ -\code{readASCII} & Read a message in ASCII format from a file or -text connection.\\ -\code{name} & Retrieve the name of the message type associated with -this descriptor.\\ -\code{as.character} & character representation of a descriptor\\ -\code{toString} & character representation of a descriptor (same as \code{as.character}) \\ -\code{as.list} & return a named -list of the field, enum, and nested descriptors included in this descriptor.\\ -\code{asMessage} & return DescriptorProto message. \\ -\code{fileDescriptor} & Retrieve the file descriptor of this -descriptor.\\ -\code{containing\_type} & Retrieve the descriptor describing the message type containing this descriptor.\\ -\code{field\_count} & Return the number of fields in this descriptor.\\ -\code{field} & Return the descriptor for the specified field in this descriptor.\\ -\code{nested\_type\_count} & The number of nested types in this descriptor.\\ -\code{nested\_type} & Return the descriptor for the specified nested -type in this descriptor.\\ -\code{enum\_type\_count} & The number of enum types in this descriptor.\\ -\code{enum\_type} & Return the descriptor for the specified enum -type in this descriptor.\\ -\bottomrule -\end{tabular} -\end{small} -\caption{\label{Descriptor-methods-table}Description of slots and methods for the \code{Descriptor} S4 class.} -\end{table} - \subsection{Field descriptors} \label{subsec-field-descriptor} The class \emph{FieldDescriptor} represents field descriptors in \proglang{R}. This is a wrapper S4 class around the \code{google::protobuf::FieldDescriptor} \proglang{C++} class. -Table~\ref{fielddescriptor-methods-table} describes the methods -defined for the \code{FieldDescriptor} class. -\begin{table}[tbp] -\centering -\begin{small} -\begin{tabular}{lp{10cm}} -\toprule -Slot & Description \\ -\cmidrule(r){2-2} -\code{pointer} & External pointer to the \code{FieldDescriptor} \proglang{C++} variable \\ -\code{name} & Simple name of the field \\ -\code{full\_name} & Fully qualified name of the field \\ -\code{type} & Name of the message type where the field is declared \\[.3cm] -% +<<>>= +tutorial.Person$email +tutorial.Person$email$is_required() +tutorial.Person$email$type() +tutorial.Person$email$as.character() +@ -Method & Description \\ -\cmidrule(r){2-2} -\code{as.character} & Character representation of a descriptor\\ -\code{toString} & Character representation of a descriptor (same as \code{as.character}) \\ -\code{asMessage} & Return FieldDescriptorProto message. \\ -\code{name} & Return the name of the field descriptor.\\ -\code{fileDescriptor} & Return the fileDescriptor where this field is defined.\\ -\code{containing\_type} & Return the containing descriptor of this field.\\ -\code{is\_extension} & Return TRUE if this field is an extension.\\ -\code{number} & Gets the declared tag number of the field.\\ -\code{type} & Gets the type of the field.\\ -\code{cpp\_type} & Gets the \proglang{C++} type of the field.\\ -\code{label} & Gets the label of a field (optional, required, or repeated).\\ -\code{is\_repeated} & Return TRUE if this field is repeated.\\ -\code{is\_required} & Return TRUE if this field is required.\\ -\code{is\_optional} & Return TRUE if this field is optional.\\ -\code{has\_default\_value} & Return TRUE if this field has a default value.\\ -\code{default\_value} & Return the default value.\\ -\code{message\_type} & Return the message type if this is a message type field.\\ -\code{enum\_type} & Return the enum type if this is an enum type field.\\ -\bottomrule -\end{tabular} -\end{small} -\caption{\label{fielddescriptor-methods-table}Description of slots and - methods for the \code{FieldDescriptor} S4 class.} -\end{table} - - \subsection{Enum descriptors} \label{subsec-enum-descriptor} The class \emph{EnumDescriptor} represents enum descriptors in \proglang{R}. This is a wrapper S4 class around the \code{google::protobuf::EnumDescriptor} \proglang{C++} class. -Table~\ref{enumdescriptor-methods-table} describes the methods -defined for the \code{EnumDescriptor} class. The \verb|$| operator can be used to retrieve the value of enum constants contained in the EnumDescriptor, or to invoke @@ -848,49 +719,12 @@ tutorial.Person$PhoneType$WORK @ -\begin{table}[tbp] -\centering -\begin{small} -\begin{tabular}{lp{10cm}} -\toprule -Slot & Description \\ -\cmidrule(r){2-2} -\code{pointer} & External pointer to the \code{EnumDescriptor} \proglang{C++} variable \\ -\code{name} & Simple name of the enum \\ -\code{full\_name} & Fully qualified name of the enum \\ -\code{type} & Name of the message type where the enum is declared \\[.3cm] -% - -Method & Description \\ -\cmidrule(r){2-2} -\code{as.list} & return a named -integer vector with the values of the enum and their names.\\ -\code{as.character} & character representation of a descriptor\\ -\code{toString} & character -representation of a descriptor (same as \code{as.character}) \\ -\code{asMessage} & return EnumDescriptorProto message. \\ -\code{name} & Return the name of the enum descriptor.\\ -\code{fileDescriptor} & Return the fileDescriptor where this field is defined.\\ -\code{containing\_type} & Return the containing descriptor of this field.\\ -\code{length} & Return the number of constants in this enum.\\ -\code{has} & Return TRUE if this enum contains the specified named constant string.\\ -\code{value\_count} & Return the number of constants in this enum (same as \code{length}).\\ -\code{value} & Return the EnumValueDescriptor of an enum value of specified index, name, or number.\\ -\bottomrule -\end{tabular} -\end{small} -\caption{\label{enumdescriptor-methods-table}Description of slots and methods - for the \code{EnumDescriptor} S4 class.} -\end{table} - \subsection{Enum value descriptors} \label{subsec-enumvalue-descriptor} The class \emph{EnumValueDescriptor} represents enumeration value descriptors in \proglang{R}. This is a wrapper S4 class around the \code{google::protobuf::EnumValueDescriptor} \proglang{C++} class. -Table~\ref{EnumValueDescriptor-methods-table} describes the methods -defined for the \code{EnumValueDescriptor} class. The \verb|$| operator can be used to invoke methods. @@ -900,68 +734,13 @@ tutorial.Person$PhoneType$value(number=1) @ -\begin{table}[tbp] -\centering -\begin{small} -\begin{tabular}{lp{10cm}} -\toprule -Slot & Description \\ -\cmidrule(r){2-2} -\code{pointer} & External pointer to the \code{EnumValueDescriptor} \proglang{C++} variable \\ -\code{name} & simple name of the enum value \\ -\code{full\_name} & fully qualified name of the enum value \\[.3cm] -% -Method & Description \\ -\cmidrule(r){2-2} -\code{number} & return the number of this EnumValueDescriptor. \\ -\code{name} & Return the name of the enum value descriptor.\\ -\code{enum\_type} & return the EnumDescriptor type of this EnumValueDescriptor. \\ -\code{as.character} & character representation of a descriptor. \\ -\code{toString} & character representation of a descriptor (same as \code{as.character}). \\ -\code{asMessage} & return EnumValueDescriptorProto message. \\ -\bottomrule -\end{tabular} -\end{small} -\caption{\label{EnumValueDescriptor-methods-table}Description of slots - and methods for the \code{EnumValueDescriptor} S4 class.} -\end{table} - \subsection{File descriptors} \label{subsec-file-descriptor} -\begin{table}[tbp] -\centering -\begin{small} -\begin{tabular}{lp{10cm}} -\toprule -Slot & Description \\ -\cmidrule(r){2-2} -\code{pointer} & external pointer to the \code{FileDescriptor} object of the \proglang{C++} proto library. Documentation for the -\code{FileDescriptor} class is available from the Protocol Buffer project page: -\url{http://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.descriptor.html#FileDescriptor} \\ -\code{filename} & fully qualified pathname of the \code{.proto} file.\\ -\code{package} & package name defined in this \code{.proto} file.\\[.3cm] - -Method & Description \\ -\cmidrule(r){2-2} -\code{name} & Return the filename for this FileDescriptorProto.\\ -\code{package} & Return the file-level package name specified in this FileDescriptorProto.\\ -\code{as.character} & character representation of a descriptor. \\ -\code{toString} & character representation of a descriptor (same as \code{as.character}). \\ -\code{asMessage} & return FileDescriptorProto message. \\ -\code{as.list} & return named list of descriptors defined in this file descriptor.\\ -\bottomrule -\end{tabular} -\end{small} -\caption{\label{filedescriptor-methods-table}Description of slots and methods for the \code{FileDescriptor} S4 class.} -\end{table} - The class \emph{FileDescriptor} represents file descriptors in \proglang{R}. This is a wrapper S4 class around the \code{google::protobuf::FileDescriptor} \proglang{C++} class. -Table~\ref{filedescriptor-methods-table} describes the methods -defined for the \code{FileDescriptor} class. The \verb|$| operator can be used to retrieve named fields defined in the FileDescriptor, or to invoke methods. From noreply at r-forge.r-project.org Wed Nov 19 18:22:57 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 19 Nov 2014 18:22:57 +0100 (CET) Subject: [Rprotobuf-commits] r902 - papers/jss Message-ID: <20141119172257.C924B18637D@r-forge.r-project.org> Author: murray Date: 2014-11-19 18:22:57 +0100 (Wed, 19 Nov 2014) New Revision: 902 Modified: papers/jss/article.Rnw Log: Make section 4 more concise and less manual-like per referee #1 feedback. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-19 01:20:50 UTC (rev 901) +++ papers/jss/article.Rnw 2014-11-19 17:22:57 UTC (rev 902) @@ -675,73 +675,40 @@ Table~\ref{tab:proto} lacked an explicit call to \code{readProtoFiles}.}. -<<>>= -tutorial.Person$email - -tutorial.Person$PhoneType - -tutorial.Person$PhoneNumber - -tutorial.Person.PhoneNumber -@ - -\subsection{Field descriptors} +\subsubsection*{Field descriptors} \label{subsec-field-descriptor} -The class \emph{FieldDescriptor} represents field -descriptors in \proglang{R}. This is a wrapper S4 class around the -\code{google::protobuf::FieldDescriptor} \proglang{C++} class. - <<>>= -tutorial.Person$email +tutorial.Person$email tutorial.Person$email$is_required() tutorial.Person$email$type() tutorial.Person$email$as.character() +class(tutorial.Person$email) @ -\subsection{Enum descriptors} +\subsubsection*{Enum and EnumValue descriptors} \label{subsec-enum-descriptor} -The class \emph{EnumDescriptor} represents enum descriptors in \proglang{R}. -This is a wrapper S4 class around the -\code{google::protobuf::EnumDescriptor} \proglang{C++} class. +The \code{EnumDescriptor} contains information about what values a +type defines, while the \code{EnumValueDescriptor} describes an +individual enum constant of a particular type. The \verb|$| operator +can be used to retrieve the value of enum constants contained in the +EnumDescriptor, or to invoke methods. -The \verb|$| operator can be used to retrieve the value of enum -constants contained in the EnumDescriptor, or to invoke -methods. - -The \code{EnumDescriptor} contains information about what values this type -defines, while the \code{EnumValueDescriptor} describes an -individual enum constant of a particular type. - <<>>= tutorial.Person$PhoneType tutorial.Person$PhoneType$WORK -@ - -\subsection{Enum value descriptors} -\label{subsec-enumvalue-descriptor} - -The class \emph{EnumValueDescriptor} represents enumeration value -descriptors in \proglang{R}. This is a wrapper S4 class around the -\code{google::protobuf::EnumValueDescriptor} \proglang{C++} class. - -The \verb|$| operator can be used to invoke methods. - -<<>>= +class(tutorial.Person$PhoneType) tutorial.Person$PhoneType$value(1) tutorial.Person$PhoneType$value(name="HOME") tutorial.Person$PhoneType$value(number=1) +class(tutorial.Person$PhoneType$value(1)) @ - -\subsection{File descriptors} +\subsubsection*{File descriptors} \label{subsec-file-descriptor} The class \emph{FileDescriptor} represents file descriptors in \proglang{R}. -This is a wrapper S4 class around the -\code{google::protobuf::FileDescriptor} \proglang{C++} class. - The \verb|$| operator can be used to retrieve named fields defined in the FileDescriptor, or to invoke methods. From noreply at r-forge.r-project.org Tue Nov 25 00:06:12 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 25 Nov 2014 00:06:12 +0100 (CET) Subject: [Rprotobuf-commits] r903 - papers/jss Message-ID: <20141124230612.7433518779C@r-forge.r-project.org> Author: murray Date: 2014-11-25 00:06:05 +0100 (Tue, 25 Nov 2014) New Revision: 903 Modified: papers/jss/article.Rnw Log: Remove a duplicate word in the last paragraph of the introduction, add a word to fix a linewrap in section 4, and add a paragraph to the beginning of section 7 to make it more interesting. More work on sections 6,7, and 8 is sorely needed. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-19 17:22:57 UTC (rev 902) +++ papers/jss/article.Rnw 2014-11-24 23:06:05 UTC (rev 903) @@ -215,7 +215,7 @@ Section~\ref{sec:rprotobuf-basic} describes the interactive \proglang{R} interface provided by the \pkg{RProtoBuf} package, and introduces the two main abstractions: \emph{Messages} and \emph{Descriptors}. Section~\ref{sec:rprotobuf-classes} -details the implementation details of the main S4 classes and methods. +details the implementation of the main S4 classes and methods. Section~\ref{sec:types} describes the challenges of type coercion between \proglang{R} and other languages. Section~\ref{sec:evaluation} introduces a general \proglang{R} language schema for serializing arbitrary \proglang{R} objects and evaluates @@ -689,7 +689,7 @@ \subsubsection*{Enum and EnumValue descriptors} \label{subsec-enum-descriptor} -The \code{EnumDescriptor} contains information about what values a +The \code{EnumDescriptor} type contains information about what values a type defines, while the \code{EnumValueDescriptor} describes an individual enum constant of a particular type. The \verb|$| operator can be used to retrieve the value of enum constants contained in the @@ -1074,6 +1074,12 @@ \section{Application: Distributed data collection with MapReduce} \label{sec:mapreduce} +Protocol Buffers have been used extensively at Google for almost all +RPC protocols, and for storing structured information in a variety of +persistent storage systems since 2000 \citep{dean2009designs}. The +\pkg{RProtoBuf} package has been in widespread use by hundreds of +analysts at Google since 2010. + Many large data sets in fields such as particle physics and information processing are stored in binned or histogram form in order to reduce the data storage requirements \citep{scott2009multivariate}. In the From noreply at r-forge.r-project.org Tue Nov 25 00:25:50 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 25 Nov 2014 00:25:50 +0100 (CET) Subject: [Rprotobuf-commits] r904 - papers/jss Message-ID: <20141124232550.B106C1877A6@r-forge.r-project.org> Author: murray Date: 2014-11-25 00:25:50 +0100 (Tue, 25 Nov 2014) New Revision: 904 Modified: papers/jss/article.Rnw papers/jss/article.bib Log: Improve section 7 and update the conclusions to note we've been using this package at Google for 5 years, not 3 years. Done with section 7 updates. 6 and 8 need work. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-24 23:06:05 UTC (rev 903) +++ papers/jss/article.Rnw 2014-11-24 23:25:50 UTC (rev 904) @@ -1078,7 +1078,10 @@ RPC protocols, and for storing structured information in a variety of persistent storage systems since 2000 \citep{dean2009designs}. The \pkg{RProtoBuf} package has been in widespread use by hundreds of -analysts at Google since 2010. +statisticians and software engineers at Google since 2010. This +section describes a simplified example of a common design pattern of +collecting a large structured data set in one language for later +analysis in \proglang{R}. Many large data sets in fields such as particle physics and information processing are stored in binned or histogram form in order to reduce @@ -1196,8 +1199,9 @@ @ \end{center} -One of the authors has used this design pattern for several -large-scale studies of distributed storage systems +One of the authors has used this design pattern with large-scale \proglang{C++} +MapReduces over very large data sets to write out histogram protocol +buffers for several large-scale studies of distributed storage systems \citep{sciencecloud,janus}. \section{Application: Data interchange in web services} @@ -1372,7 +1376,7 @@ and extends the \proglang{R} system with the ability to create, read, write, parse, and manipulate Protocol Buffer messages. \pkg{RProtoBuf} has been used extensively inside Google -for the past three years by statisticians, analysts, and software engineers. +for the past five years by statisticians, analysts, and software engineers. At the time of this writing there are over 300 active users of \pkg{RProtoBuf} using it to read data from and otherwise interact with distributed systems written in \proglang{C++}, \proglang{Java}, \proglang{Python}, and Modified: papers/jss/article.bib =================================================================== --- papers/jss/article.bib 2014-11-24 23:06:05 UTC (rev 903) +++ papers/jss/article.bib 2014-11-24 23:25:50 UTC (rev 904) @@ -1,3 +1,9 @@ + at article{dean2009designs, + title={Designs, lessons and advice from building large distributed systems}, + author={Dean, Jeff}, + journal={Keynote from LADIS}, + year={2009} +} @article{eddelbuettel2011rcpp, title = {Rcpp: Seamless R and C++ Integration}, author = {Dirk Eddelbuettel and Romain Fran{\c{c}}ois}, From noreply at r-forge.r-project.org Tue Nov 25 02:43:33 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 25 Nov 2014 02:43:33 +0100 (CET) Subject: [Rprotobuf-commits] r905 - papers/jss Message-ID: <20141125014333.346D818735F@r-forge.r-project.org> Author: murray Date: 2014-11-25 02:43:32 +0100 (Tue, 25 Nov 2014) New Revision: 905 Modified: papers/jss/article.Rnw Log: Use listings package to add line numbers so we can explain an example in section 7 better per referee #1 feedback. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-24 23:25:50 UTC (rev 904) +++ papers/jss/article.Rnw 2014-11-25 01:43:32 UTC (rev 905) @@ -1,5 +1,6 @@ \documentclass[article]{jss} \usepackage{booktabs} +\usepackage{listings} \usepackage[toc,page]{appendix} %%%%%%%%%%%%%%%%%%%%%%%%%%%%% @@ -504,7 +505,7 @@ % covered in the package documentation well. % %\code{serialize} can also be called in a more traditional -%object oriented fashion using the dollar operator. +%object-oriented fashion using the dollar operator. % %<<>>= %p$serialize(tf1) @@ -608,7 +609,7 @@ \begin{itemize} \item The functional dispatch mechanism of the the form \verb|method(object, arguments)| (common to \proglang{R}), and -\item The traditional object oriented notation +\item The message passing object-oriented notation of the form \verb|object$method(arguments)|. \end{itemize} @@ -1171,27 +1172,33 @@ outfile.close() \end{Code} -The Protocol Buffer can then be read into \proglang{R} and converted to a native -\proglang{R} histogram object for plotting. Here, the schema is read first, -then the (serialized) histogram is read into the variable \code{hist} which -is then converted a histogram object which is display as a plot. +The Protocol Buffer created from this \proglang{Python} script can then be read into \proglang{R} and converted to a native +\proglang{R} histogram object for plotting. Line~1 in the listing below attaches the \pkg{HistogramTools} package which imports \pkg{RProtoBuf}. Line~2 then reads all of the \code{.proto} descriptor definitions provided by \pkg{HistogramTools} and adds them to the environment as described in Section~\ref{sec:rprotobuf-basic}. Line~3 parses the serialized protocol buffer using the \code{HistogramTools.HistogramState} schema. Line~8 converts the protocol buffer representation of the histogram to a native \proglang{R} histogram object with \code{as.histogram} and passes the result to \code{plot}. -\begin{Code} -library("RProtoBuf") -library("HistogramTools") +% Here, the schema is read first, +%then the (serialized) histogram is read into the variable \code{hist} which +%is then converted a histogram object which is display as a plot. -readProtoFiles(package="HistogramTools") +\lstdefinelanguage{jss} + {sensitive=false, + morecomment=[l]{R>}} -hist <- HistogramTools.HistogramState$read("/tmp/hist.pb") -hist +\lstset{language=jss, basicstyle=\ttfamily, numbers=left, numberstyle=\tiny, stepnumber=2, numbersep=5pt, columns=fullflexible, keepspaces=true, showstringspaces=false, commentstyle=\textsl} +%\begin{Code} +\begin{lstlisting} +R> library("HistogramTools") +R> readProtoFiles(package="HistogramTools") +R> hist <- HistogramTools.HistogramState$read("/tmp/hist.pb") +R> hist + [1] "message of type 'HistogramTools.HistogramState' with 3 fields set" -plot(as.histogram(hist)) -\end{Code} +R> plot(as.histogram(hist)) +\end{lstlisting} +%\end{Code} \begin{center} <>= -require(RProtoBuf) require(HistogramTools) readProtoFiles(package="HistogramTools") hist <- HistogramTools.HistogramState$read("hist.pb") From noreply at r-forge.r-project.org Tue Nov 25 03:26:09 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 25 Nov 2014 03:26:09 +0100 (CET) Subject: [Rprotobuf-commits] r906 - in pkg: . inst/unitTests man Message-ID: <20141125022609.811E91877E4@r-forge.r-project.org> Author: murray Date: 2014-11-25 03:26:08 +0100 (Tue, 25 Nov 2014) New Revision: 906 Modified: pkg/ChangeLog pkg/inst/unitTests/runit.golden.message.R pkg/man/P.Rd Log: Minor cosmetic improvements suggested by Tim Hesterberg in code review. Modified: pkg/ChangeLog =================================================================== --- pkg/ChangeLog 2014-11-25 01:43:32 UTC (rev 905) +++ pkg/ChangeLog 2014-11-25 02:26:08 UTC (rev 906) @@ -1,3 +1,9 @@ +2014-11-24 Murray Stokely + + * inst/unitTests/runit.golden.message.R: remove trailing + whitespace. + * man/P.Rd: Improve output of example. + 2014-09-15 Murray Stokely Address feedback from anonymous reviewers for our Journal of Modified: pkg/inst/unitTests/runit.golden.message.R =================================================================== --- pkg/inst/unitTests/runit.golden.message.R 2014-11-25 01:43:32 UTC (rev 905) +++ pkg/inst/unitTests/runit.golden.message.R 2014-11-25 02:26:08 UTC (rev 906) @@ -10,18 +10,18 @@ .tearDown <- function(){} test.import <- function(){ - checkTrue( exists( "protobuf_unittest_import.ImportMessage", "RProtoBuf:DescriptorPool" ) , + checkTrue( exists( "protobuf_unittest_import.ImportMessage", "RProtoBuf:DescriptorPool" ) , msg = "exists( protobuf_unittest_import.ImportMessage ) " ) - checkTrue( exists( "protobuf_unittest_import.ImportEnum", "RProtoBuf:DescriptorPool" ) , + checkTrue( exists( "protobuf_unittest_import.ImportEnum", "RProtoBuf:DescriptorPool" ) , msg = "exists( protobuf_unittest_import.ImportEnum ) " ) - checkEquals( - names(as.list( protobuf_unittest_import.ImportMessage)), - "d", + checkEquals( + names(as.list( protobuf_unittest_import.ImportMessage)), + "d", msg = "names( protobuf_unittest_import.ImportMessage ) == 'd'" ) import_enum <- as.list(protobuf_unittest_import.ImportEnum ) - checkTrue( all( c("IMPORT_FOO", "IMPORT_BAR", "IMPORT_BAZ") %in% names(import_enum) ), + checkTrue( all( c("IMPORT_FOO", "IMPORT_BAR", "IMPORT_BAZ") %in% names(import_enum) ), msg = "expected names for 'protobuf_unittest_import.ImportEnum'" ) - checkEquals( unlist(unname(import_enum)), 7:9, + checkEquals( unlist(unname(import_enum)), 7:9, msg = "expected values for 'protobuf_unittest_import.ImportEnum' " ) } @@ -30,10 +30,9 @@ checkTrue( exists( "protobuf_unittest.TestAllTypes.NestedMessage", "RProtoBuf:DescriptorPool" ), msg = "exists( protobuf_unittest_import.TestAllTypes.NestedMessage ) " ) checkTrue( exists( "protobuf_unittest.TestAllTypes.NestedEnum", "RProtoBuf:DescriptorPool" ), msg = "exists( protobuf_unittest_import.TestAllTypes.NestedEnum ) " ) checkTrue( exists( "protobuf_unittest.TestAllTypes.OptionalGroup", "RProtoBuf:DescriptorPool" ) , msg = "exists( protobuf_unittest.TestAllTypes.OptionalGroup ) " ) - - types <- c("int32", "int64", "uint32", "uint64", "sint32", "sint64", - "fixed32", "fixed64", "sfixed32", "sfixed64", "float", "double", - "bool", "string", "bytes" ) + types <- c("int32", "int64", "uint32", "uint64", "sint32", "sint64", + "fixed32", "fixed64", "sfixed32", "sfixed64", "float", "double", + "bool", "string", "bytes" ) fieldnames <- names( as.list( protobuf_unittest.TestAllTypes ) ) prefixes <- c("optional", "default", "repeated" ) for( prefix in prefixes ){ @@ -43,16 +42,13 @@ checkTrue( sprintf("%s_foreign_enum" , prefix ) %in% fieldnames, msg = sprintf( "%s_foreign_enum in field names" , prefix ) ) checkTrue( sprintf("%s_import_enum" , prefix ) %in% fieldnames, msg = sprintf( "%s_import_enum in field names" , prefix ) ) } - checkTrue( exists( "protobuf_unittest.ForeignMessage", "RProtoBuf:DescriptorPool" ) , msg = "exists( protobuf_unittest.ForeignMessage ) " ) checkEquals( names(as.list(protobuf_unittest.ForeignMessage)), "c" ) - checkTrue( exists( "protobuf_unittest.ForeignEnum", "RProtoBuf:DescriptorPool" ) , msg = "exists( protobuf_unittest.ForeignEnum ) " ) foreign_enum <- as.list( protobuf_unittest.ForeignEnum ) checkEquals( length(foreign_enum), 3L, msg = "length( protobuf_unittest.ForeignEnum ) == 3" ) checkTrue( all( c("FOREIGN_FOO", "FOREIGN_BAR", "FOREIGN_BAZ") %in% names( foreign_enum ) ), msg = "expected names for enum `protobuf_unittest.ForeignEnum`" ) checkEquals( unlist(unname(as.list(protobuf_unittest.ForeignEnum))), 4:6, msg = "expected values for enum `protobuf_unittest.ForeignEnum`" ) - } # Early versions of RProtoBuf did not support repeated messages properly. Modified: pkg/man/P.Rd =================================================================== --- pkg/man/P.Rd 2014-11-25 01:43:32 UTC (rev 905) +++ pkg/man/P.Rd 2014-11-25 02:26:08 UTC (rev 906) @@ -28,6 +28,6 @@ } \dontshow{ Person <- P("tutorial.Person") } -as.character( Person ) +cat(as.character( Person )) } \keyword{ interface } From noreply at r-forge.r-project.org Tue Nov 25 03:39:23 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 25 Nov 2014 03:39:23 +0100 (CET) Subject: [Rprotobuf-commits] r907 - papers/jss Message-ID: <20141125023923.9ACEC183FAD@r-forge.r-project.org> Author: edd Date: 2014-11-25 03:39:23 +0100 (Tue, 25 Nov 2014) New Revision: 907 Modified: papers/jss/article.Rnw Log: three small edits Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-25 02:26:08 UTC (rev 906) +++ papers/jss/article.Rnw 2014-11-25 02:39:23 UTC (rev 907) @@ -230,7 +230,7 @@ Protocol Buffers are a modern, language-neutral, platform-neutral, extensible mechanism for sharing and storing structured data. Some of -the key features provided by Protocol Buffers for data analysis include: +the key features provided by Protocol Buffers for data analysis are: \begin{itemize} \item \emph{Portable}: Enable users to send and receive data between @@ -385,7 +385,7 @@ namespace.\footnote{Note that there is a significant performance overhead with this RObjectTable implementation. Because the table is on the search path and isn't cacheable, lookups of symbols that - are behind it in the search path can't be added to the global object + are behind it in the search path cannot be added to the global object cache, and R must perform an expensive lookup through all of the attached environments and the protocol buffer definitions to find common symbols (most notably those in base) from the global environment. @@ -623,8 +623,7 @@ Table~\ref{class-summary-table} lists the six primary Message and Descriptor classes in \pkg{RProtoBuf}. -Please see the package -documentation for a complete description of the slots and methods for +The package documentation provides a complete description of the slots and methods for each class. % [bp] From mstokely at google.com Tue Nov 25 03:52:36 2014 From: mstokely at google.com (Murray Stokely) Date: Mon, 24 Nov 2014 18:52:36 -0800 Subject: [Rprotobuf-commits] r907 - papers/jss In-Reply-To: <20141125023923.9ACEC183FAD@r-forge.r-project.org> References: <20141125023923.9ACEC183FAD@r-forge.r-project.org> Message-ID: Yay Dirk, lets finish this up and get it done. =) I'm addressing referee#2 feedback on section 7 now. Then I will probably look at section 8 because the best we can do to resolve the comment is add a few sentences of motivational text here and there. Addressing feedback in section 6 requires code changes for the general R object to protobuf stuff and rexp.proto which I don't use because all my use cases are typed. Again I would try to get around it by writing more motivational text, talking about future work, rather than doing all that implementation work now. - Murray On Mon, Nov 24, 2014 at 6:39 PM, wrote: > Author: edd > Date: 2014-11-25 03:39:23 +0100 (Tue, 25 Nov 2014) > New Revision: 907 > > Modified: > papers/jss/article.Rnw > Log: > three small edits > > > Modified: papers/jss/article.Rnw > =================================================================== > --- papers/jss/article.Rnw 2014-11-25 02:26:08 UTC (rev 906) > +++ papers/jss/article.Rnw 2014-11-25 02:39:23 UTC (rev 907) > @@ -230,7 +230,7 @@ > > Protocol Buffers are a modern, language-neutral, platform-neutral, > extensible mechanism for sharing and storing structured data. Some of > -the key features provided by Protocol Buffers for data analysis include: > +the key features provided by Protocol Buffers for data analysis are: > > \begin{itemize} > \item \emph{Portable}: Enable users to send and receive data between > @@ -385,7 +385,7 @@ > namespace.\footnote{Note that there is a significant performance > overhead with this RObjectTable implementation. Because the table > is on the search path and isn't cacheable, lookups of symbols that > - are behind it in the search path can't be added to the global object > + are behind it in the search path cannot be added to the global object > cache, and R must perform an expensive lookup through all of the > attached environments and the protocol buffer definitions to find common > symbols (most notably those in base) from the global environment. > @@ -623,8 +623,7 @@ > > Table~\ref{class-summary-table} lists the six primary Message and > Descriptor classes in \pkg{RProtoBuf}. > -Please see the package > -documentation for a complete description of the slots and methods for > +The package documentation provides a complete description of the slots > and methods for > each class. > > % [bp] > > _______________________________________________ > Rprotobuf-commits mailing list > Rprotobuf-commits at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rprotobuf-commits > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noreply at r-forge.r-project.org Tue Nov 25 03:59:01 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 25 Nov 2014 03:59:01 +0100 (CET) Subject: [Rprotobuf-commits] r908 - papers/jss Message-ID: <20141125025901.695051877F7@r-forge.r-project.org> Author: murray Date: 2014-11-25 03:59:00 +0100 (Tue, 25 Nov 2014) New Revision: 908 Modified: papers/jss/article.Rnw Log: Improve example in section 7 using some of the specific advantages suggested by referee #2 and point out why we've given the user a simplified example and how it differs from the real MapReduce context where this would be more useful. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-25 02:39:23 UTC (rev 907) +++ papers/jss/article.Rnw 2014-11-25 02:59:00 UTC (rev 908) @@ -1142,7 +1142,11 @@ This HistogramState message type is designed to be helpful if some of the Map or Reduce tasks are written in \proglang{R}, or if those components are written in other languages and only the resulting output histograms -need to be manipulated in \proglang{R}. For example, to create HistogramState +need to be manipulated in \proglang{R}. + +\subsection*{A trivial single-machine example for Python to R serialization} + +To create HistogramState messages in Python for later consumption by \proglang{R}, we first compile the \code{histogram.proto} descriptor into a python module using the \code{protoc} compiler: @@ -1205,7 +1209,18 @@ @ \end{center} -One of the authors has used this design pattern with large-scale \proglang{C++} +This simple example uses a constant histogram generated in +\proglang{Python} to illustrate the serialization concepts without +requiring the reader to be familiar with the interface of any +particular MapReduce implementation. In practice, using Protocol +Buffers to pass histograms between another programming language and R +would provide a much greater benefit in a distributed context. +For example, a first-class data type to represent histograms would +prevent individual histograms from being split up and would allow the +use of combiners on Map workers to process large data sets more +efficiently than simply passing around lists of counts and buckets. + +One of the authors has used this design pattern with \proglang{C++} MapReduces over very large data sets to write out histogram protocol buffers for several large-scale studies of distributed storage systems \citep{sciencecloud,janus}. From noreply at r-forge.r-project.org Tue Nov 25 04:09:11 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 25 Nov 2014 04:09:11 +0100 (CET) Subject: [Rprotobuf-commits] r909 - papers/jss Message-ID: <20141125030911.2EA7B184EAB@r-forge.r-project.org> Author: murray Date: 2014-11-25 04:09:00 +0100 (Tue, 25 Nov 2014) New Revision: 909 Modified: papers/jss/article.Rnw Log: Add acknowledgements for Tim Hesterberg and two anonymous referees. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-25 02:59:00 UTC (rev 908) +++ papers/jss/article.Rnw 2014-11-25 03:09:00 UTC (rev 909) @@ -1418,12 +1418,14 @@ The user-defined table mechanism, implemented by Duncan Temple Lang for the purpose of the \pkg{RObjectTables} package, allows for the dynamic symbol lookup. Kenton Varda was generous with his time in reviewing code and explaining -obscure Protocol Buffer semantics. Karl Millar was very +obscure Protocol Buffer semantics. Karl Millar and Tim Hesterberg were very helpful in reviewing code and offering suggestions. Saptarshi Guha's work on RHIPE and implementation of a universal message type for \proglang{R} language objects allowed us to add the \code{serialize_pb} and \code{unserialize_pb} methods for turning arbitrary R objects into Protocol Buffers without -a specialized pre-defined schema. +a specialized pre-defined schema. Feedback from two anonymous +referees greatly improved both the presentation of this paper and the +package contents. \newpage \appendix From noreply at r-forge.r-project.org Tue Nov 25 08:17:09 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 25 Nov 2014 08:17:09 +0100 (CET) Subject: [Rprotobuf-commits] r910 - papers/jss Message-ID: <20141125071710.07E931868CC@r-forge.r-project.org> Author: murray Date: 2014-11-25 08:17:09 +0100 (Tue, 25 Nov 2014) New Revision: 910 Modified: papers/jss/article.Rnw Log: Updates to section 7: Add a better transition to start section 7 reminding the user what the application in section 6 was about and how/why this one is different. Remove several duplicate sentences about the basics of protocol buffer .proto files and such which are explained earlier in the paper. Remove a few sentences that provide unnecessary level of detail about OpenCPU. In this section it is an example web service and so we don't need to advertise to the reader its other capabilities that are unnecessary for the example. Backreference to the section about serialize_pb here. Avoid saying 'protobuf' or 'protobuf messages' as this terminology was only used in this section. Instead, spell out protocol buffers. End result is 13-lines shorter/more concise (1/3 - 1/2 of a page), and I think clearer as well. We still don't mention the word RESTful anywhere here, where OpenCPU is a RESTful web service where just one argument of the POST request is encoded with the protocol buffer, instead of a more general non web/REST type of RPC server that would tend to be a more natural fit for protocol buffers. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-25 03:09:00 UTC (rev 909) +++ papers/jss/article.Rnw 2014-11-25 07:17:09 UTC (rev 910) @@ -1209,7 +1209,7 @@ @ \end{center} -This simple example uses a constant histogram generated in + This simple example uses a constant histogram generated in \proglang{Python} to illustrate the serialization concepts without requiring the reader to be familiar with the interface of any particular MapReduce implementation. In practice, using Protocol @@ -1228,8 +1228,12 @@ \section{Application: Data interchange in web services} \label{sec:opencpu} -As described earlier, the primary application of Protocol Buffers is data -interchange in the context of inter-system communications. Network protocols +The previous section described an application where data from a +program written in another language was output to persistent storage +and then read into \proglang{R} for further analysis. This section +describes another common use case where Protocol Buffers are used as +the interchange format for client-server communication. +Network protocols such as HTTP provide mechanisms for client-server communication, i.e., how to initiate requests, authenticate, send messages, etc. However, network protocols generally do not regulate the \emph{content} of messages: they @@ -1240,47 +1244,51 @@ messages (buffers) on the network. Protocol Buffers solve exactly this problem by providing a cross-platform method for serializing arbitrary structures into well defined messages, which can then be exchanged using any -protocol. The descriptors (\code{.proto} files) are used to formally define -the interface of a remote API or network application. Libraries to parse and -generate protobuf messages are available for many programming languages, -making it relatively straightforward to implement clients and servers. +protocol. +%The descriptors (\code{.proto} files) are used to formally define +%the interface of a remote API or network application. +%Libraries to parse and +%generate protobuf messages are available for many programming languages, +%making it relatively straightforward to implement clients and servers. \subsection[Interacting with R through HTTPS and Protocol Buffers]{Interacting with \proglang{R} through HTTPS and Protocol Buffers} One example of a system that supports Protocol Buffers to interact -with \proglang{R} is OpenCPU \citep{opencpu}. OpenCPU is a framework for embedded statistical -computation and reproducible research based on \proglang{R} and \LaTeX. It exposes a -HTTP(S) API to access and manipulate \proglang{R} objects and allows for performing -remote \proglang{R} function calls. Clients do not need to understand -or generate any \proglang{R} code: HTTP requests are automatically mapped to -function calls, and arguments/return values can be posted/retrieved -using several data interchange formats, such as Protocol Buffers. -OpenCPU uses the \code{serialize\_pb} and \code{unserialize\_pb} functions -from the \pkg{RProtoBuf} package to convert between \proglang{R} objects and protobuf -messages. Therefore, clients need the \code{rexp.proto} descriptor mentioned -earlier to parse and generate protobuf messages when interacting with OpenCPU. +with \proglang{R} is OpenCPU \citep{opencpu}. OpenCPU is a framework +for embedded statistical computation and reproducible research based +on \proglang{R} and \LaTeX. It exposes a HTTP(S) API to access and +manipulate \proglang{R} objects and execute remote \proglang{R} +function calls. Clients do not need to understand or generate any +\proglang{R} code: HTTP requests are automatically mapped to function +calls, and arguments/return values can be posted/retrieved using +several data interchange formats, such as Protocol Buffers. OpenCPU +uses the \code{rexp.proto} descriptor and the \code{serialize\_pb} and +\code{unserialize\_pb} functions described in +Section~\ref{sec:evaluation} to convert between \proglang{R} objects +and protocol buffer messages. \subsection[HTTP GET: Retrieving an R object]{HTTP GET: Retrieving an \proglang{R} object} The \code{HTTP GET} method is used to read a resource from OpenCPU. For example, -to access the data set \code{Animals} from the package \code{MASS}, a +to access the data set \code{Animals} from the package \code{MASS}, a client performs the following HTTP request: \begin{verbatim} GET https://public.opencpu.org/ocpu/library/MASS/data/Animals/pb \end{verbatim} The postfix \code{/pb} in the URL tells the server to send this -object in the form of a protobuf message. Alternative formats include -\code{/json}, \code{/csv}, \code{/rds} and others. If the request -is successful, OpenCPU returns the serialized object with HTTP status -code 200 and HTTP response header \code{Content-Type: application/x-protobuf}. +object in the form of a protocol buffer message. +% Alternative formats include \code{/json}, \code{/csv}, \code{/rds} and others. +If the request +is successful, OpenCPU returns the serialized object with HTTP status +code 200 and HTTP response header \code{Content-Type: application/x-protobuf}. The latter is the conventional MIME type that formally notifies the client to -interpret the response as a protobuf message. +interpret the response as a protocol buffer. -Because both HTTP and Protocol Buffers have libraries available for many +Because both HTTP and Protocol Buffers have libraries available for many languages, clients can be implemented in just a few lines of code. Below -is example code for both \proglang{R} and Python that retrieves a data set from \proglang{R} with -OpenCPU using a protobuf message. In \proglang{R}, we use the HTTP client from +is example code for both \proglang{R} and Python that retrieves an \proglang{R} data set encoded as a protocol buffer message from OpenCPU. +In \proglang{R}, we use the HTTP client from the \code{httr} package \citep{httr}. In this example we download a data set which is part of the base \proglang{R} distribution, so we can verify that the object was transferred without loss of information. @@ -1295,28 +1303,28 @@ identical(output, MASS::Animals) @ -This code suggests a method for exchanging objects between \proglang{R} servers, however this might as -well be done without Protocol Buffers. The main advantage of using an inter-operable format -is that we can actually access \proglang{R} objects from within another -programming language. For example, in a very similar fashion we can retrieve the same -data set in a Python client. To parse messages in Python, we first compile the -\code{rexp.proto} descriptor into a python module using the \code{protoc} compiler: +Similarly, to retrieve the same data set in a Python client, we first +compile the \code{rexp.proto} descriptor into a python module +using the \code{protoc} compiler: \begin{verbatim} protoc rexp.proto --python_out=. \end{verbatim} -This generates Python module called \code{rexp\_pb2.py}, containing both the -descriptor information as well as methods to read and manipulate the \proglang{R} object -message. In the example below we use the HTTP client from the \code{urllib2} -module. +This generates Python module called \code{rexp\_pb2.py}, containing +both the descriptor information as well as methods to read and +manipulate the \proglang{R} object message. We use the +HTTP client from the \code{urllib2} module in our example to retrieve the +encoded protocol buffer from the remote server then parse and print it +from Python. + \begin{verbatim} import urllib2 from rexp_pb2 import REXP req = urllib2.Request('https://public.opencpu.org/ocpu/library/MASS/data/Animals/pb') res = urllib2.urlopen(req) - + msg = REXP() msg.ParseFromString(res.read()) print(msg) @@ -1324,35 +1332,28 @@ The \code{msg} object contains all data from the Animals data set. From here we can easily extract the desired fields for further use in Python. - \subsection[HTTP POST: Calling an R function]{HTTP POST: Calling an \proglang{R} function} -The example above shows how the \code{HTTP GET} method retrieves a -resource from OpenCPU, for example an \proglang{R} object. The \code{HTTP POST} -method on the other hand is used for calling functions and running scripts, -which is the primary purpose of the framework. As before, the \code{/pb} -postfix requests to retrieve the output as a protobuf message, in this -case the function return value. However, OpenCPU allows us to supply the -arguments of the function call in the form of protobuf messages as well. -This is a bit more work, because clients needs to both generate messages -containing \proglang{R} objects to post to the server, as well as retrieve and parse -protobuf messages returned by the server. Using Protocol Buffers to post -function arguments is not required, and for simple (scalar) arguments -the standard \code{application/x-www-form-urlencoded} format might be sufficient. -However, with Protocol Buffers the client can perform function calls with -more complex arguments such as \proglang{R} vectors or lists. The result is a complete -RPC system to do arbitrary \proglang{R} function calls from within -any programming language. +The previous example used a simple \code{HTTP GET} method to retrieve +an \proglang{R} object from a remote service (OpenCPU) encoded as a +protocol buffer. +In many cases simple \code{HTTP GET} methods are insufficient, and a +more complete RPC system may need to create compact protocol buffers +for each request to send to the remote server in addition to parsing +the response protocol buffers. -The following example \proglang{R} client code performs the remote function call -\code{stats::rnorm(n=42, mean=100)}. The function arguments (in this -case \code{n} and \code{mean}) as well as the return value (a vector -with 42 random numbers) are transferred using a protobuf message. RPC in -OpenCPU works like the \code{do.call} function in \proglang{R}, hence all arguments -are contained within a list. +The OpenCPU framework allows us to do arbitrary \proglang{R} function +calls from within any programming language by encoding the arguments +in the request protocol buffer. The following example \proglang{R} +client code performs the remote function call \code{stats::rnorm(n=42, +mean=100)}. The function arguments (in this case \code{n} and +\code{mean}) as well as the return value (a vector with 42 random +numbers) are transferred using protocol buffer messages. RPC in OpenCPU +works like the \code{do.call} function in \proglang{R}, hence all +arguments are contained within a list. <>= -library("httr") +library("httr") library("RProtoBuf") args <- list(n=42, mean=100) @@ -1369,7 +1370,7 @@ output <- unserialize_pb(req$content) print(output) @ -The OpenCPU server basically performs the following steps to process the above RPC request: +The OpenCPU server basically performs the following steps to process the above RPC request: <>= fnargs <- unserialize_pb(inputmsg) From noreply at r-forge.r-project.org Wed Nov 26 03:25:16 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 26 Nov 2014 03:25:16 +0100 (CET) Subject: [Rprotobuf-commits] r911 - papers/jss Message-ID: <20141126022516.B5429187866@r-forge.r-project.org> Author: edd Date: 2014-11-26 03:25:15 +0100 (Wed, 26 Nov 2014) New Revision: 911 Modified: papers/jss/article.Rnw Log: s/protocol buffers/Protocol Buffers/ Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-25 07:17:09 UTC (rev 910) +++ papers/jss/article.Rnw 2014-11-26 02:25:15 UTC (rev 911) @@ -1265,7 +1265,7 @@ uses the \code{rexp.proto} descriptor and the \code{serialize\_pb} and \code{unserialize\_pb} functions described in Section~\ref{sec:evaluation} to convert between \proglang{R} objects -and protocol buffer messages. +and Protocol Buffer messages. \subsection[HTTP GET: Retrieving an R object]{HTTP GET: Retrieving an \proglang{R} object} @@ -1277,17 +1277,17 @@ GET https://public.opencpu.org/ocpu/library/MASS/data/Animals/pb \end{verbatim} The postfix \code{/pb} in the URL tells the server to send this -object in the form of a protocol buffer message. +object in the form of a Protocol Buffer message. % Alternative formats include \code{/json}, \code{/csv}, \code{/rds} and others. If the request is successful, OpenCPU returns the serialized object with HTTP status code 200 and HTTP response header \code{Content-Type: application/x-protobuf}. The latter is the conventional MIME type that formally notifies the client to -interpret the response as a protocol buffer. +interpret the response as a Protocol Buffer. Because both HTTP and Protocol Buffers have libraries available for many languages, clients can be implemented in just a few lines of code. Below -is example code for both \proglang{R} and Python that retrieves an \proglang{R} data set encoded as a protocol buffer message from OpenCPU. +is example code for both \proglang{R} and Python that retrieves an \proglang{R} data set encoded as a Protocol Buffer message from OpenCPU. In \proglang{R}, we use the HTTP client from the \code{httr} package \citep{httr}. In this example we download a data set which is part of the base \proglang{R} distribution, so we can @@ -1315,7 +1315,7 @@ both the descriptor information as well as methods to read and manipulate the \proglang{R} object message. We use the HTTP client from the \code{urllib2} module in our example to retrieve the -encoded protocol buffer from the remote server then parse and print it +encoded Protocol Buffer from the remote server then parse and print it from Python. \begin{verbatim} @@ -1336,19 +1336,19 @@ The previous example used a simple \code{HTTP GET} method to retrieve an \proglang{R} object from a remote service (OpenCPU) encoded as a -protocol buffer. +Protocol Buffer. In many cases simple \code{HTTP GET} methods are insufficient, and a -more complete RPC system may need to create compact protocol buffers +more complete RPC system may need to create compact Protocol Buffers for each request to send to the remote server in addition to parsing -the response protocol buffers. +the response Protocol Buffers. The OpenCPU framework allows us to do arbitrary \proglang{R} function calls from within any programming language by encoding the arguments -in the request protocol buffer. The following example \proglang{R} +in the request Protocol Buffer. The following example \proglang{R} client code performs the remote function call \code{stats::rnorm(n=42, mean=100)}. The function arguments (in this case \code{n} and \code{mean}) as well as the return value (a vector with 42 random -numbers) are transferred using protocol buffer messages. RPC in OpenCPU +numbers) are transferred using Protocol Buffer messages. RPC in OpenCPU works like the \code{do.call} function in \proglang{R}, hence all arguments are contained within a list. From noreply at r-forge.r-project.org Wed Nov 26 04:09:13 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 26 Nov 2014 04:09:13 +0100 (CET) Subject: [Rprotobuf-commits] r912 - in pkg: . vignettes Message-ID: <20141126030913.E2B8A1851D9@r-forge.r-project.org> Author: edd Date: 2014-11-26 04:09:13 +0100 (Wed, 26 Nov 2014) New Revision: 912 Modified: pkg/ChangeLog pkg/configure.in pkg/vignettes/RProtoBuf-intro.Rnw Log: * vignettes/RProtoBuf-intro.Rnw: Applied a few corrections spotted by Tim Hesterberg and communicated in email. Modified: pkg/ChangeLog =================================================================== --- pkg/ChangeLog 2014-11-26 02:25:15 UTC (rev 911) +++ pkg/ChangeLog 2014-11-26 03:09:13 UTC (rev 912) @@ -1,3 +1,8 @@ +2014-11-25 Dirk Eddelbuettel + + * vignettes/RProtoBuf-intro.Rnw: Applied a few corrections spotted by + Tim Hesterberg and communicated in email. + 2014-11-24 Murray Stokely * inst/unitTests/runit.golden.message.R: remove trailing Modified: pkg/configure.in =================================================================== --- pkg/configure.in 2014-11-26 02:25:15 UTC (rev 911) +++ pkg/configure.in 2014-11-26 03:09:13 UTC (rev 912) @@ -8,7 +8,7 @@ AC_PREREQ(2.61) # Process this file with autoconf to produce a configure script. -AC_INIT([RProtoBuf],[0.4]) +AC_INIT([RProtoBuf],[0.4.1]) m4_include([m4/m4-ax_cxx_compile_stdcxx_0x.m4]) # We are using C++ Modified: pkg/vignettes/RProtoBuf-intro.Rnw =================================================================== --- pkg/vignettes/RProtoBuf-intro.Rnw 2014-11-26 02:25:15 UTC (rev 911) +++ pkg/vignettes/RProtoBuf-intro.Rnw 2014-11-26 03:09:13 UTC (rev 912) @@ -367,7 +367,7 @@ the S4 system, the \verb|@| operator is very rarely used. Fields of the message are retrieved or modified using the \verb|$| or \verb|[[| operators as seen on the previous section, and pseudo-methods can also -be called using the \verb|$| operator. The table~\ref{Message-methods-table} +be called using the \verb|$| operator. Table~\ref{Message-methods-table} describes the methods defined for the \texttt{Message} class : \begin{table}[h] @@ -431,7 +431,7 @@ \verb|$| is also used to call methods on the message, and the \verb|[[| operator can use the tag number of the field. -The table~\ref{table-get-types} details correspondance between +Table~\ref{table-get-types} details correspondance between the field type and the type of data that is retrieved by \verb|$| and \verb|[[|. @@ -495,7 +495,7 @@ writeLines( message$as.character() ) @ -The table~\ref{table-message-field-setters} describes the R types that +Table~\ref{table-message-field-setters} describes the R types that are allowed in the right hand side depending on the target type of the field. @@ -726,8 +726,8 @@ \subsubsection{Message\$setExtension method} \label{Message-method-setExtension} -The \texttt{setExtension} method can be used to get values -of a repeated field. +The \texttt{setExtension} method can be used to set an extension field of the +Message. <>= if (!exists("protobuf_unittest.TestAllTypes", @@ -882,7 +882,7 @@ \end{table} Similarly to messages, the \verb|$| operator can be used to extract -information from the descriptor, or invoke pseuso-methods. +information from the descriptor, or invoke pseudo-methods. Table~\ref{Descriptor-methods-table} describes the methods defined for the \texttt{Descriptor} class : \begin{table}[h] From noreply at r-forge.r-project.org Wed Nov 26 22:13:54 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 26 Nov 2014 22:13:54 +0100 (CET) Subject: [Rprotobuf-commits] r913 - in pkg: . inst inst/proto inst/unitTests Message-ID: <20141126211354.B4296187322@r-forge.r-project.org> Author: murray Date: 2014-11-26 22:13:54 +0100 (Wed, 26 Nov 2014) New Revision: 913 Modified: pkg/ChangeLog pkg/inst/NEWS.Rd pkg/inst/proto/rexp.proto pkg/inst/unitTests/runit.serialize_pb.R Log: Address referee feedback by adding support for serializing function, language, and environment objects with serialize_pb. It's still not particularly useful since these are mostly R language constructs, but still I agree it will make our exposition clearer in section 6 and makes this functionality feel more complete. Add a unit test verifying that all 106 built-in datasets in R can be round-trip serialized/unserialized into protocol buffers without error. Modified: pkg/ChangeLog =================================================================== --- pkg/ChangeLog 2014-11-26 03:09:13 UTC (rev 912) +++ pkg/ChangeLog 2014-11-26 21:13:54 UTC (rev 913) @@ -1,3 +1,21 @@ +2014-11-26 Murray Stokely + + Address feedback from anonymous reviewer for JSS to make this + package more complete: + + * inst/unitTests/runit.serialize_pb.R: Add a test to verify that + we can serialize all 100+ built-in datasets with R and get an + identical object to the original once unserialized. + + * R/rexp_obj.R: Serialize function, language, and environment + objects by just falling back to R's native serialization and using + raw bytes to store them. This at least lets us round-trip encode + all native R types, even though these three only make sense in the + context of R. Greatly simplify the can_serialize_pb function. + + * inst/proto/rexp.proto: Add support for function, language, and + environment objects. + 2014-11-25 Dirk Eddelbuettel * vignettes/RProtoBuf-intro.Rnw: Applied a few corrections spotted by Modified: pkg/inst/NEWS.Rd =================================================================== --- pkg/inst/NEWS.Rd 2014-11-26 03:09:13 UTC (rev 912) +++ pkg/inst/NEWS.Rd 2014-11-26 21:13:54 UTC (rev 913) @@ -18,7 +18,12 @@ \item Update the default print methods to use \code{cat()} with \code{fill=TRUE} instead of \code{show()} to eliminate the confusing \code{[1]} since the classes in \cpkg{RProtoBuf} are not vectorized. - \item Add unit tests. + \item Add support for serializing function, language, and + environment objects by falling back to R's native serialization + with \code{serialize_pb} and \code{unserialize_pb} to make it + easy to serialize into a protocol buffer all 100+ of the + built-in datasets with R. + \item Add unit tests for all of the above. } \section{Changes in RProtoBuf version 0.4.1 (2014-03-25)}{ Modified: pkg/inst/proto/rexp.proto =================================================================== --- pkg/inst/proto/rexp.proto 2014-11-26 03:09:13 UTC (rev 912) +++ pkg/inst/proto/rexp.proto 2014-11-26 21:13:54 UTC (rev 913) @@ -1,8 +1,15 @@ // Originally written by Saptarshi Guha for RHIPE (http://www.rhipe.org) -// Released under Apache License 2.0, and reused with permission here +// Released under Apache License 2.0, and reused with permission here +// Extended in November 2014 with new types to support encoding +// language, environment, and function types from R. package rexp; +option java_package = "org.godhuli.rhipe"; +option java_outer_classname = "REXPProtos"; + +// TODO(mstokely): Refine this using the new protobuf 2.6 oneof field +// for unions. message REXP { enum RClass { STRING = 0; @@ -13,6 +20,9 @@ LIST = 5; LOGICAL = 6; NULLTYPE = 7; + LANGUAGE = 8; + ENVIRONMENT = 9; + FUNCTION = 10; } enum RBOOLEAN { F=0; @@ -20,7 +30,7 @@ NA=2; } - required RClass rclass = 1 ; + required RClass rclass = 1; repeated double realValue = 2 [packed=true]; repeated sint32 intValue = 3 [packed=true]; repeated RBOOLEAN booleanValue = 4; @@ -32,6 +42,9 @@ repeated string attrName = 11; repeated REXP attrValue = 12; + optional bytes languageValue = 13; + optional bytes environmentValue = 14; + optional bytes functionValue = 14; } message STRING { optional string strval = 1; @@ -41,4 +54,3 @@ optional double real = 1 [default=0]; required double imag = 2; } - Modified: pkg/inst/unitTests/runit.serialize_pb.R =================================================================== --- pkg/inst/unitTests/runit.serialize_pb.R 2014-11-26 03:09:13 UTC (rev 912) +++ pkg/inst/unitTests/runit.serialize_pb.R 2014-11-26 21:13:54 UTC (rev 913) @@ -3,11 +3,11 @@ test.serialize_pb <- function() { #verify that rexp.proto is loaded RProtoBuf:::pb(rexp.REXP) - + #serialize a nested list x <- list(foo=cars, bar=Titanic) checkEquals(unserialize_pb(serialize_pb(x, NULL)), x) - + #a bit of everything, copied from jsonlite package set.seed('123') myobject <- list( @@ -22,6 +22,20 @@ somemissings = c(1,2,NA,NaN,5, Inf, 7 -Inf, 9, NA), myrawvec = charToRaw('This is a test') ); - + checkEquals(unserialize_pb(serialize_pb(myobject, NULL)), myobject) } + +test.serialize_pb.alldatasets <- function() { + datasets <- as.data.frame(data(package="datasets")$results) + datasets$name <- sub("\\s+.*$", "", datasets$Item) + + encoded.datasets <- sapply(datasets$name, + function(x) serialize_pb(get(x), NULL)) + + unserialized.datasets <- sapply(encoded.datasets, unserialize_pb) + + checkTrue(all(sapply(names(unserialized.datasets), + function(name) identical(get(name), + unserialized.datasets[[name]])))) +} \ No newline at end of file From noreply at r-forge.r-project.org Wed Nov 26 22:53:19 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 26 Nov 2014 22:53:19 +0100 (CET) Subject: [Rprotobuf-commits] r914 - papers/jss Message-ID: <20141126215319.361631872F2@r-forge.r-project.org> Author: murray Date: 2014-11-26 22:53:18 +0100 (Wed, 26 Nov 2014) New Revision: 914 Modified: papers/jss/article.Rnw Log: Cut section 6 in half by removing the section explaining the caveats about formulas and types not supported by serialized_pb, and just explain now that we serialize everything, but in one sentence explain the caveat that we fall back to base::serialize for R-specific types like language,function,and environment. This basically removes the need for 6.1 at all, so remove that section, and move a tiny bit of the top text about which datasets we are using into the top of section 6.2 which explains the compresison performance. Next step: Replace table with a plot as hadley and one of the referees both suggested. Another referee suggested just merging 5 and 6 together completely. Now that section 6 is one page plus one large table, that seems feasible, but I'll revisit that after replacing the table with a plot. I do want to make a stark distinction between having a schema (section 5) and not (section 6). I never really use section 6 schema-less method, but its the one that is easier for people to play around with probably if they don't have a real application with protocol buffers yet. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-26 21:13:54 UTC (rev 913) +++ papers/jss/article.Rnw 2014-11-26 21:53:18 UTC (rev 914) @@ -868,7 +868,7 @@ within \proglang{R}. The package also provides methods for converting arbitrary \proglang{R} data structures into Protocol Buffers and vice versa with a universal \proglang{R} object schema. The \code{serialize\_pb} and \code{unserialize\_pb} -functions serialize arbitrary \proglang{R} objects into a universal Protocol Buffer +functions serialize arbitrary \proglang{R} objects into a universal Protocol Buffer message: <<>>= @@ -877,76 +877,44 @@ @ In order to accomplish this, \pkg{RProtoBuf} uses the same catch-all \code{proto} -schema used by \pkg{RHIPE} for exchanging \proglang{R} data with Hadoop \citep{rhipe}. This +schema used by \pkg{RHIPE} for exchanging \proglang{R} data with Hadoop \citep{rhipe}. This schema, which we will refer to as \code{rexp.proto}, is printed in %appendix \ref{rexp.proto}. the appendix. The Protocol Buffer messages generated by \pkg{RProtoBuf} and -\pkg{RHIPE} are naturally compatible between the two systems because they use the +\pkg{RHIPE} are naturally compatible between the two systems because they use the same schema. This shows the power of using a schema-based cross-platform format such as Protocol Buffers: interoperability is achieved without effort or close coordination. -The \code{rexp.proto} schema supports all main \proglang{R} storage types holding \emph{data}. -These include \code{NULL}, \code{list} and vectors of type \code{logical}, -\code{character}, \code{double}, \code{integer}, and \code{complex}. In addition, -every type can contain a named set of attributes, as is the case in \proglang{R}. The \code{rexp.proto} -schema does not support some of the special \proglang{R} specific storage types, such as \code{function}, -\code{language} or \code{environment}. Such objects have no native equivalent -type in Protocol Buffers, and have little meaning outside the context of \proglang{R}. -When serializing \proglang{R} objects using \code{serialize\_pb}, values or attributes of -unsupported types are skipped with a warning. If the user really wishes to serialize these -objects, they need to be converted into a supported type. For example, the can use -\code{deparse} to convert functions or language objects into strings, or \code{as.list} -for environments. +The \code{rexp.proto} schema natively supports all main \proglang{R} +storage types holding \emph{data}. These include \code{NULL}, +\code{list} and vectors of type \code{logical}, \code{character}, +\code{double}, \code{integer}, and \code{complex}. In addition, every +type can contain a named set of attributes, as is the case in +\proglang{R}. The storage types \code{function}, \code{language}, and +\code{environment} are specific to \proglang{R} and have no equivalent +native type in Protocol Buffers. These three types are supported by +first serializing with \code{base::serialize} in \proglang{R} and +then stored in a raw bytes field. -\subsection[Evaluation: Converting R data sets]{Evaluation: Converting \proglang{R} data sets} -To illustrate how this method works, we attempt to convert all of the built-in -data sets from \proglang{R} into this serialized Protocol Buffer representation. +\subsection[Evaluation: Serializing R data sets]{Evaluation: Serializing \proglang{R} data sets} +\label{sec:compression} -<>= +<>= datasets <- as.data.frame(data(package="datasets")$results) datasets$name <- sub("\\s+.*$", "", datasets$Item) n <- nrow(datasets) @ -There are \Sexpr{n} standard data sets included in the \pkg{datasets} -package included with \proglang{R}. These data sets include data frames, matrices, time series, tables lists, -and some more exotic data classes. The \code{can\_serialize\_pb} method is -used to determine which of those can fully be converted to the \code{rexp.proto} -Protocol Buffer representation. This method simply checks if any of the values or -attributes in an object is of an unsupported type: +This section evaluates the effectiveness of serializing arbitrary +\proglang{R} data structures into Protocol Buffers. We use the +\Sexpr{n} standard data sets included in the \pkg{datasets} package +included with \proglang{R} as our evaluation data. These data sets +include data frames, matrices, time series, tables, lists, and some +more exotic data classes. For each data set, we compare how many +bytes are used to store the data set using four different methods: -<>= -m <- sum(sapply(datasets$name, function(x) can_serialize_pb(get(x)))) -@ - -\Sexpr{m} data sets can be converted to Protocol Buffers -without loss of information (\Sexpr{format(100*m/n,digits=1)}\%). Upon closer -inspection, all other data sets are objects of class \code{nfnGroupedData}. -This class represents a special type of data frame that has some additional -attributes (such as a \emph{formula} object) used by the \pkg{nlme} package \citep{nlme}. -Because formulas are \proglang{R} \emph{language} objects, they have little meaning to -other systems, and are not supported by the \code{rexp.proto} descriptor. -When \code{serialize\_pb} is used on objects of this class, it will serialize -the data frame and all attributes, except for the formula. - -<<>>= -attr(CO2, "formula") -msg <- serialize_pb(CO2, NULL) -object <- unserialize_pb(msg) -identical(CO2, object) -identical(class(CO2), class(object)) -identical(dim(CO2), dim(object)) -attr(object, "formula") -@ - -\subsection{Compression performance} -\label{sec:compression} - -This section compares how many bytes are used to store data sets -using four different methods: - \begin{itemize} \item normal \proglang{R} serialization \citep{serialization}, \item \proglang{R} serialization followed by gzip, @@ -977,9 +945,9 @@ @ Table~\ref{tab:compression} shows the sizes of 50 sample \proglang{R} data sets as -returned by object.size() compared to the serialized sizes. +returned by \code{object.size()} compared to the serialized sizes. %The summary compression sizes are listed below, and a full table for a -%sample of 50 data sets is included on the next page. +%sample of 50 data sets is included on the next page. Note that Protocol Buffer serialization results in slightly smaller byte streams compared to native \proglang{R} serialization in most cases, but this difference disappears if the results are compressed with gzip. @@ -1443,7 +1411,6 @@ \begin{verbatim} package rexp; - message REXP { enum RClass { STRING = 0; @@ -1454,14 +1421,16 @@ LIST = 5; LOGICAL = 6; NULLTYPE = 7; + LANGUAGE = 8; + ENVIRONMENT = 9; + FUNCTION = 10; } enum RBOOLEAN { F=0; T=1; NA=2; } - - required RClass rclass = 1 ; + required RClass rclass = 1; repeated double realValue = 2 [packed=true]; repeated sint32 intValue = 3 [packed=true]; repeated RBOOLEAN booleanValue = 4; @@ -1471,6 +1440,9 @@ repeated REXP rexpValue = 8; repeated string attrName = 11; repeated REXP attrValue = 12; + optional bytes languageValue = 13; + optional bytes environmentValue = 14; + optional bytes functionValue = 14; } message STRING { optional string strval = 1; From noreply at r-forge.r-project.org Thu Nov 27 02:45:53 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Thu, 27 Nov 2014 02:45:53 +0100 (CET) Subject: [Rprotobuf-commits] r915 - papers/jss Message-ID: <20141127014553.1CDB1185FB9@r-forge.r-project.org> Author: murray Date: 2014-11-27 02:45:52 +0100 (Thu, 27 Nov 2014) New Revision: 915 Modified: papers/jss/article.bib Log: Add bib entries for faithful and crimdata datasets in base R which are the two outlier datasets in the protobuf vs R serialization comparison. Modified: papers/jss/article.bib =================================================================== --- papers/jss/article.bib 2014-11-26 21:53:18 UTC (rev 914) +++ papers/jss/article.bib 2014-11-27 01:45:52 UTC (rev 915) @@ -1,3 +1,19 @@ + at article{garson1900metric, + title={The Metric System of Identification of Criminals, as Used in Great Britain and Ireland}, + author={Garson, John George}, + journal={Journal of the Anthropological Institute of Great Britain and Ireland}, + pages={161--198}, + year={1900}, + publisher={JSTOR} +} + at article{azzalini1990look, + title={A look at some data on the Old Faithful geyser}, + author={Azzalini, A and Bowman, AW}, + journal={Applied Statistics}, + pages={357--365}, + year={1990}, + publisher={JSTOR} +} @article{dean2009designs, title={Designs, lessons and advice from building large distributed systems}, author={Dean, Jeff}, From noreply at r-forge.r-project.org Thu Nov 27 02:48:43 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Thu, 27 Nov 2014 02:48:43 +0100 (CET) Subject: [Rprotobuf-commits] r916 - papers/jss Message-ID: <20141127014843.A94071877B5@r-forge.r-project.org> Author: murray Date: 2014-11-27 02:48:43 +0100 (Thu, 27 Nov 2014) New Revision: 916 Modified: papers/jss/article.Rnw Log: Improve section 6 to address referee feedback: Replace massive full page 50 row table with a more succinct plot of the relevant data points, and label the outliers of this plot so we can talk about the interesting cases in the text. Add a small 3-row table beneath the plot showing the data for the two outliers plus the aggregate of all datasets. Explain why protocol buffers are so much more space-efficient for one dataset, and slightly less efficient for another. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-27 01:45:52 UTC (rev 915) +++ papers/jss/article.Rnw 2014-11-27 01:48:43 UTC (rev 916) @@ -941,21 +941,61 @@ "gzipped serialized"=datasets$R.serialize.size.gz, "RProtoBuf"=datasets$RProtoBuf.serialize.size, "gzipped RProtoBuf"=datasets$RProtoBuf.serialize.size.gz, + "ratio.serialized" = datasets$R.serialize.size / datasets$object.size, + "ratio.rprotobuf" = datasets$RProtoBuf.serialize.size / datasets$object.size, + "ratio.serialized.gz" = datasets$R.serialize.size.gz / datasets$object.size, + "ratio.rprotobuf.gz" = datasets$RProtoBuf.serialize.size.gz / datasets$object.size, + "savings.serialized" = 1-(datasets$R.serialize.size / datasets$object.size), + "savings.rprotobuf" = 1-(datasets$RProtoBuf.serialize.size / datasets$object.size), + "savings.serialized.gz" = 1-(datasets$R.serialize.size.gz / datasets$object.size), + "savings.rprotobuf.gz" = 1-(datasets$RProtoBuf.serialize.size.gz / datasets$object.size), check.names=FALSE) + +all.df<-data.frame(dataset="TOTAL", object.size=sum(datasets$object.size), + "serialized"=sum(datasets$R.serialize.size), + "gzipped serialized"=sum(datasets$R.serialize.size.gz), + "RProtoBuf"=sum(datasets$RProtoBuf.serialize.size), + "gzipped RProtoBuf"=sum(datasets$RProtoBuf.serialize.size.gz), + "ratio.serialized" = sum(datasets$R.serialize.size) / sum(datasets$object.size), + "ratio.rprotobuf" = sum(datasets$RProtoBuf.serialize.size) / sum(datasets$object.size), + "ratio.serialized.gz" = sum(datasets$R.serialize.size.gz) / sum(datasets$object.size), + "ratio.rprotobuf.gz" = sum(datasets$RProtoBuf.serialize.size.gz) / sum(datasets$object.size), + "savings.serialized" = 1-(sum(datasets$R.serialize.size) / sum(datasets$object.size)), + "savings.rprotobuf" = 1-(sum(datasets$RProtoBuf.serialize.size) / sum(datasets$object.size)), + "savings.serialized.gz" = 1-(sum(datasets$R.serialize.size.gz) / sum(datasets$object.size)), + "savings.rprotobuf.gz" = 1-(sum(datasets$RProtoBuf.serialize.size.gz) / sum(datasets$object.size)), + check.names=FALSE) +clean.df<-rbind(clean.df, all.df) @ -Table~\ref{tab:compression} shows the sizes of 50 sample \proglang{R} data sets as -returned by \code{object.size()} compared to the serialized sizes. -%The summary compression sizes are listed below, and a full table for a -%sample of 50 data sets is included on the next page. +Figure~\ref{fig:compression} shows the space savings $\left(1 - \frac{\textrm{Uncompressed Size}}{\textrm{Compressed Size}}\right)$ for each of the data sets using each of these four methods. The associated table shows the exact data sizes for two outliers and the aggregate of all \Sexpr{n} data sets. Note that Protocol Buffer serialization results in slightly smaller byte streams compared to native \proglang{R} serialization in most cases, but this difference disappears if the results are compressed with gzip. %Sizes are comparable but Protocol Buffers provide simple getters and setters %in multiple languages instead of requiring other programs to parse the \proglang{R} %serialization format. % \citep{serialization}. -One takeaway from this table is that the universal \proglang{R} object schema -included in \pkg{RProtoBuf} does not in general provide + +The \code{crimtab} dataset of anthropometry measurements of British +prisoners \citep{garson1900metric} +shows the greatest difference in the space savings when +using Protocol Buffers compared to \proglang{R} native serialization. +This dataset is a 42x22 table of integers, most equal to 0. Small +integer values like this can be very efficiently encoded by the +\emph{Varint} integer encoding scheme used by Protocol Buffers which +use a variable number of bytes for each value. + +The other extreme is represented by the \code{faithful} dataset of +waiting time and eruptions of the Old Faithful geyser in Yellowstone +National Park, Wyoming, USA \citep{azzalini1990look}. This dataset is +a data frame with 272 observations of 2 numeric variables. The +\proglang{R} native serialization of repeated numeric values is more +space-efficient, resulting in a slightly smaller object size compared +to the serialized Protocol Buffer equivalent. + +This evaluation shows that the \code{rexp.proto} universal +\proglang{R} object schema included in \pkg{RProtoBuf} does not in +general provide any significant saving in file size compared to the normal serialization mechanism in \proglang{R}. % redundant: which is seen as equally compact. @@ -964,81 +1004,62 @@ application-specific schema has been defined. The example in the next section satisfies both of these conditions. -% latex table generated in \proglang{R} 3.0.2 by xtable 1.7-0 package -% Fri Dec 27 17:00:03 2013 -\begin{table}[h!] +\begin{figure}[t!] \begin{center} - \small -\scalebox{0.9}{ -\begin{tabular}{lrrrrr} +<>= +plot(clean.df$savings.serialized, clean.df$savings.rprotobuf, pch=1, col="red", las=1, xlab="Serialization Space Savings", ylab="Protocol Buffer Space Savings") +points(clean.df$savings.serialized.gz, clean.df$savings.rprotobuf.gz,pch=2, col="blue") +# grey dotted diagonal +abline(a=0,b=1, col="grey",lty=3) + +# find point furthest off the X axis. +clean.df$savings.diff <- clean.df$savings.serialized - clean.df$savings.rprotobuf +clean.df$savings.diff.gz <- clean.df$savings.serialized.gz - clean.df$savings.rprotobuf.gz + +# The one to label. +tmp.df <- clean.df[which(clean.df$savings.diff == min(clean.df$savings.diff)),] +# This minimum means most to the left of our line, so pos=2 is label to the left +text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=2) +text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=2) + +tmp.df <- clean.df[which(clean.df$savings.diff == max(clean.df$savings.diff)),] +# This minimum means most to the right of the diagonal, so pos=4 is label to the right +text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=4) +text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=4) + +#outlier.dfs <- clean.df[c(which(clean.df$savings.diff == min(clean.df$savings.diff)), + +legend("topleft", c("Raw", "Gzip Compressed"), pch=1:2, col=c("red", "blue")) + +interesting.df <- clean.df[unique(c(which(clean.df$savings.diff == min(clean.df$savings.diff)), + which(clean.df$savings.diff == max(clean.df$savings.diff)), + which(clean.df$savings.diff.gz == max(clean.df$savings.diff.gz)), + which(clean.df$dataset == "TOTAL"))),c("dataset", "object.size", "serialized", "gzipped serialized", "RProtoBuf", "gzipped RProtoBuf", "savings.serialized", "savings.serialized.gz", "savings.rprotobuf", "savings.rprotobuf.gz")] +# Print without .00 in xtable +interesting.df$object.size <- as.integer(interesting.df$object.size) +@ + +% latex table generated in R 3.0.2 by xtable 1.7-0 package +% Wed Nov 26 15:31:30 2014 +%\begin{table}[ht] +%\begin{center} +\begin{tabular}{rlrrrrr} \toprule Data Set & object.size & \multicolumn{2}{c}{\proglang{R} Serialization} & - \multicolumn{2}{c}{RProtoBuf Serial.} \\ + \multicolumn{2}{c}{RProtoBuf Serialization} \\ & & default & gzipped & default & gzipped \\ \cmidrule(r){2-6} - uspop & 584 & 268 & 172 & 211 & 148 \\ - Titanic & 1960 & 633 & 257 & 481 & 249 \\ - volcano & 42656 & 42517 & 5226 & 42476 & 4232 \\ - euro.cross & 2728 & 1319 & 910 & 1207 & 891 \\ - attenu & 14568 & 8234 & 2165 & 7771 & 2336 \\ - ToothGrowth & 2568 & 1486 & 349 & 1239 & 391 \\ - lynx & 1344 & 1028 & 429 & 971 & 404 \\ - nottem & 2352 & 2036 & 627 & 1979 & 641 \\ - sleep & 2752 & 746 & 282 & 483 & 260 \\ - co2 & 4176 & 3860 & 1473 & 3803 & 1453 \\ - austres & 1144 & 828 & 439 & 771 & 410 \\ - ability.cov & 1944 & 716 & 357 & 589 & 341 \\ - EuStockMarkets & 60664 & 59785 & 21232 & 59674 & 19882 \\ - treering & 64272 & 63956 & 17647 & 63900 & 17758 \\ - freeny.x & 1944 & 1445 & 1311 & 1372 & 1289 \\ - Puromycin & 2088 & 813 & 306 & 620 & 320 \\ - warpbreaks & 2768 & 1231 & 310 & 811 & 343 \\ - BOD & 1088 & 334 & 182 & 226 & 168 \\ - sunspots & 22992 & 22676 & 6482 & 22620 & 6742 \\ - beaver2 & 4184 & 3423 & 751 & 3468 & 840 \\ - anscombe & 2424 & 991 & 375 & 884 & 352 \\ - esoph & 5624 & 3111 & 548 & 2240 & 665 \\ - PlantGrowth & 1680 & 646 & 303 & 459 & 314 \\ - infert & 15848 & 14328 & 1172 & 13197 & 1404 \\ - BJsales & 1632 & 1316 & 496 & 1259 & 465 \\ - stackloss & 1688 & 917 & 293 & 844 & 283 \\ - crimtab & 7936 & 4641 & 713 & 1655 & 576 \\ - LifeCycleSavings & 6048 & 3014 & 1420 & 2825 & 1407 \\ - Harman74.cor & 9144 & 6056 & 2045 & 5861 & 2070 \\ - nhtemp & 912 & 596 & 240 & 539 & 223 \\ - faithful & 5136 & 4543 & 1339 & 4936 & 1776 \\ - freeny & 5296 & 2465 & 1518 & 2271 & 1507 \\ - discoveries & 1232 & 916 & 199 & 859 & 180 \\ - state.x77 & 7168 & 4251 & 1754 & 4068 & 1756 \\ - pressure & 1096 & 498 & 277 & 427 & 273 \\ - fdeaths & 1008 & 692 & 291 & 635 & 272 \\ - euro & 976 & 264 & 186 & 202 & 161 \\ - LakeHuron & 1216 & 900 & 420 & 843 & 404 \\ - mtcars & 6736 & 3798 & 1204 & 3633 & 1206 \\ - precip & 4992 & 1793 & 813 & 1615 & 815 \\ - state.area & 440 & 422 & 246 & 405 & 235 \\ - attitude & 3024 & 1990 & 544 & 1920 & 561 \\ - randu & 10496 & 9794 & 8859 & 10441 & 9558 \\ - state.name & 3088 & 844 & 408 & 724 & 415 \\ - airquality & 5496 & 4551 & 1241 & 2874 & 1294 \\ - airmiles & 624 & 308 & 170 & 251 & 148 \\ - quakes & 33112 & 32246 & 9898 & 29063 & 11595 \\ - islands & 3496 & 1232 & 563 & 1098 & 561 \\ - OrchardSprays & 3600 & 2164 & 445 & 1897 & 483 \\ - WWWusage & 1232 & 916 & 274 & 859 & 251 \\ - \bottomrule -% Total & 391176 & 327537 & 99161 & 313456 & 100308 \\ - Relative Size & 100\% & 83.7\% & 25.3\% & 80.1\% & 25.6\%\\ - \bottomrule + crimtab & 7,936 & 4,641 (41.5\%) & 713 (91.0\%) & 1,655 (79.2\%) & 576 (92.7\%)\\ + faithful & 5,136 & 4,543 (11.5\%) & 1,339 (73.9\%) & 4,936 (3.9\%) & 1,776 (65.5\%)\\ + \hline + All & 605,256 & 461,667 (24\%) & 138,937 (77\%) & 435,360 (28\%) & 142,134 (77\%)\\ +\hline \end{tabular} -} -\caption{Serialization sizes for default serialization in \proglang{R} and - \pkg{RProtoBuf} for 50 \proglang{R} data sets.} -\label{tab:compression} \end{center} -\end{table} +\caption{(Top) Relative space savings of Protocol Buffers and native \proglang{R} serialization over the raw object sizes of each of the \Sexpr{n} data sets in the \pkg{datasets} package. Points to the left of the dotted $y=x$ line represent datasets that are more efficiently encoded with Protocol Buffers. (Bottom) Absolute space savings of two outlier datasets and the aggregate performance of all datasets.} +\label{fig:compression} +\end{figure} - \section{Application: Distributed data collection with MapReduce} \label{sec:mapreduce} @@ -1164,7 +1185,7 @@ [1] "message of type 'HistogramTools.HistogramState' with 3 fields set" -R> plot(as.histogram(hist)) +R> plot(as.histogram(hist), main="") \end{lstlisting} %\end{Code} @@ -1173,7 +1194,7 @@ require(HistogramTools) readProtoFiles(package="HistogramTools") hist <- HistogramTools.HistogramState$read("hist.pb") -plot(as.histogram(hist)) +plot(as.histogram(hist), main="") @ \end{center} From noreply at r-forge.r-project.org Thu Nov 27 03:14:07 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Thu, 27 Nov 2014 03:14:07 +0100 (CET) Subject: [Rprotobuf-commits] r917 - pkg/R Message-ID: <20141127021408.016FC183F26@r-forge.r-project.org> Author: murray Date: 2014-11-27 03:14:07 +0100 (Thu, 27 Nov 2014) New Revision: 917 Modified: pkg/R/rexp_obj.R Log: Oops left this out of previous submit, the code implementing the serialization for environments, functions, and languages. Could do something better, especially for environments, but this is fine for now. Modified: pkg/R/rexp_obj.R =================================================================== --- pkg/R/rexp_obj.R 2014-11-27 01:48:43 UTC (rev 916) +++ pkg/R/rexp_obj.R 2014-11-27 02:14:07 UTC (rev 917) @@ -1,3 +1,9 @@ +# Functions to convert an arbitrary R object into a protocol buffer +# using the universal rexp.proto descriptor. +# +# Written by Jeroen Ooms +# Modified 2014 by Murray Stokely to support language and environment types + rexp_obj <- function(obj){ sm <- typeof(obj); msg <- switch(sm, @@ -8,10 +14,13 @@ "integer" = rexp_integer(obj), "list" = rexp_list(obj), "logical" = rexp_logical(obj), + "language" = rexp_language(obj), + "environment" = rexp_environment(obj), + "function" = rexp_function(obj), "NULL" = rexp_null(), {warning("Unsupported R object type:", sm); rexp_null()} ); - + attrib <- attributes(obj) msg$attrName <- names(attrib) msg$attrValue <- lapply(attrib, rexp_obj) @@ -25,6 +34,21 @@ new(pb(rexp.REXP), rclass = 0, stringValue=xvalue) } +# For objects that only make sense in R, we just fall back +# to R's default serialization. + +rexp_language <- function(obj){ + new(pb(rexp.REXP), rclass= 8, languageValue = base::serialize(obj, NULL)) +} + +rexp_environment <- function(obj){ + new(pb(rexp.REXP), rclass= 9, environmentValue = base::serialize(obj, NULL)) +} + +rexp_function <- function(obj){ + new(pb(rexp.REXP), rclass= 10, functionValue = base::serialize(obj, NULL)) +} + rexp_raw <- function(obj){ new(pb(rexp.REXP), rclass= 1, rawValue = obj) } @@ -62,7 +86,7 @@ unrexp <- function(msg){ stopifnot(is(msg, "Message")) stopifnot(msg at type == "rexp.REXP") - + myrexp <- as.list(msg) xobj <- switch(as.character(myrexp$rclass), "0" = unrexp_string(myrexp), @@ -73,15 +97,18 @@ "5" = unrexp_list(myrexp), "6" = unrexp_logical(myrexp), "7" = unrexp_null(), + "8" = unrexp_language(myrexp), + "9" = unrexp_environment(myrexp), + "10" = unrexp_function(myrexp), stop("Unsupported rclass:", myrexp$rclass) ) - + if(length(myrexp$attrValue)){ attrib <- lapply(myrexp$attrValue, unrexp) names(attrib) <- myrexp$attrName attributes(xobj) <- attrib } - + xobj } @@ -125,6 +152,21 @@ NULL } +unrexp_language <- function(myrexp){ + xvalue <- myrexp$languageValue + unserialize(xvalue) +} + +unrexp_environment <- function(myrexp){ + xvalue <- myrexp$environmentValue + unserialize(xvalue) +} + +unrexp_function <- function(myrexp){ + xvalue <- myrexp$functionValue + unserialize(xvalue) +} + #Helper function to lookup a PB descriptor pb <- function(name){ descriptor <- deparse(substitute(name)) @@ -134,28 +176,8 @@ get(descriptor, "RProtoBuf:DescriptorPool") } -#Checks if object can be serialized +#Checks if object can be serialized can_serialize_pb <- rexp_valid <- function(obj) { - valid.types <- c("character", "raw", "double", "complex", "integer", - "list", "logical", "NULL") - sm <- typeof(obj) - if (sm %in% valid.types) { - if (sm == "list") { - if (any(! unlist(lapply(obj, rexp_valid)))) { - return(FALSE) - } - } - } else { - return(FALSE) - } - attrib <- attributes(obj) - if (is.null(attrib)) { - return(TRUE) - } - if (rexp_valid(names(attrib))) { - if (rexp_valid(unname(attrib))) { - return(TRUE) - } - } - return(FALSE) +# We can now serialize everything. just call back to R serialization + return(TRUE) } From noreply at r-forge.r-project.org Sun Nov 30 23:42:55 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Sun, 30 Nov 2014 23:42:55 +0100 (CET) Subject: [Rprotobuf-commits] r918 - pkg/inst/proto Message-ID: <20141130224255.CB02D187731@r-forge.r-project.org> Author: murray Date: 2014-11-30 23:42:55 +0100 (Sun, 30 Nov 2014) New Revision: 918 Modified: pkg/inst/proto/rexp.proto Log: Correct duplicate field id used. Modified: pkg/inst/proto/rexp.proto =================================================================== --- pkg/inst/proto/rexp.proto 2014-11-27 02:14:07 UTC (rev 917) +++ pkg/inst/proto/rexp.proto 2014-11-30 22:42:55 UTC (rev 918) @@ -44,7 +44,7 @@ repeated REXP attrValue = 12; optional bytes languageValue = 13; optional bytes environmentValue = 14; - optional bytes functionValue = 14; + optional bytes functionValue = 15; } message STRING { optional string strval = 1;