From noreply at r-forge.r-project.org Mon Dec 1 02:08:24 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Mon, 1 Dec 2014 02:08:24 +0100 (CET) Subject: [Rprotobuf-commits] r919 - papers/jss Message-ID: <20141201010824.48124187867@r-forge.r-project.org> Author: jeroenooms Date: 2014-12-01 02:08:23 +0100 (Mon, 01 Dec 2014) New Revision: 919 Modified: papers/jss/article.Rnw papers/jss/article.bib Log: Update citations Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-11-30 22:42:55 UTC (rev 918) +++ papers/jss/article.Rnw 2014-12-01 01:08:23 UTC (rev 919) @@ -91,7 +91,7 @@ University of California\\ Los Angeles, CA, USA\\ E-mail: \email{jeroen.ooms at stat.ucla.edu}\\ - URL: \url{http://jeroenooms.github.io} + URL: \url{https://jeroenooms.github.io} } %% It is also possible to add a telephone and fax number %% before the e-mail in the following format: Modified: papers/jss/article.bib =================================================================== --- papers/jss/article.bib 2014-11-30 22:42:55 UTC (rev 918) +++ papers/jss/article.bib 2014-12-01 01:08:23 UTC (rev 919) @@ -117,12 +117,12 @@ url = {http://CRAN.R-project.org/package=rjson}, } - at Manual{jsonlite, - title = {jsonlite: A Smarter JSON Encoder/Decoder for R}, + at article{jsonlite, + title = {The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects}, + journal = {arXiv: Computation (stat.CO); Mathematical Software (cs.MS); Software Engineering (cs.SE)}, author = {Jeroen Ooms}, year = 2014, - note = {R package version 0.9.4}, - url = {http://github.com/jeroenooms/jsonlite#readme}, + url = {http://arxiv.org/abs/1403.2805}, } @Manual{rmongodb, @@ -457,13 +457,12 @@ url = {http://CRAN.R-project.org/package=httr}, } - at Manual{opencpu, - title = {OpenCPU System for Embedded Statistical Computation - and Reproducible Research}, + at article{opencpu, + journal = {arXiv: Computation (stat.CO); Mathematical Software (cs.MS); Software Engineering (cs.SE)}, + title = {The OpenCPU System: Towards a Universal Interface for Scientific Computing through Separation of Concerns}, author = {Jeroen Ooms}, - year = 2013, - note = {R package version 1.2.2}, - url = {http://www.opencpu.org}, + year = 2014, + url = {http://arxiv.org/abs/1406.4806}, } @article{shafranovich2005common, From noreply at r-forge.r-project.org Mon Dec 1 03:00:55 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Mon, 1 Dec 2014 03:00:55 +0100 (CET) Subject: [Rprotobuf-commits] r920 - papers/jss Message-ID: <20141201020055.82D34183E72@r-forge.r-project.org> Author: jeroenooms Date: 2014-12-01 03:00:53 +0100 (Mon, 01 Dec 2014) New Revision: 920 Modified: papers/jss/article.Rnw Log: Use shorter ocpu URL Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-12-01 01:08:23 UTC (rev 919) +++ papers/jss/article.Rnw 2014-12-01 02:00:53 UTC (rev 920) @@ -1263,7 +1263,7 @@ client performs the following HTTP request: \begin{verbatim} - GET https://public.opencpu.org/ocpu/library/MASS/data/Animals/pb + GET https://demo.ocpu.io/MASS/data/Animals/pb \end{verbatim} The postfix \code{/pb} in the URL tells the server to send this object in the form of a Protocol Buffer message. @@ -1286,7 +1286,7 @@ library("RProtoBuf") library("httr") -req <- GET('https://public.opencpu.org/ocpu/library/MASS/data/Animals/pb') +req <- GET('https://demo.ocpu.io/MASS/data/Animals/pb') output <- unserialize_pb(req$content) identical(output, MASS::Animals) @@ -1311,7 +1311,7 @@ import urllib2 from rexp_pb2 import REXP -req = urllib2.Request('https://public.opencpu.org/ocpu/library/MASS/data/Animals/pb') +req = urllib2.Request('https://demo.ocpu.io/MASS/data/Animals/pb') res = urllib2.urlopen(req) msg = REXP() @@ -1349,7 +1349,7 @@ payload <- serialize_pb(args, NULL) req <- POST ( - url = "https://public.opencpu.org/ocpu/library/stats/R/rnorm/pb", + url = "https://demo.ocpu.io/stats/R/rnorm/pb", body = payload, add_headers ( "Content-Type" = "application/x-protobuf" From noreply at r-forge.r-project.org Mon Dec 1 08:58:25 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Mon, 1 Dec 2014 08:58:25 +0100 (CET) Subject: [Rprotobuf-commits] r921 - papers/jss Message-ID: <20141201075825.D1A561875B9@r-forge.r-project.org> Author: jeroenooms Date: 2014-12-01 08:58:25 +0100 (Mon, 01 Dec 2014) New Revision: 921 Modified: papers/jss/article.Rnw Log: Rewrite mapreduce introduction. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-12-01 02:00:53 UTC (rev 920) +++ papers/jss/article.Rnw 2014-12-01 07:58:25 UTC (rev 921) @@ -1063,37 +1063,35 @@ \section{Application: Distributed data collection with MapReduce} \label{sec:mapreduce} -Protocol Buffers have been used extensively at Google for almost all -RPC protocols, and for storing structured information in a variety of -persistent storage systems since 2000 \citep{dean2009designs}. The -\pkg{RProtoBuf} package has been in widespread use by hundreds of -statisticians and software engineers at Google since 2010. This -section describes a simplified example of a common design pattern of -collecting a large structured data set in one language for later -analysis in \proglang{R}. +Protocol Buffers are used extensively at Google for almost all +RPC protocols, and to store structured information on a variety of +persistent storage systems \citep{dean2009designs}. Since the +initial release in 2010, hundreds of Google's statisticians and +software engineers use the \pkg{RProtoBuf} package on daily basis +to interact with these systems from within \proglang{R}. +The current section illustrates the power of Protocol Buffers to +collect and manage large structured data in one language +before analyzing it in \proglang{R}. Our example uses MapReduce +\citep{dean2008mapreduce}, which has emerged in the last +decade as a popular design pattern to facilitate parallel +processing of big data using distributed computing clusters. -Many large data sets in fields such as particle physics and information -processing are stored in binned or histogram form in order to reduce -the data storage requirements \citep{scott2009multivariate}. In the -last decade, the MapReduce programming model \citep{dean2008mapreduce} -has emerged as a popular design pattern that enables the processing of -very large data sets on large compute clusters. - -Many types of data analysis over large data sets may involve very rare +Big data sets in fields such as particle physics and information +processing are often stored in binned (histogram) form in order +to reduce storage requirements \citep{scott2009multivariate}. +Because analysis over such large data sets may involve very rare phenomenon or deal with highly skewed data sets or inflexible -raw data storage systems from which unbiased sampling is not feasible. -In such situations, MapReduce and binning may be combined as a +raw data storage systems, unbiased sampling is often not feasible. +In these situations, MapReduce and binning may be combined as a pre-processing step for a wide range of statistical and scientific analyses \citep{blocker2013}. There are two common patterns for generating histograms of large data -sets in a single pass with MapReduce. In the first method, each +sets in a single pass with MapReduce. In the first method, each mapper task generates a histogram over a subset of the data that it has been assigned, serializes this histogram and sends it to one or more reducer tasks which merge the intermediate histograms from the -mappers. - -In the second method, illustrated in +mappers. In the second method, illustrated in Figure~\ref{fig:mr-histogram-pattern1}, each mapper rounds a data point to a bucket width and outputs that bucket as a key and '1' as a value. Reducers then sum up all of the values with the same key and From noreply at r-forge.r-project.org Mon Dec 1 22:54:51 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Mon, 1 Dec 2014 22:54:51 +0100 (CET) Subject: [Rprotobuf-commits] r922 - papers/jss Message-ID: <20141201215452.03F3B18788A@r-forge.r-project.org> Author: murray Date: 2014-12-01 22:54:51 +0100 (Mon, 01 Dec 2014) New Revision: 922 Modified: papers/jss/article.Rnw Log: Make the dotted y=x line in figure 2 dashed with a bigger width to make it more visible. Suggested by Steve Scott. Also add some commented out code to add line numbers to make review copies for folks that have offered to do a final review before our resubmit. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-12-01 07:58:25 UTC (rev 921) +++ papers/jss/article.Rnw 2014-12-01 21:54:51 UTC (rev 922) @@ -3,6 +3,10 @@ \usepackage{listings} \usepackage[toc,page]{appendix} +% Line numbers for drafts. +%\usepackage[switch, modulo]{lineno} +%\linenumbers + %%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Spelling Standardization: % Protocol Buffers, not protocol buffers @@ -1010,7 +1014,7 @@ plot(clean.df$savings.serialized, clean.df$savings.rprotobuf, pch=1, col="red", las=1, xlab="Serialization Space Savings", ylab="Protocol Buffer Space Savings") points(clean.df$savings.serialized.gz, clean.df$savings.rprotobuf.gz,pch=2, col="blue") # grey dotted diagonal -abline(a=0,b=1, col="grey",lty=3) +abline(a=0,b=1, col="grey",lty=2,lwd=3) # find point furthest off the X axis. clean.df$savings.diff <- clean.df$savings.serialized - clean.df$savings.rprotobuf @@ -1056,7 +1060,7 @@ \hline \end{tabular} \end{center} -\caption{(Top) Relative space savings of Protocol Buffers and native \proglang{R} serialization over the raw object sizes of each of the \Sexpr{n} data sets in the \pkg{datasets} package. Points to the left of the dotted $y=x$ line represent datasets that are more efficiently encoded with Protocol Buffers. (Bottom) Absolute space savings of two outlier datasets and the aggregate performance of all datasets.} +\caption{(Top) Relative space savings of Protocol Buffers and native \proglang{R} serialization over the raw object sizes of each of the \Sexpr{n} data sets in the \pkg{datasets} package. Points to the left of the dashed $y=x$ line represent datasets that are more efficiently encoded with Protocol Buffers. (Bottom) Absolute space savings of two outlier datasets and the aggregate performance of all datasets.} \label{fig:compression} \end{figure} From noreply at r-forge.r-project.org Mon Dec 1 22:58:07 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Mon, 1 Dec 2014 22:58:07 +0100 (CET) Subject: [Rprotobuf-commits] r923 - papers/jss Message-ID: <20141201215807.A725B18788A@r-forge.r-project.org> Author: murray Date: 2014-12-01 22:58:07 +0100 (Mon, 01 Dec 2014) New Revision: 923 Modified: papers/jss/article.Rnw Log: Add a missing article to Jeroen's nice rewording of this section. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-12-01 21:54:51 UTC (rev 922) +++ papers/jss/article.Rnw 2014-12-01 21:58:07 UTC (rev 923) @@ -1069,9 +1069,9 @@ Protocol Buffers are used extensively at Google for almost all RPC protocols, and to store structured information on a variety of -persistent storage systems \citep{dean2009designs}. Since the +persistent storage systems \citep{dean2009designs}. Since the initial release in 2010, hundreds of Google's statisticians and -software engineers use the \pkg{RProtoBuf} package on daily basis +software engineers use the \pkg{RProtoBuf} package on a daily basis to interact with these systems from within \proglang{R}. The current section illustrates the power of Protocol Buffers to collect and manage large structured data in one language From noreply at r-forge.r-project.org Mon Dec 1 23:53:20 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Mon, 1 Dec 2014 23:53:20 +0100 (CET) Subject: [Rprotobuf-commits] r924 - in pkg: . R inst Message-ID: <20141201225320.994B6187870@r-forge.r-project.org> Author: murray Date: 2014-12-01 23:53:20 +0100 (Mon, 01 Dec 2014) New Revision: 924 Modified: pkg/ChangeLog pkg/R/serialize.R pkg/R/wrapper_ZeroCopyInputStream.R pkg/inst/NEWS.Rd Log: Address a FIXME in the code and comment from JSS referee about aboiding file.create to get absolute pathname of temporary file. Use normalizePath with mustWork=FALSE as suggested by Jeroen. Modified: pkg/ChangeLog =================================================================== --- pkg/ChangeLog 2014-12-01 21:58:07 UTC (rev 923) +++ pkg/ChangeLog 2014-12-01 22:53:20 UTC (rev 924) @@ -1,3 +1,9 @@ +2014-12-01 Murray Stokely + + * R/wrapper_ZeroCopyInputStream.R: Avoid file.create for getting + absolute path of a temporary file name (JSS reviewer feedback) + * R/serialize.R: Idem. + 2014-11-26 Murray Stokely Address feedback from anonymous reviewer for JSS to make this Modified: pkg/R/serialize.R =================================================================== --- pkg/R/serialize.R 2014-12-01 21:58:07 UTC (rev 923) +++ pkg/R/serialize.R 2014-12-01 22:53:20 UTC (rev 924) @@ -14,10 +14,10 @@ if( is.character( connection ) ){ # pretend it is a file name if( !file.exists(connection) ){ - # FIXME: hack to grab the absolute path name - file.create( connection ) - file <- file_path_as_absolute(connection) - unlink( file ) + if( !file.exists( dirname(connection) ) ){ + stop( "directory does not exist" ) + } + file <- normalizePath(connection, mustWork=FALSE) } else{ file <- file_path_as_absolute(connection) } Modified: pkg/R/wrapper_ZeroCopyInputStream.R =================================================================== --- pkg/R/wrapper_ZeroCopyInputStream.R 2014-12-01 21:58:07 UTC (rev 923) +++ pkg/R/wrapper_ZeroCopyInputStream.R 2014-12-01 22:53:20 UTC (rev 924) @@ -128,9 +128,7 @@ if( !file.exists( dirname(filename) ) ){ stop( "directory does not exist" ) } - file.create( filename ) - filename <- file_path_as_absolute(filename) - unlink( filename ) + filename <- normalizePath(filename, mustWork=FALSE) } else{ filename <- file_path_as_absolute(filename) } Modified: pkg/inst/NEWS.Rd =================================================================== --- pkg/inst/NEWS.Rd 2014-12-01 21:58:07 UTC (rev 923) +++ pkg/inst/NEWS.Rd 2014-12-01 22:53:20 UTC (rev 924) @@ -2,7 +2,7 @@ \title{News for Package \pkg{RProtoBuf}} \newcommand{\cpkg}{\href{http://CRAN.R-project.org/package=#1}{\pkg{#1}}} -\section{Changes in RProtoBuf version 0.4.2 (2014-??-??)}{ +\section{Changes in RProtoBuf version 0.4.2 (2014-12-??)}{ \itemize{ \item Address changes suggested by anonymous reviewers for our Journal of Statistical Software submission. @@ -23,6 +23,8 @@ with \code{serialize_pb} and \code{unserialize_pb} to make it easy to serialize into a protocol buffer all 100+ of the built-in datasets with R. + \item Use \code{normalizePath} instead of creating a temporary + file with \code{file.create} when getting absolute path names. \item Add unit tests for all of the above. } From noreply at r-forge.r-project.org Tue Dec 2 01:40:57 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 2 Dec 2014 01:40:57 +0100 (CET) Subject: [Rprotobuf-commits] r925 - papers/jss Message-ID: <20141202004057.2E47E187863@r-forge.r-project.org> Author: murray Date: 2014-12-02 01:40:56 +0100 (Tue, 02 Dec 2014) New Revision: 925 Modified: papers/jss/article.Rnw Log: Grammatical improvements throughout the paper suggested by Tim Hesterberg. Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-12-01 22:53:20 UTC (rev 924) +++ papers/jss/article.Rnw 2014-12-02 00:40:56 UTC (rev 925) @@ -172,20 +172,20 @@ lacks type-safety, and has limited precision for numeric values. Moreover, ambiguities in the format itself frequently cause problems. For example, conventions on which characters is used as separator or decimal point vary by -country. \emph{Extensible Markup Language} (\code{XML}) is another +country. \emph{Extensible Markup Language} (\code{XML}) is a well-established and widely-supported format with the ability to define just about any arbitrarily complex schema \citep{nolan2013xml}. However, it pays for this complexity with comparatively large and verbose messages, and added -complexity at the parsing side (which are somewhat mitigated by the -availability of mature libraries and parsers). Because \code{XML} is +complexity at the parsing side (these problems are somewhat mitigated by the +availability of mature libraries and parsers). Because \code{XML} is text-based and has no native notion of numeric types or arrays, it usually not a very practical format to store numeric data sets as they appear in statistical applications. -A more modern format is \emph{JavaScript ObjectNotation} +A more modern format is \emph{JavaScript ObjectNotation} (\code{JSON}), which is derived from the object literals of -\proglang{JavaScript}, and already widely-used on the world wide web. +\proglang{JavaScript}, and already widely-used on the world wide web. Several \proglang{R} packages implement functions to parse and generate \code{JSON} data from \proglang{R} objects \citep{rjson,RJSONIO,jsonlite}. \code{JSON} natively supports arrays and four primitive types: numbers, strings, @@ -220,11 +220,11 @@ Section~\ref{sec:rprotobuf-basic} describes the interactive \proglang{R} interface provided by the \pkg{RProtoBuf} package, and introduces the two main abstractions: \emph{Messages} and \emph{Descriptors}. Section~\ref{sec:rprotobuf-classes} -details the implementation of the main S4 classes and methods. +details the implementation of the main S4 classes and methods. Section~\ref{sec:types} describes the challenges of type coercion between \proglang{R} and other languages. Section~\ref{sec:evaluation} introduces a -general \proglang{R} language schema for serializing arbitrary \proglang{R} objects and evaluates -it against the serialization capabilities built directly into \proglang{R}. Sections~\ref{sec:mapreduce} +general \proglang{R} language schema for serializing arbitrary \proglang{R} objects and compares it to +the serialization capabilities built directly into \proglang{R}. Sections~\ref{sec:mapreduce} and \ref{sec:opencpu} provide real-world use cases of \pkg{RProtoBuf} in MapReduce and web service environments, respectively, before Section~\ref{sec:summary} concludes. @@ -233,8 +233,8 @@ \label{sec:protobuf} Protocol Buffers are a modern, language-neutral, platform-neutral, -extensible mechanism for sharing and storing structured data. Some of -the key features provided by Protocol Buffers for data analysis are: +extensible mechanism for sharing and storing structured data. Key +features provided by Protocol Buffers for data analysis include: \begin{itemize} \item \emph{Portable}: Enable users to send and receive data between @@ -260,9 +260,9 @@ communication work flow with Protocol Buffers and an interactive \proglang{R} session. Common use cases include populating a request remote-procedure call (RPC) Protocol Buffer in \proglang{R} that is then serialized and sent over the network to a -remote server. The server would then deserialize the message, act on the -request, and respond with a new Protocol Buffer over the network. -The key difference to, say, a request to an \pkg{Rserve} +remote server. The server deserializes the message, acts on the +request, and responds with a new Protocol Buffer over the network. +The key difference to, say, a request to an \pkg{Rserve} \citep{Urbanek:2003:Rserve,CRAN:Rserve} instance is that the remote server may be implemented in any language. %, with no dependence on \proglang{R}. @@ -367,8 +367,8 @@ \subsection*{Importing message descriptors from \code{.proto} files} -To create or parse a Protocol Buffer Message, one must first read in -the message type specification from a \code{.proto} file. +To create or parse a Protocol Buffer Message, one must first read in +the message descriptor (\emph{message type}) from a \code{.proto} file. A small number of message types are imported when the package is first loaded, including the \code{tutorial.Person} type we saw in the last section. @@ -472,8 +472,8 @@ % \subsection{Serializing messages} -One of the primary benefits of Protocol Buffers is the efficient -binary wire-format representation. +A primary benefit of Protocol Buffers is an efficient +binary wire-format representation. The \code{serialize} method is implemented for Protocol Buffer messages to serialize a message into a sequence of bytes (raw vector) that represents the message. @@ -1098,8 +1098,8 @@ mappers. In the second method, illustrated in Figure~\ref{fig:mr-histogram-pattern1}, each mapper rounds a data point to a bucket width and outputs that bucket as a key and '1' as a -value. Reducers then sum up all of the values with the same key and -output to a data store. +value. Reducers count how many times each key occurs and outputs a +histogram to a data store. \begin{figure}[h!] \begin{center} @@ -1154,20 +1154,17 @@ \begin{Code} from histogram_pb2 import HistogramState; - hist = HistogramState() - hist.counts.extend([2, 6, 2, 4, 6]) hist.breaks.extend(range(6)) hist.name="Example Histogram Created in Python" - outfile = open("/tmp/hist.pb", "wb") outfile.write(hist.SerializeToString()) outfile.close() \end{Code} The Protocol Buffer created from this \proglang{Python} script can then be read into \proglang{R} and converted to a native -\proglang{R} histogram object for plotting. Line~1 in the listing below attaches the \pkg{HistogramTools} package which imports \pkg{RProtoBuf}. Line~2 then reads all of the \code{.proto} descriptor definitions provided by \pkg{HistogramTools} and adds them to the environment as described in Section~\ref{sec:rprotobuf-basic}. Line~3 parses the serialized protocol buffer using the \code{HistogramTools.HistogramState} schema. Line~8 converts the protocol buffer representation of the histogram to a native \proglang{R} histogram object with \code{as.histogram} and passes the result to \code{plot}. +\proglang{R} histogram object for plotting. Line~1 in the listing below attaches the \pkg{HistogramTools} package which imports \pkg{RProtoBuf}. Line~2 then reads all of the \code{.proto} descriptor definitions provided by \pkg{HistogramTools} and adds them to the environment as described in Section~\ref{sec:rprotobuf-basic}. Line~3 parses the serialized protocol buffer using the \code{HistogramTools.HistogramState} schema. The last line converts the protocol buffer representation of the histogram to a native \proglang{R} histogram object with \code{as.histogram} and passes the result to \code{plot}. % Here, the schema is read first, %then the (serialized) histogram is read into the variable \code{hist} which @@ -1220,7 +1217,7 @@ \label{sec:opencpu} The previous section described an application where data from a -program written in another language was output to persistent storage +program written in another language was saved to persistent storage and then read into \proglang{R} for further analysis. This section describes another common use case where Protocol Buffers are used as the interchange format for client-server communication. @@ -1232,7 +1229,7 @@ multimedia content. When designing systems where various components require exchange of specific data structures, we need something on top of the network protocol that prescribes how these structures are to be represented in -messages (buffers) on the network. Protocol Buffers solve exactly this +messages (buffers) on the network. Protocol Buffers solve this problem by providing a cross-platform method for serializing arbitrary structures into well defined messages, which can then be exchanged using any protocol. @@ -1312,10 +1309,8 @@ \begin{verbatim} import urllib2 from rexp_pb2 import REXP - req = urllib2.Request('https://demo.ocpu.io/MASS/data/Animals/pb') res = urllib2.urlopen(req) - msg = REXP() msg.ParseFromString(res.read()) print(msg) @@ -1394,7 +1389,7 @@ users of \pkg{RProtoBuf} using it to read data from and otherwise interact with distributed systems written in \proglang{C++}, \proglang{Java}, \proglang{Python}, and other languages. We hope that making Protocol Buffers available to the -\proglang{R} community will contribute towards better software integration +\proglang{R} community will contribute to better software integration and allow for building even more advanced applications and analysis pipelines with \proglang{R}. @@ -1465,7 +1460,7 @@ repeated REXP attrValue = 12; optional bytes languageValue = 13; optional bytes environmentValue = 14; - optional bytes functionValue = 14; + optional bytes functionValue = 15; } message STRING { optional string strval = 1; From noreply at r-forge.r-project.org Tue Dec 2 04:39:47 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 2 Dec 2014 04:39:47 +0100 (CET) Subject: [Rprotobuf-commits] r926 - in pkg: . inst man Message-ID: <20141202033947.B597D184C54@r-forge.r-project.org> Author: edd Date: 2014-12-02 04:39:46 +0100 (Tue, 02 Dec 2014) New Revision: 926 Modified: pkg/ChangeLog pkg/inst/NEWS.Rd pkg/man/Descriptor-class.Rd pkg/man/EnumDescriptor-class.Rd pkg/man/Message-class.Rd Log: minor fixes for documentation Modified: pkg/ChangeLog =================================================================== --- pkg/ChangeLog 2014-12-02 00:40:56 UTC (rev 925) +++ pkg/ChangeLog 2014-12-02 03:39:46 UTC (rev 926) @@ -1,3 +1,9 @@ +2014-12-01 Dirk Eddelbuettel + + * man/Message-class.Rd: Completed documentation + * man/Descriptor-class.Rd: Ditto + * man/EnumDescriptor-class.Rd: Ditto + 2014-12-01 Murray Stokely * R/wrapper_ZeroCopyInputStream.R: Avoid file.create for getting Modified: pkg/inst/NEWS.Rd =================================================================== --- pkg/inst/NEWS.Rd 2014-12-02 00:40:56 UTC (rev 925) +++ pkg/inst/NEWS.Rd 2014-12-02 03:39:46 UTC (rev 926) @@ -26,6 +26,7 @@ \item Use \code{normalizePath} instead of creating a temporary file with \code{file.create} when getting absolute path names. \item Add unit tests for all of the above. + } } \section{Changes in RProtoBuf version 0.4.1 (2014-03-25)}{ Modified: pkg/man/Descriptor-class.Rd =================================================================== --- pkg/man/Descriptor-class.Rd 2014-12-02 00:40:56 UTC (rev 925) +++ pkg/man/Descriptor-class.Rd 2014-12-02 03:39:46 UTC (rev 926) @@ -15,6 +15,9 @@ \alias{field,Descriptor-method} \alias{nested_type,Descriptor-method} \alias{enum_type,Descriptor,ANY,ANY-method} +\alias{[[,Descriptor-method} +\alias{names,Descriptor-method} +\alias{length,Descriptor-method} \title{Class "Descriptor" } \description{ full descriptive information about a protocol buffer @@ -81,6 +84,9 @@ If \code{name} is used, the enum type will be retrieved using its name, with the \code{FindEnumTypeByName} C++ method } + \item{[[}{\code{signature(x = "Descriptor")}: extracts a field identified by its name or declared tag number} + \item{names}{\code{signature(x = "Descriptor")} : extracts names of this descriptor} + \item{length}{\code{signature(x = "Descriptor")} : extracts length of this descriptor} } } Modified: pkg/man/EnumDescriptor-class.Rd =================================================================== --- pkg/man/EnumDescriptor-class.Rd 2014-12-02 00:40:56 UTC (rev 925) +++ pkg/man/EnumDescriptor-class.Rd 2014-12-02 03:39:46 UTC (rev 926) @@ -18,6 +18,9 @@ \alias{value-methods} \alias{value,EnumDescriptor-method} +\alias{[[,EnumDescriptor-method} +\alias{names,EnumDescriptor-method} + \title{Class "EnumDescriptor" } \description{ R representation of an enum descriptor. This is a thin wrapper around the \code{EnumDescriptor} c++ class. } @@ -60,6 +63,8 @@ using the name of the constant, using the \code{FindValueByName} C++ method. } + \item{[[}{\code{signature(x = "EnumDescriptor")}: extracts field identified by its name or declared tag number} + \item{names}{\code{signature(x = "EnumDescriptor")} : extracts names of this enum} } } Modified: pkg/man/Message-class.Rd =================================================================== --- pkg/man/Message-class.Rd 2014-12-02 00:40:56 UTC (rev 925) +++ pkg/man/Message-class.Rd 2014-12-02 03:39:46 UTC (rev 926) @@ -11,6 +11,7 @@ \alias{show,Message-method} \alias{update,Message-method} \alias{length,Message-method} +\alias{names,Message-method} \alias{str,Message-method} \alias{toString,Message-method} \alias{identical,Message,Message-method} @@ -68,7 +69,9 @@ \item{==}{\code{signature(e1 = "Message", e2 = "Message")}: Same as \code{identical} } \item{!=}{\code{signature(e1 = "Message", e2 = "Message")}: Negation of \code{identical} } \item{all.equal}{\code{signature(e1 = "Message", e2 = "Message")}: Test near equality } - } + \item{names}{\code{signature(x = "Message")}: extracts the names of the message. } + + } } \references{ The \code{Message} class from the C++ proto library. From mstokely at google.com Tue Dec 2 04:42:10 2014 From: mstokely at google.com (Murray Stokely) Date: Mon, 1 Dec 2014 19:42:10 -0800 Subject: [Rprotobuf-commits] r926 - in pkg: . inst man In-Reply-To: <20141202033947.B597D184C54@r-forge.r-project.org> References: <20141202033947.B597D184C54@r-forge.r-project.org> Message-ID: Thanks, Dirk! - Murray On Mon, Dec 1, 2014 at 7:39 PM, wrote: > Author: edd > Date: 2014-12-02 04:39:46 +0100 (Tue, 02 Dec 2014) > New Revision: 926 > > Modified: > pkg/ChangeLog > pkg/inst/NEWS.Rd > pkg/man/Descriptor-class.Rd > pkg/man/EnumDescriptor-class.Rd > pkg/man/Message-class.Rd > Log: > minor fixes for documentation > > > Modified: pkg/ChangeLog > =================================================================== > --- pkg/ChangeLog 2014-12-02 00:40:56 UTC (rev 925) > +++ pkg/ChangeLog 2014-12-02 03:39:46 UTC (rev 926) > @@ -1,3 +1,9 @@ > +2014-12-01 Dirk Eddelbuettel > + > + * man/Message-class.Rd: Completed documentation > + * man/Descriptor-class.Rd: Ditto > + * man/EnumDescriptor-class.Rd: Ditto > + > 2014-12-01 Murray Stokely > > * R/wrapper_ZeroCopyInputStream.R: Avoid file.create for getting > > Modified: pkg/inst/NEWS.Rd > =================================================================== > --- pkg/inst/NEWS.Rd 2014-12-02 00:40:56 UTC (rev 925) > +++ pkg/inst/NEWS.Rd 2014-12-02 03:39:46 UTC (rev 926) > @@ -26,6 +26,7 @@ > \item Use \code{normalizePath} instead of creating a temporary > file with \code{file.create} when getting absolute path names. > \item Add unit tests for all of the above. > + } > } > > \section{Changes in RProtoBuf version 0.4.1 (2014-03-25)}{ > > Modified: pkg/man/Descriptor-class.Rd > =================================================================== > --- pkg/man/Descriptor-class.Rd 2014-12-02 00:40:56 UTC (rev 925) > +++ pkg/man/Descriptor-class.Rd 2014-12-02 03:39:46 UTC (rev 926) > @@ -15,6 +15,9 @@ > \alias{field,Descriptor-method} > \alias{nested_type,Descriptor-method} > \alias{enum_type,Descriptor,ANY,ANY-method} > +\alias{[[,Descriptor-method} > +\alias{names,Descriptor-method} > +\alias{length,Descriptor-method} > > \title{Class "Descriptor" } > \description{ full descriptive information about a protocol buffer > @@ -81,6 +84,9 @@ > If \code{name} is used, the enum type will be > retrieved > using its name, with the \code{FindEnumTypeByName} > C++ method > } > + \item{[[}{\code{signature(x = "Descriptor")}: extracts a field > identified by its name or declared tag number} > + \item{names}{\code{signature(x = "Descriptor")} : extracts names of > this descriptor} > + \item{length}{\code{signature(x = "Descriptor")} : extracts length of > this descriptor} > > } > } > > Modified: pkg/man/EnumDescriptor-class.Rd > =================================================================== > --- pkg/man/EnumDescriptor-class.Rd 2014-12-02 00:40:56 UTC (rev 925) > +++ pkg/man/EnumDescriptor-class.Rd 2014-12-02 03:39:46 UTC (rev 926) > @@ -18,6 +18,9 @@ > \alias{value-methods} > \alias{value,EnumDescriptor-method} > > +\alias{[[,EnumDescriptor-method} > +\alias{names,EnumDescriptor-method} > + > \title{Class "EnumDescriptor" } > \description{ R representation of an enum descriptor. This > is a thin wrapper around the \code{EnumDescriptor} c++ class. } > @@ -60,6 +63,8 @@ > using the name of the constant, using the \code{FindValueByName} > C++ method. > } > + \item{[[}{\code{signature(x = "EnumDescriptor")}: extracts field > identified by its name or declared tag number} > + \item{names}{\code{signature(x = "EnumDescriptor")} : extracts names > of this enum} > } > > } > > Modified: pkg/man/Message-class.Rd > =================================================================== > --- pkg/man/Message-class.Rd 2014-12-02 00:40:56 UTC (rev 925) > +++ pkg/man/Message-class.Rd 2014-12-02 03:39:46 UTC (rev 926) > @@ -11,6 +11,7 @@ > \alias{show,Message-method} > \alias{update,Message-method} > \alias{length,Message-method} > +\alias{names,Message-method} > \alias{str,Message-method} > \alias{toString,Message-method} > \alias{identical,Message,Message-method} > @@ -68,7 +69,9 @@ > \item{==}{\code{signature(e1 = "Message", e2 = "Message")}: Same as > \code{identical} } > \item{!=}{\code{signature(e1 = "Message", e2 = "Message")}: Negation > of \code{identical} } > \item{all.equal}{\code{signature(e1 = "Message", e2 = "Message")}: > Test near equality } > - } > + \item{names}{\code{signature(x = "Message")}: extracts the names of > the message. } > + > + } > } > \references{ > The \code{Message} class from the C++ proto library. > > _______________________________________________ > Rprotobuf-commits mailing list > Rprotobuf-commits at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rprotobuf-commits > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noreply at r-forge.r-project.org Wed Dec 3 20:43:16 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 3 Dec 2014 20:43:16 +0100 (CET) Subject: [Rprotobuf-commits] r927 - papers/jss Message-ID: <20141203194316.7CE3D18444D@r-forge.r-project.org> Author: murray Date: 2014-12-03 20:43:16 +0100 (Wed, 03 Dec 2014) New Revision: 927 Modified: papers/jss/article.Rnw Log: Improve the plot and point out 3 outliers now and explain them in the text. Correct an error in the space savings definition. Change trivial example to simple example. Suggestions from: Andy Chu Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-12-02 03:39:46 UTC (rev 926) +++ papers/jss/article.Rnw 2014-12-03 19:43:16 UTC (rev 927) @@ -972,20 +972,20 @@ clean.df<-rbind(clean.df, all.df) @ -Figure~\ref{fig:compression} shows the space savings $\left(1 - \frac{\textrm{Uncompressed Size}}{\textrm{Compressed Size}}\right)$ for each of the data sets using each of these four methods. The associated table shows the exact data sizes for two outliers and the aggregate of all \Sexpr{n} data sets. +Figure~\ref{fig:compression} shows the space savings $\left(1 - \frac{\textrm{Compressed Size}}{\textrm{Uncompressed Size}}\right)$ for each of the data sets using each of these four methods. The associated table shows the exact data sizes for some outliers and the aggregate of all \Sexpr{n} data sets. Note that Protocol Buffer serialization results in slightly -smaller byte streams compared to native \proglang{R} serialization in most cases, -but this difference disappears if the results are compressed with gzip. +smaller byte streams compared to native \proglang{R} serialization in most cases (red dots), +but this difference disappears if the results are compressed with gzip (blue triangles). %Sizes are comparable but Protocol Buffers provide simple getters and setters %in multiple languages instead of requiring other programs to parse the \proglang{R} %serialization format. % \citep{serialization}. The \code{crimtab} dataset of anthropometry measurements of British -prisoners \citep{garson1900metric} -shows the greatest difference in the space savings when +prisoners \citep{garson1900metric} and the \code{airquality} dataset of air quality measurements in New York show the +greatest difference in the space savings when using Protocol Buffers compared to \proglang{R} native serialization. -This dataset is a 42x22 table of integers, most equal to 0. Small -integer values like this can be very efficiently encoded by the +The \code{crimtab} dataset is a 42x22 table of integers, most equal to 0, and the \code{airquality} dataset is a data frame of 154 observations of 1 numeric and 5 integer variables. In both data sets, the large number of small +integer values can be very efficiently encoded by the \emph{Varint} integer encoding scheme used by Protocol Buffers which use a variable number of bytes for each value. @@ -1008,10 +1008,16 @@ application-specific schema has been defined. The example in the next section satisfies both of these conditions. -\begin{figure}[t!] +\begin{figure}[hbt!] \begin{center} -<>= -plot(clean.df$savings.serialized, clean.df$savings.rprotobuf, pch=1, col="red", las=1, xlab="Serialization Space Savings", ylab="Protocol Buffer Space Savings") +<>= +old.mar<-par("mar") +new.mar<-old.mar +new.mar[3]<-0 +new.mar[4]<-0 +my.cex<-1.3 +par("mar"=new.mar) +plot(clean.df$savings.serialized, clean.df$savings.rprotobuf, pch=1, col="red", las=1, xlab="Serialization Space Savings", ylab="Protocol Buffer Space Savings", xlim=c(0,1),ylim=c(0,1),cex.lab=my.cex, cex.axis=my.cex) points(clean.df$savings.serialized.gz, clean.df$savings.rprotobuf.gz,pch=2, col="blue") # grey dotted diagonal abline(a=0,b=1, col="grey",lty=2,lwd=3) @@ -1023,17 +1029,27 @@ # The one to label. tmp.df <- clean.df[which(clean.df$savings.diff == min(clean.df$savings.diff)),] # This minimum means most to the left of our line, so pos=2 is label to the left -text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=2) -text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=2) +text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=2, cex=my.cex) +# Some gziped version +# text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=2, cex=my.cex) + +# Second one is also an outlier +tmp.df <- clean.df[which(clean.df$savings.diff == sort(clean.df$savings.diff)[2]),] +# This minimum means most to the left of our line, so pos=2 is label to the left +text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=2, cex=my.cex) +#text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=my.cex) + + tmp.df <- clean.df[which(clean.df$savings.diff == max(clean.df$savings.diff)),] # This minimum means most to the right of the diagonal, so pos=4 is label to the right -text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=4) -text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=4) +# Only show the gziped one. +#text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=4, cex=my.cex) +text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=4, cex=my.cex) #outlier.dfs <- clean.df[c(which(clean.df$savings.diff == min(clean.df$savings.diff)), -legend("topleft", c("Raw", "Gzip Compressed"), pch=1:2, col=c("red", "blue")) +legend("topleft", c("Raw", "Gzip Compressed"), pch=1:2, col=c("red", "blue"), cex=my.cex) interesting.df <- clean.df[unique(c(which(clean.df$savings.diff == min(clean.df$savings.diff)), which(clean.df$savings.diff == max(clean.df$savings.diff)), @@ -1041,7 +1057,9 @@ which(clean.df$dataset == "TOTAL"))),c("dataset", "object.size", "serialized", "gzipped serialized", "RProtoBuf", "gzipped RProtoBuf", "savings.serialized", "savings.serialized.gz", "savings.rprotobuf", "savings.rprotobuf.gz")] # Print without .00 in xtable interesting.df$object.size <- as.integer(interesting.df$object.size) +par("mar"=old.mar) @ +\includegraphics[width=0.45\textwidth]{figures/fig-SER} % latex table generated in R 3.0.2 by xtable 1.7-0 package % Wed Nov 26 15:31:30 2014 @@ -1054,13 +1072,14 @@ & & default & gzipped & default & gzipped \\ \cmidrule(r){2-6} crimtab & 7,936 & 4,641 (41.5\%) & 713 (91.0\%) & 1,655 (79.2\%) & 576 (92.7\%)\\ + airquality & 5,496 & 4,551 (17.2\%) & 1,241 (77.4\%) & 2,874 (47.7\%) & 1,294 (76.5\%)\\ faithful & 5,136 & 4,543 (11.5\%) & 1,339 (73.9\%) & 4,936 (3.9\%) & 1,776 (65.5\%)\\ \hline All & 605,256 & 461,667 (24\%) & 138,937 (77\%) & 435,360 (28\%) & 142,134 (77\%)\\ \hline \end{tabular} \end{center} -\caption{(Top) Relative space savings of Protocol Buffers and native \proglang{R} serialization over the raw object sizes of each of the \Sexpr{n} data sets in the \pkg{datasets} package. Points to the left of the dashed $y=x$ line represent datasets that are more efficiently encoded with Protocol Buffers. (Bottom) Absolute space savings of two outlier datasets and the aggregate performance of all datasets.} +\caption{(Top) Relative space savings of Protocol Buffers and native \proglang{R} serialization over the raw object sizes of each of the \Sexpr{n} data sets in the \pkg{datasets} package. Points to the left of the dashed $y=x$ line represent datasets that are more efficiently encoded with Protocol Buffers. (Bottom) Absolute space savings of three outlier datasets and the aggregate performance of all datasets.} \label{fig:compression} \end{figure} @@ -1135,7 +1154,7 @@ written in other languages and only the resulting output histograms need to be manipulated in \proglang{R}. -\subsection*{A trivial single-machine example for Python to R serialization} +\subsection*{A simple single-machine example for Python to R serialization} To create HistogramState messages in Python for later consumption by \proglang{R}, we first compile the From noreply at r-forge.r-project.org Wed Dec 3 23:09:22 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 3 Dec 2014 23:09:22 +0100 (CET) Subject: [Rprotobuf-commits] r928 - / Message-ID: <20141203220922.C44551810FB@r-forge.r-project.org> Author: jeroenooms Date: 2014-12-03 23:09:22 +0100 (Wed, 03 Dec 2014) New Revision: 928 Added: .travis.yml Log: Add travis file. Added: .travis.yml =================================================================== --- .travis.yml (rev 0) +++ .travis.yml 2014-12-03 22:09:22 UTC (rev 928) @@ -0,0 +1,24 @@ +# Sample .travis.yml for R projects. +# +# See README.md for instructions, or for more configuration options, +# see the wiki: +# https://github.com/craigcitro/r-travis/wiki + +language: c + +before_install: + - sudo apt-get install libprotobuf-dev libprotoc-dev + - curl -OL http://raw.github.com/craigcitro/r-travis/master/scripts/travis-tool.sh + - chmod 755 ./travis-tool.sh + - ./travis-tool.sh bootstrap +install: + - ./travis-tool.sh install_deps +script: ./travis-tool.sh run_tests + +after_failure: + - ./travis-tool.sh dump_logs + +notifications: + email: + on_success: change + on_failure: change From noreply at r-forge.r-project.org Thu Dec 4 02:45:57 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Thu, 4 Dec 2014 02:45:57 +0100 (CET) Subject: [Rprotobuf-commits] r929 - in pkg: . inst/unitTests Message-ID: <20141204014557.A1023185D50@r-forge.r-project.org> Author: murray Date: 2014-12-04 02:45:57 +0100 (Thu, 04 Dec 2014) New Revision: 929 Modified: pkg/ChangeLog pkg/inst/unitTests/runit.int64.R Log: Save the options and restore them on.exit to make this test indempotent. This might be responsible for some unit test failures if R CMD CHECK now runs the same testSuite twice, which I don't think it does. Modified: pkg/ChangeLog =================================================================== --- pkg/ChangeLog 2014-12-03 22:09:22 UTC (rev 928) +++ pkg/ChangeLog 2014-12-04 01:45:57 UTC (rev 929) @@ -1,3 +1,8 @@ +2014-12-04 Murray Stokely + + * inst/unitTests/runit.int64.R: restore options on exit from this + function to make the test indempotent. + 2014-12-01 Dirk Eddelbuettel * man/Message-class.Rd: Completed documentation Modified: pkg/inst/unitTests/runit.int64.R =================================================================== --- pkg/inst/unitTests/runit.int64.R 2014-12-03 22:09:22 UTC (rev 928) +++ pkg/inst/unitTests/runit.int64.R 2014-12-04 01:45:57 UTC (rev 929) @@ -15,6 +15,10 @@ # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. test.int64 <- function() { + # Preserve option. + old.optval <- options("RProtoBuf.int64AsString") + on.exit(options(old.optval)) + if (!exists("protobuf_unittest.TestAllTypes", "RProtoBuf:DescriptorPool")) { unittest.proto.file <- system.file("unitTests", "data", From noreply at r-forge.r-project.org Mon Dec 15 02:10:08 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Mon, 15 Dec 2014 02:10:08 +0100 (CET) Subject: [Rprotobuf-commits] r930 - papers/jss Message-ID: <20141215011008.3385A1865E6@r-forge.r-project.org> Author: edd Date: 2014-12-15 02:10:07 +0100 (Mon, 15 Dec 2014) New Revision: 930 Modified: papers/jss/article.R papers/jss/article.Rnw Log: one "isn't" replaced with "is not"; one sentence reworked Modified: papers/jss/article.R =================================================================== --- papers/jss/article.R 2014-12-04 01:45:57 UTC (rev 929) +++ papers/jss/article.R 2014-12-15 01:10:07 UTC (rev 930) @@ -1,7 +1,7 @@ -### R code from vignette source 'article.Rnw' +### R code from vignette source '/home/edd/svn/rprotobuf/papers/jss/article.Rnw' ################################################### -### code chunk number 1: article.Rnw:125-131 +### code chunk number 1: article.Rnw:130-136 ################################################### ## cf http://www.jstatsoft.org/style#q12 options(prompt = "R> ", @@ -12,7 +12,7 @@ ################################################### -### code chunk number 2: article.Rnw:313-321 +### code chunk number 2: article.Rnw:318-326 ################################################### library("RProtoBuf") p <- new(tutorial.Person, id=1, @@ -25,20 +25,13 @@ ################################################### -### code chunk number 3: article.Rnw:376-377 +### code chunk number 3: article.Rnw:421-422 ################################################### -ls("RProtoBuf:DescriptorPool") - - -################################################### -### code chunk number 4: article.Rnw:391-393 -################################################### -p1 <- new(tutorial.Person) p <- new(tutorial.Person, name = "Murray", id = 1) ################################################### -### code chunk number 5: article.Rnw:402-405 +### code chunk number 4: article.Rnw:431-434 ################################################### p$name p$id @@ -46,7 +39,7 @@ ################################################### -### code chunk number 6: article.Rnw:413-416 +### code chunk number 5: article.Rnw:442-445 ################################################### p[["name"]] <- "Murray Stokely" p[[ 2 ]] <- 3 @@ -54,25 +47,25 @@ ################################################### -### code chunk number 7: article.Rnw:429-430 +### code chunk number 6: article.Rnw:461-462 ################################################### p ################################################### -### code chunk number 8: article.Rnw:437-438 +### code chunk number 7: article.Rnw:469-470 ################################################### writeLines(as.character(p)) ################################################### -### code chunk number 9: article.Rnw:451-452 +### code chunk number 8: article.Rnw:483-484 ################################################### serialize(p, NULL) ################################################### -### code chunk number 10: article.Rnw:457-460 +### code chunk number 9: article.Rnw:489-492 ################################################### tf1 <- tempfile() serialize(p, tf1) @@ -80,92 +73,42 @@ ################################################### -### code chunk number 11: article.Rnw:465-470 +### code chunk number 10: article.Rnw:538-540 ################################################### -tf2 <- tempfile() -con <- file(tf2, open = "wb") -serialize(p, con) -close(con) -readBin(tf2, raw(0), 500) - - -################################################### -### code chunk number 12: article.Rnw:476-480 -################################################### -p$serialize(tf1) -con <- file(tf2, open = "wb") -p$serialize(con) -close(con) - - -################################################### -### code chunk number 13: article.Rnw:500-502 -################################################### msg <- read(tutorial.Person, tf1) writeLines(as.character(msg)) ################################################### -### code chunk number 14: article.Rnw:508-512 +### code chunk number 11: article.Rnw:660-661 ################################################### -con <- file(tf2, open = "rb") -message <- read(tutorial.Person, con) -close(con) -writeLines(as.character(message)) - - -################################################### -### code chunk number 15: article.Rnw:517-519 -################################################### -payload <- readBin(tf1, raw(0), 5000) -message <- read(tutorial.Person, payload) - - -################################################### -### code chunk number 16: article.Rnw:526-531 -################################################### -message <- tutorial.Person$read(tf1) -con <- file(tf2, open = "rb") -message <- tutorial.Person$read(con) -close(con) -message <- tutorial.Person$read(payload) - - -################################################### -### code chunk number 17: article.Rnw:610-611 -################################################### new(tutorial.Person) ################################################### -### code chunk number 18: article.Rnw:675-682 +### code chunk number 12: article.Rnw:685-690 ################################################### tutorial.Person$email +tutorial.Person$email$is_required() +tutorial.Person$email$type() +tutorial.Person$email$as.character() +class(tutorial.Person$email) -tutorial.Person$PhoneType -tutorial.Person$PhoneNumber - -tutorial.Person.PhoneNumber - - ################################################### -### code chunk number 19: article.Rnw:798-800 +### code chunk number 13: article.Rnw:702-709 ################################################### tutorial.Person$PhoneType tutorial.Person$PhoneType$WORK - - -################################################### -### code chunk number 20: article.Rnw:849-852 -################################################### +class(tutorial.Person$PhoneType) tutorial.Person$PhoneType$value(1) tutorial.Person$PhoneType$value(name="HOME") tutorial.Person$PhoneType$value(number=1) +class(tutorial.Person$PhoneType$value(1)) ################################################### -### code chunk number 21: article.Rnw:921-924 +### code chunk number 14: article.Rnw:719-722 ################################################### f <- tutorial.Person$fileDescriptor() f @@ -173,7 +116,7 @@ ################################################### -### code chunk number 22: article.Rnw:987-990 +### code chunk number 15: article.Rnw:785-788 ################################################### if (!exists("JSSPaper.Example1", "RProtoBuf:DescriptorPool")) { readProtoFiles(file="int64.proto") @@ -181,7 +124,7 @@ ################################################### -### code chunk number 23: article.Rnw:1012-1016 +### code chunk number 16: article.Rnw:810-814 ################################################### as.integer(2^31-1) as.integer(2^31 - 1) + as.integer(1) @@ -190,20 +133,20 @@ ################################################### -### code chunk number 24: article.Rnw:1028-1029 +### code chunk number 17: article.Rnw:826-827 ################################################### 2^53 == (2^53 + 1) ################################################### -### code chunk number 25: article.Rnw:1080-1082 +### code chunk number 18: article.Rnw:878-880 ################################################### msg <- serialize_pb(iris, NULL) identical(iris, unserialize_pb(msg)) ################################################### -### code chunk number 26: article.Rnw:1113-1116 +### code chunk number 19: article.Rnw:908-911 ################################################### datasets <- as.data.frame(data(package="datasets")$results) datasets$name <- sub("\\s+.*$", "", datasets$Item) @@ -211,26 +154,8 @@ ################################################### -### code chunk number 27: article.Rnw:1126-1127 +### code chunk number 20: article.Rnw:929-972 ################################################### -m <- sum(sapply(datasets$name, function(x) can_serialize_pb(get(x)))) - - -################################################### -### code chunk number 28: article.Rnw:1140-1147 -################################################### -attr(CO2, "formula") -msg <- serialize_pb(CO2, NULL) -object <- unserialize_pb(msg) -identical(CO2, object) -identical(class(CO2), class(object)) -identical(dim(CO2), dim(object)) -attr(object, "formula") - - -################################################### -### code chunk number 29: article.Rnw:1163-1182 -################################################### datasets$object.size <- unname(sapply(datasets$name, function(x) object.size(eval(as.name(x))))) datasets$R.serialize.size <- unname(sapply(datasets$name, function(x) length(serialize(eval(as.name(x)), NULL)))) @@ -249,42 +174,117 @@ "gzipped serialized"=datasets$R.serialize.size.gz, "RProtoBuf"=datasets$RProtoBuf.serialize.size, "gzipped RProtoBuf"=datasets$RProtoBuf.serialize.size.gz, + "ratio.serialized" = datasets$R.serialize.size / datasets$object.size, + "ratio.rprotobuf" = datasets$RProtoBuf.serialize.size / datasets$object.size, + "ratio.serialized.gz" = datasets$R.serialize.size.gz / datasets$object.size, + "ratio.rprotobuf.gz" = datasets$RProtoBuf.serialize.size.gz / datasets$object.size, + "savings.serialized" = 1-(datasets$R.serialize.size / datasets$object.size), + "savings.rprotobuf" = 1-(datasets$RProtoBuf.serialize.size / datasets$object.size), + "savings.serialized.gz" = 1-(datasets$R.serialize.size.gz / datasets$object.size), + "savings.rprotobuf.gz" = 1-(datasets$RProtoBuf.serialize.size.gz / datasets$object.size), check.names=FALSE) +all.df<-data.frame(dataset="TOTAL", object.size=sum(datasets$object.size), + "serialized"=sum(datasets$R.serialize.size), + "gzipped serialized"=sum(datasets$R.serialize.size.gz), + "RProtoBuf"=sum(datasets$RProtoBuf.serialize.size), + "gzipped RProtoBuf"=sum(datasets$RProtoBuf.serialize.size.gz), + "ratio.serialized" = sum(datasets$R.serialize.size) / sum(datasets$object.size), + "ratio.rprotobuf" = sum(datasets$RProtoBuf.serialize.size) / sum(datasets$object.size), + "ratio.serialized.gz" = sum(datasets$R.serialize.size.gz) / sum(datasets$object.size), + "ratio.rprotobuf.gz" = sum(datasets$RProtoBuf.serialize.size.gz) / sum(datasets$object.size), + "savings.serialized" = 1-(sum(datasets$R.serialize.size) / sum(datasets$object.size)), + "savings.rprotobuf" = 1-(sum(datasets$RProtoBuf.serialize.size) / sum(datasets$object.size)), + "savings.serialized.gz" = 1-(sum(datasets$R.serialize.size.gz) / sum(datasets$object.size)), + "savings.rprotobuf.gz" = 1-(sum(datasets$RProtoBuf.serialize.size.gz) / sum(datasets$object.size)), + check.names=FALSE) +clean.df<-rbind(clean.df, all.df) + ################################################### -### code chunk number 30: article.Rnw:1390-1395 +### code chunk number 21: SER ################################################### -require(RProtoBuf) +old.mar<-par("mar") +new.mar<-old.mar +new.mar[3]<-0 +new.mar[4]<-0 +my.cex<-1.3 +par("mar"=new.mar) +plot(clean.df$savings.serialized, clean.df$savings.rprotobuf, pch=1, col="red", las=1, xlab="Serialization Space Savings", ylab="Protocol Buffer Space Savings", xlim=c(0,1),ylim=c(0,1),cex.lab=my.cex, cex.axis=my.cex) +points(clean.df$savings.serialized.gz, clean.df$savings.rprotobuf.gz,pch=2, col="blue") +# grey dotted diagonal +abline(a=0,b=1, col="grey",lty=2,lwd=3) + +# find point furthest off the X axis. +clean.df$savings.diff <- clean.df$savings.serialized - clean.df$savings.rprotobuf +clean.df$savings.diff.gz <- clean.df$savings.serialized.gz - clean.df$savings.rprotobuf.gz + +# The one to label. +tmp.df <- clean.df[which(clean.df$savings.diff == min(clean.df$savings.diff)),] +# This minimum means most to the left of our line, so pos=2 is label to the left +text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=2, cex=my.cex) + +# Some gziped version +# text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=2, cex=my.cex) + +# Second one is also an outlier +tmp.df <- clean.df[which(clean.df$savings.diff == sort(clean.df$savings.diff)[2]),] +# This minimum means most to the left of our line, so pos=2 is label to the left +text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=2, cex=my.cex) +#text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=my.cex) + + +tmp.df <- clean.df[which(clean.df$savings.diff == max(clean.df$savings.diff)),] +# This minimum means most to the right of the diagonal, so pos=4 is label to the right +# Only show the gziped one. +#text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=4, cex=my.cex) +text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=4, cex=my.cex) + +#outlier.dfs <- clean.df[c(which(clean.df$savings.diff == min(clean.df$savings.diff)), + +legend("topleft", c("Raw", "Gzip Compressed"), pch=1:2, col=c("red", "blue"), cex=my.cex) + +interesting.df <- clean.df[unique(c(which(clean.df$savings.diff == min(clean.df$savings.diff)), + which(clean.df$savings.diff == max(clean.df$savings.diff)), + which(clean.df$savings.diff.gz == max(clean.df$savings.diff.gz)), + which(clean.df$dataset == "TOTAL"))),c("dataset", "object.size", "serialized", "gzipped serialized", "RProtoBuf", "gzipped RProtoBuf", "savings.serialized", "savings.serialized.gz", "savings.rprotobuf", "savings.rprotobuf.gz")] +# Print without .00 in xtable +interesting.df$object.size <- as.integer(interesting.df$object.size) +par("mar"=old.mar) + + +################################################### +### code chunk number 22: article.Rnw:1211-1215 +################################################### require(HistogramTools) readProtoFiles(package="HistogramTools") hist <- HistogramTools.HistogramState$read("hist.pb") -plot(as.histogram(hist)) +plot(as.histogram(hist), main="") ################################################### -### code chunk number 31: article.Rnw:1463-1470 (eval = FALSE) +### code chunk number 23: article.Rnw:1303-1310 (eval = FALSE) ################################################### ## library("RProtoBuf") ## library("httr") ## -## req <- GET('https://public.opencpu.org/ocpu/library/MASS/data/Animals/pb') +## req <- GET('https://demo.ocpu.io/MASS/data/Animals/pb') ## output <- unserialize_pb(req$content) ## ## identical(output, MASS::Animals) ################################################### -### code chunk number 32: article.Rnw:1529-1545 (eval = FALSE) +### code chunk number 24: article.Rnw:1360-1376 (eval = FALSE) ################################################### -## library("httr") +## library("httr") ## library("RProtoBuf") ## ## args <- list(n=42, mean=100) ## payload <- serialize_pb(args, NULL) ## ## req <- POST ( -## url = "https://public.opencpu.org/ocpu/library/stats/R/rnorm/pb", +## url = "https://demo.ocpu.io/stats/R/rnorm/pb", ## body = payload, ## add_headers ( ## "Content-Type" = "application/x-protobuf" @@ -296,7 +296,7 @@ ################################################### -### code chunk number 33: article.Rnw:1549-1552 (eval = FALSE) +### code chunk number 25: article.Rnw:1380-1383 (eval = FALSE) ################################################### ## fnargs <- unserialize_pb(inputmsg) ## val <- do.call(stats::rnorm, fnargs) Modified: papers/jss/article.Rnw =================================================================== --- papers/jss/article.Rnw 2014-12-04 01:45:57 UTC (rev 929) +++ papers/jss/article.Rnw 2014-12-15 01:10:07 UTC (rev 930) @@ -233,8 +233,8 @@ \label{sec:protobuf} Protocol Buffers are a modern, language-neutral, platform-neutral, -extensible mechanism for sharing and storing structured data. Key -features provided by Protocol Buffers for data analysis include: +extensible mechanism for sharing and storing structured data. Some of their +features, particularly in the context of data analysis, are: \begin{itemize} \item \emph{Portable}: Enable users to send and receive data between @@ -388,7 +388,7 @@ parsed from \code{.proto} files and added to the global namespace.\footnote{Note that there is a significant performance overhead with this RObjectTable implementation. Because the table - is on the search path and isn't cacheable, lookups of symbols that + is on the search path and is not cacheable, lookups of symbols that are behind it in the search path cannot be added to the global object cache, and R must perform an expensive lookup through all of the attached environments and the protocol buffer definitions to find common From noreply at r-forge.r-project.org Mon Dec 15 04:01:41 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Mon, 15 Dec 2014 04:01:41 +0100 (CET) Subject: [Rprotobuf-commits] r931 - papers/jss Message-ID: <20141215030142.166D7184817@r-forge.r-project.org> Author: edd Date: 2014-12-15 04:01:40 +0100 (Mon, 15 Dec 2014) New Revision: 931 Added: papers/jss/JSS_1313_comments.txt papers/jss/response-to-reviewers.tex Log: added referee report; started a point-by-point reply --- which will need a lot of work still Added: papers/jss/JSS_1313_comments.txt =================================================================== --- papers/jss/JSS_1313_comments.txt (rev 0) +++ papers/jss/JSS_1313_comments.txt 2014-12-15 03:01:40 UTC (rev 931) @@ -0,0 +1,224 @@ +This submission is important, but needs some work on both the paper and +the software before it can be accepted. The authors should address the +concerns of the two reviewers (below). + + +Overall, I think this is a strong paper. Cross-language communication +is a challenging problem, and good solutions for R are important to +establish R as a well-behaved member of a data analysis pipeline. The +paper is well written, and I recommend that it be accepted subject to +the suggestions below. + +# More big picture, less details + +Overall, I think the paper provides too much detail on relatively +unimportant topics and not enough on the reasoning behind important +design decisions. I think you could comfortably reduce the paper by +5-10 pages, referring the interested reader to the documentation for +more detail. + +I'd recommend shrinking section 3 to ~2 pages, and removing the +subheadings. This section should quickly orient the reader to the +RProtobuf API so they understand the big picture before learning more +details in the subsequent sections. I'd recommend picking one OO style +and sticking to it in this section - two is confusing. + +Section 4 dives into the details without giving a good overview and +motivation. Why use S4 and not RC? How are the objects made mutable? +Why do you provide both generic function and message passing OO +styles? What does `$` do in this context? What the heck is a +pseudo-method? Spend more time on those big issues rather than +describing each class in detail. Reduce class descriptions to a +bulleted list giving a high-level overview, then encourage the reader +to refer to the documentation for further details. Similarly, Tables +3-5 belong in the documentation, not in a vignette/paper. + +Section 7 is weak. I think the important message is that RProtobuf is +being used in practice at large scale for for large data, and is +useful for communicating between R and Python. How can you make that +message stronger while avoiding (for the purposes of this paper) the +relatively unimportant details of the map-reduce setup? + +# R <-> Protobuf translation + +The discussion of R <-> Protobuf could be improved. Table 9 would be +much simpler if instead of Message, you provided a "vectorised" +Messages class (this would also make the interface more consistent and +hence the package easier to use). + +Along these lines, I think it would make sense to combine sections 5 +and 6 and discuss translation challenges in both direction +simultaneously. At the minimum, add the equivalent for Table 9 that +shows how important R classes are converted to their protobuf +equivalents. + +You should discuss how missing values are handled for strings and +integers, and why enums are not equivalent to factors. I think you +could make explicit how coercion of factors, dates, times and matrices +occurs, and the implications of this on sharing data structures +between programming languages. For example, how do you share date/time +data between R and python using RProtoBuf? + +Table 10 is dying to be a plot, and a natural companion would be to +show how long it takes to serialise data frames using both RProtoBuf +and R's native serialisation. Is there a performance penalty to using +protobufs? + +# RObjectTables magic + +The use of RObjectTables magic makes me uneasy. It doesn't seem like a +good fit for an infrastructure package and it's not clear what +advantages it has over explicitly loading a protobuf definition into +an object. + +Using global state makes understanding code much harder. In Table 1, +it's not obvious where `tutorial.Person` comes from. Is it loaded by +default by RProtobuf? This need some explanation. In Section 7, what +does `readProtoFiles()` do? Why does `RProtobuf` need to be attached +as well as `HistogramTools`? This needs more explanation, and a +comment on the implications of this approach on CRAN packages and +namespaces. + +I'd prefer you eliminate this magic from the magic, but failing that, +you need a good explanation of why. + +# Code comments + +* Using `file.create()` to determine the absolute path seems like a bad +idea. + + +# Minor niggles + +* Don't refer to the message passing style of OO as traditional. + +* In Section 3.4, if messages isn't a vectorised class, the default + print method should use `cat()` to eliminate the confusing `[1]`. + +* The REXP definition would have been better defined using an enum that + matches R's SEXPTYPE "enum". But I guess that ship has sailed. + +* Why does `serialize_pb(CO2, NULL)` fail silently? Shouldn't it at least + warn that the serialization is partial? + + + +??????????????????????????????????????????????????????? +??????????????????????????????????????????????????????? + + + +The paper gives an overview of the RProtoBuf package which implements an +R interface to the Protocol Buffers library for an efficient +serialization of objects. The paper is well written and easy to read. +Introductory code is clear and the package provides objects to play with +immediately without the need to jump through hoops, making it appealing. +The software implementation is executed well. + +There are, however, a few inconsistencies in the implementation and some +issues with specific sections in the paper. In the following both issues +will be addressed sequentially by their occurrence in the paper. + + +p.4 illustrates the use of messages. The class implements list-like +access via [[ and $, but strangely names() return NULL and length() +doesn't correspond to the number of fields leading to startling results like + + > p +[1] "message of type 'tutorial.Person' with 2 fields set" + > length(p) +[1] 2 + > p[[3]] +[1] "" + +The inconsistencies get even more bizarre with descriptors (p.9): + + > tutorial.Person$email +[1] "descriptor for field 'email' of type 'tutorial.Person' " + > tutorial.Person[["email"]] +Error in tutorial.Person[["email"]] : this S4 class is not subsettable + > names(tutorial.Person) +NULL + > length(tutorial.Person) +[1] 1 + +It appears that there is no way to find out the fields of a descriptor +directly (although the low-level object methods seem to be exposed as +$field_count() and $fields() - but that seems extremely cumbersome). +Again, implementing names() and subsetting may help here. + +Another inconsistency concerns the as.list() method which by design +coerces objects to lists (see ?as.list), but the implementation for +EnumDescriptor breaks that contract and returns a vector instead: + + > is.list(as.list(tutorial.Person$PhoneType)) +[1] FALSE + > str(as.list(tutorial.Person$PhoneType)) + Named int [1:3] 0 1 2 + - attr(*, "names")= chr [1:3] "MOBILE" "HOME" "WORK" + +As with the other interfaces, names() returns NULL so it is again quite +difficult to perform even simple operations such as finding out the +values. It may be natural use some of the standard methods like names(), +levels() or similar. As with the previous cases, the lack of [[ support +makes it impossible to map named enum values to codes and vice-versa. + +In general, the package would benefit from one pass of checks to assess +the consistency of the API. Since the authors intend direct interaction +with the objects via basic standard R methods, the classes should behave +consistently. + +Finally, most classes implement coercion to characters, which is not +mentioned and is not quite intuitive for some objects. For example, one +may think that as.character() on a file descriptor returns let's say the +filename, but we get: + + > cat(as.character(tutorial.Person$fileDescriptor())) +syntax = "proto2"; + +package tutorial; + +option java_package = "com.example.tutorial"; +option java_outer_classname = "AddressBookProtos"; +[...] + +It is not necessary clear what java_package has to do with a file +descriptor in R. Depending on the intention here, it may be useful to +explain this feature. + +Other comments: + +p.17: "does not support ... function, language or environment. Such +objects have no native equivalent type in Protocol Buffers, and have +little meaning outside the context or R" +That is certainly false. Native mirror of environments are hash tables - +a very useful type indeed. Language objects are just lists, so there is +no reason to not include them - they can be useful to store expressions +that may not be necessary specific to R. Further on p. 18 your run into +the same problem that could be fixed so easily. + +The examples in sections 7 and 8 are somewhat weak. It does not seem +clear why one would wish to unleash the power of PB just to transfer +breaks and counts for plotting - even a simple ASCII file would do that +just fine. The main point in the example is presumably that there are +code generation methods for Hadoop based on PB IDL such that Hadoop can +be made aware of the data types, thus making a histogram a proper record +that won't be split, can be combined etc. -- yet that is not mentioned +nor a way presented how that can be leveraged in practice. The Python +example code simply uses a static example with constants to simulate the +output of a reducer so it doesn't illustrate the point - the reader is +left confused why something as trivial would require PB while a savvy +reader is not able to replicate the illustrated process. Possibly +explaining the benefits and providing more details on how one would +write such a job would make it much more relevant. + +Section 8 is not very well motivated. It is much easier to use other +formats for HTTP exchange - JSON is probably the most popular, but even +CSV works in simple settings. PB is a much less common standard. The +main advantage of PB is the performance over the alternatives, but HTTP +services are not necessarily known for their high-throughput so why one +would sacrifice interoperability by using PB (they are still more hassle +and require special installations)? It would be useful if the reason +could be made explicit here or a better example chosen. + + Added: papers/jss/response-to-reviewers.tex =================================================================== --- papers/jss/response-to-reviewers.tex (rev 0) +++ papers/jss/response-to-reviewers.tex 2014-12-15 03:01:40 UTC (rev 931) @@ -0,0 +1,301 @@ + +\documentclass[10pt]{article} +\usepackage{url} +\usepackage{vmargin} +\setpapersize{USletter} +% left top right bottom -- headheight headsep footheight footskop +\setmarginsrb{1in}{1in}{1in}{0.5in}{0pt}{0mm}{10pt}{0.5in} +\usepackage{charter} + +\setlength{\parskip}{1ex plus1ex minus1ex} +\setlength{\parindent}{0pt} + +\newcommand{\proglang}[1]{\textsf{#1}} +\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}} + +\newcommand{\pointRaised}[2]{\smallskip %\hrule + \textsl{{\fontseries{b}\selectfont #1}: #2}\newline} +\newcommand{\simplePointRaised}[1]{\bigskip \hrule\textsl{#1} } +\newcommand{\reply}[1]{\textbf{Reply}:\ #1 \smallskip } %\hrule \smallskip} + +\begin{document} + +\author{Dirk Eddelbuettel\\Debian Project \and + Murray Stokely\\Google, Inc \and + Jeroen Ooms\\UCLA} +\title{Submission JSS 1313: \\ Response to Reviewers' Comments} +\maketitle +\thispagestyle{empty} + +Thank you for reviewing our manuscript, and for giving us an opportunity to +rewrite, extend and and tighten both the paper and the underlying package. + +\smallskip +We truly appreciate the comments and suggestions. Below, we have regrouped the sets +of comments, and have provided detailed point-by-point replies. +% +We hope that this satisfies the request for changes necessary to proceed with +the publication of the revised and updated manuscript, along with the revised +and updated package (which was recently resubmitted to CRAN as version 0.4.2). + +\section*{Response to Reviewer \#1} + +\pointRaised{Comment 1}{Overall, I think this is a strong paper. Cross-language communication + is a challenging problem, and good solutions for R are important to + establish R as a well-behaved member of a data analysis pipeline. The + paper is well written, and I recommend that it be accepted subject to + the suggestions below.} +\reply{Thank you. We are providing a point-by-point reply below.} + +\subsubsection*{More big picture, less details} + +\pointRaised{Comment 2}{Overall, I think the paper provides too much detail on + relatively unimportant topics and not enough on the reasoning behind + important design decisions. I think you could comfortably reduce the paper + by 5-10 pages, referring the interested reader to the documentation for + more detail.} +\reply{The paper was rewritten throughout and is now much tighter at just 23 pages.} + +\pointRaised{Comment 3}{I'd recommend shrinking section 3 to ~2 pages, and removing the + subheadings. This section should quickly orient the reader to the + RProtobuf API so they understand the big picture before learning more + details in the subsequent sections. I'd recommend picking one OO style + and sticking to it in this section - two is confusing.} +\reply{We followed this recommendation and reduced section 3 to about 2 1/2 pages.} + +\pointRaised{Comment 3}{Section 4 dives into the details without giving a good overview and + motivation. Why use S4 and not RC? How are the objects made mutable? + Why do you provide both generic function and message passing OO + styles? What does \$ do in this context? What the heck is a + pseudo-method? Spend more time on those big issues rather than + describing each class in detail. Reduce class descriptions to a + bulleted list giving a high-level overview, then encourage the reader + to refer to the documentation for further details. Similarly, Tables + 3-5 belong in the documentation, not in a vignette/paper.} +\reply{Done. TO BE EXPANDED} + +\pointRaised{Comment 4}{Section 7 is weak. I think the important message is that RProtobuf is + being used in practice at large scale for for large data, and is + useful for communicating between R and Python. How can you make that + message stronger while avoiding (for the purposes of this paper) the + relatively unimportant details of the map-reduce setup?} +\reply{TBD} + +\subsubsection*{R to/from Protobuf translation} + +\pointRaised{Comment 5}{The discussion of R to/from Protobuf could be improved. Table 9 would be + much simpler if instead of Message, you provided a "vectorised" + Messages class (this would also make the interface more consistent and + hence the package easier to use).} +\reply{TBD} + +\pointRaised{Comment 6}{Along these lines, I think it would make sense to combine sections 5 + and 6 and discuss translation challenges in both direction + simultaneously. At the minimum, add the equivalent for Table 9 that + shows how important R classes are converted to their protobuf + equivalents.} +\reply{TBD} + +\pointRaised{Comment 7}{You should discuss how missing values are handled for strings and + integers, and why enums are not equivalent to factors. I think you + could make explicit how coercion of factors, dates, times and matrices + occurs, and the implications of this on sharing data structures + between programming languages. For example, how do you share date/time + data between R and python using RProtoBuf?} +\reply{TBD} + +\pointRaised{Comment 8}{Table 10 is dying to be a plot, and a natural companion would be to + show how long it takes to serialise data frames using both RProtoBuf + and R's native serialisation. Is there a performance penalty to using + protobufs?} +\reply{TBD} + +\subsubsection*{RObjectTables magic} + +\pointRaised{Comment 9}{The use of RObjectTables magic makes me uneasy. It doesn't seem like a + good fit for an infrastructure package and it's not clear what + advantages it has over explicitly loading a protobuf definition into + an object.} +\reply{TBD} + +\pointRaised{Comment 10}{Using global state makes understanding code much harder. In Table 1, + it's not obvious where \texttt{tutorial.Person} comes from. Is it loaded by + default by RProtobuf? This need some explanation. In Section 7, what + does \texttt{readProtoFiles()} do? Why does \texttt{RProtobuf} need to be attached + as well as \texttt{HistogramTools}? This needs more explanation, and a + comment on the implications of this approach on CRAN packages and + namespaces.} +\reply{TBD} + +\pointRaised{Comment 11}{ + I'd prefer you eliminate this magic from the magic, but failing that, + you need a good explanation of why.} +\reply{TBD} + +\subsubsection*{Code comments} + +\pointRaised{Comment 12}{Using \texttt{file.create()} to determine the absolute path seems like a bad idea.} +\reply{TBD} + + +\subsubsection*{Minor niggles} + +\pointRaised{Comment 13}{Don't refer to the message passing style of OO as traditional.} +\reply{TBD} + +\pointRaised{Comment 14}{In Section 3.4, if messages isn't a vectorised class, the default + print method should use \texttt{cat()} to eliminate the confusing \texttt{[1]}.} +\reply{TBD} + +\pointRaised{Comment 15}{The REXP definition would have been better defined using an enum that + matches R's SEXPTYPE "enum". But I guess that ship has sailed.} +\reply{TBD} + +\pointRaised{Comment 16}{Why does \texttt{serialize\_pb(CO2, NULL)} fail silently? Shouldn't it at least + warn that the serialization is partial?} +\reply{TBD} + + +\section*{Response to Reviewer \#2} + +\pointRaised{Comment 1}{The paper gives an overview of the RProtoBuf package which implements an + R interface to the Protocol Buffers library for an efficient + serialization of objects. The paper is well written and easy to read. + Introductory code is clear and the package provides objects to play with + immediately without the need to jump through hoops, making it appealing. + The software implementation is executed well.} +\reply{Thank you.} + +\pointRaised{Comment 2}{There are, however, a few inconsistencies in the implementation and some + issues with specific sections in the paper. In the following both issues + will be addressed sequentially by their occurrence in the paper.} +\reply{TBD} + +\pointRaised{Comment 3}{p.4 illustrates the use of messages. The class implements list-like + access via \texttt{[[} and \$, but strangely \texttt{names()} return NULL and \texttt{length() } + doesn't correspond to the number of fields leading to startling results like +the following:} + +\begin{verbatim} + > p +[1] "message of type 'tutorial.Person' with 2 fields set" + > length(p) +[1] 2 + > p[[3]] +[1] "" +\end{verbatim} +\reply{TBD} + +\pointRaised{Comment 3 cont.}{The inconsistencies get even more bizarre with descriptors (p.9):} + +\begin{verbatim} + > tutorial.Person$email +[1] "descriptor for field 'email' of type 'tutorial.Person' " + > tutorial.Person[["email"]] +Error in tutorial.Person[["email"]] : this S4 class is not subsettable + > names(tutorial.Person) +NULL + > length(tutorial.Person) +[1] 1 +\end{verbatim} +\reply{TBD} + +\pointRaised{Comment 3 cont.}{It appears that there is no way to find out the fields of a descriptor + directly (although the low-level object methods seem to be exposed as + \texttt{\$field\_count()} and \texttt{\$fields()} - but that seems extremely cumbersome). + Again, implementing names() and subsetting may help here.} +\reply{TBD} + +\pointRaised{Comment 4}{Another inconsistency concerns the \texttt{as.list()} method which by design + coerces objects to lists (see \texttt{?as.list}), but the implementation for + EnumDescriptor breaks that contract and returns a vector instead:} + +\begin{verbatim} + > is.list(as.list(tutorial.Person$PhoneType)) +[1] FALSE + > str(as.list(tutorial.Person$PhoneType)) + Named int [1:3] 0 1 2 + - attr(*, "names")= chr [1:3] "MOBILE" "HOME" "WORK" +\end{verbatim} + +\pointRaised{Comment 4 cont}{As with the other interfaces, names() returns NULL so it is again quite + difficult to perform even simple operations such as finding out the + values. It may be natural use some of the standard methods like names(), + levels() or similar. As with the previous cases, the lack of [[ support + makes it impossible to map named enum values to codes and vice-versa.} +\reply{TBD} + +\pointRaised{Comment 5}{In general, the package would benefit from one pass of checks to assess + the consistency of the API. Since the authors intend direct interaction + with the objects via basic standard R methods, the classes should behave + consistently.} +\reply{TBD} + +\pointRaised{Comment 6}{Finally, most classes implement coercion to characters, which is not + mentioned and is not quite intuitive for some objects. For example, one + may think that as.character() on a file descriptor returns let's say the + filename, but we get:} + +\begin{verbatim} + > cat(as.character(tutorial.Person$fileDescriptor())) +syntax = "proto2"; + +package tutorial; + +option java_package = "com.example.tutorial"; +option java_outer_classname = "AddressBookProtos"; +[...] +\end{verbatim} +\reply{TBD} + +\pointRaised{Comment 7}{It is not necessary clear what java\_package has to do with a file + descriptor in R. Depending on the intention here, it may be useful to + explain this feature. +} +\reply{TBD} + +\subsubsection*{Other comments:} + +\pointRaised{Comment 8}{p.17: "does not support ... function, language or environment. Such + objects have no native equivalent type in Protocol Buffers, and have + little meaning outside the context or R" + That is certainly false. Native mirror of environments are hash tables - + a very useful type indeed. Language objects are just lists, so there is + no reason to not include them - they can be useful to store expressions + that may not be necessary specific to R. Further on p. 18 your run into + the same problem that could be fixed so easily.} +\reply{TBD} + +\pointRaised{Comment 9}{The examples in sections 7 and 8 are somewhat weak. It does not seem + clear why one would wish to unleash the power of PB just to transfer + breaks and counts for plotting - even a simple ASCII file would do that + just fine. The main point in the example is presumably that there are + code generation methods for Hadoop based on PB IDL such that Hadoop can + be made aware of the data types, thus making a histogram a proper record + that won't be split, can be combined etc. -- yet that is not mentioned + nor a way presented how that can be leveraged in practice. The Python + example code simply uses a static example with constants to simulate the + output of a reducer so it doesn't illustrate the point - the reader is + left confused why something as trivial would require PB while a savvy + reader is not able to replicate the illustrated process. Possibly + explaining the benefits and providing more details on how one would + write such a job would make it much more relevant.} +\reply{TBD} + + +\pointRaised{Comment 10}{Section 8 is not very well motivated. It is much easier to use other + formats for HTTP exchange - JSON is probably the most popular, but even + CSV works in simple settings. PB is a much less common standard. The + main advantage of PB is the performance over the alternatives, but HTTP + services are not necessarily known for their high-throughput so why one + would sacrifice interoperability by using PB (they are still more hassle + and require special installations)? It would be useful if the reason + could be made explicit here or a better example chosen.} +\reply{TBD} + +\end{document} + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: t +%%% End: From noreply at r-forge.r-project.org Mon Dec 15 19:52:34 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Mon, 15 Dec 2014 19:52:34 +0100 (CET) Subject: [Rprotobuf-commits] r932 - papers/jss Message-ID: <20141215185234.E8D351876E8@r-forge.r-project.org> Author: murray Date: 2014-12-15 19:52:34 +0100 (Mon, 15 Dec 2014) New Revision: 932 Modified: papers/jss/response-to-reviewers.tex Log: Add more point to point replies. Still working. I can mostly finish this up today. Modified: papers/jss/response-to-reviewers.tex =================================================================== --- papers/jss/response-to-reviewers.tex 2014-12-15 03:01:40 UTC (rev 931) +++ papers/jss/response-to-reviewers.tex 2014-12-15 18:52:34 UTC (rev 932) @@ -72,14 +72,20 @@ bulleted list giving a high-level overview, then encourage the reader to refer to the documentation for further details. Similarly, Tables 3-5 belong in the documentation, not in a vignette/paper.} -\reply{Done. TO BE EXPANDED} +\reply{Done. RProtoBuf was designed and implemented before RC were + available, and this is noted in a footnote now. Explanation of how + they are made mutable haas been added. Better explanation of the + two styles and '\$' as been added, while no longer using the + confusing term + 'pseudo-method' anywhere. Moved Tables 3-5 into the documentation + and out of the paper, as suggested.} \pointRaised{Comment 4}{Section 7 is weak. I think the important message is that RProtobuf is being used in practice at large scale for for large data, and is useful for communicating between R and Python. How can you make that message stronger while avoiding (for the purposes of this paper) the relatively unimportant details of the map-reduce setup?} -\reply{TBD} +\reply{Done. Rewritten with more motivation taking into account this feedback.} \subsubsection*{R to/from Protobuf translation} @@ -87,15 +93,29 @@ much simpler if instead of Message, you provided a "vectorised" Messages class (this would also make the interface more consistent and hence the package easier to use).} -\reply{TBD} +\reply{This is an area for future work and is a space explored in + another package called Motobuf by other authors.} \pointRaised{Comment 6}{Along these lines, I think it would make sense to combine sections 5 and 6 and discuss translation challenges in both direction simultaneously. At the minimum, add the equivalent for Table 9 that shows how important R classes are converted to their protobuf equivalents.} -\reply{TBD} +\reply{We have updated these sections to make it clearer that the main + distinction is between schema-based datastructures (section 5) and + schema-less use where a catch-all .proto is used (section 6). + Neither section is meant to focus on only a single direction of the + conversion, but how conversion works when you have a schema or not. + How important R classes are converted to their protobuf equivalents + isn't super useful as a C++, Java, or Python program is unlikely to + want to read in an R data.frame exactly as it is defined. Much more + likely is an application-specific message format is defined between the + two services, such as the HistogramTools example in the next section. + Much more detail has been added to an interesting part of section 6 -- + which datasets exactly are better served with RProtoBuf than + base::serialize and why?} + \pointRaised{Comment 7}{You should discuss how missing values are handled for strings and integers, and why enums are not equivalent to factors. I think you could make explicit how coercion of factors, dates, times and matrices @@ -108,7 +128,16 @@ show how long it takes to serialise data frames using both RProtoBuf and R's native serialisation. Is there a performance penalty to using protobufs?} -\reply{TBD} +\reply{Table 10 has been replaced with a plot, the outliers are + labeled, and the text now includes some interesting explanation + about the outliers. Page 4 explains that the R implementation of + protocol buffers uses reflection to make operations slower but makes + it more convenient for interactive data analysis. None of the + built-in datasets are large enough for performance to really come up + as an issue, and for any serialization method examples could be + found that significantly favor one over another, so we don't think + there will be benefit to adding anything here. +} \subsubsection*{RObjectTables magic} @@ -116,7 +145,8 @@ good fit for an infrastructure package and it's not clear what advantages it has over explicitly loading a protobuf definition into an object.} -\reply{TBD} +\reply{More information about the advantages and disadvantages of this + approach have been added.} \pointRaised{Comment 10}{Using global state makes understanding code much harder. In Table 1, it's not obvious where \texttt{tutorial.Person} comes from. Is it loaded by @@ -125,19 +155,23 @@ as well as \texttt{HistogramTools}? This needs more explanation, and a comment on the implications of this approach on CRAN packages and namespaces.} -\reply{TBD} +\reply{We followed this recommendation and added explanation for how +\texttt{tutorial.Person} is loaded, specifically : \emph{A small number of message types are imported when the +package is first loaded, including the tutorial.Person type we saw in +the last section.} We removed the superfluous attach of \texttt{RProtoBuf}.} \pointRaised{Comment 11}{ I'd prefer you eliminate this magic from the magic, but failing that, you need a good explanation of why.} -\reply{TBD} +\reply{We've added more explanation about this.} \subsubsection*{Code comments} \pointRaised{Comment 12}{Using \texttt{file.create()} to determine the absolute path seems like a bad idea.} -\reply{TBD} +\reply{We followed this recommendation and removed two instances of + \texttt{file.create()} for this purpose with calls to + \texttt{normalizePath} with \texttt{mustWork=FALSE}.} - \subsubsection*{Minor niggles} \pointRaised{Comment 13}{Don't refer to the message passing style of OO as traditional.} From noreply at r-forge.r-project.org Mon Dec 15 22:46:52 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Mon, 15 Dec 2014 22:46:52 +0100 (CET) Subject: [Rprotobuf-commits] r933 - papers/jss Message-ID: <20141215214652.17AEA1859B0@r-forge.r-project.org> Author: murray Date: 2014-12-15 22:46:51 +0100 (Mon, 15 Dec 2014) New Revision: 933 Modified: papers/jss/response-to-reviewers.tex Log: More point-by-point responses. Modified: papers/jss/response-to-reviewers.tex =================================================================== --- papers/jss/response-to-reviewers.tex 2014-12-15 18:52:34 UTC (rev 932) +++ papers/jss/response-to-reviewers.tex 2014-12-15 21:46:51 UTC (rev 933) @@ -158,7 +158,8 @@ \reply{We followed this recommendation and added explanation for how \texttt{tutorial.Person} is loaded, specifically : \emph{A small number of message types are imported when the package is first loaded, including the tutorial.Person type we saw in -the last section.} We removed the superfluous attach of \texttt{RProtoBuf}.} +the last section.} Thank you also for spotting the superfluous attach +of \texttt{RProtoBuf}, it has been removed from the example.} \pointRaised{Comment 11}{ I'd prefer you eliminate this magic from the magic, but failing that, @@ -175,21 +176,26 @@ \subsubsection*{Minor niggles} \pointRaised{Comment 13}{Don't refer to the message passing style of OO as traditional.} -\reply{TBD} +\reply{Done, we don't refer to this style as traditional anywhere in + the manuscript anymore.} \pointRaised{Comment 14}{In Section 3.4, if messages isn't a vectorised class, the default print method should use \texttt{cat()} to eliminate the confusing \texttt{[1]}.} -\reply{TBD} +\reply{Done} \pointRaised{Comment 15}{The REXP definition would have been better defined using an enum that matches R's SEXPTYPE "enum". But I guess that ship has sailed.} -\reply{TBD} +\reply{Acknowledged. We chose to maintain compatibility with RHIPE here. The main +use of RProtoBuf is not with rexp.proto however -- it with +application-specific schemas in .proto files for sending data between +applications. Users that want to do something very R-specific are +welcome to use their own \texttt{.proto} files with an enum to represent R SEXPTYPEs.} \pointRaised{Comment 16}{Why does \texttt{serialize\_pb(CO2, NULL)} fail silently? Shouldn't it at least warn that the serialization is partial?} -\reply{TBD} +\reply{Fixed, \texttt{serialize\_pb} now works for all built-in datatypes in R + and no longer fails silently if it encounters something it can't serialize.} - \section*{Response to Reviewer \#2} \pointRaised{Comment 1}{The paper gives an overview of the RProtoBuf package which implements an @@ -203,7 +209,8 @@ \pointRaised{Comment 2}{There are, however, a few inconsistencies in the implementation and some issues with specific sections in the paper. In the following both issues will be addressed sequentially by their occurrence in the paper.} -\reply{TBD} +\reply{These and others have been identified and addressed. Thank you + for taking the time to enumerate these issues.} \pointRaised{Comment 3}{p.4 illustrates the use of messages. The class implements list-like access via \texttt{[[} and \$, but strangely \texttt{names()} return NULL and \texttt{length() } @@ -218,7 +225,21 @@ > p[[3]] [1] "" \end{verbatim} -\reply{TBD} +\reply{We've corrected the list-like accessor, fixed \texttt{length()} to + correspond to the number of set fields, and added \texttt{names()}:} +\begin{verbatim} +> p +message of type 'tutorial.Person' with 0 fields set +> length(p) +[1] 0 +> p[[3]] +[1] "" +> p$id <- 1 +> length(p) +[1] 1 +> names(p) +[1] "name" "id" "email" "phone" +\end{verbatim} \pointRaised{Comment 3 cont.}{The inconsistencies get even more bizarre with descriptors (p.9):} @@ -232,13 +253,31 @@ > length(tutorial.Person) [1] 1 \end{verbatim} -\reply{TBD} +\reply{We agree, and have addressed this inconsistency. Thank you:} +\begin{verbatim} +> tutorial.Person$email +descriptor for field 'email' of type 'tutorial.Person' +> tutorial.Person[["email"]] +descriptor for field 'email' of type 'tutorial.Person' +> names(tutorial.Person) +[1] "name" "id" "email" "phone" "PhoneNumber" +[6] "PhoneType" +> length(tutorial.Person) +[1] 6 +\end{verbatim} \pointRaised{Comment 3 cont.}{It appears that there is no way to find out the fields of a descriptor directly (although the low-level object methods seem to be exposed as \texttt{\$field\_count()} and \texttt{\$fields()} - but that seems extremely cumbersome). Again, implementing names() and subsetting may help here.} -\reply{TBD} +\reply{\texttt{names} and subsetting implemented. Thank you for the + suggestion.:} +\begin{verbatim} +> tutorial.Person[[1]] +descriptor for field 'name' of type 'tutorial.Person' +> tutorial.Person[[2]] +descriptor for field 'id' of type 'tutorial.Person' +\end{verbatim} \pointRaised{Comment 4}{Another inconsistency concerns the \texttt{as.list()} method which by design coerces objects to lists (see \texttt{?as.list}), but the implementation for @@ -252,18 +291,36 @@ - attr(*, "names")= chr [1:3] "MOBILE" "HOME" "WORK" \end{verbatim} +\reply{Fixed, thank you. New output:} +\begin{verbatim} +> is.list(as.list(tutorial.Person$PhoneType)) +[1] TRUE +> str(as.list(tutorial.Person$PhoneType)) +List of 3 + $ MOBILE: int 0 + $ HOME : int 1 + $ WORK : int 2 +\end{verbatim} + \pointRaised{Comment 4 cont}{As with the other interfaces, names() returns NULL so it is again quite difficult to perform even simple operations such as finding out the values. It may be natural use some of the standard methods like names(), levels() or similar. As with the previous cases, the lack of [[ support makes it impossible to map named enum values to codes and vice-versa.} -\reply{TBD} +\reply{Fixed, thank you. New output:} +\begin{verbatim} +> names(tutorial.Person$PhoneType) +[1] "MOBILE" "HOME" "WORK" +> tutorial.Person$PhoneType[["HOME"]] +[1] 1 +\end{verbatim} \pointRaised{Comment 5}{In general, the package would benefit from one pass of checks to assess the consistency of the API. Since the authors intend direct interaction with the objects via basic standard R methods, the classes should behave consistently.} -\reply{TBD} +\reply{We made several passes, correcting issues as documented in + \texttt{ChangeLog} and now present in our latest 0.4.2 release on CRAN.} \pointRaised{Comment 6}{Finally, most classes implement coercion to characters, which is not mentioned and is not quite intuitive for some objects. For example, one @@ -280,13 +337,29 @@ option java_outer_classname = "AddressBookProtos"; [...] \end{verbatim} -\reply{TBD} +\reply{In choosing the debug output for a file descriptor we agree + that \texttt{filename} is a reasonable thing to expect, but we also + think that the contents of the \texttt{.proto} file is also + reasonable, and also more useful. We document this in + ``FileDescriptor-class'', the vignette, and other sources. + \texttt{@filename} is one of the slots of the FileDescriptor class + and so very easy to find. The contents of the \texttt{.proto} are + not as easily accessible in a slot, however, and so we find it much + more useful to be output with \texttt{as.character()}.} \pointRaised{Comment 7}{It is not necessary clear what java\_package has to do with a file descriptor in R. Depending on the intention here, it may be useful to explain this feature. } -\reply{TBD} +\reply{This snippet has been removed as part of the general move of + less relevant details to the package documentation, but for + reference the \texttt{.proto} file syntax is defined in the Protocol Buffers + language guide which is referenced earlier. It is a cross platform + library and so this syntax specifies some parameters when Java code + is used to access the structures defined in this file. No such + special syntax is required in the \texttt{.proto} files for R + language code and so this line about java\_package was not relevant + or needed in any way for RProtoBuf and is documented elsewhere.} \subsubsection*{Other comments:} @@ -298,7 +371,14 @@ no reason to not include them - they can be useful to store expressions that may not be necessary specific to R. Further on p. 18 your run into the same problem that could be fixed so easily.} -\reply{TBD} +\reply{You are right. Environments are more than just hash + tables because they include other configuration parameters that are + necessary to serialize as well to make sure + serialization/unserialization is indempotent, but we agree it is + cleaner and the package and the exposition in the paper to just make + sure we serialize everything. We can now fall back to + \texttt{base::serialize} and storing the bits in a rawString type of + RProtoBuf to make the R schema-less serialization more complete.} \pointRaised{Comment 9}{The examples in sections 7 and 8 are somewhat weak. It does not seem clear why one would wish to unleash the power of PB just to transfer From noreply at r-forge.r-project.org Tue Dec 16 02:18:04 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Tue, 16 Dec 2014 02:18:04 +0100 (CET) Subject: [Rprotobuf-commits] r934 - papers/jss Message-ID: <20141216011804.9FAC3187794@r-forge.r-project.org> Author: murray Date: 2014-12-16 02:18:04 +0100 (Tue, 16 Dec 2014) New Revision: 934 Modified: papers/jss/response-to-reviewers.tex Log: Address remaining points in referee feedback. Modified: papers/jss/response-to-reviewers.tex =================================================================== --- papers/jss/response-to-reviewers.tex 2014-12-15 21:46:51 UTC (rev 933) +++ papers/jss/response-to-reviewers.tex 2014-12-16 01:18:04 UTC (rev 934) @@ -54,14 +54,17 @@ important design decisions. I think you could comfortably reduce the paper by 5-10 pages, referring the interested reader to the documentation for more detail.} -\reply{The paper was rewritten throughout and is now much tighter at just 23 pages.} +\reply{The paper is now 6-pages much tighter at just 23 pages. + Sections 3 - 8 (all but sec 1 introduction, sec 2 protocol buffers, + and sec 9 conclusion have been rewritten to address the specific and + general feedback in these reviews)} \pointRaised{Comment 3}{I'd recommend shrinking section 3 to ~2 pages, and removing the subheadings. This section should quickly orient the reader to the RProtobuf API so they understand the big picture before learning more details in the subsequent sections. I'd recommend picking one OO style and sticking to it in this section - two is confusing.} -\reply{We followed this recommendation and reduced section 3 to about 2 1/2 pages.} +\reply{We followed this recommendation and reduced section 3 to about $2\frac{1}{2}$ pages.} \pointRaised{Comment 3}{Section 4 dives into the details without giving a good overview and motivation. Why use S4 and not RC? How are the objects made mutable? @@ -74,10 +77,10 @@ 3-5 belong in the documentation, not in a vignette/paper.} \reply{Done. RProtoBuf was designed and implemented before RC were available, and this is noted in a footnote now. Explanation of how - they are made mutable haas been added. Better explanation of the - two styles and '\$' as been added, while no longer using the + they are made mutable has been added. Better explanation of the + two styles and '\$' as been added. We are no longer using the confusing term - 'pseudo-method' anywhere. Moved Tables 3-5 into the documentation + 'pseudo-method' anywhere. We moved Tables 3-5 into the documentation and out of the paper, as suggested.} \pointRaised{Comment 4}{Section 7 is weak. I think the important message is that RProtobuf is @@ -93,15 +96,27 @@ much simpler if instead of Message, you provided a "vectorised" Messages class (this would also make the interface more consistent and hence the package easier to use).} -\reply{This is an area for future work and is a space explored in - another package called Motobuf by other authors.} +\reply{This is a good observation that only became clear to us after + significant usage of \texttt{RProtoBuf}. Providing a full ``vectorized'' Messages class would require slicing + operators that let you quickly extract a given field from each + element of the message vector in order to be really useful. This + would require significant amounts of C++ code for efficient + manipulation on the order of data.table or other similar large C++ R + packages on CRAN. There is another package called Motobuf by other authors + that takes this approach but in practice, at Google at least, the + ease-of-use provided by the simple Message interface of RProtoBuf + has won with users. It is still future work to keep the simple + interactive interface of RProtoBuf with the vectorized efficiency of + Motobuf. For now, users typically do their slicing of vectors like + this through a distributed database (NewSQL is the term of the day?) + like Dremel or other system and then just get the response Protocol + Buffers in return to the request.} \pointRaised{Comment 6}{Along these lines, I think it would make sense to combine sections 5 and 6 and discuss translation challenges in both direction simultaneously. At the minimum, add the equivalent for Table 9 that shows how important R classes are converted to their protobuf equivalents.} - \reply{We have updated these sections to make it clearer that the main distinction is between schema-based datastructures (section 5) and schema-less use where a catch-all .proto is used (section 6). @@ -122,7 +137,13 @@ occurs, and the implications of this on sharing data structures between programming languages. For example, how do you share date/time data between R and python using RProtoBuf?} -\reply{TBD} +\reply{All of these details are application-specific, whereas + RProtoBuf is an infrastructure package. Distributed systems define + their own interfaces, with their own date/time fields, usually as + int64s of fractional seconds since the unix epoch for the systems I + have worked on. An example is given for Histograms in the next + section. Factors could be represented as repeated enums in protocol + buffers, certainly, if that is how one wanted to define a schema.} \pointRaised{Comment 8}{Table 10 is dying to be a plot, and a natural companion would be to show how long it takes to serialise data frames using both RProtoBuf @@ -135,9 +156,8 @@ it more convenient for interactive data analysis. None of the built-in datasets are large enough for performance to really come up as an issue, and for any serialization method examples could be - found that significantly favor one over another, so we don't think - there will be benefit to adding anything here. -} + found that significantly favor one over another in runtime, so we + don't think there will be benefit to adding anything here. } \subsubsection*{RObjectTables magic} @@ -181,13 +201,13 @@ \pointRaised{Comment 14}{In Section 3.4, if messages isn't a vectorised class, the default print method should use \texttt{cat()} to eliminate the confusing \texttt{[1]}.} -\reply{Done} +\reply{Done, thanks.} \pointRaised{Comment 15}{The REXP definition would have been better defined using an enum that matches R's SEXPTYPE "enum". But I guess that ship has sailed.} \reply{Acknowledged. We chose to maintain compatibility with RHIPE here. The main -use of RProtoBuf is not with rexp.proto however -- it with -application-specific schemas in .proto files for sending data between +use of RProtoBuf is not with \texttt{rexp.proto} however -- it with +application-specific schemas in \texttt{.proto} files for sending data between applications. Users that want to do something very R-specific are welcome to use their own \texttt{.proto} files with an enum to represent R SEXPTYPEs.} @@ -324,7 +344,7 @@ \pointRaised{Comment 6}{Finally, most classes implement coercion to characters, which is not mentioned and is not quite intuitive for some objects. For example, one - may think that as.character() on a file descriptor returns let's say the + may think that \texttt{as.character()} on a file descriptor returns let's say the filename, but we get:} \begin{verbatim} @@ -337,10 +357,12 @@ option java_outer_classname = "AddressBookProtos"; [...] \end{verbatim} -\reply{In choosing the debug output for a file descriptor we agree +\reply{The behavior is documented in the package documentation but + seemed like a minor detail not important for an already-long paper. + In choosing the debug output for a file descriptor we agree that \texttt{filename} is a reasonable thing to expect, but we also think that the contents of the \texttt{.proto} file is also - reasonable, and also more useful. We document this in + reasonable, but more useful. We document this in ``FileDescriptor-class'', the vignette, and other sources. \texttt{@filename} is one of the slots of the FileDescriptor class and so very easy to find. The contents of the \texttt{.proto} are @@ -394,9 +416,17 @@ reader is not able to replicate the illustrated process. Possibly explaining the benefits and providing more details on how one would write such a job would make it much more relevant.} -\reply{TBD} +\reply{Yes, we added more detail about the advantages of using a + proper data type for the histograms in this example that you mentioned here -- the + ability to write combiners, prevent arbitrary splitting of the + records, etc that can greatly improve performance. We agree with + the other reviewer that we don't want to get bogged down in details + about a particular MapReduce implementation (such as Hadoop) and so + now we specifically mention that goal here. + I think we make a better connection now between the + abstract MapReduce example given, and then the simpler Python + example code with a static example.} - \pointRaised{Comment 10}{Section 8 is not very well motivated. It is much easier to use other formats for HTTP exchange - JSON is probably the most popular, but even CSV works in simple settings. PB is a much less common standard. The @@ -405,7 +435,17 @@ would sacrifice interoperability by using PB (they are still more hassle and require special installations)? It would be useful if the reason could be made explicit here or a better example chosen.} -\reply{TBD} +\reply{This section has been reworded to make it shorter and more + crisp, with fewer extraneous details about OpenCPU. +Protocol + Buffers is an efficient protocol used between distributed systems at + many of the world's largest internet companies (Twitter, Sony, + Google, etc.) but the design and implementation of a large + enterprise-scale distributed system with a complex RPC system and + serialization needs is well beyond the scope of what we can add to a + paper about RProtoBuf. We chose this example because it is a much + more accessible example that any reader can use to easily + send/receive RPCs and parse the results with RProtoBuf.} \end{document} From noreply at r-forge.r-project.org Wed Dec 17 00:02:00 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 17 Dec 2014 00:02:00 +0100 (CET) Subject: [Rprotobuf-commits] r935 - papers/jss Message-ID: <20141216230200.7B3021877C7@r-forge.r-project.org> Author: edd Date: 2014-12-17 00:02:00 +0100 (Wed, 17 Dec 2014) New Revision: 935 Added: papers/jss/article-submitted-2014-03.pdf Log: initial submission Added: papers/jss/article-submitted-2014-03.pdf =================================================================== (Binary files differ) Property changes on: papers/jss/article-submitted-2014-03.pdf ___________________________________________________________________ Added: svn:mime-type + application/octet-stream From noreply at r-forge.r-project.org Wed Dec 17 03:04:32 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 17 Dec 2014 03:04:32 +0100 (CET) Subject: [Rprotobuf-commits] r936 - papers/jss Message-ID: <20141217020432.5B3EA1878AB@r-forge.r-project.org> Author: edd Date: 2014-12-17 03:04:25 +0100 (Wed, 17 Dec 2014) New Revision: 936 Modified: papers/jss/response-to-reviewers.tex Log: Halfway done another pass. This is coming together very well too. Modified: papers/jss/response-to-reviewers.tex =================================================================== --- papers/jss/response-to-reviewers.tex 2014-12-16 23:02:00 UTC (rev 935) +++ papers/jss/response-to-reviewers.tex 2014-12-17 02:04:25 UTC (rev 936) @@ -54,17 +54,18 @@ important design decisions. I think you could comfortably reduce the paper by 5-10 pages, referring the interested reader to the documentation for more detail.} -\reply{The paper is now 6-pages much tighter at just 23 pages. - Sections 3 - 8 (all but sec 1 introduction, sec 2 protocol buffers, - and sec 9 conclusion have been rewritten to address the specific and - general feedback in these reviews)} +\reply{The paper is now six pages shorter at just 23 pages. + Sections 3 - 8 (all but Section 1 (``Introduction''), Section 2 (``Protocol Buffers''), + and Section 9 (``Conclusion'') have been thoroughly rewritten to address the specific and + general feedback in these reviews.} \pointRaised{Comment 3}{I'd recommend shrinking section 3 to ~2 pages, and removing the subheadings. This section should quickly orient the reader to the RProtobuf API so they understand the big picture before learning more details in the subsequent sections. I'd recommend picking one OO style and sticking to it in this section - two is confusing.} -\reply{We followed this recommendation and reduced section 3 to about $2\frac{1}{2}$ pages.} +\reply{We followed this recommendation, reduced section 3 to about + $2\frac{1}{2}$ pages, removed the subheadings and tightened the exposition.} \pointRaised{Comment 3}{Section 4 dives into the details without giving a good overview and motivation. Why use S4 and not RC? How are the objects made mutable? @@ -76,12 +77,11 @@ to refer to the documentation for further details. Similarly, Tables 3-5 belong in the documentation, not in a vignette/paper.} \reply{Done. RProtoBuf was designed and implemented before RC were - available, and this is noted in a footnote now. Explanation of how + available, and this is now noted explicitly in a new footnote. Explanation of how they are made mutable has been added. Better explanation of the two styles and '\$' as been added. We are no longer using the - confusing term - 'pseudo-method' anywhere. We moved Tables 3-5 into the documentation - and out of the paper, as suggested.} + confusing term 'pseudo-method' anywhere. We also moved Tables 3-5 into the + documentation and out of the paper, as suggested.} \pointRaised{Comment 4}{Section 7 is weak. I think the important message is that RProtobuf is being used in practice at large scale for for large data, and is @@ -103,8 +103,8 @@ would require significant amounts of C++ code for efficient manipulation on the order of data.table or other similar large C++ R packages on CRAN. There is another package called Motobuf by other authors - that takes this approach but in practice, at Google at least, the - ease-of-use provided by the simple Message interface of RProtoBuf + that takes this approach but in practice (at least for the several hundred + users at Google), the ease-of-use provided by the simple Message interface of RProtoBuf has won with users. It is still future work to keep the simple interactive interface of RProtoBuf with the vectorized efficiency of Motobuf. For now, users typically do their slicing of vectors like @@ -117,9 +117,9 @@ simultaneously. At the minimum, add the equivalent for Table 9 that shows how important R classes are converted to their protobuf equivalents.} -\reply{We have updated these sections to make it clearer that the main - distinction is between schema-based datastructures (section 5) and - schema-less use where a catch-all .proto is used (section 6). +\reply{Done. We have updated these sections to make it clearer that the main + distinction is between schema-based datastructures (Section 5) and + schema-less use where a catch-all \texttt{.proto} is used (Section 6). Neither section is meant to focus on only a single direction of the conversion, but how conversion works when you have a schema or not. How important R classes are converted to their protobuf equivalents @@ -129,7 +129,7 @@ two services, such as the HistogramTools example in the next section. Much more detail has been added to an interesting part of section 6 -- which datasets exactly are better served with RProtoBuf than - base::serialize and why?} + \texttt{base::serialize} and why?} \pointRaised{Comment 7}{You should discuss how missing values are handled for strings and integers, and why enums are not equivalent to factors. I think you @@ -140,19 +140,19 @@ \reply{All of these details are application-specific, whereas RProtoBuf is an infrastructure package. Distributed systems define their own interfaces, with their own date/time fields, usually as - int64s of fractional seconds since the unix epoch for the systems I + a double of fractional seconds since the unix epoch for the systems I have worked on. An example is given for Histograms in the next - section. Factors could be represented as repeated enums in protocol - buffers, certainly, if that is how one wanted to define a schema.} + section. Factors could be represented as repeated enums in Protocol + Buffers, certainly, if that is how one wanted to define a schema.} \pointRaised{Comment 8}{Table 10 is dying to be a plot, and a natural companion would be to show how long it takes to serialise data frames using both RProtoBuf and R's native serialisation. Is there a performance penalty to using protobufs?} -\reply{Table 10 has been replaced with a plot, the outliers are +\reply{Done. Table 10 has been replaced with a plot, the outliers are labeled, and the text now includes some interesting explanation about the outliers. Page 4 explains that the R implementation of - protocol buffers uses reflection to make operations slower but makes + Protocol Buffers uses reflection to make operations slower but makes it more convenient for interactive data analysis. None of the built-in datasets are large enough for performance to really come up as an issue, and for any serialization method examples could be @@ -165,7 +165,7 @@ good fit for an infrastructure package and it's not clear what advantages it has over explicitly loading a protobuf definition into an object.} -\reply{More information about the advantages and disadvantages of this +\reply{Done. More information about the advantages and disadvantages of this approach have been added.} \pointRaised{Comment 10}{Using global state makes understanding code much harder. In Table 1, @@ -175,28 +175,28 @@ as well as \texttt{HistogramTools}? This needs more explanation, and a comment on the implications of this approach on CRAN packages and namespaces.} -\reply{We followed this recommendation and added explanation for how -\texttt{tutorial.Person} is loaded, specifically : \emph{A small number of message types are imported when the -package is first loaded, including the tutorial.Person type we saw in -the last section.} Thank you also for spotting the superfluous attach -of \texttt{RProtoBuf}, it has been removed from the example.} +\reply{Done. We followed this recommendation and added explanation for how + \texttt{tutorial.Person} is loaded, specifically : \emph{A small number of message types are imported when the + package is first loaded, including the tutorial.Person type we saw in + the last section.} Thank you also for spotting the superfluous attach + of \texttt{RProtoBuf}, it has been removed from the example.} \pointRaised{Comment 11}{ I'd prefer you eliminate this magic from the magic, but failing that, you need a good explanation of why.} -\reply{We've added more explanation about this.} +\reply{Done. We've added more explanation about this.} \subsubsection*{Code comments} \pointRaised{Comment 12}{Using \texttt{file.create()} to determine the absolute path seems like a bad idea.} -\reply{We followed this recommendation and removed two instances of +\reply{Done. We followed this recommendation and removed two instances of \texttt{file.create()} for this purpose with calls to \texttt{normalizePath} with \texttt{mustWork=FALSE}.} \subsubsection*{Minor niggles} \pointRaised{Comment 13}{Don't refer to the message passing style of OO as traditional.} -\reply{Done, we don't refer to this style as traditional anywhere in +\reply{Done. We don't refer to this style as traditional anywhere in the manuscript anymore.} \pointRaised{Comment 14}{In Section 3.4, if messages isn't a vectorised class, the default @@ -213,7 +213,7 @@ \pointRaised{Comment 16}{Why does \texttt{serialize\_pb(CO2, NULL)} fail silently? Shouldn't it at least warn that the serialization is partial?} -\reply{Fixed, \texttt{serialize\_pb} now works for all built-in datatypes in R +\reply{Done. We fixed this and \texttt{serialize\_pb} now works for all built-in datatypes in R and no longer fails silently if it encounters something it can't serialize.} \section*{Response to Reviewer \#2} From noreply at r-forge.r-project.org Wed Dec 17 03:34:24 2014 From: noreply at r-forge.r-project.org (noreply at r-forge.r-project.org) Date: Wed, 17 Dec 2014 03:34:24 +0100 (CET) Subject: [Rprotobuf-commits] r937 - papers/jss Message-ID: <20141217023425.0822C1878E8@r-forge.r-project.org> Author: edd Date: 2014-12-17 03:34:24 +0100 (Wed, 17 Dec 2014) New Revision: 937 Modified: papers/jss/response-to-reviewers.tex Log: a few more edits Modified: papers/jss/response-to-reviewers.tex =================================================================== --- papers/jss/response-to-reviewers.tex 2014-12-17 02:04:25 UTC (rev 936) +++ papers/jss/response-to-reviewers.tex 2014-12-17 02:34:24 UTC (rev 937) @@ -229,7 +229,7 @@ \pointRaised{Comment 2}{There are, however, a few inconsistencies in the implementation and some issues with specific sections in the paper. In the following both issues will be addressed sequentially by their occurrence in the paper.} -\reply{These and others have been identified and addressed. Thank you +\reply{Done. These and others have been identified and addressed. Thank you for taking the time to enumerate these issues.} \pointRaised{Comment 3}{p.4 illustrates the use of messages. The class implements list-like @@ -245,7 +245,7 @@ > p[[3]] [1] "" \end{verbatim} -\reply{We've corrected the list-like accessor, fixed \texttt{length()} to +\reply{Done. We have corrected the list-like accessor, fixed \texttt{length()} to correspond to the number of set fields, and added \texttt{names()}:} \begin{verbatim} > p @@ -273,7 +273,8 @@ > length(tutorial.Person) [1] 1 \end{verbatim} -\reply{We agree, and have addressed this inconsistency. Thank you:} +\reply{Done. We agree, and have addressed this inconsistency. Thank you for + catching this.} \begin{verbatim} > tutorial.Person$email descriptor for field 'email' of type 'tutorial.Person' @@ -290,8 +291,8 @@ directly (although the low-level object methods seem to be exposed as \texttt{\$field\_count()} and \texttt{\$fields()} - but that seems extremely cumbersome). Again, implementing names() and subsetting may help here.} -\reply{\texttt{names} and subsetting implemented. Thank you for the - suggestion.:} +\reply{Done. We have implemented \texttt{names} and subsetting. Thank you for the + suggestion.} \begin{verbatim} > tutorial.Person[[1]] descriptor for field 'name' of type 'tutorial.Person' @@ -311,7 +312,7 @@ - attr(*, "names")= chr [1:3] "MOBILE" "HOME" "WORK" \end{verbatim} -\reply{Fixed, thank you. New output:} +\reply{Done, thank you. New output below:} \begin{verbatim} > is.list(as.list(tutorial.Person$PhoneType)) [1] TRUE @@ -327,7 +328,7 @@ values. It may be natural use some of the standard methods like names(), levels() or similar. As with the previous cases, the lack of [[ support makes it impossible to map named enum values to codes and vice-versa.} -\reply{Fixed, thank you. New output:} +\reply{Done, thank you. New output:} \begin{verbatim} > names(tutorial.Person$PhoneType) [1] "MOBILE" "HOME" "WORK" @@ -339,7 +340,7 @@ the consistency of the API. Since the authors intend direct interaction with the objects via basic standard R methods, the classes should behave consistently.} -\reply{We made several passes, correcting issues as documented in +\reply{We made several passes, correcting issues as documented in the \texttt{ChangeLog} and now present in our latest 0.4.2 release on CRAN.} \pointRaised{Comment 6}{Finally, most classes implement coercion to characters, which is not @@ -362,7 +363,7 @@ In choosing the debug output for a file descriptor we agree that \texttt{filename} is a reasonable thing to expect, but we also think that the contents of the \texttt{.proto} file is also - reasonable, but more useful. We document this in + reasonable, but more useful. We document this in the help for ``FileDescriptor-class'', the vignette, and other sources. \texttt{@filename} is one of the slots of the FileDescriptor class and so very easy to find. The contents of the \texttt{.proto} are @@ -373,7 +374,7 @@ descriptor in R. Depending on the intention here, it may be useful to explain this feature. } -\reply{This snippet has been removed as part of the general move of +\reply{Done. This snippet has been removed as part of the general move of less relevant details to the package documentation, but for reference the \texttt{.proto} file syntax is defined in the Protocol Buffers language guide which is referenced earlier. It is a cross platform @@ -393,13 +394,13 @@ no reason to not include them - they can be useful to store expressions that may not be necessary specific to R. Further on p. 18 your run into the same problem that could be fixed so easily.} -\reply{You are right. Environments are more than just hash +\reply{Acknowledged. Environments are more than just hash tables because they include other configuration parameters that are necessary to serialize as well to make sure serialization/unserialization is indempotent, but we agree it is cleaner and the package and the exposition in the paper to just make sure we serialize everything. We can now fall back to - \texttt{base::serialize} and storing the bits in a rawString type of + \texttt{base::serialize()} and storing the bits in a rawString type of RProtoBuf to make the R schema-less serialization more complete.} \pointRaised{Comment 9}{The examples in sections 7 and 8 are somewhat weak. It does not seem @@ -435,9 +436,8 @@ would sacrifice interoperability by using PB (they are still more hassle and require special installations)? It would be useful if the reason could be made explicit here or a better example chosen.} -\reply{This section has been reworded to make it shorter and more - crisp, with fewer extraneous details about OpenCPU. -Protocol +\reply{Done. This section has been reworded to make it shorter and more + crisp, with fewer extraneous details about OpenCPU. Protocol Buffers is an efficient protocol used between distributed systems at many of the world's largest internet companies (Twitter, Sony, Google, etc.) but the design and implementation of a large