[Rprotobuf-commits] r946 - in papers/jss: . JSSemails
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Mon Apr 13 21:53:43 CEST 2015
Author: edd
Date: 2015-04-13 21:53:43 +0200 (Mon, 13 Apr 2015)
New Revision: 946
Added:
papers/jss/JSSemails/article-submitted-2014-03.pdf
papers/jss/JSSemails/response-to-reviewers.tex
Removed:
papers/jss/article-submitted-2014-03.pdf
papers/jss/response-to-reviewers.tex
Log:
some file cleanups, part two
Copied: papers/jss/JSSemails/article-submitted-2014-03.pdf (from rev 943, papers/jss/article-submitted-2014-03.pdf)
===================================================================
(Binary files differ)
Copied: papers/jss/JSSemails/response-to-reviewers.tex (from rev 943, papers/jss/response-to-reviewers.tex)
===================================================================
--- papers/jss/JSSemails/response-to-reviewers.tex (rev 0)
+++ papers/jss/JSSemails/response-to-reviewers.tex 2015-04-13 19:53:43 UTC (rev 946)
@@ -0,0 +1,455 @@
+
+\documentclass[10pt]{article}
+\usepackage{url}
+\usepackage{vmargin}
+\setpapersize{USletter}
+% left top right bottom -- headheight headsep footheight footskop
+\setmarginsrb{1in}{1in}{1in}{0.5in}{0pt}{0mm}{10pt}{0.5in}
+\usepackage{charter}
+
+\setlength{\parskip}{1ex plus1ex minus1ex}
+\setlength{\parindent}{0pt}
+
+\newcommand{\proglang}[1]{\textsf{#1}}
+\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}
+
+\newcommand{\pointRaised}[2]{\smallskip %\hrule
+ \textsl{{\fontseries{b}\selectfont #1}: #2}\newline}
+\newcommand{\simplePointRaised}[1]{\bigskip \hrule\textsl{#1} }
+\newcommand{\reply}[1]{\textbf{Reply}:\ #1 \smallskip } %\hrule \smallskip}
+
+\begin{document}
+
+\author{Dirk Eddelbuettel\\Debian Project \and
+ Murray Stokely\\Google, Inc \and
+ Jeroen Ooms\\UCLA}
+\title{Submission JSS 1313: \\ Response to Reviewers' Comments}
+\maketitle
+\thispagestyle{empty}
+
+Thank you for reviewing our manuscript, and for giving us an opportunity to
+rewrite, extend and and tighten both the paper and the underlying package.
+
+\smallskip
+We truly appreciate the comments and suggestions. Below, we have regrouped the sets
+of comments, and have provided detailed point-by-point replies.
+%
+We hope that this satisfies the request for changes necessary to proceed with
+the publication of the revised and updated manuscript, along with the revised
+and updated package (which was recently resubmitted to CRAN as version 0.4.2).
+
+\section*{Response to Reviewer \#1}
+
+\pointRaised{Comment 1}{Overall, I think this is a strong paper. Cross-language communication
+ is a challenging problem, and good solutions for R are important to
+ establish R as a well-behaved member of a data analysis pipeline. The
+ paper is well written, and I recommend that it be accepted subject to
+ the suggestions below.}
+\reply{Thank you. We are providing a point-by-point reply below.}
+
+\subsubsection*{More big picture, less details}
+
+\pointRaised{Comment 2}{Overall, I think the paper provides too much detail on
+ relatively unimportant topics and not enough on the reasoning behind
+ important design decisions. I think you could comfortably reduce the paper
+ by 5-10 pages, referring the interested reader to the documentation for
+ more detail.}
+\reply{The paper is now six pages shorter at just 23 pages.
+ Sections 3 - 8 (all but Section 1 (``Introduction''), Section 2 (``Protocol Buffers''),
+ and Section 9 (``Conclusion'') have been thoroughly rewritten to address the specific and
+ general feedback in these reviews.}
+
+\pointRaised{Comment 3}{I'd recommend shrinking section 3 to ~2 pages, and removing the
+ subheadings. This section should quickly orient the reader to the
+ RProtobuf API so they understand the big picture before learning more
+ details in the subsequent sections. I'd recommend picking one OO style
+ and sticking to it in this section - two is confusing.}
+\reply{We followed this recommendation, reduced section 3 to about
+ $2\frac{1}{2}$ pages, removed the subheadings and tightened the exposition.}
+
+\pointRaised{Comment 3}{Section 4 dives into the details without giving a good overview and
+ motivation. Why use S4 and not RC? How are the objects made mutable?
+ Why do you provide both generic function and message passing OO
+ styles? What does \$ do in this context? What the heck is a
+ pseudo-method? Spend more time on those big issues rather than
+ describing each class in detail. Reduce class descriptions to a
+ bulleted list giving a high-level overview, then encourage the reader
+ to refer to the documentation for further details. Similarly, Tables
+ 3-5 belong in the documentation, not in a vignette/paper.}
+\reply{Done. RProtoBuf was designed and implemented before RC were
+ available, and this is now noted explicitly in a new footnote. Explanation of how
+ they are made mutable has been added. Better explanation of the
+ two styles and '\$' as been added. We are no longer using the
+ confusing term 'pseudo-method' anywhere. We also moved Tables 3-5 into the
+ documentation and out of the paper, as suggested.}
+
+\pointRaised{Comment 4}{Section 7 is weak. I think the important message is that RProtobuf is
+ being used in practice at large scale for for large data, and is
+ useful for communicating between R and Python. How can you make that
+ message stronger while avoiding (for the purposes of this paper) the
+ relatively unimportant details of the map-reduce setup?}
+\reply{Done. Rewritten with more motivation taking into account this feedback.}
+
+\subsubsection*{R to/from Protobuf translation}
+
+\pointRaised{Comment 5}{The discussion of R to/from Protobuf could be improved. Table 9 would be
+ much simpler if instead of Message, you provided a "vectorised"
+ Messages class (this would also make the interface more consistent and
+ hence the package easier to use).}
+\reply{This is a good observation that only became clear to us after
+ significant usage of \texttt{RProtoBuf}. Providing a full ``vectorized'' Messages class would require slicing
+ operators that let you quickly extract a given field from each
+ element of the message vector in order to be really useful. This
+ would require significant amounts of C++ code for efficient
+ manipulation on the order of data.table or other similar large C++ R
+ packages on CRAN. There is another package called Motobuf by other authors
+ that takes this approach but in practice (at least for the several hundred
+ users at Google), the ease-of-use provided by the simple Message interface of RProtoBuf
+ has won with users. It is still future work to keep the simple
+ interactive interface of RProtoBuf with the vectorized efficiency of
+ Motobuf. For now, users typically do their slicing of vectors like
+ this through a distributed database (NewSQL is the term of the day?)
+ like Dremel or other system and then just get the response Protocol
+ Buffers in return to the request.}
+
+\pointRaised{Comment 6}{Along these lines, I think it would make sense to combine sections 5
+ and 6 and discuss translation challenges in both direction
+ simultaneously. At the minimum, add the equivalent for Table 9 that
+ shows how important R classes are converted to their protobuf
+ equivalents.}
+\reply{Done. We have updated these sections to make it clearer that the main
+ distinction is between schema-based datastructures (Section 5) and
+ schema-less use where a catch-all \texttt{.proto} is used (Section 6).
+ Neither section is meant to focus on only a single direction of the
+ conversion, but how conversion works when you have a schema or not.
+ How important R classes are converted to their protobuf equivalents
+ isn't super useful as a C++, Java, or Python program is unlikely to
+ want to read in an R data.frame exactly as it is defined. Much more
+ likely is an application-specific message format is defined between the
+ two services, such as the HistogramTools example in the next section.
+ Much more detail has been added to an interesting part of section 6 --
+ which datasets exactly are better served with RProtoBuf than
+ \texttt{base::serialize} and why?}
+
+\pointRaised{Comment 7}{You should discuss how missing values are handled for strings and
+ integers, and why enums are not equivalent to factors. I think you
+ could make explicit how coercion of factors, dates, times and matrices
+ occurs, and the implications of this on sharing data structures
+ between programming languages. For example, how do you share date/time
+ data between R and python using RProtoBuf?}
+\reply{All of these details are application-specific, whereas
+ RProtoBuf is an infrastructure package. Distributed systems define
+ their own interfaces, with their own date/time fields, usually as
+ a double of fractional seconds since the unix epoch for the systems I
+ have worked on. An example is given for Histograms in the next
+ section. Factors could be represented as repeated enums in Protocol
+ Buffers, certainly, if that is how one wanted to define a schema.}
+
+\pointRaised{Comment 8}{Table 10 is dying to be a plot, and a natural companion would be to
+ show how long it takes to serialise data frames using both RProtoBuf
+ and R's native serialisation. Is there a performance penalty to using
+ protobufs?}
+\reply{Done. Table 10 has been replaced with a plot, the outliers are
+ labeled, and the text now includes some interesting explanation
+ about the outliers. Page 4 explains that the R implementation of
+ Protocol Buffers uses reflection to make operations slower but makes
+ it more convenient for interactive data analysis. None of the
+ built-in datasets are large enough for performance to really come up
+ as an issue, and for any serialization method examples could be
+ found that significantly favor one over another in runtime, so we
+ don't think there will be benefit to adding anything here. }
+
+\subsubsection*{RObjectTables magic}
+
+\pointRaised{Comment 9}{The use of RObjectTables magic makes me uneasy. It doesn't seem like a
+ good fit for an infrastructure package and it's not clear what
+ advantages it has over explicitly loading a protobuf definition into
+ an object.}
+\reply{Done. More information about the advantages and disadvantages of this
+ approach have been added.}
+
+\pointRaised{Comment 10}{Using global state makes understanding code much harder. In Table 1,
+ it's not obvious where \texttt{tutorial.Person} comes from. Is it loaded by
+ default by RProtobuf? This need some explanation. In Section 7, what
+ does \texttt{readProtoFiles()} do? Why does \texttt{RProtobuf} need to be attached
+ as well as \texttt{HistogramTools}? This needs more explanation, and a
+ comment on the implications of this approach on CRAN packages and
+ namespaces.}
+\reply{Done. We followed this recommendation and added explanation for how
+ \texttt{tutorial.Person} is loaded, specifically : \emph{A small number of message types are imported when the
+ package is first loaded, including the tutorial.Person type we saw in
+ the last section.} Thank you also for spotting the superfluous attach
+ of \texttt{RProtoBuf}, it has been removed from the example.}
+
+\pointRaised{Comment 11}{
+ I'd prefer you eliminate this magic from the magic, but failing that,
+ you need a good explanation of why.}
+\reply{Done. We've added more explanation about this.}
+
+\subsubsection*{Code comments}
+
+\pointRaised{Comment 12}{Using \texttt{file.create()} to determine the absolute path seems like a bad idea.}
+\reply{Done. We followed this recommendation and removed two instances of
+ \texttt{file.create()} for this purpose with calls to
+ \texttt{normalizePath} with \texttt{mustWork=FALSE}.}
+
+\subsubsection*{Minor niggles}
+
+\pointRaised{Comment 13}{Don't refer to the message passing style of OO as traditional.}
+\reply{Done. We don't refer to this style as traditional anywhere in
+ the manuscript anymore.}
+
+\pointRaised{Comment 14}{In Section 3.4, if messages isn't a vectorised class, the default
+ print method should use \texttt{cat()} to eliminate the confusing \texttt{[1]}.}
+\reply{Done, thanks.}
+
+\pointRaised{Comment 15}{The REXP definition would have been better defined using an enum that
+ matches R's SEXPTYPE "enum". But I guess that ship has sailed.}
+\reply{Acknowledged. We chose to maintain compatibility with RHIPE here. The main
+use of RProtoBuf is not with \texttt{rexp.proto} however -- it with
+application-specific schemas in \texttt{.proto} files for sending data between
+applications. Users that want to do something very R-specific are
+welcome to use their own \texttt{.proto} files with an enum to represent R SEXPTYPEs.}
+
+\pointRaised{Comment 16}{Why does \texttt{serialize\_pb(CO2, NULL)} fail silently? Shouldn't it at least
+ warn that the serialization is partial?}
+\reply{Done. We fixed this and \texttt{serialize\_pb} now works for all built-in datatypes in R
+ and no longer fails silently if it encounters something it can't serialize.}
+
+\section*{Response to Reviewer \#2}
+
+\pointRaised{Comment 1}{The paper gives an overview of the RProtoBuf package which implements an
+ R interface to the Protocol Buffers library for an efficient
+ serialization of objects. The paper is well written and easy to read.
+ Introductory code is clear and the package provides objects to play with
+ immediately without the need to jump through hoops, making it appealing.
+ The software implementation is executed well.}
+\reply{Thank you.}
+
+\pointRaised{Comment 2}{There are, however, a few inconsistencies in the implementation and some
+ issues with specific sections in the paper. In the following both issues
+ will be addressed sequentially by their occurrence in the paper.}
+\reply{Done. These and others have been identified and addressed. Thank you
+ for taking the time to enumerate these issues.}
+
+\pointRaised{Comment 3}{p.4 illustrates the use of messages. The class implements list-like
+ access via \texttt{[[} and \$, but strangely \texttt{names()} return NULL and \texttt{length() }
+ doesn't correspond to the number of fields leading to startling results like
+the following:}
+
+\begin{verbatim}
+ > p
+[1] "message of type 'tutorial.Person' with 2 fields set"
+ > length(p)
+[1] 2
+ > p[[3]]
+[1] ""
+\end{verbatim}
+\reply{Done. We have corrected the list-like accessor, fixed \texttt{length()} to
+ correspond to the number of set fields, and added \texttt{names()}:}
+\begin{verbatim}
+> p
+message of type 'tutorial.Person' with 0 fields set
+> length(p)
+[1] 0
+> p[[3]]
+[1] ""
+> p$id <- 1
+> length(p)
+[1] 1
+> names(p)
+[1] "name" "id" "email" "phone"
+\end{verbatim}
+
+\pointRaised{Comment 3 cont.}{The inconsistencies get even more bizarre with descriptors (p.9):}
+
+\begin{verbatim}
+ > tutorial.Person$email
+[1] "descriptor for field 'email' of type 'tutorial.Person' "
+ > tutorial.Person[["email"]]
+Error in tutorial.Person[["email"]] : this S4 class is not subsettable
+ > names(tutorial.Person)
+NULL
+ > length(tutorial.Person)
+[1] 1
+\end{verbatim}
+\reply{Done. We agree, and have addressed this inconsistency. Thank you for
+ catching this.}
+\begin{verbatim}
+> tutorial.Person$email
+descriptor for field 'email' of type 'tutorial.Person'
+> tutorial.Person[["email"]]
+descriptor for field 'email' of type 'tutorial.Person'
+> names(tutorial.Person)
+[1] "name" "id" "email" "phone" "PhoneNumber"
+[6] "PhoneType"
+> length(tutorial.Person)
+[1] 6
+\end{verbatim}
+
+\pointRaised{Comment 3 cont.}{It appears that there is no way to find out the fields of a descriptor
+ directly (although the low-level object methods seem to be exposed as
+ \texttt{\$field\_count()} and \texttt{\$fields()} - but that seems extremely cumbersome).
+ Again, implementing names() and subsetting may help here.}
+\reply{Done. We have implemented \texttt{names} and subsetting. Thank you for the
+ suggestion.}
+\begin{verbatim}
+> tutorial.Person[[1]]
+descriptor for field 'name' of type 'tutorial.Person'
+> tutorial.Person[[2]]
+descriptor for field 'id' of type 'tutorial.Person'
+\end{verbatim}
+
+\pointRaised{Comment 4}{Another inconsistency concerns the \texttt{as.list()} method which by design
+ coerces objects to lists (see \texttt{?as.list}), but the implementation for
+ EnumDescriptor breaks that contract and returns a vector instead:}
+
+\begin{verbatim}
+ > is.list(as.list(tutorial.Person$PhoneType))
+[1] FALSE
+ > str(as.list(tutorial.Person$PhoneType))
+ Named int [1:3] 0 1 2
+ - attr(*, "names")= chr [1:3] "MOBILE" "HOME" "WORK"
+\end{verbatim}
+
+\reply{Done, thank you. New output below:}
+\begin{verbatim}
+> is.list(as.list(tutorial.Person$PhoneType))
+[1] TRUE
+> str(as.list(tutorial.Person$PhoneType))
+List of 3
+ $ MOBILE: int 0
+ $ HOME : int 1
+ $ WORK : int 2
+\end{verbatim}
+
+\pointRaised{Comment 4 cont}{As with the other interfaces, names() returns NULL so it is again quite
+ difficult to perform even simple operations such as finding out the
+ values. It may be natural use some of the standard methods like names(),
+ levels() or similar. As with the previous cases, the lack of [[ support
+ makes it impossible to map named enum values to codes and vice-versa.}
+\reply{Done, thank you. New output:}
+\begin{verbatim}
+> names(tutorial.Person$PhoneType)
+[1] "MOBILE" "HOME" "WORK"
+> tutorial.Person$PhoneType[["HOME"]]
+[1] 1
+\end{verbatim}
+
+\pointRaised{Comment 5}{In general, the package would benefit from one pass of checks to assess
+ the consistency of the API. Since the authors intend direct interaction
+ with the objects via basic standard R methods, the classes should behave
+ consistently.}
+\reply{We made several passes, correcting issues as documented in the
+ \texttt{ChangeLog} and now present in our latest 0.4.2 release on CRAN.}
+
+\pointRaised{Comment 6}{Finally, most classes implement coercion to characters, which is not
+ mentioned and is not quite intuitive for some objects. For example, one
+ may think that \texttt{as.character()} on a file descriptor returns let's say the
+ filename, but we get:}
+
+\begin{verbatim}
+ > cat(as.character(tutorial.Person$fileDescriptor()))
+syntax = "proto2";
+
+package tutorial;
+
+option java_package = "com.example.tutorial";
+option java_outer_classname = "AddressBookProtos";
+[...]
+\end{verbatim}
+\reply{The behavior is documented in the package documentation but
+ seemed like a minor detail not important for an already-long paper.
+ In choosing the debug output for a file descriptor we agree
+ that \texttt{filename} is a reasonable thing to expect, but we also
+ think that the contents of the \texttt{.proto} file is also
+ reasonable, but more useful. We document this in the help for
+ ``FileDescriptor-class'', the vignette, and other sources.
+ \texttt{@filename} is one of the slots of the FileDescriptor class
+ and so very easy to find. The contents of the \texttt{.proto} are
+ not as easily accessible in a slot, however, and so we find it much
+ more useful to be output with \texttt{as.character()}.}
+
+\pointRaised{Comment 7}{It is not necessary clear what java\_package has to do with a file
+ descriptor in R. Depending on the intention here, it may be useful to
+ explain this feature.
+}
+\reply{Done. This snippet has been removed as part of the general move of
+ less relevant details to the package documentation, but for
+ reference the \texttt{.proto} file syntax is defined in the Protocol Buffers
+ language guide which is referenced earlier. It is a cross platform
+ library and so this syntax specifies some parameters when Java code
+ is used to access the structures defined in this file. No such
+ special syntax is required in the \texttt{.proto} files for R
+ language code and so this line about java\_package was not relevant
+ or needed in any way for RProtoBuf and is documented elsewhere.}
+
+\subsubsection*{Other comments:}
+
+\pointRaised{Comment 8}{p.17: "does not support ... function, language or environment. Such
+ objects have no native equivalent type in Protocol Buffers, and have
+ little meaning outside the context or R"
+ That is certainly false. Native mirror of environments are hash tables -
+ a very useful type indeed. Language objects are just lists, so there is
+ no reason to not include them - they can be useful to store expressions
+ that may not be necessary specific to R. Further on p. 18 your run into
+ the same problem that could be fixed so easily.}
+\reply{Acknowledged. Environments are more than just hash
+ tables because they include other configuration parameters that are
+ necessary to serialize as well to make sure
+ serialization/unserialization is indempotent, but we agree it is
+ cleaner and the package and the exposition in the paper to just make
+ sure we serialize everything. We can now fall back to
+ \texttt{base::serialize()} and storing the bits in a rawString type of
+ RProtoBuf to make the R schema-less serialization more complete.}
+
+\pointRaised{Comment 9}{The examples in sections 7 and 8 are somewhat weak. It does not seem
+ clear why one would wish to unleash the power of PB just to transfer
+ breaks and counts for plotting - even a simple ASCII file would do that
+ just fine. The main point in the example is presumably that there are
+ code generation methods for Hadoop based on PB IDL such that Hadoop can
+ be made aware of the data types, thus making a histogram a proper record
+ that won't be split, can be combined etc. -- yet that is not mentioned
+ nor a way presented how that can be leveraged in practice. The Python
+ example code simply uses a static example with constants to simulate the
+ output of a reducer so it doesn't illustrate the point - the reader is
+ left confused why something as trivial would require PB while a savvy
+ reader is not able to replicate the illustrated process. Possibly
+ explaining the benefits and providing more details on how one would
+ write such a job would make it much more relevant.}
+\reply{Yes, we added more detail about the advantages of using a
+ proper data type for the histograms in this example that you mentioned here -- the
+ ability to write combiners, prevent arbitrary splitting of the
+ records, etc that can greatly improve performance. We agree with
+ the other reviewer that we don't want to get bogged down in details
+ about a particular MapReduce implementation (such as Hadoop) and so
+ now we specifically mention that goal here.
+ I think we make a better connection now between the
+ abstract MapReduce example given, and then the simpler Python
+ example code with a static example.}
+
+\pointRaised{Comment 10}{Section 8 is not very well motivated. It is much easier to use other
+ formats for HTTP exchange - JSON is probably the most popular, but even
+ CSV works in simple settings. PB is a much less common standard. The
+ main advantage of PB is the performance over the alternatives, but HTTP
+ services are not necessarily known for their high-throughput so why one
+ would sacrifice interoperability by using PB (they are still more hassle
+ and require special installations)? It would be useful if the reason
+ could be made explicit here or a better example chosen.}
+\reply{Done. This section has been reworded to make it shorter and more
+ crisp, with fewer extraneous details about OpenCPU. Protocol
+ Buffers is an efficient protocol used between distributed systems at
+ many of the world's largest internet companies (Twitter, Sony,
+ Google, etc.) but the design and implementation of a large
+ enterprise-scale distributed system with a complex RPC system and
+ serialization needs is well beyond the scope of what we can add to a
+ paper about RProtoBuf. We chose this example because it is a much
+ more accessible example that any reader can use to easily
+ send/receive RPCs and parse the results with RProtoBuf.}
+
+\end{document}
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: t
+%%% End:
Deleted: papers/jss/article-submitted-2014-03.pdf
===================================================================
(Binary files differ)
Deleted: papers/jss/response-to-reviewers.tex
===================================================================
--- papers/jss/response-to-reviewers.tex 2015-04-13 19:51:21 UTC (rev 945)
+++ papers/jss/response-to-reviewers.tex 2015-04-13 19:53:43 UTC (rev 946)
@@ -1,455 +0,0 @@
-
-\documentclass[10pt]{article}
-\usepackage{url}
-\usepackage{vmargin}
-\setpapersize{USletter}
-% left top right bottom -- headheight headsep footheight footskop
-\setmarginsrb{1in}{1in}{1in}{0.5in}{0pt}{0mm}{10pt}{0.5in}
-\usepackage{charter}
-
-\setlength{\parskip}{1ex plus1ex minus1ex}
-\setlength{\parindent}{0pt}
-
-\newcommand{\proglang}[1]{\textsf{#1}}
-\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}
-
-\newcommand{\pointRaised}[2]{\smallskip %\hrule
- \textsl{{\fontseries{b}\selectfont #1}: #2}\newline}
-\newcommand{\simplePointRaised}[1]{\bigskip \hrule\textsl{#1} }
-\newcommand{\reply}[1]{\textbf{Reply}:\ #1 \smallskip } %\hrule \smallskip}
-
-\begin{document}
-
-\author{Dirk Eddelbuettel\\Debian Project \and
- Murray Stokely\\Google, Inc \and
- Jeroen Ooms\\UCLA}
-\title{Submission JSS 1313: \\ Response to Reviewers' Comments}
-\maketitle
-\thispagestyle{empty}
-
-Thank you for reviewing our manuscript, and for giving us an opportunity to
-rewrite, extend and and tighten both the paper and the underlying package.
-
-\smallskip
-We truly appreciate the comments and suggestions. Below, we have regrouped the sets
-of comments, and have provided detailed point-by-point replies.
-%
-We hope that this satisfies the request for changes necessary to proceed with
-the publication of the revised and updated manuscript, along with the revised
-and updated package (which was recently resubmitted to CRAN as version 0.4.2).
-
-\section*{Response to Reviewer \#1}
-
-\pointRaised{Comment 1}{Overall, I think this is a strong paper. Cross-language communication
- is a challenging problem, and good solutions for R are important to
- establish R as a well-behaved member of a data analysis pipeline. The
- paper is well written, and I recommend that it be accepted subject to
- the suggestions below.}
-\reply{Thank you. We are providing a point-by-point reply below.}
-
-\subsubsection*{More big picture, less details}
-
-\pointRaised{Comment 2}{Overall, I think the paper provides too much detail on
- relatively unimportant topics and not enough on the reasoning behind
- important design decisions. I think you could comfortably reduce the paper
- by 5-10 pages, referring the interested reader to the documentation for
- more detail.}
-\reply{The paper is now six pages shorter at just 23 pages.
- Sections 3 - 8 (all but Section 1 (``Introduction''), Section 2 (``Protocol Buffers''),
- and Section 9 (``Conclusion'') have been thoroughly rewritten to address the specific and
- general feedback in these reviews.}
-
-\pointRaised{Comment 3}{I'd recommend shrinking section 3 to ~2 pages, and removing the
- subheadings. This section should quickly orient the reader to the
- RProtobuf API so they understand the big picture before learning more
- details in the subsequent sections. I'd recommend picking one OO style
- and sticking to it in this section - two is confusing.}
-\reply{We followed this recommendation, reduced section 3 to about
- $2\frac{1}{2}$ pages, removed the subheadings and tightened the exposition.}
-
-\pointRaised{Comment 3}{Section 4 dives into the details without giving a good overview and
- motivation. Why use S4 and not RC? How are the objects made mutable?
- Why do you provide both generic function and message passing OO
- styles? What does \$ do in this context? What the heck is a
- pseudo-method? Spend more time on those big issues rather than
- describing each class in detail. Reduce class descriptions to a
- bulleted list giving a high-level overview, then encourage the reader
- to refer to the documentation for further details. Similarly, Tables
- 3-5 belong in the documentation, not in a vignette/paper.}
-\reply{Done. RProtoBuf was designed and implemented before RC were
- available, and this is now noted explicitly in a new footnote. Explanation of how
- they are made mutable has been added. Better explanation of the
- two styles and '\$' as been added. We are no longer using the
- confusing term 'pseudo-method' anywhere. We also moved Tables 3-5 into the
- documentation and out of the paper, as suggested.}
-
-\pointRaised{Comment 4}{Section 7 is weak. I think the important message is that RProtobuf is
- being used in practice at large scale for for large data, and is
- useful for communicating between R and Python. How can you make that
- message stronger while avoiding (for the purposes of this paper) the
- relatively unimportant details of the map-reduce setup?}
-\reply{Done. Rewritten with more motivation taking into account this feedback.}
-
-\subsubsection*{R to/from Protobuf translation}
-
-\pointRaised{Comment 5}{The discussion of R to/from Protobuf could be improved. Table 9 would be
- much simpler if instead of Message, you provided a "vectorised"
- Messages class (this would also make the interface more consistent and
- hence the package easier to use).}
-\reply{This is a good observation that only became clear to us after
- significant usage of \texttt{RProtoBuf}. Providing a full ``vectorized'' Messages class would require slicing
- operators that let you quickly extract a given field from each
- element of the message vector in order to be really useful. This
- would require significant amounts of C++ code for efficient
- manipulation on the order of data.table or other similar large C++ R
- packages on CRAN. There is another package called Motobuf by other authors
- that takes this approach but in practice (at least for the several hundred
- users at Google), the ease-of-use provided by the simple Message interface of RProtoBuf
- has won with users. It is still future work to keep the simple
- interactive interface of RProtoBuf with the vectorized efficiency of
- Motobuf. For now, users typically do their slicing of vectors like
- this through a distributed database (NewSQL is the term of the day?)
- like Dremel or other system and then just get the response Protocol
- Buffers in return to the request.}
-
-\pointRaised{Comment 6}{Along these lines, I think it would make sense to combine sections 5
- and 6 and discuss translation challenges in both direction
- simultaneously. At the minimum, add the equivalent for Table 9 that
- shows how important R classes are converted to their protobuf
- equivalents.}
-\reply{Done. We have updated these sections to make it clearer that the main
- distinction is between schema-based datastructures (Section 5) and
- schema-less use where a catch-all \texttt{.proto} is used (Section 6).
- Neither section is meant to focus on only a single direction of the
- conversion, but how conversion works when you have a schema or not.
- How important R classes are converted to their protobuf equivalents
- isn't super useful as a C++, Java, or Python program is unlikely to
- want to read in an R data.frame exactly as it is defined. Much more
- likely is an application-specific message format is defined between the
- two services, such as the HistogramTools example in the next section.
- Much more detail has been added to an interesting part of section 6 --
- which datasets exactly are better served with RProtoBuf than
- \texttt{base::serialize} and why?}
-
-\pointRaised{Comment 7}{You should discuss how missing values are handled for strings and
- integers, and why enums are not equivalent to factors. I think you
- could make explicit how coercion of factors, dates, times and matrices
- occurs, and the implications of this on sharing data structures
- between programming languages. For example, how do you share date/time
- data between R and python using RProtoBuf?}
-\reply{All of these details are application-specific, whereas
- RProtoBuf is an infrastructure package. Distributed systems define
- their own interfaces, with their own date/time fields, usually as
- a double of fractional seconds since the unix epoch for the systems I
- have worked on. An example is given for Histograms in the next
- section. Factors could be represented as repeated enums in Protocol
- Buffers, certainly, if that is how one wanted to define a schema.}
-
-\pointRaised{Comment 8}{Table 10 is dying to be a plot, and a natural companion would be to
- show how long it takes to serialise data frames using both RProtoBuf
- and R's native serialisation. Is there a performance penalty to using
- protobufs?}
-\reply{Done. Table 10 has been replaced with a plot, the outliers are
- labeled, and the text now includes some interesting explanation
- about the outliers. Page 4 explains that the R implementation of
- Protocol Buffers uses reflection to make operations slower but makes
- it more convenient for interactive data analysis. None of the
- built-in datasets are large enough for performance to really come up
- as an issue, and for any serialization method examples could be
- found that significantly favor one over another in runtime, so we
- don't think there will be benefit to adding anything here. }
-
-\subsubsection*{RObjectTables magic}
-
[TRUNCATED]
To get the complete diff run:
svnlook diff /svnroot/rprotobuf -r 946
More information about the Rprotobuf-commits
mailing list