[Rprotobuf-commits] r946 - in papers/jss: . JSSemails

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Mon Apr 13 21:53:43 CEST 2015


Author: edd
Date: 2015-04-13 21:53:43 +0200 (Mon, 13 Apr 2015)
New Revision: 946

Added:
   papers/jss/JSSemails/article-submitted-2014-03.pdf
   papers/jss/JSSemails/response-to-reviewers.tex
Removed:
   papers/jss/article-submitted-2014-03.pdf
   papers/jss/response-to-reviewers.tex
Log:
some file cleanups, part two

Copied: papers/jss/JSSemails/article-submitted-2014-03.pdf (from rev 943, papers/jss/article-submitted-2014-03.pdf)
===================================================================
(Binary files differ)

Copied: papers/jss/JSSemails/response-to-reviewers.tex (from rev 943, papers/jss/response-to-reviewers.tex)
===================================================================
--- papers/jss/JSSemails/response-to-reviewers.tex	                        (rev 0)
+++ papers/jss/JSSemails/response-to-reviewers.tex	2015-04-13 19:53:43 UTC (rev 946)
@@ -0,0 +1,455 @@
+
+\documentclass[10pt]{article}
+\usepackage{url}
+\usepackage{vmargin}
+\setpapersize{USletter}
+% left top right bottom -- headheight headsep footheight footskop
+\setmarginsrb{1in}{1in}{1in}{0.5in}{0pt}{0mm}{10pt}{0.5in}
+\usepackage{charter}
+
+\setlength{\parskip}{1ex plus1ex minus1ex}
+\setlength{\parindent}{0pt}
+
+\newcommand{\proglang}[1]{\textsf{#1}}
+\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}
+
+\newcommand{\pointRaised}[2]{\smallskip %\hrule 
+  \textsl{{\fontseries{b}\selectfont #1}: #2}\newline}
+\newcommand{\simplePointRaised}[1]{\bigskip \hrule\textsl{#1} }
+\newcommand{\reply}[1]{\textbf{Reply}:\ #1 \smallskip } %\hrule \smallskip}
+
+\begin{document}
+
+\author{Dirk Eddelbuettel\\Debian Project \and 
+        Murray Stokely\\Google, Inc \and
+        Jeroen Ooms\\UCLA}
+\title{Submission JSS 1313: \\ Response to Reviewers' Comments}
+\maketitle 
+\thispagestyle{empty}
+
+Thank you for reviewing our manuscript, and for giving us an opportunity to
+rewrite, extend and and tighten both the paper and the underlying package.
+
+\smallskip
+We truly appreciate the comments and suggestions. Below, we have regrouped the sets
+of comments, and have provided detailed point-by-point replies.
+%
+We hope that this satisfies the request for changes necessary to proceed with
+the publication of the revised and updated manuscript, along with the revised
+and updated package (which was recently resubmitted to CRAN as version 0.4.2).
+
+\section*{Response to Reviewer \#1}
+
+\pointRaised{Comment 1}{Overall, I think this is a strong paper. Cross-language communication
+  is a challenging problem, and good solutions for R are important to
+  establish R as a well-behaved member of a data analysis pipeline. The
+  paper is well written, and I recommend that it be accepted subject to
+  the suggestions below.}
+\reply{Thank you. We are providing a point-by-point reply below.}
+
+\subsubsection*{More big picture, less details}
+
+\pointRaised{Comment 2}{Overall, I think the paper provides too much detail on
+  relatively unimportant topics and not enough on the reasoning behind
+  important design decisions. I think you could comfortably reduce the paper
+  by 5-10 pages, referring the interested reader to the documentation for
+  more detail.}
+\reply{The paper is now six pages shorter at just 23 pages.
+  Sections 3 - 8 (all but Section 1 (``Introduction''), Section 2 (``Protocol Buffers''),
+  and Section 9 (``Conclusion'') have been thoroughly rewritten to address the specific and
+  general feedback in these reviews.}
+
+\pointRaised{Comment 3}{I'd recommend shrinking section 3 to ~2 pages, and removing the
+  subheadings. This section should quickly orient the reader to the
+  RProtobuf API so they understand the big picture before learning more
+  details in the subsequent sections. I'd recommend picking one OO style
+  and sticking to it in this section - two is confusing.}
+\reply{We followed this recommendation, reduced section 3 to about
+  $2\frac{1}{2}$ pages, removed the subheadings and tightened the exposition.}
+
+\pointRaised{Comment 3}{Section 4 dives into the details without giving a good overview and
+  motivation. Why use S4 and not RC? How are the objects made mutable?
+  Why do you provide both generic function and message passing OO
+  styles? What does \$ do in this context? What the heck is a
+  pseudo-method? Spend more time on those big issues rather than
+  describing each class in detail. Reduce class descriptions to a
+  bulleted list giving a high-level overview, then encourage the reader
+  to refer to the documentation for further details. Similarly, Tables
+  3-5 belong in the documentation, not in a vignette/paper.}
+\reply{Done. RProtoBuf was designed and implemented before RC were
+  available, and this is now noted explicitly in a new footnote.  Explanation of how
+  they are made mutable has been added.  Better explanation of the
+  two styles and '\$' as been added.  We are no longer using the
+  confusing term 'pseudo-method' anywhere.  We also moved Tables 3-5 into the
+  documentation and out of the paper, as suggested.}
+
+\pointRaised{Comment 4}{Section 7 is weak. I think the important message is that RProtobuf is
+  being used in practice at large scale for for large data, and is
+  useful for communicating between R and Python. How can you make that
+  message stronger while avoiding (for the purposes of this paper) the
+  relatively unimportant details of the map-reduce setup?}
+\reply{Done.  Rewritten with more motivation taking into account this feedback.}
+
+\subsubsection*{R to/from Protobuf translation}
+
+\pointRaised{Comment 5}{The discussion of R to/from Protobuf could be improved. Table 9 would be
+  much simpler if instead of Message, you provided a "vectorised"
+  Messages class (this would also make the interface more consistent and
+  hence the package easier to use).}
+\reply{This is a good observation that only became clear to us after
+  significant usage of \texttt{RProtoBuf}.  Providing a full ``vectorized'' Messages class would require slicing
+  operators that let you quickly extract a given field from each
+  element of the message vector in order to be really useful.  This
+  would require significant amounts of C++ code for efficient
+  manipulation on the order of data.table or other similar large C++ R
+  packages on CRAN.  There is another package called Motobuf by other authors
+  that takes this approach but in practice (at least for the several hundred
+  users at Google), the ease-of-use provided by the simple Message interface of RProtoBuf
+  has won with users.  It is still future work to keep the simple
+  interactive interface of RProtoBuf with the vectorized efficiency of
+  Motobuf.  For now, users typically do their slicing of vectors like
+  this through a distributed database (NewSQL is the term of the day?)
+  like Dremel or other system and then just get the response Protocol
+  Buffers in return to the request.}
+
+\pointRaised{Comment 6}{Along these lines, I think it would make sense to combine sections 5
+  and 6 and discuss translation challenges in both direction
+  simultaneously. At the minimum, add the equivalent for Table 9 that
+  shows how important R classes are converted to their protobuf
+  equivalents.}
+\reply{Done. We have updated these sections to make it clearer that the main
+  distinction is between schema-based datastructures (Section 5) and
+  schema-less use where a catch-all \texttt{.proto} is used (Section 6).
+  Neither section is meant to focus on only a single direction of the
+  conversion, but how conversion works when you have a schema or not.
+  How important R classes are converted to their protobuf equivalents
+  isn't super useful as a C++, Java, or Python program is unlikely to
+  want to read in an R data.frame exactly as it is defined.  Much more
+  likely is an application-specific message format is defined between the
+  two services, such as the HistogramTools example in the next section.
+  Much more detail has been added to an interesting part of section 6 --
+  which datasets exactly are better served with RProtoBuf than
+  \texttt{base::serialize} and why?}
+
+\pointRaised{Comment 7}{You should discuss how missing values are handled for strings and
+  integers, and why enums are not equivalent to factors. I think you
+  could make explicit how coercion of factors, dates, times and matrices
+  occurs, and the implications of this on sharing data structures
+  between programming languages. For example, how do you share date/time
+  data between R and python using RProtoBuf?}
+\reply{All of these details are application-specific, whereas
+  RProtoBuf is an infrastructure package.  Distributed systems define
+  their own interfaces, with their own date/time fields, usually as
+  a double of fractional seconds since the unix epoch for the systems I
+  have worked on.  An example is given for Histograms in the next
+  section.  Factors could be represented as repeated enums in Protocol
+  Buffers, certainly, if that is how one wanted to define a schema.}
+
+\pointRaised{Comment 8}{Table 10 is dying to be a plot, and a natural companion would be to
+  show how long it takes to serialise data frames using both RProtoBuf
+  and R's native serialisation. Is there a performance penalty to using
+  protobufs?}
+\reply{Done. Table 10 has been replaced with a plot, the outliers are
+  labeled, and the text now includes some interesting explanation
+  about the outliers.  Page 4 explains that the R implementation of
+  Protocol Buffers uses reflection to make operations slower but makes
+  it more convenient for interactive data analysis.  None of the
+  built-in datasets are large enough for performance to really come up
+  as an issue, and for any serialization method examples could be
+  found that significantly favor one over another in runtime, so we
+  don't think there will be benefit to adding anything here.  }
+
+\subsubsection*{RObjectTables magic}
+
+\pointRaised{Comment 9}{The use of RObjectTables magic makes me uneasy. It doesn't seem like a
+  good fit for an infrastructure package and it's not clear what
+  advantages it has over explicitly loading a protobuf definition into
+  an object.}
+\reply{Done. More information about the advantages and disadvantages of this
+  approach have been added.}
+
+\pointRaised{Comment 10}{Using global state makes understanding code much harder. In Table 1,
+  it's not obvious where \texttt{tutorial.Person} comes from. Is it loaded by
+  default by RProtobuf? This need some explanation. In Section 7, what
+  does \texttt{readProtoFiles()} do? Why does \texttt{RProtobuf} need to be attached
+  as well as \texttt{HistogramTools}? This needs more explanation, and a
+  comment on the implications of this approach on CRAN packages and
+  namespaces.}
+\reply{Done. We followed this recommendation and added explanation for how
+  \texttt{tutorial.Person} is loaded, specifically : \emph{A small number of message types are imported when the
+    package is first loaded, including the tutorial.Person type we saw in
+    the last section.}  Thank you also for spotting the superfluous attach
+  of \texttt{RProtoBuf}, it has been removed from the example.}
+
+\pointRaised{Comment 11}{
+  I'd prefer you eliminate this magic from the magic, but failing that,
+  you need a good explanation of why.}
+\reply{Done. We've added more explanation about this.}
+
+\subsubsection*{Code comments}
+
+\pointRaised{Comment 12}{Using \texttt{file.create()} to determine the absolute path seems like a bad idea.}
+\reply{Done. We followed this recommendation and removed two instances of
+  \texttt{file.create()} for this purpose with calls to
+  \texttt{normalizePath} with \texttt{mustWork=FALSE}.}
+
+\subsubsection*{Minor niggles}
+
+\pointRaised{Comment 13}{Don't refer to the message passing style of OO as traditional.}
+\reply{Done. We don't refer to this style as traditional anywhere in
+  the manuscript anymore.}
+
+\pointRaised{Comment 14}{In Section 3.4, if messages isn't a vectorised class, the default
+   print method should use \texttt{cat()} to eliminate the confusing \texttt{[1]}.}
+\reply{Done, thanks.}
+
+\pointRaised{Comment 15}{The REXP definition would have been better defined using an enum that
+   matches R's SEXPTYPE "enum". But I guess that ship has sailed.}
+\reply{Acknowledged.  We chose to maintain compatibility with RHIPE here.  The main
+use of RProtoBuf is not with \texttt{rexp.proto} however -- it with
+application-specific schemas in \texttt{.proto} files for sending data between
+applications.  Users that want to do something very R-specific are
+welcome to use their own \texttt{.proto} files with an enum to represent R SEXPTYPEs.}
+
+\pointRaised{Comment 16}{Why does \texttt{serialize\_pb(CO2, NULL)} fail silently? Shouldn't it at least
+   warn that the serialization is partial?}
+\reply{Done. We fixed this and \texttt{serialize\_pb} now works for all built-in datatypes in R
+  and no longer fails silently if it encounters something it can't serialize.}
+
+\section*{Response to Reviewer \#2}
+
+\pointRaised{Comment 1}{The paper gives an overview of the RProtoBuf package which implements an 
+  R interface to the Protocol Buffers library for an efficient 
+  serialization of objects. The paper is well written and easy to read. 
+  Introductory code is clear and the package provides objects to play with 
+  immediately without the need to jump through hoops, making it appealing. 
+  The software implementation is executed well.}
+\reply{Thank you.}
+
+\pointRaised{Comment 2}{There are, however, a few inconsistencies in the implementation and some 
+  issues with specific sections in the paper. In the following both issues 
+  will be addressed sequentially by their occurrence in the paper.}
+\reply{Done. These and others have been identified and addressed.  Thank you
+  for taking the time to enumerate these issues.}
+
+\pointRaised{Comment 3}{p.4 illustrates the use of messages. The class implements list-like 
+  access via \texttt{[[} and \$, but strangely \texttt{names()} return NULL and \texttt{length() }
+  doesn't correspond to the number of fields leading to startling results like
+the following:}
+
+\begin{verbatim}
+ > p
+[1] "message of type 'tutorial.Person' with 2 fields set"
+ > length(p)
+[1] 2
+ > p[[3]]
+[1] ""
+\end{verbatim}
+\reply{Done. We have corrected the list-like accessor, fixed \texttt{length()} to
+  correspond to the number of set fields, and added \texttt{names()}:}
+\begin{verbatim}
+> p
+message of type 'tutorial.Person' with 0 fields set
+> length(p)
+[1] 0
+> p[[3]]
+[1] ""
+> p$id <- 1
+> length(p)
+[1] 1
+> names(p)
+[1] "name"  "id"    "email" "phone"
+\end{verbatim}
+
+\pointRaised{Comment 3 cont.}{The inconsistencies get even more bizarre with descriptors (p.9):}
+
+\begin{verbatim}
+ > tutorial.Person$email
+[1] "descriptor for field 'email' of type 'tutorial.Person' "
+ > tutorial.Person[["email"]]
+Error in tutorial.Person[["email"]] : this S4 class is not subsettable
+ > names(tutorial.Person)
+NULL
+ > length(tutorial.Person)
+[1] 1
+\end{verbatim}
+\reply{Done. We agree, and have addressed this inconsistency.  Thank you for
+  catching this.}
+\begin{verbatim}
+> tutorial.Person$email
+descriptor for field 'email' of type 'tutorial.Person' 
+> tutorial.Person[["email"]]
+descriptor for field 'email' of type 'tutorial.Person' 
+> names(tutorial.Person)
+[1] "name"        "id"          "email"       "phone"       "PhoneNumber"
+[6] "PhoneType"  
+> length(tutorial.Person)
+[1] 6
+\end{verbatim}
+
+\pointRaised{Comment 3 cont.}{It appears that there is no way to find out the fields of a descriptor 
+  directly (although the low-level object methods seem to be exposed as 
+  \texttt{\$field\_count()} and \texttt{\$fields()} - but that seems extremely cumbersome). 
+  Again, implementing names() and subsetting may help here.}
+\reply{Done. We have implemented \texttt{names} and subsetting.  Thank you for the
+  suggestion.}
+\begin{verbatim}
+> tutorial.Person[[1]]
+descriptor for field 'name' of type 'tutorial.Person' 
+> tutorial.Person[[2]]
+descriptor for field 'id' of type 'tutorial.Person' 
+\end{verbatim}
+
+\pointRaised{Comment 4}{Another inconsistency concerns the \texttt{as.list()} method which by design 
+  coerces objects to lists (see \texttt{?as.list}), but the implementation for 
+  EnumDescriptor breaks that contract and returns a vector instead:}
+
+\begin{verbatim}
+ > is.list(as.list(tutorial.Person$PhoneType))
+[1] FALSE
+ > str(as.list(tutorial.Person$PhoneType))
+  Named int [1:3] 0 1 2
+  - attr(*, "names")= chr [1:3] "MOBILE" "HOME" "WORK"
+\end{verbatim}
+
+\reply{Done, thank you. New output below:}
+\begin{verbatim}
+> is.list(as.list(tutorial.Person$PhoneType))
+[1] TRUE
+> str(as.list(tutorial.Person$PhoneType))
+List of 3
+ $ MOBILE: int 0
+ $ HOME  : int 1
+ $ WORK  : int 2
+\end{verbatim}
+
+\pointRaised{Comment 4 cont}{As with the other interfaces, names() returns NULL so it is again quite
+  difficult to perform even simple operations such as finding out the 
+  values. It may be natural use some of the standard methods like names(), 
+  levels() or similar. As with the previous cases, the lack of [[ support
+  makes it impossible to map named enum values to codes and vice-versa.}
+\reply{Done, thank you.  New output:}
+\begin{verbatim}
+> names(tutorial.Person$PhoneType)
+[1] "MOBILE" "HOME"   "WORK"  
+> tutorial.Person$PhoneType[["HOME"]]
+[1] 1
+\end{verbatim}
+
+\pointRaised{Comment 5}{In general, the package would benefit from one pass of checks to assess
+  the consistency of the API. Since the authors intend direct interaction
+  with the objects via basic standard R methods, the classes should behave 
+  consistently.}
+\reply{We made several passes, correcting issues as documented in the
+  \texttt{ChangeLog} and now present in our latest 0.4.2 release on CRAN.}
+
+\pointRaised{Comment 6}{Finally, most classes implement coercion to characters, which is not 
+  mentioned and is not quite intuitive for some objects. For example, one
+  may think that \texttt{as.character()} on a file descriptor returns let's say the 
+  filename, but we get:}
+
+\begin{verbatim}
+ > cat(as.character(tutorial.Person$fileDescriptor()))
+syntax = "proto2";
+
+package tutorial;
+
+option java_package = "com.example.tutorial";
+option java_outer_classname = "AddressBookProtos";
+[...]
+\end{verbatim}
+\reply{The behavior is documented in the package documentation but
+  seemed like a minor detail not important for an already-long paper.
+  In choosing the debug output for a file descriptor we agree
+  that \texttt{filename} is a reasonable thing to expect, but we also
+  think that the contents of the \texttt{.proto} file is also
+  reasonable, but more useful.  We document this in the help for
+  ``FileDescriptor-class'', the vignette, and other sources.
+  \texttt{@filename} is one of the slots of the FileDescriptor class
+  and so very easy to find.  The contents of the \texttt{.proto} are
+  not as easily accessible in a slot, however, and so we find it much
+  more useful to be output with \texttt{as.character()}.}
+
+\pointRaised{Comment 7}{It is not necessary clear what java\_package has to do with a file 
+  descriptor in R. Depending on the intention here, it may be useful to 
+  explain this feature.
+}
+\reply{Done. This snippet has been removed as part of the general move of
+  less relevant details to the package documentation, but for
+  reference the \texttt{.proto} file syntax is defined in the Protocol Buffers
+  language guide which is referenced earlier. It is a cross platform
+  library and so this syntax specifies some parameters when Java code
+  is used to access the structures defined in this file.  No such
+  special syntax is required in the \texttt{.proto} files for R
+  language code and so this line about java\_package was not relevant
+  or needed in any way for RProtoBuf and is documented elsewhere.}
+
+\subsubsection*{Other comments:}
+
+\pointRaised{Comment 8}{p.17: "does not support ... function, language or environment. Such 
+  objects have no native equivalent type in Protocol Buffers, and have 
+  little meaning outside the context or R"
+  That is certainly false. Native mirror of environments are hash tables - 
+  a very useful type indeed. Language objects are just lists, so there is
+  no reason to not include them - they can be useful to store expressions
+  that may not be necessary specific to R. Further on p. 18 your run into
+  the same problem that could be fixed so easily.}
+\reply{Acknowledged.  Environments are more than just hash
+  tables because they include other configuration parameters that are
+  necessary to serialize as well to make sure
+  serialization/unserialization is indempotent, but we agree it is
+  cleaner and the package and the exposition in the paper to just make
+  sure we serialize everything.  We can now fall back to
+  \texttt{base::serialize()} and storing the bits in a rawString type of
+  RProtoBuf to make the R schema-less serialization more complete.}
+
+\pointRaised{Comment 9}{The examples in sections 7 and 8 are somewhat weak. It does not seem 
+  clear why one would wish to unleash the power of PB just to transfer 
+  breaks and counts for plotting - even a simple ASCII file would do that
+  just fine. The main point in the example is presumably that there are 
+  code generation methods for Hadoop based on PB IDL such that Hadoop can
+  be made aware of the data types, thus making a histogram a proper record 
+  that won't be split, can be combined etc. -- yet that is not mentioned 
+  nor a way presented how that can be leveraged in practice. The Python 
+  example code simply uses a static example with constants to simulate the 
+  output of a reducer so it doesn't illustrate the point - the reader is 
+  left confused why something as trivial would require PB while a savvy 
+  reader is not able to replicate the illustrated process. Possibly 
+  explaining the benefits and providing more details on how one would 
+  write such a job would make it much more relevant.}
+\reply{Yes, we added more detail about the advantages of using a
+  proper data type for the histograms in this example that you mentioned here -- the
+  ability to write combiners, prevent arbitrary splitting of the
+  records, etc that can greatly improve performance.  We agree with
+  the other reviewer that we don't want to get bogged down in details
+  about a particular MapReduce implementation (such as Hadoop) and so
+  now we specifically mention that goal here.
+  I think we make a better connection now between the
+  abstract MapReduce example given, and then the simpler Python
+  example code with a static example.}
+
+\pointRaised{Comment 10}{Section 8 is not very well motivated. It is much easier to use other 
+  formats for HTTP exchange - JSON is probably the most popular, but even
+  CSV works in simple settings. PB is a much less common standard. The 
+  main advantage of PB is the performance over the alternatives, but HTTP
+  services are not necessarily known for their high-throughput so why one
+  would sacrifice interoperability by using PB (they are still more hassle 
+  and require special installations)? It would be useful if the reason 
+  could be made explicit here or a better example chosen.}
+\reply{Done. This section has been reworded to make it shorter and more
+  crisp, with fewer extraneous details about OpenCPU. Protocol
+  Buffers is an efficient protocol used between distributed systems at
+  many of the world's largest internet companies (Twitter, Sony,
+  Google, etc.) but the design and implementation of a large
+  enterprise-scale distributed system with a complex RPC system and
+  serialization needs is well beyond the scope of what we can add to a
+  paper about RProtoBuf.  We chose this example because it is a much
+  more accessible example that any reader can use to easily
+  send/receive RPCs and parse the results with RProtoBuf.}
+
+\end{document}
+
+%%% Local Variables: 
+%%% mode: latex
+%%% TeX-master: t
+%%% End: 

Deleted: papers/jss/article-submitted-2014-03.pdf
===================================================================
(Binary files differ)

Deleted: papers/jss/response-to-reviewers.tex
===================================================================
--- papers/jss/response-to-reviewers.tex	2015-04-13 19:51:21 UTC (rev 945)
+++ papers/jss/response-to-reviewers.tex	2015-04-13 19:53:43 UTC (rev 946)
@@ -1,455 +0,0 @@
-
-\documentclass[10pt]{article}
-\usepackage{url}
-\usepackage{vmargin}
-\setpapersize{USletter}
-% left top right bottom -- headheight headsep footheight footskop
-\setmarginsrb{1in}{1in}{1in}{0.5in}{0pt}{0mm}{10pt}{0.5in}
-\usepackage{charter}
-
-\setlength{\parskip}{1ex plus1ex minus1ex}
-\setlength{\parindent}{0pt}
-
-\newcommand{\proglang}[1]{\textsf{#1}}
-\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}
-
-\newcommand{\pointRaised}[2]{\smallskip %\hrule 
-  \textsl{{\fontseries{b}\selectfont #1}: #2}\newline}
-\newcommand{\simplePointRaised}[1]{\bigskip \hrule\textsl{#1} }
-\newcommand{\reply}[1]{\textbf{Reply}:\ #1 \smallskip } %\hrule \smallskip}
-
-\begin{document}
-
-\author{Dirk Eddelbuettel\\Debian Project \and 
-        Murray Stokely\\Google, Inc \and
-        Jeroen Ooms\\UCLA}
-\title{Submission JSS 1313: \\ Response to Reviewers' Comments}
-\maketitle 
-\thispagestyle{empty}
-
-Thank you for reviewing our manuscript, and for giving us an opportunity to
-rewrite, extend and and tighten both the paper and the underlying package.
-
-\smallskip
-We truly appreciate the comments and suggestions. Below, we have regrouped the sets
-of comments, and have provided detailed point-by-point replies.
-%
-We hope that this satisfies the request for changes necessary to proceed with
-the publication of the revised and updated manuscript, along with the revised
-and updated package (which was recently resubmitted to CRAN as version 0.4.2).
-
-\section*{Response to Reviewer \#1}
-
-\pointRaised{Comment 1}{Overall, I think this is a strong paper. Cross-language communication
-  is a challenging problem, and good solutions for R are important to
-  establish R as a well-behaved member of a data analysis pipeline. The
-  paper is well written, and I recommend that it be accepted subject to
-  the suggestions below.}
-\reply{Thank you. We are providing a point-by-point reply below.}
-
-\subsubsection*{More big picture, less details}
-
-\pointRaised{Comment 2}{Overall, I think the paper provides too much detail on
-  relatively unimportant topics and not enough on the reasoning behind
-  important design decisions. I think you could comfortably reduce the paper
-  by 5-10 pages, referring the interested reader to the documentation for
-  more detail.}
-\reply{The paper is now six pages shorter at just 23 pages.
-  Sections 3 - 8 (all but Section 1 (``Introduction''), Section 2 (``Protocol Buffers''),
-  and Section 9 (``Conclusion'') have been thoroughly rewritten to address the specific and
-  general feedback in these reviews.}
-
-\pointRaised{Comment 3}{I'd recommend shrinking section 3 to ~2 pages, and removing the
-  subheadings. This section should quickly orient the reader to the
-  RProtobuf API so they understand the big picture before learning more
-  details in the subsequent sections. I'd recommend picking one OO style
-  and sticking to it in this section - two is confusing.}
-\reply{We followed this recommendation, reduced section 3 to about
-  $2\frac{1}{2}$ pages, removed the subheadings and tightened the exposition.}
-
-\pointRaised{Comment 3}{Section 4 dives into the details without giving a good overview and
-  motivation. Why use S4 and not RC? How are the objects made mutable?
-  Why do you provide both generic function and message passing OO
-  styles? What does \$ do in this context? What the heck is a
-  pseudo-method? Spend more time on those big issues rather than
-  describing each class in detail. Reduce class descriptions to a
-  bulleted list giving a high-level overview, then encourage the reader
-  to refer to the documentation for further details. Similarly, Tables
-  3-5 belong in the documentation, not in a vignette/paper.}
-\reply{Done. RProtoBuf was designed and implemented before RC were
-  available, and this is now noted explicitly in a new footnote.  Explanation of how
-  they are made mutable has been added.  Better explanation of the
-  two styles and '\$' as been added.  We are no longer using the
-  confusing term 'pseudo-method' anywhere.  We also moved Tables 3-5 into the
-  documentation and out of the paper, as suggested.}
-
-\pointRaised{Comment 4}{Section 7 is weak. I think the important message is that RProtobuf is
-  being used in practice at large scale for for large data, and is
-  useful for communicating between R and Python. How can you make that
-  message stronger while avoiding (for the purposes of this paper) the
-  relatively unimportant details of the map-reduce setup?}
-\reply{Done.  Rewritten with more motivation taking into account this feedback.}
-
-\subsubsection*{R to/from Protobuf translation}
-
-\pointRaised{Comment 5}{The discussion of R to/from Protobuf could be improved. Table 9 would be
-  much simpler if instead of Message, you provided a "vectorised"
-  Messages class (this would also make the interface more consistent and
-  hence the package easier to use).}
-\reply{This is a good observation that only became clear to us after
-  significant usage of \texttt{RProtoBuf}.  Providing a full ``vectorized'' Messages class would require slicing
-  operators that let you quickly extract a given field from each
-  element of the message vector in order to be really useful.  This
-  would require significant amounts of C++ code for efficient
-  manipulation on the order of data.table or other similar large C++ R
-  packages on CRAN.  There is another package called Motobuf by other authors
-  that takes this approach but in practice (at least for the several hundred
-  users at Google), the ease-of-use provided by the simple Message interface of RProtoBuf
-  has won with users.  It is still future work to keep the simple
-  interactive interface of RProtoBuf with the vectorized efficiency of
-  Motobuf.  For now, users typically do their slicing of vectors like
-  this through a distributed database (NewSQL is the term of the day?)
-  like Dremel or other system and then just get the response Protocol
-  Buffers in return to the request.}
-
-\pointRaised{Comment 6}{Along these lines, I think it would make sense to combine sections 5
-  and 6 and discuss translation challenges in both direction
-  simultaneously. At the minimum, add the equivalent for Table 9 that
-  shows how important R classes are converted to their protobuf
-  equivalents.}
-\reply{Done. We have updated these sections to make it clearer that the main
-  distinction is between schema-based datastructures (Section 5) and
-  schema-less use where a catch-all \texttt{.proto} is used (Section 6).
-  Neither section is meant to focus on only a single direction of the
-  conversion, but how conversion works when you have a schema or not.
-  How important R classes are converted to their protobuf equivalents
-  isn't super useful as a C++, Java, or Python program is unlikely to
-  want to read in an R data.frame exactly as it is defined.  Much more
-  likely is an application-specific message format is defined between the
-  two services, such as the HistogramTools example in the next section.
-  Much more detail has been added to an interesting part of section 6 --
-  which datasets exactly are better served with RProtoBuf than
-  \texttt{base::serialize} and why?}
-
-\pointRaised{Comment 7}{You should discuss how missing values are handled for strings and
-  integers, and why enums are not equivalent to factors. I think you
-  could make explicit how coercion of factors, dates, times and matrices
-  occurs, and the implications of this on sharing data structures
-  between programming languages. For example, how do you share date/time
-  data between R and python using RProtoBuf?}
-\reply{All of these details are application-specific, whereas
-  RProtoBuf is an infrastructure package.  Distributed systems define
-  their own interfaces, with their own date/time fields, usually as
-  a double of fractional seconds since the unix epoch for the systems I
-  have worked on.  An example is given for Histograms in the next
-  section.  Factors could be represented as repeated enums in Protocol
-  Buffers, certainly, if that is how one wanted to define a schema.}
-
-\pointRaised{Comment 8}{Table 10 is dying to be a plot, and a natural companion would be to
-  show how long it takes to serialise data frames using both RProtoBuf
-  and R's native serialisation. Is there a performance penalty to using
-  protobufs?}
-\reply{Done. Table 10 has been replaced with a plot, the outliers are
-  labeled, and the text now includes some interesting explanation
-  about the outliers.  Page 4 explains that the R implementation of
-  Protocol Buffers uses reflection to make operations slower but makes
-  it more convenient for interactive data analysis.  None of the
-  built-in datasets are large enough for performance to really come up
-  as an issue, and for any serialization method examples could be
-  found that significantly favor one over another in runtime, so we
-  don't think there will be benefit to adding anything here.  }
-
-\subsubsection*{RObjectTables magic}
-
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/rprotobuf -r 946


More information about the Rprotobuf-commits mailing list