[Rprotobuf-commits] r789 - papers/jss

Wed Jan 15 05:30:14 CET 2014

Author: edd
Date: 2014-01-15 05:30:13 +0100 (Wed, 15 Jan 2014)
New Revision: 789

Modified:
   papers/jss/article.Rnw
Log:
some tweaks to intro


Modified: papers/jss/article.Rnw
===================================================================

--- papers/jss/article.Rnw	2014-01-15 02:30:53 UTC (rev 788)
+++ papers/jss/article.Rnw	2014-01-15 04:30:13 UTC (rev 789)
@@ -113,7 +113,7 @@
 
 %TODO(de) 'protocol buffers' or 'Protocol Buffers' ?
 
-\section{Introduction: Friends don't let friends use CSV}
+\section{Introduction} % TODO(DE) More sober: Friends don't let friends use CSV}
 
 Modern data collection and analysis pipelines involve collections
 of decoupled components in order to manage and control complexity 
@@ -140,59 +140,59 @@
 are usually language specific and thereby lock the user into a single
 environment.  
 
-Traditionally, scientists and statisticians often use character seperated
-text formats such as \texttt{CSV} \citep{shafranovich2005common} to 
-export and import data. However, anyone who has ever used \texttt{CSV} will have
-noticed that this method has many limitations: it is restricted to tabular 
-data, lacks type-safety, and has limited precision for numeric values.
-Moreover, ambiguities in the format itself frequently cause problems. 
-For example, conventions on which characters used as seperator and decimal
-point vary by country.
-\emph{Extensible Markup Language} (\texttt{XML}) is another
-well-established and widely-supported format with the ability to define
-just about any arbitrarily complex schema \citep{nolan2013xml}. However,
-it pays for this complexity with comparatively large and verbose messages,
-and added complexitiy at the parsing side (which are somewhat mitigated
-by the availability of mature libraries and parsers). Because \texttt{XML}
-is text based and has no native notion of numeric types or arrays, it 
-usually not a very practical format to store numeric datasets as they appear
-in statistical applications.
-A more modern, widely used format is \emph{JavaScript Object Notation}
-(\texttt{JSON}), which is derived from the object literals of
-\proglang{JavaScript}, and used mostly on the web. \texttt{JSON} natively
-supports arrays and distinguishes 4 primitive types: numbers, strings, 
-booleans and null. However, because it is a text-based format, numbers are
+%\paragraph*{Friends don't let friends use CSV!}
+Data analysts and researchers often use character seperated text formats such
+as \texttt{CSV} \citep{shafranovich2005common} to export and import
+data. However, anyone who has ever used \texttt{CSV} files will have noticed
+that this method has many limitations: it is restricted to tabular data,
+lacks type-safety, and has limited precision for numeric values.  Moreover,
+ambiguities in the format itself frequently cause problems.  For example,
+conventions on which characters is used as seperator or decimal point vary by
+country.  \emph{Extensible Markup Language} (\texttt{XML}) is another
+well-established and widely-supported format with the ability to define just
+about any arbitrarily complex schema \citep{nolan2013xml}. However, it pays
+for this complexity with comparatively large and verbose messages, and added
+complexitiy at the parsing side (which are somewhat mitigated by the
+availability of mature libraries and parsers). Because \texttt{XML} is text
+based and has no native notion of numeric types or arrays, it usually not a
+very practical format to store numeric datasets as they appear in statistical
+applications.  A more modern, widely used format is \emph{JavaScript Object
+  Notation} (\texttt{JSON}), which is derived from the object literals of
+\proglang{JavaScript}, and used increasingly on the world wide web. \texttt{JSON} natively
+supports arrays and distinguishes 4 primitive types: numbers, strings,
+booleans and null. However, as it too is a text-based format, numbers are
 stored as human-readable decimal notation which is somewhat inefficient and
-leads to loss of type (double vs integer) and precision. Several R packages
-implement functions to parse and generate \texttt{JSON} data from R objects. 
-A number of \texttt{JSON} dialects has been proposed, such as \texttt{BSON} and 
-\texttt{MessagePack} which both add binary support. However, these derivatives
-are not compatible with existing JSON software, and have not been widely adopted.
+leads to loss of type (double versus integer) and precision. Several R packages
+implement functions to parse and generate \texttt{JSON} data from R objects.
+A number of \texttt{JSON} variants has been proposed, such as \texttt{BSON}
+and \texttt{MessagePack} which both add binary support. However, these
+derivatives are not compatible with existing JSON software, and have not seen
+wide adoption.
  
-\subsection{Why Protocol Buffers}
-
-In 2008, Google released an open source version of Protocol Buffers: the data 
+%\paragraph*{Enter Protocol Buffers:}
+In 2008, and following several years of internal use, Google released an open
+source version of Protocol Buffers. It provides data  
 interchange format that was designed and used for their internal infrastructure.
-Google officially provides high quality parsing libraries for \texttt{Java}, 
-\texttt{C++} and \texttt{Python}, and community developed open source implementations
+Google officially provides high-quality parsing libraries for \texttt{Java}, 
+\texttt{C++} and \texttt{Python}, and community-developed open source implementations
 are available for many other languages. 
 Protocol Buffers take a quite different approach from many other popular formats.
 They offer a unique combination of features, performance, and maturity that seems
 particulary well suited for data-driven applications and numerical computing.
 Protocol Buffers are a binary format that natively supports all common primitive types
-found in modern programming languages. The advantage of this is that numeric values
-are serialized exactly the same way as they are stored in memory. Therefore there is
+found in modern programming languages. A key advantage is that numeric values
+are serialized exactly the same way as they are stored in memory. There is
 no loss of precision, no overhead, and parsing messages is very efficient: the system can 
 simply copy bytes to memory without any further processing. 
-But the most powerful feature of protocol buffers is that it decouples the content
+But the most powerful feature of Protocol Buffers is that it decouples the content
 from the structure using a schema, very similar to a database. This further increases
 performance by eliminating redundancy, while at the same time providing foundations
 for defining an \emph{Interface Description Language}, or \emph{IDL}.
 Many sources compare data serialization formats and show Protocol Buffers compare 
 very favorably to the alternatives; see \citet{Sumaray:2012:CDS:2184751.2184810} 
 for one such comparison.
+% TODO(DE): Mention "future proof" forward compatibility of schemata
 
-
 %  The schema can be used to
 %generate model classes for statically-typed programming languages such
 %as C++ and Java, or can be used with reflection for dynamically-typed
@@ -213,8 +213,8 @@
 % in the middle (full class/method details) and interesting
 % applications at the end.
 
-This paper describes an R interface to protocol buffers.
-The rest of the paper is organized as follows. Section~\ref{sec:protobuf}
+This paper describes an R interface to Protocol Buffer, 
+and is organized as follows. Section~\ref{sec:protobuf}
 provides a general overview of Protocol Buffers.
 Section~\ref{sec:rprotobuf-basic} describes the interactive R interface
 provided by \CRANpkg{RProtoBuf} and introduces the two main abstractions: