[Rprotobuf-commits] r845 - papers/jss
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Thu Jan 23 15:27:20 CET 2014
Author: edd
Date: 2014-01-23 15:27:19 +0100 (Thu, 23 Jan 2014)
New Revision: 845
Modified:
papers/jss/article.Rnw
Log:
a few micro-edits in Section 1
plural s in two places, hyphenating, minor rewording
Modified: papers/jss/article.Rnw
===================================================================
--- papers/jss/article.Rnw 2014-01-23 05:55:47 UTC (rev 844)
+++ papers/jss/article.Rnw 2014-01-23 14:27:19 UTC (rev 845)
@@ -42,7 +42,7 @@
verbose, inefficient, not type-safe, or tied to a specific programming language.
Protocol Buffers are a popular
method of serializing structured data between applications---while remaining
-independent of programming languages or operating system.
+independent of programming languages or operating systems.
They offer a unique combination of features, performance, and maturity that seems
particulary well suited for data-driven applications and numerical
computing.
@@ -53,7 +53,7 @@
This paper outlines the general class of data serialization
requirements for statistical computing, describes the implementation
of the \CRANpkg{RProtoBuf} package, and illustrates its use with
-examples applications in large-scale data collection pipelines and web
+example applications in large-scale data collection pipelines and web
services.
%TODO(ms) keep it less than 150 words. -- I think this may be 154,
%depending how emacs is counting.
@@ -146,11 +146,11 @@
These pipelines are frequently built using different programming
languages for the different phases of data analysis -- collection,
cleaning, modeling, analysis, post-processing, and
-presentation in order to take advantage of the unique combination of
+presentation -- in order to take advantage of the unique combination of
performance, speed of development, and library support offered by
different environments and languages. Each stage of such a data
analysis pipeline may produce intermediate results that need to be
-stored in a file or sent over the network for further processing.
+stored in a file, or sent over the network for further processing.
% JO Perhaps also mention that serialization is needed for distributed
% systems to make systems scale up?
@@ -173,7 +173,7 @@
environment.
%\paragraph*{Friends don't let friends use CSV!}
-Data analysts and researchers often use character separated text formats such
+Data analysts and researchers often use character-separated text formats such
as \texttt{CSV} \citep{shafranovich2005common} to export and import
data. However, anyone who has ever used \texttt{CSV} files will have noticed
that this method has many limitations: it is restricted to tabular data,
@@ -185,18 +185,18 @@
about any arbitrarily complex schema \citep{nolan2013xml}. However, it pays
for this complexity with comparatively large and verbose messages, and added
complexity at the parsing side (which are somewhat mitigated by the
-availability of mature libraries and parsers). Because \texttt{XML} is text
-based and has no native notion of numeric types or arrays, it usually not a
+availability of mature libraries and parsers). Because \texttt{XML} is
+text-based and has no native notion of numeric types or arrays, it usually not a
very practical format to store numeric datasets as they appear in statistical
applications.
%
-A more modern, widely used format is \emph{JavaScript ObjectNotation}
+A more modern format is \emph{JavaScript ObjectNotation}
(\texttt{JSON}), which is derived from the object literals of
-\proglang{JavaScript}, and used increasingly on the world wide web.
+\proglang{JavaScript}, and already widely-used on the world wide web.
Several \proglang{R} packages implement functions to parse and generate
\texttt{JSON} data from \proglang{R} objects \citep{rjson,RJSONIO,jsonlite}.
-\texttt{JSON} natively supports arrays and 4 primitive types: numbers, strings,
+\texttt{JSON} natively supports arrays and four primitive types: numbers, strings,
booleans, and null. However, as it too is a text-based format, numbers are
stored as human-readable decimal notation which is inefficient and
leads to loss of type (double versus integer) and precision.
@@ -222,7 +222,7 @@
Protocol Buffers \citep{protobuf}, Apache Thrift, and Apache Avro
provide a compact well-documented schema for cross-language data
structures and efficient binary interchange formats. Since the schema
-is provided separately from the encoded data, the data can be
+is provided separately from the data, the data can be
efficiently encoded to minimize storage costs when
compared with simple ``schema-less'' binary interchange formats.
Many sources compare data serialization formats
@@ -270,9 +270,8 @@
Section~\ref{sec:rprotobuf-basic} describes the interactive \proglang{R} interface
provided by the \CRANpkg{RProtoBuf} package, and introduces the two main abstractions:
\emph{Messages} and \emph{Descriptors}. Section~\ref{sec:rprotobuf-classes}
-details the implementation details of the main S4 classes and methods
-contained in this
-package. Section~\ref{sec:types} describes the challenges of type coercion
+details the implementation details of the main S4 classes and methods.
+Section~\ref{sec:types} describes the challenges of type coercion
between \proglang{R} and other languages. Section~\ref{sec:evaluation} introduces a
general \proglang{R} language schema for serializing arbitrary \proglang{R} objects and evaluates
it against the serialization capbilities built directly into \proglang{R}. Sections~\ref{sec:mapreduce}
More information about the Rprotobuf-commits
mailing list