[Rprotobuf-commits] r927 - papers/jss

Wed Dec 3 20:43:16 CET 2014

Author: murray
Date: 2014-12-03 20:43:16 +0100 (Wed, 03 Dec 2014)
New Revision: 927

Modified:
   papers/jss/article.Rnw
Log:
Improve the plot and point out 3 outliers now and explain them in the
text.  Correct an error in the space savings definition.  Change trivial example to simple example.

Suggestions from: Andy Chu


Modified: papers/jss/article.Rnw
===================================================================

--- papers/jss/article.Rnw	2014-12-02 03:39:46 UTC (rev 926)
+++ papers/jss/article.Rnw	2014-12-03 19:43:16 UTC (rev 927)
@@ -972,20 +972,20 @@
 clean.df<-rbind(clean.df, all.df)
 @
 
-Figure~\ref{fig:compression} shows the space savings $\left(1 - \frac{\textrm{Uncompressed Size}}{\textrm{Compressed Size}}\right)$ for each of the data sets using each of these four methods.  The associated table shows the exact data sizes for two outliers and the aggregate of all \Sexpr{n} data sets.
+Figure~\ref{fig:compression} shows the space savings $\left(1 - \frac{\textrm{Compressed Size}}{\textrm{Uncompressed Size}}\right)$ for each of the data sets using each of these four methods.  The associated table shows the exact data sizes for some outliers and the aggregate of all \Sexpr{n} data sets.
 Note that Protocol Buffer serialization results in slightly
-smaller byte streams compared to native \proglang{R} serialization in most cases,
-but this difference disappears if the results are compressed with gzip.
+smaller byte streams compared to native \proglang{R} serialization in most cases (red dots),
+but this difference disappears if the results are compressed with gzip (blue triangles).
 %Sizes are comparable but Protocol Buffers provide simple getters and setters
 %in multiple languages instead of requiring other programs to parse the \proglang{R}
 %serialization format. % \citep{serialization}.
 
 The \code{crimtab} dataset of anthropometry measurements of British
-prisoners \citep{garson1900metric}
-shows the greatest difference in the space savings when
+prisoners \citep{garson1900metric} and the \code{airquality} dataset of air quality measurements in New York show the
+greatest difference in the space savings when
 using Protocol Buffers compared to \proglang{R} native serialization.
-This dataset is a 42x22 table of integers, most equal to 0.  Small
-integer values like this can be very efficiently encoded by the
+The \code{crimtab} dataset is a 42x22 table of integers, most equal to 0, and the \code{airquality} dataset is a data frame of 154 observations of 1 numeric and 5 integer variables.  In both data sets, the large number of small
+integer values can be very efficiently encoded by the
 \emph{Varint} integer encoding scheme used by Protocol Buffers which
 use a variable number of bytes for each value.
 
@@ -1008,10 +1008,16 @@
 application-specific schema has been defined.  The example in the next
 section satisfies both of these conditions.
 
-\begin{figure}[t!]
+\begin{figure}[hbt!]
 \begin{center}
-<<echo=FALSE,fig=TRUE,width=8,height=4>>=
-plot(clean.df$savings.serialized, clean.df$savings.rprotobuf, pch=1, col="red", las=1, xlab="Serialization Space Savings", ylab="Protocol Buffer Space Savings")
+<<label=SER,echo=FALSE,include=FALSE,fig=TRUE>>=
+old.mar<-par("mar")
+new.mar<-old.mar
+new.mar[3]<-0
+new.mar[4]<-0
+my.cex<-1.3
+par("mar"=new.mar)
+plot(clean.df$savings.serialized, clean.df$savings.rprotobuf, pch=1, col="red", las=1, xlab="Serialization Space Savings", ylab="Protocol Buffer Space Savings", xlim=c(0,1),ylim=c(0,1),cex.lab=my.cex, cex.axis=my.cex)
 points(clean.df$savings.serialized.gz, clean.df$savings.rprotobuf.gz,pch=2, col="blue")
 # grey dotted diagonal
 abline(a=0,b=1, col="grey",lty=2,lwd=3)
@@ -1023,17 +1029,27 @@
 # The one to label.
 tmp.df <- clean.df[which(clean.df$savings.diff == min(clean.df$savings.diff)),]
 # This minimum means most to the left of our line, so pos=2 is label to the left
-text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=2)
-text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=2)
+text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=2, cex=my.cex)
 
+# Some gziped version
+# text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=2, cex=my.cex)
+
+# Second one is also an outlier
+tmp.df <- clean.df[which(clean.df$savings.diff == sort(clean.df$savings.diff)[2]),]
+# This minimum means most to the left of our line, so pos=2 is label to the left
+text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=2, cex=my.cex)
+#text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=my.cex)
+
+
 tmp.df <- clean.df[which(clean.df$savings.diff == max(clean.df$savings.diff)),]
 # This minimum means most to the right of the diagonal, so pos=4 is label to the right
-text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=4)
-text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=4)
+# Only show the gziped one.
+#text(tmp.df$savings.serialized, tmp.df$savings.rprotobuf, labels=tmp.df$dataset, pos=4, cex=my.cex)
+text(tmp.df$savings.serialized.gz, tmp.df$savings.rprotobuf.gz, labels=tmp.df$dataset, pos=4, cex=my.cex)
 
 #outlier.dfs <- clean.df[c(which(clean.df$savings.diff == min(clean.df$savings.diff)),
 
-legend("topleft", c("Raw", "Gzip Compressed"), pch=1:2, col=c("red", "blue"))
+legend("topleft", c("Raw", "Gzip Compressed"), pch=1:2, col=c("red", "blue"), cex=my.cex)
 
 interesting.df <- clean.df[unique(c(which(clean.df$savings.diff == min(clean.df$savings.diff)),
                              which(clean.df$savings.diff == max(clean.df$savings.diff)),
@@ -1041,7 +1057,9 @@
 			     which(clean.df$dataset == "TOTAL"))),c("dataset", "object.size", "serialized", "gzipped serialized", "RProtoBuf", "gzipped RProtoBuf", "savings.serialized", "savings.serialized.gz", "savings.rprotobuf", "savings.rprotobuf.gz")]
 # Print without .00 in xtable
 interesting.df$object.size <- as.integer(interesting.df$object.size)
+par("mar"=old.mar)
 @
+\includegraphics[width=0.45\textwidth]{figures/fig-SER}
 
 % latex table generated in R 3.0.2 by xtable 1.7-0 package
 % Wed Nov 26 15:31:30 2014
@@ -1054,13 +1072,14 @@
   & & default & gzipped & default & gzipped \\
   \cmidrule(r){2-6}
  crimtab & 7,936 & 4,641 (41.5\%) & 713 (91.0\%) & 1,655 (79.2\%) & 576 (92.7\%)\\
+ airquality & 5,496 & 4,551 (17.2\%) & 1,241 (77.4\%) & 2,874 (47.7\%) & 1,294 (76.5\%)\\
  faithful & 5,136 & 4,543 (11.5\%) & 1,339 (73.9\%) & 4,936 (3.9\%) & 1,776 (65.5\%)\\
    \hline
  All & 605,256 & 461,667 (24\%) & 138,937 (77\%) & 435,360 (28\%) & 142,134 (77\%)\\
 \hline
 \end{tabular}
 \end{center}
-\caption{(Top) Relative space savings of Protocol Buffers and native \proglang{R} serialization over the raw object sizes of each of the \Sexpr{n} data sets in the \pkg{datasets} package. Points to the left of the dashed $y=x$ line represent datasets that are more efficiently encoded with Protocol Buffers. (Bottom) Absolute space savings of two outlier datasets and the aggregate performance of all datasets.}
+\caption{(Top) Relative space savings of Protocol Buffers and native \proglang{R} serialization over the raw object sizes of each of the \Sexpr{n} data sets in the \pkg{datasets} package. Points to the left of the dashed $y=x$ line represent datasets that are more efficiently encoded with Protocol Buffers. (Bottom) Absolute space savings of three outlier datasets and the aggregate performance of all datasets.}
 \label{fig:compression}
 \end{figure}
 
@@ -1135,7 +1154,7 @@
 written in other languages and only the resulting output histograms
 need to be manipulated in \proglang{R}.
 
-\subsection*{A trivial single-machine example for Python to R serialization}
+\subsection*{A simple single-machine example for Python to R serialization}
 
 To create HistogramState
 messages in Python for later consumption by \proglang{R}, we first compile the