# [Rcpp-commits] r572 - papers/rjournal

Fri Feb 5 11:14:48 CET 2010

Author: romain
Date: 2010-02-05 11:14:48 +0100 (Fri, 05 Feb 2010)
New Revision: 572

Modified:
papers/rjournal/EddelbuettelFrancois.tex
Log:
some more content, attempt at a summary

Modified: papers/rjournal/EddelbuettelFrancois.tex
===================================================================
--- papers/rjournal/EddelbuettelFrancois.tex	2010-02-05 02:30:04 UTC (rev 571)
+++ papers/rjournal/EddelbuettelFrancois.tex	2010-02-05 10:14:48 UTC (rev 572)
@@ -113,11 +113,6 @@
\section{Classic Rcpp}
\label{sec:classic_rcpp}

-% [Romain:] Why 'at least initial'
-% [Dirk:] For 'Classic Rcpp'
-% [Romain:] I'd argue it is still the case with the new api
-% [Dirk:] Conceded in last rewrite: 'has always been'
-%         (and I think we can nuke the comments)
The core focus of \pkg{Rcpp}---particularly for the earlier API described in
this section---has always been on allowing the programmer to add C++-based
functions. We use this term in the standard mathematical sense of providing
@@ -168,17 +163,10 @@
\code{Rcpp.h} is needed to use the \pkg{Rcpp} API.  Second, given two
\code{SEXP} types---the bread-and-butter of all internal R programming---a
third is returned.  Third, both inputs are converted to C++ vector types that
-are \textsl{templated} (meaning that a type-indepedent framework can be
+are \textsl{templated} (meaning that a type-independent framework can be
applied to create actual vectors of the specified type). Here a standard \code{double}
type is used to create a vector of doubles from the template type.
-% [Romain:] I think the previous sentence is confusing, one might think
-% that the same vector can hold int and double
-% [Dirk:] Better?
-% [Romain:] I think so, maybe the (...) should be a footnote
-% [Dirk:] Sorry, which '(...)' ?
-% [Romain:] (which means ... base types)
-% [Dirk:] Ah. Better now?
-Fourth, the usefulness off these classes can be seen when we query the
+Fourth, the usefulness of these classes can be seen when we query the
vectors directly for their size---using the \code{size} member function---in
order to reserved a new result type of appropriate length whereas use based
on C arrays would have required additional parameters for the length of
@@ -194,16 +182,19 @@

We argue that this usage is already easier to read, write and debug than the
C macro-based approach supported by R itself. Possible performance issues and
-other potentual limitations will be discussed throughout the article and
+other potential limitations will be discussed throughout the article and
reviewed at the end.

\section{New \pkg{Rcpp} API}
\label{sec:new_rcpp}

-Having discussed the Classic Rcpp' API and its deployment in the previous
-section, we now turn to the New Rcpp'. The new API is a complete redesign
-based on the usage experience of several years of Rcpp deployment, as well as
-current C++ design approaches.
+% [Romain]: removing this pedestrian sentence :
+% Having discussed the Classic Rcpp' API and its deployment in the previous
+% section, we now turn to the New Rcpp'.
+In late 2009, the Rcpp api has been dramatically extended, leading to a
+complete redesign, based on the usage experience of several
+years of Rcpp deployment, needs from other projects,
+as well as current C++ design approaches.

\subsection{Rcpp Class hierarchy}

@@ -216,10 +207,11 @@
a very thin wrapper around the \code{SEXP} it encapsulates, the
\code{SEXP} is indeed the only data member of an \code{RObject}.

-The \code{RObject} class takes advantage of the explicit life cyle of
-c++ objects to implement garbage collection of R objects. The
-\code{RObject} effectively treats its underlying \code{SEXP} as
-a resource. The constructor of the \code{RObject} class takes
+The \code{RObject} class takes advantage of the explicit life cycle of
+c++ objects to manage exposure of the underlying R object to the
+garbage collector. The \code{RObject} effectively treats
+its underlying \code{SEXP} as a resource.
+The constructor of the \code{RObject} class takes
the necessary measures to guarantee that the underlying \code{SEXP}
is protected from the garbage collector, and the destructor
assumes the responsability to withdraw that protection.
@@ -241,20 +233,19 @@

Similarly, the member functions \code{hasSlot} and \code{slot}
can be used to manage slots of an S4 object. These function throw
-c++ exceptions when used on objects that are not S4 objects.
+c++ exceptions when used on objects that are not S4 objects, or when
+trying to access a slot that does not exist for a class.

-% example of using attr or slot ?
-% mention proxy pattern ?
-
\subsection{Derived classes}

Internally, an R object must have one type amongst the set of
predefined types, commonly referred to as SEXP types. R internals
\citep{R:ints} documents these various types.
-\pkg{Rcpp} associates a dedicated C++ class for most SEXP types.
+\pkg{Rcpp} associates a dedicated C++ class for most SEXP types,
+therefore only exposes functionality that is relevant to the R object
+that it encapsulates.

-Each class contains functionality that is relevant to the R object
-that it encapsulates. For example \code{Rcpp::Environment} contains
+For example \code{Rcpp::Environment} contains
member functions to manage objects in the associated environment.
Classes related to vectors (\code{IntegerVector}, \code{NumericVector},
\code{RawVector}, \code{LogicalVector}, \code{CharacterVector},
@@ -264,7 +255,7 @@
The following sub-sections present typical uses Rcpp classes in
comparison with the same code expressed using functions of the R api.

-\subsection{Numeric vectors}  % [Dirk] I think we need upper case
+\subsection{Numeric vectors}

The following code snippet is extracted from Writing R extensions
\citep{R:exts}. It creates a \code{numeric} vector of two elements
@@ -280,17 +271,6 @@

Although this is one of the simplest examples in Writing R extensions,
it seems verbose and it is not trivial at first sight what is happening.
-%\begin{itemize}
-%\item \code{allocVector} is used to allocate memory. We must supply to it
-%the type of data (\code{REALSXP}) and the number of elements.
-%\item once allocated, the \code{ab} object must be protected from
-%garbage collection. Since the garbage collector can happen at any time,
-%not protecting an object means its memory might be reclaimed before we are
-%finished with it.
-%\item The \code{REAL} macro returns a pointer to the beginning of the
-%actual array; its indexing is does not resemble either R or C++.
-%\end{itemize}
-% [Dirk] More compact without enumerate list?
\code{allocVector} is used to allocate memory; we must also supply it with
the type of data (\code{REALSXP}) and the number of elements.  Once
allocated, the \code{ab} object must be protected from garbage
@@ -308,18 +288,6 @@
ab[1] = 67.89;
\end{example}

-% The code contains much less idiomatic decorations. Here are the steps involved:
-% \begin{itemize}
-% \item The \code{NumericVector} constructor is given the number
-% of elements the vector contains (2), this hides a call to the
-% \code{allocVector} we saw previously.
-% \item Also hidden is protection of the
-% object from garbage collection, which is a behavior that \code{NumericVector}
-% inherits from \code{RObject}
-% \item values are assigned to the first and second elements of the vector.
-% This is achieved \code{NumericVector} overloads the \code{operator[]}.
-% \end{itemize}
-% [Dirk] Idem: no bullets
The code contains fewer idiomatic decorations. The \code{NumericVector}
constructor is given the number of elements the vector contains (2), this
hides a call to the \code{allocVector} we saw previously. Also hidden is
@@ -354,47 +322,60 @@
UNPROTECT(1);
\end{example}

-Using the \pkg{Rcpp::CharacterVector} class, we can express this code as :
+This imposes on the programmer knowledge of \code{PROTECT}, \code{UNPROTECT},
+\code{SEXP}, \code{allocVector}, \code{SET\_STRING\_ELT}, \code{mkChar}.

+Using the \pkg{Rcpp::CharacterVector} class, we can express the same
+code more concisely:
+
\begin{example}
CharacterVector ab(2) ;
ab[0] = "foo" ;
ab[1] = "bar" ;
\end{example}

-Additionally, if C++0x initializer list is implemented by the compiler, the
-code can be trimmed to the essential :
+\section{R and C++ data interchange}

-\begin{example}
-CharacterVector ab = \{"foo","bar"\};
-\end{example}
-
-\section{R and C++ data interchange} % [Dirk] Reorder to fit on 1 line
-
functions to perform conversion of C++ objects to R objects and back.

+\subsection{C++ to R : wrap}
+
The C++ to R conversion is performed by the \code{Rcpp::wrap} templated
function. It uses advanced template meta programming techniques
to convert a wide and extensible set of types and classes to the
-most appropriate type of R object. \code{wrap} will
-currently handle these C++ types:
+most appropriate type of R object. The signature of the \code{wrap}
+template is:
+
+\begin{example}
+template <typename T>
+SEXP wrap(const T& object) ;
+\end{example}
+
+The templated function takes a reference to a wrappable
+object and convert this object into a SEXP, which is what R expects.
+Currently wrappable types are :
\begin{itemize}
-\item primitive types, \code{int}, \code{double}, ... are converted
-into R vectors of the appropriate type;
-\item \code{std::string} are converted to R character vectors;
-\item STL containers such as \code{std::vector<T>} or \code{std::list<T>}
-are wrappable as long as the template type T that they contain is wrappable;
-\item STL maps (e.g. \code{std::map<std::string,T>});
-which uses \code{std::string} for keys are also wrappable as long as
+\item primitive types, \code{int}, \code{double}, ... which are converted
+into atomic R vectors of the appropriate type;
+\item \code{std::string} are converted to R atomic character vectors;
+\item STL-like containers such as \code{std::vector<T>} or \code{std::list<T>},
+as long as the template parameter type \code{T} is itself wrappable;
+\item STL-like maps which uses \code{std::string} for keys
+(e.g. \code{std::map<std::string,T>}); as long as
the type \code{T} is wrappable;
\item any type that implements implicit conversion to \code{SEXP} through the
-\code{operator SEXP()} are wrappable.
+\code{operator SEXP()}.
+\item any type for which the the \code{wrap} template is partially or fully
+specialized.
+% [Romain]: should we mention RInside as an example
\end{itemize}

-In addition, the \code{wrap} template may be partially or fully specialized by
-third party code to extend its capabilities. The design allow composition,
-so for example objects of the class
+Whether an object is wrappable is resolved at compile time, and the
+dispatch of the appropriate implementation is performed by the compiler
+using modern techniques of template meta programming and class traits.
+
+The design allows composition, so for example objects of the class
\code{std::vector< std::map<std::string,int> >} are wrappable. This is
because \code{int} is wrappable (as a primitive type), consequently
\code{std::map<std::string,int>} is wrappable (as an STL-like map of
@@ -415,11 +396,11 @@
v.push_back( m1) ;
v.push_back( m2) ;

-wrap( v ) ;
+Rcpp::wrap( v ) ;
\end{example}

The code creates a list of two named vectors, equal to the list that
-can be created by the following R code:
+can be created by the following R code.

\begin{example}
list(
@@ -427,6 +408,8 @@
c( bar = 2L, bling = 3L, foo = 1L) )
\end{example}

+\subsection{R to C++ : as}
+
The reversed conversion is implemented by variations of the
\code{Rcpp::as} template. \code{as} offers less flexibility and currently
handles conversion of R objects into primitive types (bool, int, std::string, ...),
@@ -436,13 +419,16 @@
be fully or partially specialized to manage conversion of R data
structures to third party types.

+\subsection{Implicit use of converters}
+
The converters offered by \code{wrap} and \code{as} provide a very
useful framework to implement the logic of the code in terms of C++
-data structures and then explicitely convert data back to R, ...
+data structures and then explicitely convert data back to R.

-The converters are also used implicitely in various places in the
-\code{Rcpp} api. Consider the following code that uses the
-\code{Rcpp::Environment} class to interchange data between C++ and R.
+In addition, the converters are also used implicitely
+in various places in the \code{Rcpp} api.
+Consider the following code that uses the \code{Rcpp::Environment} class to
+interchange data between C++ and R.

\begin{example}
# assuming the global environment contains
@@ -463,15 +449,24 @@
global["y"] = map ;
\end{example}

-In the first part of the example, \code{as} is used implicitely to convert
-the object "x" from the global environment into an instance
-of the \code{std::vector<double>} class. In the second part of the example,
-\code{wrap} is used implicitely to convert the object of class
-\code{std::map<std::string,std::string>} into an R object, a named
-character vector in this case.
+In the first part of the example, the code extracts a
+\code{std::vector<double>} from the global environment. This is
+achieved by the templated \code{operator[]} of \code{Environment}
+that first extracts the requested object from the environment as a \code{SEXP},
+and then outsource to \code{Rcpp::as} the creation of the
+requested type.

-\section{Other examples}
+In the second part of the example, the \code{operator[]} this time
+first delegates to wrap the production of an R object based on the
+type that is passed in (\code{std::map<std::string,std::string>}),
+and then assign the object to the requested name.

+The same mechanism is used throughout the api, including : access/modification
+of object attributes, slots, elements of generic vectors (lists),
+function arguments.
+
+\section{Function calls}
+
The last example shows how to use \pkg{Rcpp} to emulate the R code below.

\begin{example}
@@ -488,7 +483,7 @@
\end{example}

We first pull out the \code{rnorm} function from the environment
-called \samp{package:stats} in the search path, then call the function
+called \samp{package:stats} in the search path, then simply call the function
using syntax similar to calling the function in R. The \code{Rcpp::Named}
class is an utility class that is used to emulate named arguments.

@@ -538,10 +533,8 @@
return res ;
\end{example}

-For more examples, the reader is invited to
-refer to the documentation included in \pkg{Rcpp}
-as well as the many examples that the package contains as part of
-its unit tests.
+More examples are available as part of the documentation
+included in \pkg{Rcpp} as well as its unit tests.

\section{Using code inline'}
\label{sec:inline}
@@ -555,21 +548,8 @@
with \pkg{Rcpp} by allowing for the use of additional header files and
libraries. This works particularly well with the \pkg{Rcpp} package where
headers and the library are automatically found if the appropriate option
-\code{Rcpp} to \texttt{cfunction} is set to true.
+\code{Rcpp} to \texttt{cfunction} is set to \code{TRUE}.

-% [Romain] : the next paragraph is very confusing
-% [Dirk] Is this better?
-% [Romain] Not sure. It seems to be only readable backwards. what about a
-%
-%          it might also be useful to show a quick example of inlining
-%          c++ code, for example say that we use it for our unit tests
-%          and show an example unit test
-% [Dirk] Done in last round
-% [Romain] But this shows the old api !!! and the same code as above so that
-%          people get to see it twice. I'd prefer moving these bits after
-%          the new Rcpp api section and show new api code inlined
-% [Dirk]  Agreed -- Will to past 'New Cpp API'
The use of \pkg{inline} is possible as \pkg{Rcpp} can be installed and
updated just like any other R package using \textsl{e.g.} the
\code{install.packages()} function for initial installation as well as
@@ -591,7 +571,8 @@
variable \code{src}, the function header is defined by the argument
\code{signature}---and we only need to enable \code{Rcpp=TRUE} to obtain a
new function \code{fun} based on the C++ code in \code{src} where we also
-switched fromn the classic Rcpp API to the new one:
+switched from the classic Rcpp API to the new one:
+
\begin{example}
src <- '
Rcpp::NumericVector xa(a);
@@ -599,17 +580,20 @@
int n_xa = xa.size(), n_xb = xb.size();
int nab = n_xa + n_xb - 1;
Rcpp::NumericVector xab(nab);
-  for (int i = 0; i < nab; i++) xab[i] = 0.0;
for (int i = 0; i < n_xa; i++)
for (int j = 0; j < n_xb; j++)
xab[i + j] += xa[i] * xb[j];
return xab;
-';
-fun <- cfunction(signature(a="numeric",
-                           b="numeric"),
-                 src, Rcpp=TRUE)
+'
+fun <- cfunction( signature(a="numeric", b="numeric"),
+	src, Rcpp=TRUE)
\end{example}

+% [Romain]: I've removed the line
+% for (int i = 0; i < nab; i++) xab[i] = 0.0;
+% because the constructor now does it automatically to match
+% what numeric( 10 ) would do in R
+
The main difference to the previous solution is that the input parameters are
directly passed to types \code{Rcpp::NumericVector}, and that the return
vector is automatically converted to a \code{SEXP} type through implicit
@@ -635,44 +619,22 @@
the best of it. The classic Rcpp translation of the convolve example from
\cite{R:exts} appears in sections~\ref{sec:classic_rcpp} and
\ref{sec:inline} where the latter example showed the use with the new API.
-%
-% [Dirk] Showing this example is now a little redundant as we just showed it
-%        for inline.  Shall we nuke it?
-% \begin{example}
-% #include <Rcpp.h>

-% RcppExport SEXP convolve3cpp(SEXP a, SEXP b)\{
-%     Rcpp::NumericVector xa(a);
-%     Rcpp::NumericVector xb(b);
-%     int n_xa = xa.size() ;
-%     int n_xb = xb.size() ;
-%     int nab = n_xa + n_xb - 1;
-%     Rcpp::NumericVector xab(nab);
-
-%     for (int i = 0; i < nab; i++) xab[i] = 0.0;
-%     for (int i = 0; i < n_xa; i++)
-%         for (int j = 0; j < n_xb; j++)
-%             xab[i + j] += xa[i] * xa[j];
-
-%     return xab ;
-% \}
-% \end{example}
-%
The implementation of the \code{operator[]} is implemented as
efficiently as possible, using inlining and caching,
but this implementation is still less efficient than the
-reference C imlementation described in \cite{R:exts}.
+reference C imlementation described in \cite{R:exts}.

-In order to achieve maximulm efficiency, the reference implementation
+In order to achieve maximum efficiency, the reference implementation
extracts the underlying array pointer : \code{double*} and works
with pointer arithmetics, which is a built-in operation as opposed to
calling the \code{operator[]} on a user-defined class which has to
pay the price of object encapsulation.

-Modelled after containers of the standard template library,
+Modelled after containers of the C++ standard template library,
the \code{NumericVector} class provides two member functions \code{begin}
and \code{end} that can use used to retrieve respectively
-the pointer to the first and past to end elements of the underlying array.
+the pointer to the first and past-to-end elements of the underlying array.
We can revisit the code to take advantage of this feature :

\begin{example}
@@ -690,7 +652,6 @@
double* pb = xb.begin() ;
double* pab = xab.begin() ;
int i,j=0;
-    for (i = 0; i < nab; i++) pab[i] = 0.0;
for (i = 0; i < n_xa; i++)
for (j = 0; j < n_xb; j++)
pab[i + j] += pa[i] * pb[j];
@@ -699,15 +660,17 @@
\}
\end{example}

-The following timings show the time taken (in milliseconds)
-by 1000 replicates of each function with \code{a} and
-\code{b} containing 100 elements.
+We've benchmarked the various implementations using
+1000 replicates of each function with \code{a} and
+\code{b} containing 100 elements. The timings are summarized in the
+table below:

\begin{center}
\begin{tabular}{cc}
Method & elapsed time (ms) \\
\hline
R API & 34 \\
+\hline
\code{RcppVector<double>} & 353 \\
\code{NumericVector::operator[]} & 55 \\
\code{NumericVector::begin} & 36 \\
@@ -715,30 +678,60 @@
\end{tabular}
\end{center}

-% need to comment the results, give reasons why the RcppVector<double> is
-% 10 times less efficient than the reference, show that 55-36 is the price for
-% encapsulation and say that the difference between 34 and 36 is not
-% significant
+The first implementation, using the traditional R api, unsurprisingly
+appears to be the most efficient. It takes advantage of pointer
+arithmetics and needs not to pay the price of object encapsulation.

+The last implementation comes close. Replicating the experiment
+shows that the difference is not significant.
+
+The third implementation illustrates the price of object encapsulation
+and calling an overloaded \code{operator[]} as opposed to using
+pointer arithmetics.
+
+Finally the second implementation --- from the classic Rcpp api ---
+is clearly behind in terms of efficiency. The difference is mainly
+caused by the many unnecessary copies that the \code{RcppVector<double>}
+class performs. First, both objects (\code{a} and \code{b})
+are copied into C++ structures (\code{xa} and \code{xb}).
+Then, the result is constructed as another \code{RcppVector<double>}
+(\code{xab}) that is filled using the \code{operator()} which checks
+every time that the index are suitable for the object. Finally, \code{xab}
+is converted back to an R object.
+
\section{Summary}

-% The \code{Rcpp} package provides comprehensive set of C++
-% classes aimed at significantly reducing the complexity and
-% discipline involved in combining R with compiled code.
-%
-% By assuming the responsibility of protection against garbage
-% collection automatically and transparently and encapsulating R objects
-% in C++ classes, \pkg{Rcpp} empowers the developper to concentrate on
-% the problem at hand instead of manually keeping track of
-% the \code{PROTECT}/\code{UNPROTECT} dance and without requiring
-% the expertise of knowing the details of the many macros and functions
-% of the R internal API.
-%
-% Evidently, C++ has a price and we have shown how to take advantage
-% of \code{Rcpp} to reduce --- if not eliminate --- the overhead while
-% significantly improving code clarity and maintainability.
+The \code{Rcpp} package simplifies integration of compiled code
+with R.

+The class hierarchy allows manipulation of R data structures in C++
+using member functions and operators directly related to the type
+of object being used, therefore reducing the level of expertise
+required to master the various functions and macros offered by the
+traditional R internal api. The classes assume the entire
+responsability of garbage collection of objects, relieving the
+programmer from book-keeping operations with the protection stack
+and enabling him/her to focus on the scientific problem.

+Data interchange between R and C++ --- performed by the
+\code{wrap} and \code{as} template functions --- allow the programmer
+to write logic in terms of c++ data structures, facilitating use
+of modern libraries such as the standard template library and its
+containers and algorithms. \code{wrap} and \code{as} are extensible
+by design and can be used either explicitely or implicitely throughout
+the api.
+
+Only using thin wrappers around \code{SEXP} objects,
+the footprint of the \code{Rcpp} api is very lightweight, and does not
+induces a significant performance price.
+
+Using the \code{Rcpp} api dramatically reduces the complexity
+of the code, which improves code readability and maintainability.
+The redesign of \code{Rcpp} was motivated by the needs of other
+projects such as \code{RInside} for easy embedding
+of R in a c++ application and \code{RProtoBuf} that interfaces
+with the protocol buffer library.
+
\bibliography{EddelbuettelFrancois}

`