[Rcpp-devel] New package: RcppMLPACK, integration with MLPACK using Rcpp

Dirk Eddelbuettel edd at debian.org
Fri Jul 25 03:30:26 CEST 2014


KK,

Nice work!!  

Looking forward to playing with this some more, and CCing Conrad and Ryan.
Some more comments below.

On 24 July 2014 at 20:25, Qiang Kou wrote:
| RcppMLPACK is almost done, and I really hope it is useful for other people.
| Testing and bug report are deeply welcome. Not only the code, also the results.
| Now you can try it from my repo: https://github.com/thirdwing/RcppMLPACK 
| 
| I am afraid there will be known problems on Windows about size_t type.
| 
| MLPACK is a scalable C++ machine learning library providing an intuitive and
| simple API. It implements a wide array of machine learning methods and uses
| Armadillo as input/output. For more detail about MLPACK, please visit its
| homepage: http://www.mlpack.org/ 
| 
| Since we have Rcpp and RcppArmadillo, which can integrate C++ and Armadillo
| with R seamlessly, RcppMLPACK becomes something very natural. The RcppMLPACK
| package includes the source code from the MLPACK library. Thus users do not
| need to install MLPACK itself in order to use RcppMLPACK. 
| 
| I use k-means as an example. By using RcppMLPACK, a k-means method can be
| implemented like below. The interfere between R and C++ is handled by Rcpp and
| RcppArmadillo.
| 
| #include "RcppMLPACK.h"
| 
| using namespace mlpack::kmeans;
| using namespace Rcpp;
| 
| // [[Rcpp::export]]
| List kmeans(const arma::mat& data, const int& clusters) {
|     
|     arma::Col<size_t> assignments;
| 
|     // Initialize with the default arguments.
|     KMeans<> k;
| 
|     k.Cluster(data, clusters, assignments); 
| 
|     return List::create(_["clusters"] = clusters,
|                         _["result"]   = assignments);
| }
| 
| inline package provides a complete wrapper around the compilation, linking, and
| loading steps. So all the steps can be done in an R session. There is no reason
| that RcppMLPACK doesn't support the inline compilation.

It also works via sourceCpp() as Rcpp Attributes uses the same plugin:

   R> sourceCpp("/tmp/rcppmlpackEx.cpp")   # saved your code in /tmp/rcppmlpackEx.cpp
   R> data(trees, package="datasets")
   R> kmeans(t(trees), 3)
   KMeans::Cluster(): converged after 9 iterations.
   $clusters
   [1] 3

   $result
         [,1]
   [1,]    2
   [2,]    2
   [3,]    2
   [.... rest of output omitted for brevity ...]


All it takes is to add one line

   // [[Rcpp::depends(RcppMLPACK)]]

in the source code you show above.

| library(inline)
| library(RcppMLPACK)
| code <- '
|   arma::mat data = as<arma::mat>(test);
|   int clusters = as<int>(n);
|   arma::Col<size_t> assignments;
|   mlpack::kmeans::KMeans<> k;
|   k.Cluster(data, clusters, assignments); 
|   return List::create(_["clusters"] = clusters,
|                       _["result"]   = assignments);
| '
| mlKmeans <- cxxfunction(signature(test="numeric", n ="integer"), code, plugin=
| "RcppMLPACK")
| data(trees, package="datasets")
| mlKmeans(t(trees), 3)
| 
| There is one point we need to pay attention to: Armadillo matrices in MLPACK
| are stored in a column-major format for speed. That means observations are
| stored as columns and dimensions as rows.So when using MLPACK, additional
| transpose may be needed.
| 
| The package also contains a RcppMLPACK.package.skeleton() function for people
| who want to use MLPACK code in their own package. It follows the structure of
| RcppArmadillo.package.skeleton().
| 
| library(RcppMLPACK)
| RcppMLPACK.package.skeleton("foobar")
| Creating directories ...
| Creating DESCRIPTION ...
| Creating NAMESPACE ...
| Creating Read-and-delete-me ...
| Saving functions and data ...
| Making help files ...
| Done.
| Further steps are described in './foobar/Read-and-delete-me'.
| 
| Adding RcppMLPACK settings
|  >> added Imports: Rcpp
|  >> added LinkingTo: Rcpp, RcppArmadillo, BH, RcppMLPACK
|  >> added useDynLib and importFrom directives to NAMESPACE
|  >> added Makevars file with RcppMLPACK settings
|  >> added Makevars.win file with RcppMLPACK settings
|  >> added example src file using MLPACK classes
|  >> invoked Rcpp::compileAttributes to create wrappers
| 
| system("ls -R foobar")
| foobar:
| DESCRIPTION  man  NAMESPACE  R  Read-and-delete-me  src
| 
| foobar/man:
| foobar-package.Rd
| 
| foobar/R:
| RcppExports.R
| 
| foobar/src:
| kmeans.cpp  Makevars  Makevars.win  RcppExports.cpp

Nice one too!
 
| Even without a performance testing, we are still sure the C++ implementations
| should be faster. A small wine data set from UCI data sets repository is used
| for benchmarking. A script using rbenchmark package is written as below:
| 
| suppressMessages(library(rbenchmark))
| res <- benchmark(mlKmeans(t(wine),3),
|                  kmeans(wine,3),
|                  columns=c("test", "replications", "elapsed",
|                  "relative", "user.self", "sys.self"), order="relative")
| 
| For 100 replications, MLPACK version of k-means (0.028s) is 33-time faster than
| kmeans in R (0.947s). However, we should note that R returns more information
| than the clustering result and there are much more checking functions in R.
| 
| There is an important problem in MLPACK: it uses size_t type heavily. 
| 
| There will be problems in wrapping such type, since in 64-bit Windows, size_t
| is defined as unsigned long long int. No this kind of error found during
| testing on my Ubuntu.

That is a known issue with R insisting on C++ 1998 without the interim
changes.  The simplest way around it (in the context of R and CRAN) is to
enable C++11 -- I do so in RcppCNPy and RcppBDT as I need 'long long' in
both.
 
| Testing and bug report are deeply welcome. Not only the code, also the results.

Very exciting. I am sure you'll get a ton of good feedback.

Once again, nice work and congratulations. 

Dirk

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org


More information about the Rcpp-devel mailing list