[Rcpp-devel] New package: RcppMLPACK, integration with MLPACK using Rcpp

Qiang Kou qkou at umail.iu.edu
Fri Jul 25 02:25:16 CEST 2014


RcppMLPACK is almost done, and I really hope it is useful for other people.
Testing and bug report are deeply welcome. Not only the code, also the
results. Now you can try it from my repo:
https://github.com/thirdwing/RcppMLPACK

I am afraid there will be known problems on Windows about* size_t* type.

MLPACK is a scalable C++ machine learning library providing an intuitive
and simple API. It implements a wide array of machine learning methods and
uses Armadillo as input/output. For more detail about MLPACK, please visit
its homepage: http://www.mlpack.org/

Since we have Rcpp and RcppArmadillo, which can integrate C++ and Armadillo
with R seamlessly, RcppMLPACK becomes something very natural. The
RcppMLPACK package includes the source code from the MLPACK library. Thus
users do not need to install MLPACK itself in order to use RcppMLPACK.

I use k-means as an example. By using RcppMLPACK, a k-means method can be
implemented like below. The interfere between R and C++ is handled by Rcpp
and RcppArmadillo.

#include "RcppMLPACK.h"

using namespace mlpack::kmeans;
using namespace Rcpp;

// [[Rcpp::export]]
List kmeans(const arma::mat& data, const int& clusters) {

    arma::Col<size_t> assignments;

    // Initialize with the default arguments.
    KMeans<> k;

    k.Cluster(data, clusters, assignments);

    return List::create(_["clusters"] = clusters,
                        _["result"]   = assignments);
}

*inline* package provides a complete wrapper around the compilation,
linking, and loading steps. So all the steps can be done in an R session.
There is no reason that RcppMLPACK doesn't support the inline compilation.

library(inline)
library(RcppMLPACK)
code <- '
  arma::mat data = as<arma::mat>(test);
  int clusters = as<int>(n);
  arma::Col<size_t> assignments;
  mlpack::kmeans::KMeans<> k;
  k.Cluster(data, clusters, assignments);
  return List::create(_["clusters"] = clusters,
                      _["result"]   = assignments);
'
mlKmeans <- cxxfunction(signature(test="numeric", n ="integer"), code,
plugin="RcppMLPACK")
data(trees, package="datasets")
mlKmeans(t(trees), 3)

There is one point we need to pay attention to: Armadillo matrices in
MLPACK are stored in a *column-major format* for speed. That means
*observations
are stored as columns and dimensions as rows*.So when using MLPACK,
additional transpose may be needed.

The package also contains a RcppMLPACK.package.skeleton() function for
people who want to use MLPACK code in their own package. It follows the
structure of RcppArmadillo.package.skeleton().

library(RcppMLPACK)
RcppMLPACK.package.skeleton("foobar")
Creating directories ...
Creating DESCRIPTION ...
Creating NAMESPACE ...
Creating Read-and-delete-me ...
Saving functions and data ...
Making help files ...
Done.
Further steps are described in './foobar/Read-and-delete-me'.

Adding RcppMLPACK settings
 >> added Imports: Rcpp
 >> added LinkingTo: Rcpp, RcppArmadillo, BH, RcppMLPACK
 >> added useDynLib and importFrom directives to NAMESPACE
 >> added Makevars file with RcppMLPACK settings
 >> added Makevars.win file with RcppMLPACK settings
 >> added example src file using MLPACK classes
 >> invoked Rcpp::compileAttributes to create wrappers

system("ls -R foobar")
foobar:
DESCRIPTION  man  NAMESPACE  R  Read-and-delete-me  src

foobar/man:
foobar-package.Rd

foobar/R:
RcppExports.R

foobar/src:
kmeans.cpp  Makevars  Makevars.win  RcppExports.cpp

Even without a performance testing, we are still sure the C++
implementations should be faster. A small wine data set from UCI data sets
repository is used for benchmarking. A script using rbenchmark package is
written as below:

suppressMessages(library(rbenchmark))
res <- benchmark(mlKmeans(t(wine),3),
                 kmeans(wine,3),
                 columns=c("test", "replications", "elapsed",
                 "relative", "user.self", "sys.self"), order="relative")

For 100 replications, MLPACK version of k-means (0.028s) is 33-time faster
than kmeans in R (0.947s). However, we should note that R returns more
information than the clustering result and there are much more checking
functions in R.

There is an important problem in MLPACK: it uses *size_t* type heavily.

There will be problems in wrapping such type, since in 64-bit Windows,
*size_t* is defined as *unsigned long long int*. No this kind of error
found during testing on my Ubuntu.

Testing and bug report are deeply welcome. Not only the code, also the
results.

Best,

KK
-- 
Qiang Kou
qkou at umail.iu.edu
School of Informatics and Computing, Indiana University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20140724/1b5ffd3f/attachment.html>


More information about the Rcpp-devel mailing list