[Rcpp-devel] New package: RcppMLPACK, integration with MLPACK using Rcpp

Smith, Dale (Norcross) Dale.Smith at Fiserv.com
Mon Jul 28 15:34:40 CEST 2014


Excellent work, thanks very much.

Dale Smith, Ph.D.
Senior Financial Quantitative Analyst
Financial & Risk Management Solutions
Fiserv
Office: 678-375-5315
www.fiserv.com<http://www.fiserv.com/>

From: rcpp-devel-bounces at r-forge.wu-wien.ac.at [mailto:rcpp-devel-bounces at r-forge.wu-wien.ac.at] On Behalf Of Qiang Kou
Sent: Thursday, July 24, 2014 8:25 PM
To: rcpp-devel at lists.r-forge.r-project.org
Subject: [Rcpp-devel] New package: RcppMLPACK, integration with MLPACK using Rcpp

RcppMLPACK is almost done, and I really hope it is useful for other people. Testing and bug report are deeply welcome. Not only the code, also the results. Now you can try it from my repo: https://github.com/thirdwing/RcppMLPACK

I am afraid there will be known problems on Windows about size_t type.

MLPACK is a scalable C++ machine learning library providing an intuitive and simple API. It implements a wide array of machine learning methods and uses Armadillo as input/output. For more detail about MLPACK, please visit its homepage: http://www.mlpack.org/

Since we have Rcpp and RcppArmadillo, which can integrate C++ and Armadillo with R seamlessly, RcppMLPACK becomes something very natural. The RcppMLPACK package includes the source code from the MLPACK library. Thus users do not need to install MLPACK itself in order to use RcppMLPACK.

I use k-means as an example. By using RcppMLPACK, a k-means method can be implemented like below. The interfere between R and C++ is handled by Rcpp and RcppArmadillo.

#include "RcppMLPACK.h"

using namespace mlpack::kmeans;
using namespace Rcpp;

// [[Rcpp::export]]
List kmeans(const arma::mat& data, const int& clusters) {

    arma::Col<size_t> assignments;

    // Initialize with the default arguments.
    KMeans<> k;

    k.Cluster(data, clusters, assignments);

    return List::create(_["clusters"] = clusters,
                        _["result"]   = assignments);
}

inline package provides a complete wrapper around the compilation, linking, and loading steps. So all the steps can be done in an R session. There is no reason that RcppMLPACK doesn't support the inline compilation.

library(inline)
library(RcppMLPACK)
code <- '
  arma::mat data = as<arma::mat>(test);
  int clusters = as<int>(n);
  arma::Col<size_t> assignments;
  mlpack::kmeans::KMeans<> k;
  k.Cluster(data, clusters, assignments);
  return List::create(_["clusters"] = clusters,
                      _["result"]   = assignments);
'
mlKmeans <- cxxfunction(signature(test="numeric", n ="integer"), code, plugin="RcppMLPACK")
data(trees, package="datasets")
mlKmeans(t(trees), 3)

There is one point we need to pay attention to: Armadillo matrices in MLPACK are stored in a column-major format for speed. That means observations are stored as columns and dimensions as rows.So when using MLPACK, additional transpose may be needed.

The package also contains a RcppMLPACK.package.skeleton() function for people who want to use MLPACK code in their own package. It follows the structure of RcppArmadillo.package.skeleton().

library(RcppMLPACK)
RcppMLPACK.package.skeleton("foobar")
Creating directories ...
Creating DESCRIPTION ...
Creating NAMESPACE ...
Creating Read-and-delete-me ...
Saving functions and data ...
Making help files ...
Done.
Further steps are described in './foobar/Read-and-delete-me'.

Adding RcppMLPACK settings
 >> added Imports: Rcpp
 >> added LinkingTo: Rcpp, RcppArmadillo, BH, RcppMLPACK
 >> added useDynLib and importFrom directives to NAMESPACE
 >> added Makevars file with RcppMLPACK settings
 >> added Makevars.win file with RcppMLPACK settings
 >> added example src file using MLPACK classes
 >> invoked Rcpp::compileAttributes to create wrappers

system("ls -R foobar")
foobar:
DESCRIPTION  man  NAMESPACE  R  Read-and-delete-me  src

foobar/man:
foobar-package.Rd

foobar/R:
RcppExports.R

foobar/src:
kmeans.cpp  Makevars  Makevars.win  RcppExports.cpp

Even without a performance testing, we are still sure the C++ implementations should be faster. A small wine data set from UCI data sets repository is used for benchmarking. A script using rbenchmark package is written as below:

suppressMessages(library(rbenchmark))
res <- benchmark(mlKmeans(t(wine),3),
                 kmeans(wine,3),
                 columns=c("test", "replications", "elapsed",
                 "relative", "user.self", "sys.self"), order="relative")

For 100 replications, MLPACK version of k-means (0.028s) is 33-time faster than kmeans in R (0.947s). However, we should note that R returns more information than the clustering result and there are much more checking functions in R.

There is an important problem in MLPACK: it uses size_t type heavily.

There will be problems in wrapping such type, since in 64-bit Windows, size_t is defined as unsigned long long int. No this kind of error found during testing on my Ubuntu.

Testing and bug report are deeply welcome. Not only the code, also the results.

Best,

KK
--
Qiang Kou
qkou at umail.iu.edu<mailto:qkou at umail.iu.edu>
School of Informatics and Computing, Indiana University

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20140728/edf37d8e/attachment.html>


More information about the Rcpp-devel mailing list