[Rcpp-devel] New package: RcppMLPACK, integration with MLPACK using Rcpp
Dirk Eddelbuettel
edd at debian.org
Fri Jul 25 03:30:26 CEST 2014
KK,
Nice work!!
Looking forward to playing with this some more, and CCing Conrad and Ryan.
Some more comments below.
On 24 July 2014 at 20:25, Qiang Kou wrote:
| RcppMLPACK is almost done, and I really hope it is useful for other people.
| Testing and bug report are deeply welcome. Not only the code, also the results.
| Now you can try it from my repo: https://github.com/thirdwing/RcppMLPACK
|
| I am afraid there will be known problems on Windows about size_t type.
|
| MLPACK is a scalable C++ machine learning library providing an intuitive and
| simple API. It implements a wide array of machine learning methods and uses
| Armadillo as input/output. For more detail about MLPACK, please visit its
| homepage: http://www.mlpack.org/
|
| Since we have Rcpp and RcppArmadillo, which can integrate C++ and Armadillo
| with R seamlessly, RcppMLPACK becomes something very natural. The RcppMLPACK
| package includes the source code from the MLPACK library. Thus users do not
| need to install MLPACK itself in order to use RcppMLPACK.
|
| I use k-means as an example. By using RcppMLPACK, a k-means method can be
| implemented like below. The interfere between R and C++ is handled by Rcpp and
| RcppArmadillo.
|
| #include "RcppMLPACK.h"
|
| using namespace mlpack::kmeans;
| using namespace Rcpp;
|
| // [[Rcpp::export]]
| List kmeans(const arma::mat& data, const int& clusters) {
|
| arma::Col<size_t> assignments;
|
| // Initialize with the default arguments.
| KMeans<> k;
|
| k.Cluster(data, clusters, assignments);
|
| return List::create(_["clusters"] = clusters,
| _["result"] = assignments);
| }
|
| inline package provides a complete wrapper around the compilation, linking, and
| loading steps. So all the steps can be done in an R session. There is no reason
| that RcppMLPACK doesn't support the inline compilation.
It also works via sourceCpp() as Rcpp Attributes uses the same plugin:
R> sourceCpp("/tmp/rcppmlpackEx.cpp") # saved your code in /tmp/rcppmlpackEx.cpp
R> data(trees, package="datasets")
R> kmeans(t(trees), 3)
KMeans::Cluster(): converged after 9 iterations.
$clusters
[1] 3
$result
[,1]
[1,] 2
[2,] 2
[3,] 2
[.... rest of output omitted for brevity ...]
All it takes is to add one line
// [[Rcpp::depends(RcppMLPACK)]]
in the source code you show above.
| library(inline)
| library(RcppMLPACK)
| code <- '
| arma::mat data = as<arma::mat>(test);
| int clusters = as<int>(n);
| arma::Col<size_t> assignments;
| mlpack::kmeans::KMeans<> k;
| k.Cluster(data, clusters, assignments);
| return List::create(_["clusters"] = clusters,
| _["result"] = assignments);
| '
| mlKmeans <- cxxfunction(signature(test="numeric", n ="integer"), code, plugin=
| "RcppMLPACK")
| data(trees, package="datasets")
| mlKmeans(t(trees), 3)
|
| There is one point we need to pay attention to: Armadillo matrices in MLPACK
| are stored in a column-major format for speed. That means observations are
| stored as columns and dimensions as rows.So when using MLPACK, additional
| transpose may be needed.
|
| The package also contains a RcppMLPACK.package.skeleton() function for people
| who want to use MLPACK code in their own package. It follows the structure of
| RcppArmadillo.package.skeleton().
|
| library(RcppMLPACK)
| RcppMLPACK.package.skeleton("foobar")
| Creating directories ...
| Creating DESCRIPTION ...
| Creating NAMESPACE ...
| Creating Read-and-delete-me ...
| Saving functions and data ...
| Making help files ...
| Done.
| Further steps are described in './foobar/Read-and-delete-me'.
|
| Adding RcppMLPACK settings
| >> added Imports: Rcpp
| >> added LinkingTo: Rcpp, RcppArmadillo, BH, RcppMLPACK
| >> added useDynLib and importFrom directives to NAMESPACE
| >> added Makevars file with RcppMLPACK settings
| >> added Makevars.win file with RcppMLPACK settings
| >> added example src file using MLPACK classes
| >> invoked Rcpp::compileAttributes to create wrappers
|
| system("ls -R foobar")
| foobar:
| DESCRIPTION man NAMESPACE R Read-and-delete-me src
|
| foobar/man:
| foobar-package.Rd
|
| foobar/R:
| RcppExports.R
|
| foobar/src:
| kmeans.cpp Makevars Makevars.win RcppExports.cpp
Nice one too!
| Even without a performance testing, we are still sure the C++ implementations
| should be faster. A small wine data set from UCI data sets repository is used
| for benchmarking. A script using rbenchmark package is written as below:
|
| suppressMessages(library(rbenchmark))
| res <- benchmark(mlKmeans(t(wine),3),
| kmeans(wine,3),
| columns=c("test", "replications", "elapsed",
| "relative", "user.self", "sys.self"), order="relative")
|
| For 100 replications, MLPACK version of k-means (0.028s) is 33-time faster than
| kmeans in R (0.947s). However, we should note that R returns more information
| than the clustering result and there are much more checking functions in R.
|
| There is an important problem in MLPACK: it uses size_t type heavily.
|
| There will be problems in wrapping such type, since in 64-bit Windows, size_t
| is defined as unsigned long long int. No this kind of error found during
| testing on my Ubuntu.
That is a known issue with R insisting on C++ 1998 without the interim
changes. The simplest way around it (in the context of R and CRAN) is to
enable C++11 -- I do so in RcppCNPy and RcppBDT as I need 'long long' in
both.
| Testing and bug report are deeply welcome. Not only the code, also the results.
Very exciting. I am sure you'll get a ton of good feedback.
Once again, nice work and congratulations.
Dirk
--
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
More information about the Rcpp-devel
mailing list