<div dir="ltr"><div>RcppMLPACK is almost done, and I really hope it is useful for other people. Testing and bug report are deeply welcome. Not only the code, also the results. Now you can try it from my repo: <a href="https://github.com/thirdwing/RcppMLPACK">https://github.com/thirdwing/RcppMLPACK</a> </div>
<div><br></div><div>I am afraid there will be known problems on Windows about<b> size_t</b> type.</div><div><br></div><div>MLPACK is a scalable C++ machine learning library providing an intuitive and simple API. It implements a wide array of machine learning methods and uses Armadillo as input/output. For more detail about MLPACK, please visit its homepage: <a href="http://www.mlpack.org/">http://www.mlpack.org/</a> </div>
<div><br></div><div>Since we have Rcpp and RcppArmadillo, which can integrate C++ and Armadillo with R seamlessly, RcppMLPACK becomes something very natural. The RcppMLPACK package includes the source code from the MLPACK library. Thus users do not need to install MLPACK itself in order to use RcppMLPACK. </div>
<div><br></div><div>I use k-means as an example. By using RcppMLPACK, a k-means method can be implemented like below. The interfere between R and C++ is handled by Rcpp and RcppArmadillo.<br></div><div><br></div><div>#include "RcppMLPACK.h"</div>
<div><br></div><div>using namespace mlpack::kmeans;</div><div>using namespace Rcpp;</div><div><br></div><div>// [[Rcpp::export]]</div><div>List kmeans(const arma::mat& data, const int& clusters) {</div><div> </div>
<div> arma::Col<size_t> assignments;</div><div><br></div><div> // Initialize with the default arguments.</div><div> KMeans<> k;</div><div><br></div><div> k.Cluster(data, clusters, assignments); </div>
<div><br></div><div> return List::create(_["clusters"] = clusters,</div><div> _["result"] = assignments);</div><div>}</div><div><br></div><div><b>inline</b> package provides a complete wrapper around the compilation, linking, and loading steps. So all the steps can be done in an R session. There is no reason that RcppMLPACK doesn't support the inline compilation.</div>
<div><br></div><div>library(inline)</div><div>library(RcppMLPACK)</div><div>code <- '</div><div> arma::mat data = as<arma::mat>(test);</div><div> int clusters = as<int>(n);</div><div> arma::Col<size_t> assignments;</div>
<div> mlpack::kmeans::KMeans<> k;</div><div> k.Cluster(data, clusters, assignments); </div><div> return List::create(_["clusters"] = clusters,</div><div> _["result"] = assignments);</div>
<div>'</div><div>mlKmeans <- cxxfunction(signature(test="numeric", n ="integer"), code, plugin="RcppMLPACK")</div><div>data(trees, package="datasets")</div><div>mlKmeans(t(trees), 3)</div>
<div><br></div><div><div>There is one point we need to pay attention to: Armadillo matrices in MLPACK are stored in a <b>column-major format</b> for speed. That means <b>observations are stored as columns and dimensions as rows</b>.So when using MLPACK, additional transpose may be needed.</div>
</div><div><br></div><div>The package also contains a RcppMLPACK.package.skeleton() function for people who want to use MLPACK code in their own package. It follows the structure of RcppArmadillo.package.skeleton().</div>
<div><br></div><div>library(RcppMLPACK)</div><div>RcppMLPACK.package.skeleton("foobar")</div><div>Creating directories ...</div><div>Creating DESCRIPTION ...</div><div>Creating NAMESPACE ...</div><div>Creating Read-and-delete-me ...</div>
<div>Saving functions and data ...</div><div>Making help files ...</div><div>Done.</div><div>Further steps are described in './foobar/Read-and-delete-me'.</div><div><br></div><div>Adding RcppMLPACK settings</div><div>
>> added Imports: Rcpp</div><div> >> added LinkingTo: Rcpp, RcppArmadillo, BH, RcppMLPACK</div><div> >> added useDynLib and importFrom directives to NAMESPACE</div><div> >> added Makevars file with RcppMLPACK settings</div>
<div> >> added Makevars.win file with RcppMLPACK settings</div><div> >> added example src file using MLPACK classes</div><div> >> invoked Rcpp::compileAttributes to create wrappers</div><div><br></div><div>
system("ls -R foobar")<br></div><div>foobar:</div><div>DESCRIPTION man NAMESPACE R Read-and-delete-me src</div><div><br></div><div>foobar/man:</div><div>foobar-package.Rd</div><div><br></div><div>foobar/R:</div>
<div>RcppExports.R</div><div><br></div><div>foobar/src:</div><div>kmeans.cpp Makevars Makevars.win RcppExports.cpp</div><div><br></div><div>Even without a performance testing, we are still sure the C++ implementations should be faster. A small wine data set from UCI data sets repository is used for benchmarking. A script using rbenchmark package is written as below:</div>
<div><br></div><div>suppressMessages(library(rbenchmark))</div><div>res <- benchmark(mlKmeans(t(wine),3),</div><div> kmeans(wine,3),</div><div> columns=c("test", "replications", "elapsed",</div>
<div> "relative", "user.self", "sys.self"), order="relative")</div><div><br></div><div>For 100 replications, MLPACK version of k-means (0.028s) is 33-time faster than kmeans in R (0.947s). However, we should note that R returns more information than the clustering result and there are much more checking functions in R.</div>
<div><br></div><div>There is an important problem in MLPACK: it uses <b>size_t</b> type heavily. <br></div><div><br></div><div>There will be problems in wrapping such type, since in 64-bit Windows, <b>size_t</b> is defined as <b>unsigned long long int</b>. No this kind of error found during testing on my Ubuntu.</div>
<div><br></div><div>Testing and bug report are deeply welcome. Not only the code, also the results.</div><div><br></div><div>Best,</div><div><br></div><div>KK</div>-- <br><div dir="ltr">Qiang Kou<div><a href="mailto:qkou@umail.iu.edu" target="_blank">qkou@umail.iu.edu</a><br>
<div>School of Informatics and Computing, Indiana University</div><div><br></div></div></div>
</div>