[Sprint-user] about randomForest in sprint library

Terry Sloan tms at epcc.ed.ac.uk
Thu May 9 17:58:32 CEST 2013


A SPRINT user asked in July 2012
=================================

I am a statistics student doing a research project on a data set with random forest in R.

My data set has about 1.7 million observations and 60 factors, some of which have more than 50 levels.

The original windows library randomForest in R can not handle such a big data set and factors with more than 32 levels. I notice that your library "sprint" is high performance computational library, although it can be used in Linux and Mac.

Could sprint in Linux or Mac handle factors with more than 32 levels? Because I see sprint is built on original library of randomForest.
And can sprint library handle big data set?

If their answers are yes, I will try to use a Linux or Mac computer to deal with it.

A member of the SPRINT team replied in July 2012
================================================

thank you very much for your enquiry about the SPRINT implementation of randomforest.

As you have noticed SPRINT uses the underlying R implementation of randomforest. This means that if R cannot run on your existing dataset then the SPRINT implementation will not work either. The SPRINT implementation uses a task parallel approach since it was developed for microarray data analysis where the size of the dataset typically fits into the memory of a single R process.

An alternative you may wish to consider is randomjungle. For more information see

F. Schwarz, I. R. Konig, and A. Ziegler. On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics, 26(14):1752{1758, 2010.

You can find out more about the SPRINT implementation of randomforest in the SPRINT paper available at

​http://dl.acm.org/citation.cfm?doid=1996023.1996024


-- 
----------------------------------------------------------------------
  Terry Sloan                        Email: t.sloan at epcc.ed.ac.uk
  EPCC                               Phone: +44 131 650 5155
                                     WWW  : http://www.epcc.ed.ac.uk/
----------------------------------------------------------------------

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the Sprint-user mailing list