From tms at epcc.ed.ac.uk Thu May 9 17:58:32 2013 From: tms at epcc.ed.ac.uk (Terry Sloan) Date: Thu, 09 May 2013 16:58:32 +0100 Subject: [Sprint-user] about randomForest in sprint library Message-ID: <518BC7A8.7040801@epcc.ed.ac.uk> A SPRINT user asked in July 2012 ================================= I am a statistics student doing a research project on a data set with random forest in R. My data set has about 1.7 million observations and 60 factors, some of which have more than 50 levels. The original windows library randomForest in R can not handle such a big data set and factors with more than 32 levels. I notice that your library "sprint" is high performance computational library, although it can be used in Linux and Mac. Could sprint in Linux or Mac handle factors with more than 32 levels? Because I see sprint is built on original library of randomForest. And can sprint library handle big data set? If their answers are yes, I will try to use a Linux or Mac computer to deal with it. A member of the SPRINT team replied in July 2012 ================================================ thank you very much for your enquiry about the SPRINT implementation of randomforest. As you have noticed SPRINT uses the underlying R implementation of randomforest. This means that if R cannot run on your existing dataset then the SPRINT implementation will not work either. The SPRINT implementation uses a task parallel approach since it was developed for microarray data analysis where the size of the dataset typically fits into the memory of a single R process. An alternative you may wish to consider is randomjungle. For more information see F. Schwarz, I. R. Konig, and A. Ziegler. On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics, 26(14):1752{1758, 2010. You can find out more about the SPRINT implementation of randomforest in the SPRINT paper available at ?http://dl.acm.org/citation.cfm?doid=1996023.1996024 -- ---------------------------------------------------------------------- Terry Sloan Email: t.sloan at epcc.ed.ac.uk EPCC Phone: +44 131 650 5155 WWW : http://www.epcc.ed.ac.uk/ ---------------------------------------------------------------------- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From tms at epcc.ed.ac.uk Thu May 9 18:21:29 2013 From: tms at epcc.ed.ac.uk (Terry Sloan) Date: Thu, 09 May 2013 17:21:29 +0100 Subject: [Sprint-user] sprint papply Message-ID: <518BCD09.4020000@epcc.ed.ac.uk> A SPRINT user asked in February 2013 =================================== In sprint manual, for papply(), it said data could be array, list or ff object. I just tried papply(0 for a list, but got an error: x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE)) papply(x, mean) Error in papply(x, mean) : could not find function "is.ff" It runs fine with lapply(x, mean). Could you let me know whether I am missing anything? A member of the SPRINT team replied in February 2013 ==================================================== Thanks for raising the issue with papply. On investigation it seems that papply expects you to have called 'library(ff)' before papply and also that our implementation of papply is limited to working on lists of numbers or lists of matrices only. I will update our documentation and improve the error messages in the code, and make a note of improvements that should be made to fully implement papply. In the meantime, here are a couple of working examples of papply: library(sprint) library(ff) papply(list(1:10), mean) 1? [1] 5.5 listt = list(matrix(sin(1:100), ncol=20), matrix(sin(1:100), ncol=25), matrix(sin(1:10), ncol=5)) papply(listt, mean) -- ---------------------------------------------------------------------- Terry Sloan Email: t.sloan at epcc.ed.ac.uk EPCC Phone: +44 131 650 5155 WWW : http://www.epcc.ed.ac.uk/ ---------------------------------------------------------------------- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From tms at epcc.ed.ac.uk Thu May 9 18:28:12 2013 From: tms at epcc.ed.ac.uk (Terry Sloan) Date: Thu, 09 May 2013 17:28:12 +0100 Subject: [Sprint-user] SPRINT availability on Mac Message-ID: <518BCE9C.3080602@epcc.ed.ac.uk> A SPRINT user asked in March 2013 ================================= I've recently been using parallel multicore computing in R ('snowfall' , 'multicore') on my Mac Pro with 12 physical and 24 available cores and would like to encourage you to make 'sprint' available for mac if that would be appropriate. Although there are rumors that 'sprint' is now available for mac, I can't find it at any CRAN sites. A member of the SPRINT team replied in March 2013 ================================================= In relation to your query, the current Sprint release (1.0.4 available here: ?http://cran.r-project.org/web/packages/sprint/index.html) runs on Mac OS X and can provide quite substantial performance benefits on Mac Pro models with specifications such as yours. -- ---------------------------------------------------------------------- Terry Sloan Email: t.sloan at epcc.ed.ac.uk EPCC Phone: +44 131 650 5155 WWW : http://www.epcc.ed.ac.uk/ ---------------------------------------------------------------------- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.