[Biomod-commits] problems with RF and warning messages
Wilfried Thuiller
wilfried.thuiller at ujf-grenoble.fr
Wed Nov 18 07:04:12 CET 2009
Dear Daisy,
Very interesting.
> I am having some problems running the model with my presence and
> absence data. I am running the model for a single species and have
> used a weight matrix so the sum of my observations (944 observations)
> is equal to the absences (38181 absences). I am having 2 problems:
>
> 1) When I run the models everything goes smoothly until the model RF.
> Then I get the error message: Model=Breiman and Cutler's random
> forests for classification and regression Error: cannot allocate
> vector of size 447.8 Mb. I understand this is a common problem but I
> only receive this error with the RF model. If I turn off all the other
> models it still does not run. Additionally, the RF does run when I use
> pseudo-absences instead of weighting the absences in my data. I have
> also made sure to: rm(list = ls()), gc(), memory.limit(size = 4095).
> Any advice would be helpful.
Very interesting. RF is usually not very computer demanding. It seems that using the weight parameter increases the use of memory way too much.
Sounds like RF is not very well coded for handling such effects (a matter of the randomForest library).
In any case, I am indeed wondering why using so many absences. Some absences are probably bringing the same information anyway.
It might work better if you indeed run the pseudo-absence procedure on those absences and repeat the process several times.
Another solution is to select the absence which are the most important. To do that, just extract the explanatory variables on the your absences, then run a dissimilarity matrix. You'll have the environmental difference for your absence. Then run a cluster analysis. It is going to group your absences based on environmental similarities. You can stop the cluster let's say at 5000 groups.
Then sample one or two (or more, could be weighted by the number of sites in each group to still represent the environmental bias in the data) sites in each group. They will be your absences.
This is something not implemented in BIOMOD yet but I am wondering if it should not.
>
> The following is how I have set up my model:
> Models(GLM = T, TypeGLM = "poly", Test = "AIC", GBM = T, No.trees =
> 3000, GAM = T, Spline = 3, CTA = T, CV.tree = 50, ANN = T, CV.ann = 2,
> SRE = T, Perc025=T, Perc05=F, MDA = T, MARS = T, RF = T, NbRunEval =
> 3, DataSplit = 80, Yweights=weightsall$COPORNA, Roc = T,
> Optimized.Threshold.Roc = T,Kappa = T, TSS=T, KeepPredIndependent = T,
> VarImport=5, NbRepPA=0, strategy="circles", coor= NULL, distance= 0 ,
> nb.absences=NULL)
>
> 2) After running this model I also receive an abundance of warning
> messages including:
> In gbm.perf(model.sp, method = "OOB", plot.it = F) : OOB generally
> underestimates the optimal number of iterations although predictive
> performance is reasonably competitive. Using cv.folds>0 when calling
> gbm usually results in improved predictive performance.
> In glm.fit(X, y, wt, offset = object$offset, family = object$family,
> ... : algorithm did not converge
> In glm.fit(X, y, wt, offset = object$offset, family = object$family,
> ... : fitted probabilities numerically 0 or 1 occurred
>
Yes and no.
We need to optimize GBM a bit better. We use OOB instead of CV because from my experience it is faster and does not change the results compare to CV. We might need to check it using virtual data but I am afraid we do not have time at the moment.
The Warning is coming from the GBM library. It is not a BIOMOD warning.
For GLM, yes it is often the case. GLM is supposed to fit probability of occurrence not 0 and 1. When the model tends to over-fit or when there are two many absences (or respectively presences) compare to presences (respectively absences), the model tends to fit probabilities of occurrence which are very close to 0 and 1 (kind of bimodal curve). It does not like it and let you know. In a more pragmatic term, it is not too much important but you need to be aware that there is probably an overfit problem.
Hope it helps,
Wilfried
> Do I need to worry about these?
>
> Thank you,
>
> Daisy Englert
> Macquarie University
> Sydney, NSW
> _______________________________________________
> Biomod-commits mailing list
> Biomod-commits at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/biomod-commits
--------------------------
Dr. Wilfried Thuiller
Laboratoire d'Ecologie Alpine, UMR CNRS 5553
Université Joseph Fourier
BP53, 38041 Grenoble cedex 9, France
tel: +33 (0)4 76 63 54 53
fax: +33 (0)4 76 51 42 79
Email: wilfried.thuiller at ujf-grenoble.fr
Home page: http://www.will.chez-alice.fr
Website: http://www-leca.ujf-grenoble.fr/equipes/tde.htm
FP6 European MACIS project: http://www.macis-project.net
FP6 European EcoChange project: http://www.ecochange-project.eu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/biomod-commits/attachments/20091118/529dcb16/attachment.htm
More information about the Biomod-commits
mailing list