[Biomod-commits] Re : prevalence and pseudoabsences
Brenna Forester
forestb at students.wwu.edu
Mon Apr 25 21:30:41 CEST 2011
Thank you Wilfried; your answers are extremely helpful!
I was originally modeling my species with presences = pseudo-absences based on my reading of the literature. Then I found the Jimenez-Valverde, Lobo and Hortal paper in Community Ecology (2009) titled "The effect of prevelence and its interaction with sample size on the reliability of species distribution models". Using a virtual species, they found that biased prevalence is only significant with extremely unbalanced samples, given many caveats (such as reliable training data & relevant predictors). In practice, they recommend using as large a sample size as possible, to improve model stability & improve sample coverage over the environmental and spatial gradients of the study area. This includes using as many absences as possible, down to a prevalence of 0.01. They include discussion of appropriate use of probabilities (since they are skewed due to prevalence) & appropriate assessment of model performance (e.g. don't use kappa).
Anyway - it is a very interesting paper & made me want to try modeling my species using a prevalence of 0.1. So I ran my models in three ways:
prevalence = 0.1 (presence = 304 / PA = 2736)
prevalence = 0.5 (presence = 304 / PA = 304)
prevalence = 0.5 (presence = 304 / PA = 2736 weighted)
I compared ROC and TSS scores for cross-validation, sensitivity and specificity. Models with prevalence = 0.1 had the best specificity scores, but worst CV and sensitivity. Prevalence = 0.5 (304/304) had the best CV and sensitivity scores (with one exception), with specificity second to the prevalence = 0.1 models. Prevalence = 0.5 (304/2736 weighted) was in the middle. Most of these differences were relatively small.
Right now I'm assessing the stability of my 304/304 models to PA pulls, since 304 PAs samples a small number of possible absences (total grid cells in my study area = 6808).
With real (not virtual) data sets, there are obviously many interacting factors that influence final CV/sens/spec scores. I was surprised to see the relatively small differences made by changing prevalence and # of PA records. I'd be interested to hear yours & others thoughts on these issues. I wonder how your upcoming paper using virtual datasets compares with the Jimenez-Valverde et. al. paper?
Thanks for an interesting discussion!
Brenna
________________________________
From: Wilfried Thuiller [wilfried.thuiller at ujf-grenoble.fr]
Sent: Saturday, April 23, 2011 6:29 AM
To: Brenna Forester
Cc: biomod-commits at lists.r-forge.r-project.org
Subject: Re: Re : [Biomod-commits] prevalence and pseudoabsences
Dear Brenna,
Thanks Bruno & Wilfried,
So to clarify: I run pseudo.abs - in my case as so:
PA1 <- pseudo.abs(coor=Sp.Env[,2:3], status=Sp.Env[,1], strategy="random",
env=Sp.Env[,4:10], nb.points=2736, species.name="Rhodiola",
add.pres=F, create.dataset=T, plot=T, pcol="red", acol="grey80")
This creates two objects, "PA1" (a vector of cell numbers chosen as absences) and "Dataset.Rhodiola.random.partial", a dataframe of coordinates and "status" (zero).
I would then create a new dataset that has just my presence records (304) and these 2736 absences. I would run that dataset (Sp.Env.PA1) in the Intial.State() and Models() functions, for example, as so:
Initial.State(Response=Sp.Env.PA1[,c(1)], Explanatory=Sp.Env.PA1[,4:10],
IndependentResponse=NULL, IndependentExplanatory=NULL,
sp.name="Rhodiola")
Models(GLM = T, TypeGLM = "simple", Test = "AIC", GBM = T, No.trees = 5000,
GAM = T, CTA = T, CV.tree = 100, ANN = T, CV.ann = 5, SRE = F, FDA = T,
MARS = T, RF = T, NbRunEval = 10, DataSplit = 70, Yweights=NULL,
NbRepPA=0, Roc=T, Optimized.Threshold.Roc=T, Kappa=T, TSS=T,
KeepPredIndependent = F, VarImport=5)
I keep NbRepPA = 0 so it uses the entire dataset to evaluate the model, maintaining my prevalence at 0.1 (304 presence records/3040 total records in the dataset).
I think I am correct on everything to this point?
Yes, you are correct.
So my question is: I want to do 5 PA pulls (as I would if I ran it in the Models() function, NbRepPA = 5), maintaining my 0.1 prevalence. But I would then have run Models() five times on 5 datasets (each with different PA pulls). How does BIOMOD create a final model when using PA pulls (e.g. NbRepPA = 5) within the Models() function, and can I replicate that when I run my PA pulls manually as above?
There is no final model when using several PA sets. There are as many "final models" as PA sets.
If you want to use several sets of PA yourself, make predictions from every model (using the Projections function for instance on the overall area). Then you'll need to combine them yourself.
There are several alternatives for combining projections from different models from different PA sets and from different repetitions from cross-validation:
Either you create a simple average and standard deviation from projections in probability values. You can then derive a confidence interval if you want.
You could also perform a weighted sum using weights derived from TSS or ROC for instance. It will give more weights to the best models (from the cross-validation column in Evaluation.results.TSS).
You could also perform what we usually call a committee averaging where you let the models vote for a presence or an absence. For this, you do not use the probability of occurrence anymore, but rather the presence-absence data directly. You then sum the presence-absences maps. If you have 5 repetitions, 5 models and 5 sets of PA, you thus have at maximum 125. When the sum if equal to 125, it means all repetitions, PA and models agree to say this is a presence, and when you got zero, it means the reverse obviously. Between 0 and 125 will give you the probability of agreement from the models for an absence (after rescaling everything by 125 for instance). This ensemble approach is very close to the Bayesian philosophy with posterior probabilities. I really like this approach, much better than looking at probability of occurrences themselves.
Now, I am not entirely sure why you want to keep your prevalence. Regression like models are not really good with artificial unbalanced dataset (prevalence different than zero). They are supposed to work well if the prevalence is the true prevalence of the species. This is the case with a perfect stratified sampling, but this is absolutely not when using random sets of pseudo-absence.
Therefore, the results are usually anyway similar. The main difference being the "true" probability of the models which will be higher for the pseudo-absence are downweighted. however, when they are transformed between 0 and 1, results are usually very similar.
I think Witz and Guisan recently show that using weighted pseudo-absence was better. We also have a paper close to be accepted with Methods in Ecology and Evolution showing the same with virtual datasets.
Hope it helps,
Wilfried
I hope this isn't too confusing!
Thank you!
Brenna
________________________________
From: Bruno Lafourcade [brunolafourcade at aol.com]
Sent: Thursday, April 21, 2011 11:37 PM
To: wilfried.thuiller at ujf-grenoble.fr<mailto:wilfried.thuiller at ujf-grenoble.fr>; Brenna Forester
Cc: biomod-commits at r-forge.wu-wien.ac.at<mailto:biomod-commits at r-forge.wu-wien.ac.at>
Subject: Re : [Biomod-commits] prevalence and pseudoabsences
Hi Brenna,
The pseudo-absence procedure within the Models function is automated and generates a
weighting to give a prevalence of 0.5 for each run.
To make sure that the prevalence doesn't change, you have to build your own pseudo-absence
data outside of the Models function (even prior to Initial.State). In that way, the Models function
will not recognize your data as being pseudo.abs and will not weight them, just like for any
standard input data.
Use the pseudo.abs() function to this matter. Don't hesitate to ask for details on how to use it.
Best,
Bruno
-------
Bruno Lafourcade
Statistical tools engineer
Laboratoire d'Ecologie Alpine, bureau 308
CNRS - UMR 5553, 2233 rue de la piscine
38400 Saint Martin d'Hères
-------
-----E-mail d'origine-----
De : Wilfried Thuiller <wilfried.thuiller at ujf-grenoble.fr<mailto:wilfried.thuiller at ujf-grenoble.fr>>
A : Brenna Forester <forestb at students.wwu.edu<mailto:forestb at students.wwu.edu>>
Cc : biomod-commits at lists.r-forge.r-project.org<mailto:biomod-commits at lists.r-forge.r-project.org> <biomod-commits at r-forge.wu-wien.ac.at<mailto:biomod-commits at r-forge.wu-wien.ac.at>>
Envoyé le : Vendredi, 22 Avril 2011 7:09
Sujet : Re: [Biomod-commits] prevalence and pseudoabsences
Dear Brenna,
Yes and no...
If you do not ask for pseudo-absence (NbPA=0), there is no weigthing and all your pseudo-absence will be used at once. Prevalence = 0.1
If you add NbPA = 3040 (or more), yes, there is. The prevalence = 0.5
Does it help?
Wilfried
Le 22 avr. 2011 à 00:53, Brenna Forester a écrit :
Hello,
I see in the "Presentation Manual for BIOMOD" (page 18) the following statement: "In all procedures, BIOMOD ensures that the prevalence of the original data is conserved in the calibration and evaluation datasets."
I have 304 presence records and am running my pseudoabsence pulls with 3040 absences (a prevalence of 0.1). The number of pixels in my study area is 6808.
More information about the Biomod-commits
mailing list