[adegenet-forum] more xval confusion: getting variable results

Wed Feb 26 17:48:48 CET 2014

Judge for yourself; using exactly your distribution:
###
> fac <- rep(letters[1:6], c(95,  43,  61,  72, 164, 125))
> table(fac) - table(sample(fac, size=504, replace=FALSE))
fac
 a  b  c  d  e  f 
10  6  5  8 14 13 

## in the above case, all is fine. Let's try 1000 times:
> set.seed(1)
> for(i in 1:1000) {if(any(table(fac) - table(sample(fac, size=504, replace=FALSE)) < 1)) counter=counter+1}  
> counter
[1] 12

So in 1000 resampling, 12 of them could not get data cross-validated. 43 is not a small sample size for e.g. estimating allele frequencies, but for cross-validation purposes with 90% of data used as training set, it may not always be enough. Selection a smaller training set should help.

In any case, the fact that cross-validation leads to selecting anywhere from 20 to 80 PCs may also mean that this number does not matter that much. This would be the case if e.g. PCs 20:80 had a very small variance. 

Cheers
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary’s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: Nikki Vollmer [nlv209 at hotmail.com]
Sent: 26 February 2014 13:59
To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
Subject: RE: [adegenet-forum] more xval confusion: getting variable results

Really group size?  Here are mine: 95,  43,  61,  72, 164, 125.  Is 43 really that small?

> From: t.jombart at imperial.ac.uk
> To: nlv209 at hotmail.com; adegenet-forum at lists.r-forge.r-project.org
> Subject: RE: [adegenet-forum] more xval confusion: getting variable results
> Date: Wed, 26 Feb 2014 11:52:00 +0000
>
> Hello,
>
> the results come from the fact that some groups probably have very small sample sizes in your data. Therefore, the re-sampling used for the cross validation may have i) no individuals to train the method on, and/or ii) no individuals to cross-validate with.
>
> Caitlin Collins has modified the cross-validation procedure for this kind of situation, but it is still in (one of ) the devel version of adegenet. You can either contact her directly, or just discard the smallest groups from your analysis.
>
> Cheers
> Thibaut
>
>
>
>
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Nikki Vollmer [nlv209 at hotmail.com]
> Sent: 25 February 2014 19:25
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] more xval confusion: getting variable results
>
> Hello again,
>
> I have been running xvalDapc and have been getting variable results and am not sure how to interpret this.
>
> I have a dataset of combined microsatellite (19 loci) and SNP (39 loci) data for 560 individuals. From initially running find.clusters I have 6 groups/clusters (which makes sense with my data) that I am testing with xval to eventually run a DAPC.
>
> For xvalDapc I have been using the following settings:
> n.pca.max=100, n.da=NULL, training.set=0.9, n.pca=NULL
>
> First off, if I try anything over 4 replicates I often get the following message:
>
> Warning message:
> In xvalDapc.matrix(objNoNa at tab, grp$grp, n.pca.max = 100, n.da = NULL, :
> At least one group was absent from the training / validating sets.
> Try using smaller training sets.
>
> So, I have run the command many many times with both 3 and 4 reps (occasionally, but not as often, getting the above warning message) and keep getting very variable results. For instance if I run xval 6 times with 4 reps no one run gives me the same "best" number of PCAs. Some times I get 20 PCAs as best, others I get 80. Overall, I never get the same thing twice, but all classifications are greater than 0.80, and most over 0.90, success. I feel based on the xval results there is no way to unambiguously pick a best number of PCAs to use to run a subsequent DAPC.
>
> My first thought with this inconsistency would be to run more reps, but then I get the warning message very often, and when the runs with the higher reps do proceed, I get many groups that aren't assigned to a training set. So if I am stuck with using fewer reps, and am stuck with the inconsistent results, can that be interpreted as my dataset not being very informative...and/or, I hate to say it, but that I need more loci to increase assignment consistency with DAPC?
>
> Thanks for any help you can offer, it is much appreciated!
>
> Nikki