[adegenet-forum] differentiation test
Jombart, Thibaut
t.jombart at imperial.ac.uk
Tue Apr 24 11:24:27 CEST 2012
Dear Sophie,
I hope you don't mind me forwarding this to the forum where it belongs. I answered this question off-list a while back and this time I would like the answer to be available to other users as well.
I assume that data are stored as a genlight object, although with enough RAM you could do without.
Let us use a toy dataset simulated using glSim:
##
x <- glSim(100, n.snp.nonstruc=990, n.snp.struc = 10)
> x
=== S4 class genlight ===
100 genotypes, 1000 binary SNPs
Ploidy: 1
0 (0 %) missing data
@pop: individual membership for 2 populations
> pop(x)
[1] A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
[38] A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B B B B B B B
[75] B B B B B B B B B B B B B B B B B B B B B B B B B B
Levels: A B
##
There are only 2 populations, but the approach is applicable with any number of groups.
The simplest approach to do what you are looking for is using one ANOVA for each SNP: SNP frequency will be predicted by groups. That will provide tests (F) and effect size measures as well (R^2).
For instance, for the 1st SNP:
##
> anova(lm(as.matrix(x[,1])~pop(x)))
Analysis of Variance Table
Response: as.matrix(x[, 1])
Df Sum Sq Mean Sq F value Pr(>F)
pop(x) 1 0.25 0.25000 2.2152 0.1399
Residuals 98 11.06 0.11286
# to get just the p-value:
> anova(lm(as.matrix(x[,1])~pop(x)))$"Pr(>F)"[1]
##
To do this for all SNPs, I recommend computing all lm first, and then extracting statistics and p-values. Computing all lm is going to take time and surely you just want this done once.
##
listLm <- lapply(1:nLoc(x), function(i) lm(as.matrix(x[,i])~pop(x))) #all LMs - will take time
save(listLm, file="listLm.RData") # save it for later (to load it into R, use 'load')
pVal <- sapply(listLm, function(e) anova(e)$"Pr(>F)"[1]) # get all pvalues
R2 <- sapply(listLm, function(e) summary(e)$"adj.r.squared") # adjusted R²
##
then you can plot results using simple things such as:
##
plot(pVal, type="h")
##
As a last note, here I use lm because I'm merely interested in comparing average allele frequencies between groups. If you were for interested in making allele predictions, glm (probably Poisson) would make more sense.
Cheers
Thibaut
________________________________________
From: Sophie Bouchet [sophie.bouchet at moulon.inra.fr]
Sent: 24 April 2012 09:28
To: Jombart, Thibaut
Subject: differentiation test
Dear Mr Jombart,
I have 50K SNPs markers genotyped for 400 samples that represent 5
differentiated genetic groups.
I would like to calculate the differentiation at each locus and its
significance.
I found some packages to calculate Fst or variance components but could
not find functions to calculate significance for each locus.
I would like to detect outliers as well.
Could you advice me R package or software to use in that purpose?
Thanks in advance,
Sophie
--
Sophie Bouchet
INRA
post-doctoral position
UMR de Génétique Végétale
équipe Génétique Quantitative et Méthodologie de la Sélection
Ferme du Moulon
91190 Gif sur Yvette
France
Tel : +33 (0)1 69 33 23 38
More information about the adegenet-forum
mailing list