From Mark.Coulson.ic at uhi.ac.uk  Thu Feb  1 18:01:46 2018
From: Mark.Coulson.ic at uhi.ac.uk (Mark Coulson)
Date: Thu, 1 Feb 2018 17:01:46 +0000
Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
In-Reply-To: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch>
References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch>
Message-ID: <AM0PR0602MB358623E02691E2E3C2FD287EEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>

Hi Ben,

I have used allelotype data with the input as a matrix of the frequency of the A allele in each group to run DAPC and it worked well. However, my groups were defined already but could the same type of input not be used to find.clusters?

Mark


-----Original Message-----
From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Benjamin Dauphin
Sent: 31 January 2018 09:18
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data

Dear all,

I am newly working on pool sequencing data and I simply wonder if I can use kmeans (find.cluster) and DAPC to investigate population structure from poolseq data (allele frequencies)? How find.clusters can deal with allele frequencies?

Dataset: 7 pools and 100?000 SNPs

Any comment or help would be much appreciated.
Best regards
Ben


_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
Inverness College UHI, a partner in the University of the Highlands and Islands www.inverness.uhi.ac.uk Board of Management of Inverness College (known as Inverness College UHI), Scottish Charity No SC021197.

From Mark.Coulson.ic at uhi.ac.uk  Thu Feb  1 21:36:12 2018
From: Mark.Coulson.ic at uhi.ac.uk (Mark Coulson)
Date: Thu, 1 Feb 2018 20:36:12 +0000
Subject: [adegenet-forum] How to interpret Density Plot for K=2
In-Reply-To: <BN6PR06MB3266D6C050395557B388697C80E40@BN6PR06MB3266.namprd06.prod.outlook.com>
References: <BN6PR06MB3266D6C050395557B388697C80E40@BN6PR06MB3266.namprd06.prod.outlook.com>
Message-ID: <AM0PR0602MB3586993CB33D0B7300B33A6BEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>

Hi Nikki,

Your interpretation of the plot seems correct, however I'd ask if you ran the xvalDAPC cross validation? It may be that you have kept too many PCs so are overfitting the data. The xvalDAPC will find the optimal number of PCs to retain for your two groups. Then use this number of PCs to run a new DAPC. It will likely result in more overlap between the two groups, which would then be more consistent with the low differentiation you are seeing based on FST.

Hope this helps.

Mark

From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Nikki Vollmer
Sent: 30 January 2018 18:08
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] How to interpret Density Plot for K=2


Hi,


I am trying to analyze ~200 RADseq loci for ~200 individuals.  STRUCTURE results suggest the best number of populations given the data is 2.  Pairwise Fst values are quite low for my taxa (<0.003) with pvalue 0.01802.  I was trying to do a DAPC on this same data to compare results. DAPC similarly suggested the best # of clusters is 2 and I was able to plot a 1-dimensional density plot for the one DF I kept (attached).  However, I am not sure how to interpret the plot.  Is it correct to say that because the two peaks do not overlap that suggests the 2 clusters are quite differentiated from one another (similar to two clusters on a scatter plot being in opposite quadrants)?  (...or is that logic flawed?)


I am trying to figure out if these 2 groups are very genetically differentiated or not, and I am not clear what the density plot is supporting/suggesting.


I very much appreciate any guidance on this matter!


Thank you,

Nikki


Inverness College UHI, a partner in the University of the Highlands and Islands www.inverness.uhi.ac.uk Board of Management of Inverness College (known as Inverness College UHI), Scottish Charity No SC021197.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180201/66369b98/attachment.html>

From thibautjombart at gmail.com  Fri Feb  2 17:53:34 2018
From: thibautjombart at gmail.com (Thibaut Jombart)
Date: Fri, 2 Feb 2018 16:53:34 +0000
Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
In-Reply-To: <AM0PR0602MB358623E02691E2E3C2FD287EEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>
References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch>
 <AM0PR0602MB358623E02691E2E3C2FD287EEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>
Message-ID: <CANPRA+pAJ3t67b1b-XYkHjNR4Bb8b9QZ5rHDsdsEA8Nwjg0hvQ@mail.gmail.com>

Hi there

find.clusters is implemented for matrices as well, and should deal nicely
with any kind of quantitative data. So it should apply readily to your
data. Same for DAPC.

Best
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 1 February 2018 at 17:01, Mark Coulson <Mark.Coulson.ic at uhi.ac.uk> wrote:

> Hi Ben,
>
> I have used allelotype data with the input as a matrix of the frequency of
> the A allele in each group to run DAPC and it worked well. However, my
> groups were defined already but could the same type of input not be used to
> find.clusters?
>
> Mark
>
>
> -----Original Message-----
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Benjamin
> Dauphin
> Sent: 31 January 2018 09:18
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
>
> Dear all,
>
> I am newly working on pool sequencing data and I simply wonder if I can
> use kmeans (find.cluster) and DAPC to investigate population structure from
> poolseq data (allele frequencies)? How find.clusters can deal with allele
> frequencies?
>
> Dataset: 7 pools and 100?000 SNPs
>
> Any comment or help would be much appreciated.
> Best regards
> Ben
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
> Inverness College UHI, a partner in the University of the Highlands and
> Islands www.inverness.uhi.ac.uk Board of Management of Inverness College
> (known as Inverness College UHI), Scottish Charity No SC021197.
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180202/b87e3486/attachment.html>

From benjamin.dauphin at wsl.ch  Fri Feb  2 10:07:00 2018
From: benjamin.dauphin at wsl.ch (Benjamin Dauphin)
Date: Fri, 2 Feb 2018 10:07:00 +0100
Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
In-Reply-To: <AM0PR0602MB358623E02691E2E3C2FD287EEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>
References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch>
 <AM0PR0602MB358623E02691E2E3C2FD287EEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>
Message-ID: <77640E6F-D646-48E2-86BB-6FE894DBDD54@wsl.ch>

Hi Mark, 

Thanks for response. I?ve run find.clusters() with the matrix of allele frequencies as input file, and then run the DAPC using still the matrix (not the genind or genlight object) by assigning the group generated with kmeans (grp$grp). It works but I have a strange ?inverted parabolic curve" for the kmean analysis. 
Is it a common picture for pooldseq data?

Thanks,
Ben

-------------- next part --------------
A non-text attachment was scrubbed...
Name: kmean_HJ_cohorts.pdf
Type: application/pdf
Size: 5078 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180202/247294b3/attachment.pdf>
-------------- next part --------------


> On 1 Feb 2018, at 18:01, Mark Coulson <Mark.Coulson.ic at uhi.ac.uk> wrote:
> 
> Hi Ben,
> 
> I have used allelotype data with the input as a matrix of the frequency of the A allele in each group to run DAPC and it worked well. However, my groups were defined already but could the same type of input not be used to find.clusters?
> 
> Mark
> 
> 
> -----Original Message-----
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Benjamin Dauphin
> Sent: 31 January 2018 09:18
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
> 
> Dear all,
> 
> I am newly working on pool sequencing data and I simply wonder if I can use kmeans (find.cluster) and DAPC to investigate population structure from poolseq data (allele frequencies)? How find.clusters can deal with allele frequencies?
> 
> Dataset: 7 pools and 100?000 SNPs
> 
> Any comment or help would be much appreciated.
> Best regards
> Ben
> 
> 
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> Inverness College UHI, a partner in the University of the Highlands and Islands www.inverness.uhi.ac.uk Board of Management of Inverness College (known as Inverness College UHI), Scottish Charity No SC021197.


From thibautjombart at gmail.com  Fri Feb  2 18:17:47 2018
From: thibautjombart at gmail.com (Thibaut Jombart)
Date: Fri, 2 Feb 2018 17:17:47 +0000
Subject: [adegenet-forum] snapclust
In-Reply-To: <BDCFDBC3-73B7-4AB3-8E74-1576D52D4F99@gmail.com>
References: <BDCFDBC3-73B7-4AB3-8E74-1576D52D4F99@gmail.com>
Message-ID: <CANPRA+r0i+h5ipnviNOi_qEXQ5o7v6eh55QEobr1J-7RA3Pp-w@mail.gmail.com>

Hi there,

I would analyse the empirical data separately. If you have clearly
identified parental populations (i.e. prior knowledge, not identified by
the method), sure you can benchmark the method using simulated hybrids.
Otherwise, simulations will have less interest.

How would you go about bootstrapping the final probabilities?

Best
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 31 January 2018 at 00:18, Danielle Louise <danielledanielle89 at gmail.com>
wrote:

> Hello. I am looking at implementing your snapclust function, and I am
> reading through your recent paper.
>
>  I have a few questions regarding incorporating empirical data. I have
> simulated data sets with parental and F1 F2 and BC and I am wondering how
> to incorporate the empirical data - do I add it in to the simulated data
> and measure the accuracy of the assignment to classes to then determine the
> reliability of detection of hybrids in the empirical data? The tutorial
> gives a good outline of using the simulated data, but I think I am missing
> something when it comes to checking the empirical data, so I am asking for
> some really practical advice about how to incorporate the empirical data ?
> Also should we bootstrap the final probabilities to clarify the results?
>
> Thanks
> Dan
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180202/59766c26/attachment.html>

From thibautjombart at gmail.com  Fri Feb  2 18:22:30 2018
From: thibautjombart at gmail.com (Thibaut Jombart)
Date: Fri, 2 Feb 2018 17:22:30 +0000
Subject: [adegenet-forum] How to interpret Density Plot for K=2
In-Reply-To: <AM0PR0602MB3586993CB33D0B7300B33A6BEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>
References: <BN6PR06MB3266D6C050395557B388697C80E40@BN6PR06MB3266.namprd06.prod.outlook.com>
 <AM0PR0602MB3586993CB33D0B7300B33A6BEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>
Message-ID: <CANPRA+rHqNQm06MqGTxXXKqDFkKuGAE5ttUdBOMt8WvRX9jFzw@mail.gmail.com>

Hi there,

I would definitely second Mark's comment and use cross-validation here.

Also for the clustering, I would give snapclust a try - I have just pushed
a new version on github which is now properly documented. Especially check
what the 'optimal k' is according to the various goodness of fit stats
(snapclust.choose.k) - AIC, AICc, BIC, KIC.

Best
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 1 February 2018 at 20:36, Mark Coulson <Mark.Coulson.ic at uhi.ac.uk> wrote:

> Hi Nikki,
>
>
>
> Your interpretation of the plot seems correct, however I?d ask if you ran
> the xvalDAPC cross validation? It may be that you have kept too many PCs so
> are overfitting the data. The xvalDAPC will find the optimal number of PCs
> to retain for your two groups. Then use this number of PCs to run a new
> DAPC. It will likely result in more overlap between the two groups, which
> would then be more consistent with the low differentiation you are seeing
> based on FST.
>
>
>
> Hope this helps.
>
>
>
> Mark
>
>
>
> *From:* adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org] *On Behalf Of *Nikki
> Vollmer
> *Sent:* 30 January 2018 18:08
> *To:* adegenet-forum at lists.r-forge.r-project.org
> *Subject:* [adegenet-forum] How to interpret Density Plot for K=2
>
>
>
> Hi,
>
>
>
> I am trying to analyze ~200 RADseq loci for ~200 individuals.  STRUCTURE
> results suggest the best number of populations given the data is 2.
> Pairwise Fst values are quite low for my taxa (<0.003) with pvalue
> 0.01802.  I was trying to do a DAPC on this same data to compare results.
> DAPC similarly suggested the best # of clusters is 2 and I was able to plot
> a 1-dimensional density plot for the one DF I kept (attached).  However, I
> am not sure how to interpret the plot.  Is it correct to say that because
> the two peaks do not overlap that suggests the 2 clusters are quite
> differentiated from one another (similar to two clusters on a scatter plot
> being in opposite quadrants)?  (...or is that logic flawed?)
>
>
>
> I am trying to figure out if these 2 groups are very genetically
> differentiated or not, and I am not clear what the density plot is
> supporting/suggesting.
>
>
>
> I very much appreciate any guidance on this matter!
>
>
>
> Thank you,
>
> Nikki
>
>
> Inverness College UHI, a partner in the University of the Highlands and
> Islands www.inverness.uhi.ac.uk Board of Management of Inverness College
> (known as Inverness College UHI), Scottish Charity No SC021197.
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180202/8247b448/attachment-0001.html>

From thibautjombart at gmail.com  Fri Feb  2 18:25:45 2018
From: thibautjombart at gmail.com (Thibaut Jombart)
Date: Fri, 2 Feb 2018 17:25:45 +0000
Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
In-Reply-To: <77640E6F-D646-48E2-86BB-6FE894DBDD54@wsl.ch>
References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch>
 <AM0PR0602MB358623E02691E2E3C2FD287EEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>
 <77640E6F-D646-48E2-86BB-6FE894DBDD54@wsl.ch>
Message-ID: <CANPRA+qzbO32Mv6vYujrGSM+DNtG6pk1_Kn8_HaNyjMaeTOwdg@mail.gmail.com>

Hi again,

such plot typically indicates no clustering. Just to confirm: are we
talking about 7 rows and 100,000 columns?

If so, your pools are technically your statistical individuals, and the
method explore clustering solutions for 1-6 clusters for 7 individuals,
which won't go far - not enough individuals to detect clustering really.
Apologies if I misunderstood.

Best
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 2 February 2018 at 09:07, Benjamin Dauphin <benjamin.dauphin at wsl.ch>
wrote:

> Hi Mark,
>
> Thanks for response. I?ve run find.clusters() with the matrix of allele
> frequencies as input file, and then run the DAPC using still the matrix
> (not the genind or genlight object) by assigning the group generated with
> kmeans (grp$grp). It works but I have a strange ?inverted parabolic curve"
> for the kmean analysis.
> Is it a common picture for pooldseq data?
>
> Thanks,
> Ben
>
>
>
>
> > On 1 Feb 2018, at 18:01, Mark Coulson <Mark.Coulson.ic at uhi.ac.uk> wrote:
> >
> > Hi Ben,
> >
> > I have used allelotype data with the input as a matrix of the frequency
> of the A allele in each group to run DAPC and it worked well. However, my
> groups were defined already but could the same type of input not be used to
> find.clusters?
> >
> > Mark
> >
> >
> > -----Original Message-----
> > From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Benjamin
> Dauphin
> > Sent: 31 January 2018 09:18
> > To: adegenet-forum at lists.r-forge.r-project.org
> > Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
> >
> > Dear all,
> >
> > I am newly working on pool sequencing data and I simply wonder if I can
> use kmeans (find.cluster) and DAPC to investigate population structure from
> poolseq data (allele frequencies)? How find.clusters can deal with allele
> frequencies?
> >
> > Dataset: 7 pools and 100?000 SNPs
> >
> > Any comment or help would be much appreciated.
> > Best regards
> > Ben
> >
> >
> > _______________________________________________
> > adegenet-forum mailing list
> > adegenet-forum at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
> > Inverness College UHI, a partner in the University of the Highlands and
> Islands www.inverness.uhi.ac.uk Board of Management of Inverness College
> (known as Inverness College UHI), Scottish Charity No SC021197.
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180202/2ea672cc/attachment.html>

From benjamin.dauphin at unine.ch  Fri Feb  2 22:01:40 2018
From: benjamin.dauphin at unine.ch (DAUPHIN Benjamin)
Date: Fri, 2 Feb 2018 21:01:40 +0000
Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
In-Reply-To: <CANPRA+qzbO32Mv6vYujrGSM+DNtG6pk1_Kn8_HaNyjMaeTOwdg@mail.gmail.com>
References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch>
 <AM0PR0602MB358623E02691E2E3C2FD287EEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>
 <77640E6F-D646-48E2-86BB-6FE894DBDD54@wsl.ch>,
 <CANPRA+qzbO32Mv6vYujrGSM+DNtG6pk1_Kn8_HaNyjMaeTOwdg@mail.gmail.com>
Message-ID: <40a77a4d0903435b96871c3582004ee1@vRana01.UNINE.CH>

Thanks Thibaut.
Yes i have 7 pools (=7 rows or =7 individuals in the analysis), and i expect two clusters representing two already characterized lineages. I have found 4 likely clusters based on HCPC but i want to double check this, with a kmeans if possible.
Best
Ben
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Thibaut Jombart [thibautjombart at gmail.com]
Sent: 02 February 2018 18:25
To: Benjamin Dauphin
Cc: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] Kmeans and DAPC on poolSeq data

Hi again,

such plot typically indicates no clustering. Just to confirm: are we talking about 7 rows and 100,000 columns?

If so, your pools are technically your statistical individuals, and the method explore clustering solutions for 1-6 clusters for 7 individuals, which won't go far - not enough individuals to detect clustering really. Apologies if I misunderstood.

Best
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College London
Head of RECON: repidemicsconsortium.org<http://repidemicsconsortium.org>
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 2 February 2018 at 09:07, Benjamin Dauphin <benjamin.dauphin at wsl.ch<mailto:benjamin.dauphin at wsl.ch>> wrote:
Hi Mark,

Thanks for response. I?ve run find.clusters() with the matrix of allele frequencies as input file, and then run the DAPC using still the matrix (not the genind or genlight object) by assigning the group generated with kmeans (grp$grp). It works but I have a strange ?inverted parabolic curve" for the kmean analysis.
Is it a common picture for pooldseq data?

Thanks,
Ben


> On 1 Feb 2018, at 18:01, Mark Coulson <Mark.Coulson.ic at uhi.ac.uk<mailto:Mark.Coulson.ic at uhi.ac.uk>> wrote:
>
> Hi Ben,
>
> I have used allelotype data with the input as a matrix of the frequency of the A allele in each group to run DAPC and it worked well. However, my groups were defined already but could the same type of input not be used to find.clusters?
>
> Mark
>
>
> -----Original Message-----
> From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org> [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>] On Behalf Of Benjamin Dauphin
> Sent: 31 January 2018 09:18
> To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
> Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
>
> Dear all,
>
> I am newly working on pool sequencing data and I simply wonder if I can use kmeans (find.cluster) and DAPC to investigate population structure from poolseq data (allele frequencies)? How find.clusters can deal with allele frequencies?
>
> Dataset: 7 pools and 100?000 SNPs
>
> Any comment or help would be much appreciated.
> Best regards
> Ben
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> Inverness College UHI, a partner in the University of the Highlands and Islands www.inverness.uhi.ac.uk<http://www.inverness.uhi.ac.uk> Board of Management of Inverness College (known as Inverness College UHI), Scottish Charity No SC021197.


_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum


From thibautjombart at gmail.com  Mon Feb  5 12:32:59 2018
From: thibautjombart at gmail.com (Thibaut Jombart)
Date: Mon, 5 Feb 2018 11:32:59 +0000
Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
In-Reply-To: <40a77a4d0903435b96871c3582004ee1@vRana01.UNINE.CH>
References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch>
 <AM0PR0602MB358623E02691E2E3C2FD287EEAFA0@AM0PR0602MB3586.eurprd06.prod.outlook.com>
 <77640E6F-D646-48E2-86BB-6FE894DBDD54@wsl.ch>
 <CANPRA+qzbO32Mv6vYujrGSM+DNtG6pk1_Kn8_HaNyjMaeTOwdg@mail.gmail.com>
 <40a77a4d0903435b96871c3582004ee1@vRana01.UNINE.CH>
Message-ID: <CANPRA+rR73YxU0nBU2tGOKqfSRbYZr4bDVWvhsbz_kuzGLnLBA@mail.gmail.com>

Hi Ben

while I'm not aware of hard rules for numbers of individuals needed to
detect a specific number of clusters, and I appreciate it will depend on
how clear-cut differences are, I don't think it is realistic to look for 4
clusters amongst 7 observations. Even 2 clusters will already be a stretch,
unless differences are really very obvious.

Cheers
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 2 February 2018 at 21:01, DAUPHIN Benjamin <benjamin.dauphin at unine.ch>
wrote:

> Thanks Thibaut.
> Yes i have 7 pools (=7 rows or =7 individuals in the analysis), and i
> expect two clusters representing two already characterized lineages. I have
> found 4 likely clusters based on HCPC but i want to double check this, with
> a kmeans if possible.
> Best
> Ben
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [
> adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Thibaut
> Jombart [thibautjombart at gmail.com]
> Sent: 02 February 2018 18:25
> To: Benjamin Dauphin
> Cc: adegenet-forum at lists.r-forge.r-project.org
> Subject: Re: [adegenet-forum] Kmeans and DAPC on poolSeq data
>
> Hi again,
>
> such plot typically indicates no clustering. Just to confirm: are we
> talking about 7 rows and 100,000 columns?
>
> If so, your pools are technically your statistical individuals, and the
> method explore clustering solutions for 1-6 clusters for 7 individuals,
> which won't go far - not enough individuals to detect clustering really.
> Apologies if I misunderstood.
>
> Best
> Thibaut
>
>
> --
> Dr Thibaut Jombart
> Lecturer, Department of Infectious Disease Epidemiology, Imperial College
> London
> Head of RECON: repidemicsconsortium.org<http://repidemicsconsortium.org>
> WHO Consultant - outbreak analysis
> https://thibautjombart.netlify.com
> Twitter: @TeebzR
> +44(0)20 7594 3658
>
> On 2 February 2018 at 09:07, Benjamin Dauphin <benjamin.dauphin at wsl.ch<
> mailto:benjamin.dauphin at wsl.ch>> wrote:
> Hi Mark,
>
> Thanks for response. I?ve run find.clusters() with the matrix of allele
> frequencies as input file, and then run the DAPC using still the matrix
> (not the genind or genlight object) by assigning the group generated with
> kmeans (grp$grp). It works but I have a strange ?inverted parabolic curve"
> for the kmean analysis.
> Is it a common picture for pooldseq data?
>
> Thanks,
> Ben
>
>
>
>
> > On 1 Feb 2018, at 18:01, Mark Coulson <Mark.Coulson.ic at uhi.ac.uk<mailto:
> Mark.Coulson.ic at uhi.ac.uk>> wrote:
> >
> > Hi Ben,
> >
> > I have used allelotype data with the input as a matrix of the frequency
> of the A allele in each group to run DAPC and it worked well. However, my
> groups were defined already but could the same type of input not be used to
> find.clusters?
> >
> > Mark
> >
> >
> > -----Original Message-----
> > From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:a
> degenet-forum-bounces at lists.r-forge.r-project.org> [mailto:adegenet-forum-
> bounces at lists.r-forge.r-project.org<mailto:adegenet-
> forum-bounces at lists.r-forge.r-project.org>] On Behalf Of Benjamin Dauphin
> > Sent: 31 January 2018 09:18
> > To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-
> forum at lists.r-forge.r-project.org>
> > Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data
> >
> > Dear all,
> >
> > I am newly working on pool sequencing data and I simply wonder if I can
> use kmeans (find.cluster) and DAPC to investigate population structure from
> poolseq data (allele frequencies)? How find.clusters can deal with allele
> frequencies?
> >
> > Dataset: 7 pools and 100?000 SNPs
> >
> > Any comment or help would be much appreciated.
> > Best regards
> > Ben
> >
> >
> > _______________________________________________
> > adegenet-forum mailing list
> > adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-
> forum at lists.r-forge.r-project.org>
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
> > Inverness College UHI, a partner in the University of the Highlands and
> Islands www.inverness.uhi.ac.uk<http://www.inverness.uhi.ac.uk> Board of
> Management of Inverness College (known as Inverness College UHI), Scottish
> Charity No SC021197.
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-
> forum at lists.r-forge.r-project.org>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180205/57e578cb/attachment.html>

From thibautjombart at gmail.com  Wed Feb  7 19:57:28 2018
From: thibautjombart at gmail.com (Thibaut Jombart)
Date: Wed, 7 Feb 2018 18:57:28 +0000
Subject: [adegenet-forum] new release, snapclust, podcast
Message-ID: <CANPRA+qGkQaoqwFfjWersRD4Sx=gviV5kEEAjUnf+AcMq7F5sg@mail.gmail.com>

Dear all,

a new version of adegenet (2.1.1) has now been released on CRAN. This
version implements snapclust, a fast maximum-likelihood genetic clustering
method, recently published in Methods in Ecology and Evolution.

snapclust is presented in the following podcast:
https://www.youtube.com/watch?v=Vl3cf0XHG7Q

Comments and feedback welcome!

Best
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180207/e8cb36c1/attachment.html>

From 17197751 at students.latrobe.edu.au  Fri Feb  9 01:48:39 2018
From: 17197751 at students.latrobe.edu.au (JEREMY SAMUEL BENWELL-CLARKE)
Date: Fri, 9 Feb 2018 00:48:39 +0000
Subject: [adegenet-forum] Analysing mixed ploidy datasets
Message-ID: <SY3PR01MB15646BCB509043E46D83877FC7F20@SY3PR01MB1564.ausprd01.prod.outlook.com>

Hi everyone,

I'm trying to analyse a mixed ploidy dataset. The majority of my samples are diploid but there are a few triploids too. To make ploidy even in my raw data matrix I added zeros to all my diploid samples, which makes them triploids. I then use the read.genalex function from the poppr package to read in my data setting ploidy=3. However, I don't want '0' to be recognised as an extra allele and I want the true diploid samples to be separate from the true triploid samples. Therefore, I use the recode_polyploids function from poppr and set newploidy=T. Here is my code:

genclone<-read.genalex("C:/Users/...", ploidy = 3)
genclone<-recode_polyploids(genclone, newploidy = T)


I'm wondering if this the right way (statistically speaking) to analyse mixed ploidy datasets in R? Will my estimates of genetic diversity and structure be accurate?
I know there is the POLYSAT package, which seems to have been developed specifically for dealing with mixed ploidy datasets, however, I would rather stick to using adagenet and poppr, as I'm familiar with the functions and the structure of genind and genclone objects.

Any help would be much appreciated!

Cheers,

Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180209/7c415c08/attachment.html>

From zkamvar at gmail.com  Mon Feb 12 22:10:16 2018
From: zkamvar at gmail.com (Zhian Kamvar)
Date: Mon, 12 Feb 2018 15:10:16 -0600
Subject: [adegenet-forum] Analysing mixed ploidy datasets
In-Reply-To: <mailman.5.1518174022.7345.adegenet-forum@lists.r-forge.r-project.org>
References: <mailman.5.1518174022.7345.adegenet-forum@lists.r-forge.r-project.org>
Message-ID: <C8DCF942-696B-4F08-B65C-CFEA93B655CA@gmail.com>

Hi Jeremy,

> I'm wondering if this the right way (statistically speaking) to analyse mixed ploidy datasets in R? Will my estimates of genetic diversity and structure be accurate?

I think the answer is... it depends. Meirmans, Liu, and Tienderen just came out with a review on this topic: https://doi.org/10.1093/jhered/esy006 <https://doi.org/10.1093/jhered/esy006>*. For multivariate analyses (sPCA, PCA, DAPC, etc), you will want to reduce the effect of polyploidy by converting your data to allele frequencies with:

tab(myData, freq = TRUE)

If you have an organism that changes ploidy by life stage, you may need to analyze them separately. Moreover, if you use Bruvo's distance, the choice of model is very important.

> I know there is the POLYSAT package, which seems to have been developed specifically for dealing with mixed ploidy datasets, however, I would rather stick to using adagenet and poppr, as I'm familiar with the functions and the structure of genind and genclone objects.

I know conversion is a PITA, but I would highly recommend using POLYSAT. There are far more tools to deal with polyploidy in that package that can complement any analyses you would perform with poppr or adegenet. A few years ago, I wrote a tiny function to help with conversion from genind to polysat data:

https://gist.github.com/zkamvar/aeaff83b9d126d55aade

In fact, Clarke and Schreier came out with a paper a year ago talking about how to resolve ambiguous ploidy, and it's only available in polysat: http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12639/abstract


Hope that helps,
Zhian

* The last figure in this paper will change slightly since they used an old version of poppr to calculate Bruvo's distance, which had a silent bug.

-----
Zhian N. Kamvar, Ph. D.
Postdoctoral Researcher (Everhart Lab)
Department of Plant Pathology
University of Nebraska-Lincoln
ORCID: 0000-0003-1458-7108


> 
> Date: Fri, 9 Feb 2018 00:48:39 +0000
> From: JEREMY SAMUEL BENWELL-CLARKE <17197751 at students.latrobe.edu.au>
> To: "adegenet-forum at lists.r-forge.r-project.org"
> 	<adegenet-forum at lists.r-forge.r-project.org>
> Subject: [adegenet-forum] Analysing mixed ploidy datasets
> Message-ID:
> 	<SY3PR01MB15646BCB509043E46D83877FC7F20 at SY3PR01MB1564.ausprd01.prod.outlook.com>
> 
> Content-Type: text/plain; charset="us-ascii"
> 
> Hi everyone,
> 
> I'm trying to analyse a mixed ploidy dataset. The majority of my samples are diploid but there are a few triploids too. To make ploidy even in my raw data matrix I added zeros to all my diploid samples, which makes them triploids. I then use the read.genalex function from the poppr package to read in my data setting ploidy=3. However, I don't want '0' to be recognised as an extra allele and I want the true diploid samples to be separate from the true triploid samples. Therefore, I use the recode_polyploids function from poppr and set newploidy=T. Here is my code:
> 
> genclone<-read.genalex("C:/Users/...", ploidy = 3)
> genclone<-recode_polyploids(genclone, newploidy = T)
> 
> 
> I'm wondering if this the right way (statistically speaking) to analyse mixed ploidy datasets in R? Will my estimates of genetic diversity and structure be accurate?
> I know there is the POLYSAT package, which seems to have been developed specifically for dealing with mixed ploidy datasets, however, I would rather stick to using adagenet and poppr, as I'm familiar with the functions and the structure of genind and genclone objects.
> 
> Any help would be much appreciated!
> 
> Cheers,
> 
> Jeremy
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180209/7c415c08/attachment-0001.html>
> 
> ------------------------------
> 
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> 
> End of adegenet-forum Digest, Vol 114, Issue 7
> **********************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180212/5a1b729c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180212/5a1b729c/attachment.sig>

From briank.lists at gmail.com  Fri Feb 16 16:57:46 2018
From: briank.lists at gmail.com (brian knaus)
Date: Fri, 16 Feb 2018 07:57:46 -0800
Subject: [adegenet-forum] snapclust when HW is not expected
Message-ID: <CAEp5mpA4Ag9DBrZGu8TbN2GothVqVkETXcc7UnVJ1YqZHkCtkQ@mail.gmail.com>

Hi and congrats on your snapclust paper! I was thinking of trying the
method on a couple of projects I'm working on. However, I work with fungi
and fungus-like plant pathogens that exhibit a mixture of reproductive
modes (e.g., selfing, clonality, mitotic reproduction). This means that we
do not necessarily expect Hardy-Weinberg assumptions to be met. Your manual
seems to come out pretty early stating that HW is important. I would guess
that linkage disequilibrium (non-independence of loci) may be an issue
also. So this raises my question: in systems where HW may not be assumed
and where there may be linkage disequilibrium would I be better of using
DAPC than snapclust?

Thanks!
Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180216/315f8bac/attachment.html>

From thibautjombart at gmail.com  Fri Feb 16 18:24:54 2018
From: thibautjombart at gmail.com (Thibaut Jombart)
Date: Fri, 16 Feb 2018 17:24:54 +0000
Subject: [adegenet-forum] snapclust when HW is not expected
In-Reply-To: <CAEp5mpA4Ag9DBrZGu8TbN2GothVqVkETXcc7UnVJ1YqZHkCtkQ@mail.gmail.com>
References: <CAEp5mpA4Ag9DBrZGu8TbN2GothVqVkETXcc7UnVJ1YqZHkCtkQ@mail.gmail.com>
Message-ID: <CANPRA+oePz3Z=HPJuPzX=h_1dy4NGRQU3XJstB5DK7Fvi93dwg@mail.gmail.com>

Hi Brian

thanks for reposting your question here. I am assuming that by 'DAPC' you
mean the K-means clustering presented in the DAPC paper, not the factorial
method itself. It is an interesting topic, and there are many possible
answers. I'll try to mention a few.

snapclust uses HW to compute the likelihood, like most other model-based
(likelihood, bayesian) clustering methods I know of. Similarly, it assumes
independence of loci, as that: (global log-likelihood) = sum(likelihood of
every loci)

Deviation from HW and linkage between loci will have the same kind of
effect: the computed likelihood will be an approximation of the true,
unknown likelihood. How good the approximation is in a particular case? I
don't think we know, in general, but I'd like to see such a study
published. And then, the next question is: how does it change the
clustering solution? Again, more work would be interesting on this topic.

I suspect attitudes will vary, pretty much depending on whether one decides
to be purist or pragmatic. As an anecdote, developing various Bayesian of
ML methods, it happened several times to realise the likelihood was 'wrong'
(coding error), sometimes even one full component of the likelihood was
entirely left out, and the reason I had not flagged it out before was
results were still okay. Similarly, a linear regression may still give
sensible results despite non-normally distributed results. k-means
clustering is often used without checking that groups have similar
within-group variances. And ML phylogenies from full alignments are
commonplace, while the likelihood also assumes independence of loci - see
Joe Felsenstein's cheeky comment on that in his pruning algorithm paper.

In short: it could be a problem, but we (at least, I) don't know which
impact it'll have. I know, disappointing. My 2 cents would be:
- fairly evenly distributed LD: snapclust should be fine
- a bit of clonality mixed up with some recombination / sexual
reproduction: should be worth looking at
- full clonality: work on haplotype frequencies / MLST type of markers (see
apex package), and then snapclust will be fine
- never rely on a single method if you can avoid it; I like using a
hierarchical clustering and further exploration using factorial methods
(PCA, DAPC) as a complement

Please feel free to comment / discuss, everyone. I might put this in a
podcast, time allowing.

Best
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 16 February 2018 at 15:57, brian knaus <briank.lists at gmail.com> wrote:

> Hi and congrats on your snapclust paper! I was thinking of trying the
> method on a couple of projects I'm working on. However, I work with fungi
> and fungus-like plant pathogens that exhibit a mixture of reproductive
> modes (e.g., selfing, clonality, mitotic reproduction). This means that we
> do not necessarily expect Hardy-Weinberg assumptions to be met. Your manual
> seems to come out pretty early stating that HW is important. I would guess
> that linkage disequilibrium (non-independence of loci) may be an issue
> also. So this raises my question: in systems where HW may not be assumed
> and where there may be linkage disequilibrium would I be better of using
> DAPC than snapclust?
>
> Thanks!
> Brian
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180216/14b53072/attachment.html>

From briank.lists at gmail.com  Fri Feb 16 23:56:06 2018
From: briank.lists at gmail.com (brian knaus)
Date: Fri, 16 Feb 2018 14:56:06 -0800
Subject: [adegenet-forum] snapclust when HW is not expected
In-Reply-To: <CANPRA+oePz3Z=HPJuPzX=h_1dy4NGRQU3XJstB5DK7Fvi93dwg@mail.gmail.com>
References: <CAEp5mpA4Ag9DBrZGu8TbN2GothVqVkETXcc7UnVJ1YqZHkCtkQ@mail.gmail.com>
 <CANPRA+oePz3Z=HPJuPzX=h_1dy4NGRQU3XJstB5DK7Fvi93dwg@mail.gmail.com>
Message-ID: <CAEp5mpAcweGicgbvUy+iX1F0=J5rsp+R5wHuyVMsfdOkUh8YSg@mail.gmail.com>

Thank you for a very thoughtful response! I think a summary is that we can
bend the rules, just try not to break things. And I think that was a
message expressed by Pritchard's group. They had a paper where they used
STRUCTURE on Helicobacter pylori. I think an issue though is that there are
many in the biological community do not understand the methods well enough
to know if and when they may have gone too far. I appreciate your
recommendations, but for many of these projects we have a reason to expect
mixed mating modes, but we do not know how much of any particular mode to
expect. In fact, the research goal is frequently to infer mating mode. Or
perhaps which groups of samples may be outcrossing and which are not. I
suspect that might be a lot to ask for.

I appreciate your insights! And I find it encouraging that you would like
to see more work on this. Perhaps we'll get to that one day?
Brian

On Fri, Feb 16, 2018 at 9:24 AM, Thibaut Jombart <thibautjombart at gmail.com>
wrote:

> Hi Brian
>
> thanks for reposting your question here. I am assuming that by 'DAPC' you
> mean the K-means clustering presented in the DAPC paper, not the factorial
> method itself. It is an interesting topic, and there are many possible
> answers. I'll try to mention a few.
>
> snapclust uses HW to compute the likelihood, like most other model-based
> (likelihood, bayesian) clustering methods I know of. Similarly, it assumes
> independence of loci, as that: (global log-likelihood) = sum(likelihood of
> every loci)
>
> Deviation from HW and linkage between loci will have the same kind of
> effect: the computed likelihood will be an approximation of the true,
> unknown likelihood. How good the approximation is in a particular case? I
> don't think we know, in general, but I'd like to see such a study
> published. And then, the next question is: how does it change the
> clustering solution? Again, more work would be interesting on this topic.
>
> I suspect attitudes will vary, pretty much depending on whether one
> decides to be purist or pragmatic. As an anecdote, developing various
> Bayesian of ML methods, it happened several times to realise the likelihood
> was 'wrong' (coding error), sometimes even one full component of the
> likelihood was entirely left out, and the reason I had not flagged it out
> before was results were still okay. Similarly, a linear regression may
> still give sensible results despite non-normally distributed results.
> k-means clustering is often used without checking that groups have similar
> within-group variances. And ML phylogenies from full alignments are
> commonplace, while the likelihood also assumes independence of loci - see
> Joe Felsenstein's cheeky comment on that in his pruning algorithm paper.
>
> In short: it could be a problem, but we (at least, I) don't know which
> impact it'll have. I know, disappointing. My 2 cents would be:
> - fairly evenly distributed LD: snapclust should be fine
> - a bit of clonality mixed up with some recombination / sexual
> reproduction: should be worth looking at
> - full clonality: work on haplotype frequencies / MLST type of markers
> (see apex package), and then snapclust will be fine
> - never rely on a single method if you can avoid it; I like using a
> hierarchical clustering and further exploration using factorial methods
> (PCA, DAPC) as a complement
>
> Please feel free to comment / discuss, everyone. I might put this in a
> podcast, time allowing.
>
> Best
> Thibaut
>
>
>
> --
> Dr Thibaut Jombart
> Lecturer, Department of Infectious Disease Epidemiology, Imperial College
> London
> Head of RECON: repidemicsconsortium.org
> WHO Consultant - outbreak analysis
> https://thibautjombart.netlify.com
> Twitter: @TeebzR
> +44(0)20 7594 3658 <+44%2020%207594%203658>
>
> On 16 February 2018 at 15:57, brian knaus <briank.lists at gmail.com> wrote:
>
>> Hi and congrats on your snapclust paper! I was thinking of trying the
>> method on a couple of projects I'm working on. However, I work with fungi
>> and fungus-like plant pathogens that exhibit a mixture of reproductive
>> modes (e.g., selfing, clonality, mitotic reproduction). This means that we
>> do not necessarily expect Hardy-Weinberg assumptions to be met. Your manual
>> seems to come out pretty early stating that HW is important. I would guess
>> that linkage disequilibrium (non-independence of loci) may be an issue
>> also. So this raises my question: in systems where HW may not be assumed
>> and where there may be linkage disequilibrium would I be better of using
>> DAPC than snapclust?
>>
>> Thanks!
>> Brian
>>
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo
>> /adegenet-forum
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180216/e4ecf2c2/attachment-0001.html>

From thibautjombart at gmail.com  Mon Feb 19 09:41:54 2018
From: thibautjombart at gmail.com (Thibaut Jombart)
Date: Mon, 19 Feb 2018 08:41:54 +0000
Subject: [adegenet-forum] snapclust when HW is not expected
In-Reply-To: <CAEp5mpAcweGicgbvUy+iX1F0=J5rsp+R5wHuyVMsfdOkUh8YSg@mail.gmail.com>
References: <CAEp5mpA4Ag9DBrZGu8TbN2GothVqVkETXcc7UnVJ1YqZHkCtkQ@mail.gmail.com>
 <CANPRA+oePz3Z=HPJuPzX=h_1dy4NGRQU3XJstB5DK7Fvi93dwg@mail.gmail.com>
 <CAEp5mpAcweGicgbvUy+iX1F0=J5rsp+R5wHuyVMsfdOkUh8YSg@mail.gmail.com>
Message-ID: <CANPRA+oCpw3ek3wHi3TrrkgoYEu6GY-w27EOdqbbBLuUdVe9yg@mail.gmail.com>

Again, it'd be fun to give it a try on an actual case study ;)

If one treated clonal data as independent loci (rather than as a single
one), this will result in fully correlated allele frequencies, but this
shouldn't change the clustering itself. It will change summary statistics
(AIC etc), but as both the deviance and the number of parameters will be
overestimated not sure how much this would impact the choice of the
'optimal K'.

Best
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 16 February 2018 at 22:56, brian knaus <briank.lists at gmail.com> wrote:

> Thank you for a very thoughtful response! I think a summary is that we can
> bend the rules, just try not to break things. And I think that was a
> message expressed by Pritchard's group. They had a paper where they used
> STRUCTURE on Helicobacter pylori. I think an issue though is that there
> are many in the biological community do not understand the methods well
> enough to know if and when they may have gone too far. I appreciate your
> recommendations, but for many of these projects we have a reason to expect
> mixed mating modes, but we do not know how much of any particular mode to
> expect. In fact, the research goal is frequently to infer mating mode. Or
> perhaps which groups of samples may be outcrossing and which are not. I
> suspect that might be a lot to ask for.
>
> I appreciate your insights! And I find it encouraging that you would like
> to see more work on this. Perhaps we'll get to that one day?
> Brian
>
> On Fri, Feb 16, 2018 at 9:24 AM, Thibaut Jombart <thibautjombart at gmail.com
> > wrote:
>
>> Hi Brian
>>
>> thanks for reposting your question here. I am assuming that by 'DAPC' you
>> mean the K-means clustering presented in the DAPC paper, not the factorial
>> method itself. It is an interesting topic, and there are many possible
>> answers. I'll try to mention a few.
>>
>> snapclust uses HW to compute the likelihood, like most other model-based
>> (likelihood, bayesian) clustering methods I know of. Similarly, it assumes
>> independence of loci, as that: (global log-likelihood) = sum(likelihood of
>> every loci)
>>
>> Deviation from HW and linkage between loci will have the same kind of
>> effect: the computed likelihood will be an approximation of the true,
>> unknown likelihood. How good the approximation is in a particular case? I
>> don't think we know, in general, but I'd like to see such a study
>> published. And then, the next question is: how does it change the
>> clustering solution? Again, more work would be interesting on this topic.
>>
>> I suspect attitudes will vary, pretty much depending on whether one
>> decides to be purist or pragmatic. As an anecdote, developing various
>> Bayesian of ML methods, it happened several times to realise the likelihood
>> was 'wrong' (coding error), sometimes even one full component of the
>> likelihood was entirely left out, and the reason I had not flagged it out
>> before was results were still okay. Similarly, a linear regression may
>> still give sensible results despite non-normally distributed results.
>> k-means clustering is often used without checking that groups have similar
>> within-group variances. And ML phylogenies from full alignments are
>> commonplace, while the likelihood also assumes independence of loci - see
>> Joe Felsenstein's cheeky comment on that in his pruning algorithm paper.
>>
>> In short: it could be a problem, but we (at least, I) don't know which
>> impact it'll have. I know, disappointing. My 2 cents would be:
>> - fairly evenly distributed LD: snapclust should be fine
>> - a bit of clonality mixed up with some recombination / sexual
>> reproduction: should be worth looking at
>> - full clonality: work on haplotype frequencies / MLST type of markers
>> (see apex package), and then snapclust will be fine
>> - never rely on a single method if you can avoid it; I like using a
>> hierarchical clustering and further exploration using factorial methods
>> (PCA, DAPC) as a complement
>>
>> Please feel free to comment / discuss, everyone. I might put this in a
>> podcast, time allowing.
>>
>> Best
>> Thibaut
>>
>>
>>
>> --
>> Dr Thibaut Jombart
>> Lecturer, Department of Infectious Disease Epidemiology, Imperial College
>> London
>> Head of RECON: repidemicsconsortium.org
>> WHO Consultant - outbreak analysis
>> https://thibautjombart.netlify.com
>> Twitter: @TeebzR
>> +44(0)20 7594 3658 <+44%2020%207594%203658>
>>
>> On 16 February 2018 at 15:57, brian knaus <briank.lists at gmail.com> wrote:
>>
>>> Hi and congrats on your snapclust paper! I was thinking of trying the
>>> method on a couple of projects I'm working on. However, I work with fungi
>>> and fungus-like plant pathogens that exhibit a mixture of reproductive
>>> modes (e.g., selfing, clonality, mitotic reproduction). This means that we
>>> do not necessarily expect Hardy-Weinberg assumptions to be met. Your manual
>>> seems to come out pretty early stating that HW is important. I would guess
>>> that linkage disequilibrium (non-independence of loci) may be an issue
>>> also. So this raises my question: in systems where HW may not be assumed
>>> and where there may be linkage disequilibrium would I be better of using
>>> DAPC than snapclust?
>>>
>>> Thanks!
>>> Brian
>>>
>>> _______________________________________________
>>> adegenet-forum mailing list
>>> adegenet-forum at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo
>>> /adegenet-forum
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180219/8ed86297/attachment.html>

From jdpalacio at utexas.edu  Tue Feb 20 15:04:54 2018
From: jdpalacio at utexas.edu (Juan D Palacio Mejia)
Date: Tue, 20 Feb 2018 08:04:54 -0600
Subject: [adegenet-forum] snapclust, table.value
Message-ID: <4E649EFD-5585-4F1D-9E51-E9D1D06AB055@utexas.edu>

Hi there,

I have a silly question, when I ran a cluster using snapclust or find.clusters, the legend in the table.value graph give me a square size using odd numbers, such as 0.5, 1.5, how I can change it to entere numbers?

Thanks guys,

Juan


From thibautjombart at gmail.com  Thu Feb 22 12:24:24 2018
From: thibautjombart at gmail.com (Thibaut Jombart)
Date: Thu, 22 Feb 2018 11:24:24 +0000
Subject: [adegenet-forum] snapclust, table.value
In-Reply-To: <4E649EFD-5585-4F1D-9E51-E9D1D06AB055@utexas.edu>
References: <4E649EFD-5585-4F1D-9E51-E9D1D06AB055@utexas.edu>
Message-ID: <CANPRA+rZipE1WGgGy+UwApAymukX1e33NMfbO7hDeoGPGwm14w@mail.gmail.com>

Hi Juan

The basic graphics of ade4 are not that customisable. Maybe try using
adegraphics?

https://cran.r-project.org/web/packages/adegraphics/vignettes/adegraphics.html

Best
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 20 February 2018 at 14:04, Juan D Palacio Mejia <jdpalacio at utexas.edu>
wrote:

> Hi there,
>
> I have a silly question, when I ran a cluster using snapclust or
> find.clusters, the legend in the table.value graph give me a square size
> using odd numbers, such as 0.5, 1.5, how I can change it to entere numbers?
>
> Thanks guys,
>
> Juan
>
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180222/4c529dc8/attachment.html>

From arsalane at protonmail.com  Thu Feb 22 12:27:19 2018
From: arsalane at protonmail.com (Arsalan Emami-Khoyi)
Date: Thu, 22 Feb 2018 06:27:19 -0500
Subject: [adegenet-forum] Spatial Data with large DNA alignement
Message-ID: <oyi0PuEpbw8ATVKtqPUmD5oJZM0szSKlh6GD42GjfwaTHdLmPvr4QuiHGpd9xcL2il5e-4qS-oglW6ZJgXMqgMOsuf9trsIP_86s4HJKZ9Q=@protonmail.com>

Hello Thibaut,
I hope that you are fine and all goes well.
A rapid question :
When I import a large alignment of DNA sequences using the different tools, How  can  I efficiently assign  geographical location to each  individuals or population.
e.g ,
An alignment of 1000 sequences from 10 populations each sampled for 100 animals.
What I do at the moment  is to make a 1000  lines text file for x and Y coordinates for each individuals.
Of course each 100 rows has exactly the same coordinates.
I guess they should be a more efficient way of doing that, am I correct ?
heaps of thanks in advance
Regards

Arsalan Emami-Khoyi
Postdoctoral Research Fellow in Wildlife Genomics
University of Johannesburg_Center for Ecological Genomics and Wildlife Conservation
Auckland Park 2006
South Africa
Email : Arsalane at uj.ac.za
Phone :+27 (0)11 559 3373
Cellphone:+27 79 88 14 628
Website :https://sites.google.com/site/drpeterteske/postdocs
[EGWC-LOGO (1).png]

Sent with [ProtonMail](https://protonmail.com) Secure Email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180222/ddc4df43/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EGWC-LOGO (1).png
Type: image/png
Size: 15092 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180222/ddc4df43/attachment-0001.png>

From thibautjombart at gmail.com  Thu Feb 22 15:00:51 2018
From: thibautjombart at gmail.com (Thibaut Jombart)
Date: Thu, 22 Feb 2018 14:00:51 +0000
Subject: [adegenet-forum] Spatial Data with large DNA alignement
In-Reply-To: <oyi0PuEpbw8ATVKtqPUmD5oJZM0szSKlh6GD42GjfwaTHdLmPvr4QuiHGpd9xcL2il5e-4qS-oglW6ZJgXMqgMOsuf9trsIP_86s4HJKZ9Q=@protonmail.com>
References: <oyi0PuEpbw8ATVKtqPUmD5oJZM0szSKlh6GD42GjfwaTHdLmPvr4QuiHGpd9xcL2il5e-4qS-oglW6ZJgXMqgMOsuf9trsIP_86s4HJKZ9Q=@protonmail.com>
Message-ID: <CANPRA+qjR=GMPU+TQN1LUgHJc3C432fm3rWJbyduW2q7psCrKg@mail.gmail.com>

Hello,

by the sound if it base R will make your life easier - have a look at ?rep

For instance:
> rep(c("foo", "bar"), each = 10)
 [1] "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "bar" "bar"
[13] "bar" "bar" "bar" "bar" "bar" "bar" "bar" "bar"
>

Cheers
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 22 February 2018 at 11:27, Arsalan Emami-Khoyi <arsalane at protonmail.com>
wrote:

> Hello Thibaut,
> I hope that you are fine and all goes well.
> A rapid question :
> When I import a large alignment of DNA sequences using the different
> tools, How  can  I efficiently assign  geographical location to each
>  individuals or population.
> e.g ,
> An alignment of 1000 sequences from 10 populations each sampled for 100
> animals.
> What I do at the moment  is to make a 1000  lines text file for x and Y
> coordinates for each individuals.
> Of course each 100 rows has exactly the same coordinates.
> I guess they should be a more efficient way of doing that, am I correct ?
> heaps of thanks in advance
> Regards
>
> Arsalan Emami-Khoyi
> Postdoctoral Research Fellow in Wildlife Genomics
> University of Johannesburg_Center for Ecological Genomics and Wildlife
> Conservation
> Auckland Park 2006
> South Africa
> Email : Arsalane at uj.ac.za
> Phone :+27 (0)11 559 3373 <+27%2011%20559%203373>
> Cellphone:+27 79 88 14 628
> Website :https://sites.google.com/site/drpeterteske/postdocs
> [image: EGWC-LOGO (1).png]
>
> Sent with ProtonMail <https://protonmail.com> Secure Email.
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180222/fe8758f6/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EGWC-LOGO (1).png
Type: image/png
Size: 15092 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180222/fe8758f6/attachment-0001.png>

From dollykc at gmail.com  Thu Feb 22 21:34:15 2018
From: dollykc at gmail.com (Katharine Coykendall)
Date: Thu, 22 Feb 2018 15:34:15 -0500
Subject: [adegenet-forum] different number of individuals in fastas
Message-ID: <CAMs2N1thgmCfAhvOyuPrY7Yr-gwzqwLQYQcsMQ59zvsq8UkG9Q@mail.gmail.com>

Hello,
 I've been following the instructions found at
http://popgen.nescent.org/PopDiffSequenceData.html to do a hierarchical F
analysis on my data.  I have two fasta files with the same 117 individuals
in each file.  The files are different genes.  When I put them together
using the read.multiFASTA command, it tells me that there are 120
sequences.  When I turn it into a genid object it says there are 120
individuals and 19 alleles.  I checked both files to make sure that the
sample names are the same between them and they are. I was wondering if it
had to do with add.gaps=TRUE so I tried to set it to false, but I get an
error
Error in file(con, "rb") : cannot open the connection
In addition: Warning message:
In file(con, "rb") : cannot open file 'FALSE': No such file or directory

Is there an easy way to look at the multiFASTA or genid file to see where
the extra three individuals are coming from?

Thanks,
Katharine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180222/0324e40c/attachment.html>

From dollykc at gmail.com  Thu Feb 22 23:02:55 2018
From: dollykc at gmail.com (Katharine Coykendall)
Date: Thu, 22 Feb 2018 17:02:55 -0500
Subject: [adegenet-forum] different number of individuals in fastas
In-Reply-To: <CAMs2N1thgmCfAhvOyuPrY7Yr-gwzqwLQYQcsMQ59zvsq8UkG9Q@mail.gmail.com>
References: <CAMs2N1thgmCfAhvOyuPrY7Yr-gwzqwLQYQcsMQ59zvsq8UkG9Q@mail.gmail.com>
Message-ID: <CAMs2N1t-qTiKgzcjet4DR7D7HP2XUDybu_tZEe2g=Q0JWXGsPg@mail.gmail.com>

Figured it out!  In one of my fasta files, I had three of the sample names
capitalized and the other file they weren't.

On Thu, Feb 22, 2018 at 3:34 PM, Katharine Coykendall <dollykc at gmail.com>
wrote:

> Hello,
>  I've been following the instructions found at  http://popgen.nescent.org/
> PopDiffSequenceData.html to do a hierarchical F analysis on my data.  I
> have two fasta files with the same 117 individuals in each file.  The files
> are different genes.  When I put them together using the read.multiFASTA
> command, it tells me that there are 120 sequences.  When I turn it into a
> genid object it says there are 120 individuals and 19 alleles.  I checked
> both files to make sure that the sample names are the same between them and
> they are. I was wondering if it had to do with add.gaps=TRUE so I tried to
> set it to false, but I get an error
> Error in file(con, "rb") : cannot open the connection
> In addition: Warning message:
> In file(con, "rb") : cannot open file 'FALSE': No such file or directory
>
> Is there an easy way to look at the multiFASTA or genid file to see where
> the extra three individuals are coming from?
>
> Thanks,
> Katharine
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180222/698a93c0/attachment.html>