From nathan.truelove at manchester.ac.uk  Tue Sep  3 14:44:06 2013
From: nathan.truelove at manchester.ac.uk (Nathan Truelove)
Date: Tue, 3 Sep 2013 12:44:06 +0000
Subject: [adegenet-forum] $li in sPCA analysis
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA6570638B5287@icexch-m1.ic.ac.uk>
References: <CAHPXwHdpJ4bk4G8UBAZenwim-aURxKpvWQXpuBmNb-qCp_aevQ@mail.gmail.com>
 <2CB2DA8E426F3541AB1907F98ABA6570638B5234@icexch-m1.ic.ac.uk>,
 <CAHPXwHdx=VdgVm9PyqSTokjMDyKBGME82HoRONTX_u4EVK86Ew@mail.gmail.com>
 <2CB2DA8E426F3541AB1907F98ABA6570638B5287@icexch-m1.ic.ac.uk>
Message-ID: <DE5F03D8-75DE-4D8D-B699-C151F77C0519@postgrad.manchester.ac.uk>

Hi Adegenet Forum,

Thanks in advance to anyone who has some advice to share with the forum on SPCA. If you're in a rush just read the parts in bold.

I've been using SPCA to look at spatial genetics patterns among lobster populations. I found positive local structure with the function local.rest and no global structure using global.rtest. I've followed Thibaut's advice in his previous sPCA email to forum and used $li to interpret local structure. I selected the local eigenvalue that had the highest levels of negative spatial autocorrelation and genetic variance for interpretation using the screeplot function. The $li values from this eigenvalue were then used to create an interpolated map.

My question for the forum is: What do the positive and negative $li values associated with the local eigenvalue mean? Do they correspond to levels of local (positive) and global (negative) scores at each location? Or are the $li values associated with the local eigenvalues simply a score for detecting local spatial genetic structure among sites and have nothing to do with global structure?

Best Wishes,

Nate

On Aug 11, 2013, at 4:35 PM, Jombart, Thibaut wrote:


Hello,

I think you attached the wrong file.

Negative values and local structure are not related. Local structure = sharp differences between neighours. These would be overlooked by the lagged vector.

If the structure is clear enough, use $li.

As you have many overlapping points, s.value is suboptimal. You should consider using the colorplot, or interpolated maps. See the tutorial on sPCA for some example:
http://cran.r-project.org/web/packages/adegenet/vignettes/adegenet-spca.pdf

Best
Thibaut
________________________________________
From: dooshra at gmail.com [dooshra at gmail.com] on behalf of Hanan Sela [hans at tauex.tau.ac.il]
Sent: 11 August 2013 12:19
To: Jombart, Thibaut
Subject: Re: [adegenet-forum] li vs. ls in sPCA analysis

Hello Thibaut,
Thank you for the response.
In the file I have attached I see that with the $li variable there are no negative values in the southern sites while with the $ls values there are negative values in the south. It seems that I see more local spatial structure with $ls than with $li . When I tested the data with local test I got significant results.  Which variable is better to present in a paper.
Thank you
Hanan
Mr. Hanan Sela Ph.D.
Curator of the Lieberman Cereal Germplasm Bank
The Institute for Cereal Crops Improvement
Tel-Aviv University
P.O. Box 39040
Tel Aviv 69978
Israel

hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>
Phone: 972-3-6405773
Cell: 972-50-5727458 , local U.S 17203600603
Fax: 972-3-6407857


On Sun, Aug 11, 2013 at 12:37 PM, Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>> wrote:
Hello,

the lagged vector is the spatially weighted average of the original vector. That is, the value of the score at a given location is the weighted average of the neighbouring values. This basically smooths the patterns so that they can be detected / visualized more easily.

Cheers
Thibaut.

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org> [adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Hanan Sela [hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>]
Sent: 11 August 2013<tel:2013> 06:21
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
Subject: [adegenet-forum] li vs. ls in sPCA analysis

Hello
I have plotted the first  PC of sPCA analysis using s.value once with z=my.pca$li[,1]
and once with z=my.pca$ls[,1]. The patterns seems to differ (see attached file). I do not understand what the lagged PC is representing. What is the meaning of "denoisified" in the practical day presentation  (Google does not know). How do i interpent the difference. Please explain.
Thank you

Mr. Hanan Sela Ph.D.
Curator of the Lieberman Cereal Germplasm Bank
The Institute for Cereal Crops Improvement
Tel-Aviv University
P.O. Box 39040
Tel Aviv 69978
Israel

hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il><mailto:hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>>
Phone: 972-3-6405773<tel:972-3-6405773>
Cell: 972-50-5727458<tel:972-50-5727458> , local U.S 17203600603
Fax: 972-3-6407857<tel:972-3-6407857>


On Thu, Aug 1, 2013<tel:2013> at 7:15 PM, <adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org><mailto:adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org>>> wrote:
Send adegenet-forum mailing list submissions to
       adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>

To subscribe or unsubscribe via the World Wide Web, visit
       https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

or, via email, send a message with subject or body 'help' to
       adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org><mailto:adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org>>

You can reach the person managing the list at
       adegenet-forum-owner at lists.r-forge.r-project.org<mailto:adegenet-forum-owner at lists.r-forge.r-project.org><mailto:adegenet-forum-owner at lists.r-forge.r-project.org<mailto:adegenet-forum-owner at lists.r-forge.r-project.org>>

When replying, please edit your Subject line so it is more specific
than "Re: Contents of adegenet-forum digest..."


Today's Topics:

  1. Fwd: Question about pre-processing of SNP data for        machine
     learning (Daniel Murrell)
  2. Re: Fwd: Question about pre-processing of SNP data for
     machine learning (Jombart, Thibaut)
  3. Re: Fwd: Question about pre-processing of SNP data for
     machine learning (Daniel Murrell)


----------------------------------------------------------------------

Message: 1
Date: Thu, 1 Aug 2013<tel:2013><tel:2013<tel:2013>> 15:26:00 +0100
From: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP
       data for        machine learning
Message-ID:
       <CADK=3HwmiEO5v6fCQUYNkHFQ520avQJ9LFOAdu=Yu-Z+8h7BCg at mail.gmail.com<mailto:Yu-Z%2B8h7BCg at mail.gmail.com><mailto:Yu-Z%2B8h7BCg at mail.gmail.com<mailto:Yu-Z%252B8h7BCg at mail.gmail.com>>>
Content-Type: text/plain; charset="windows-1252"

Hi All

This is my first time using adegenet. I'm trying to perform some
pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
machine learning task. My data was stored in a format which had to be
converted to a genlight object. The data was split so that the information
for the SNPs in each chromosome was in a separate file. I've read each file
in, converted that to a genlight object and then concatenated the genlight
objects using cbind. All of that seems to work ok (except the position and
chromosome data went back to NULL during the concatenation and I had to
reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to access
the information for this SNP over the 800 individuals, it takes ages to
extract. Is this because the encoding is done row wise, and so the whole
object needs to be decoded for me to get out the information I require? Is
there a way to transpose this genlight object so that I can access the data
for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
Date: Fri, Jul 19, 2013<tel:2013><tel:2013<tel:2013>> at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>


Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the
tutorial on adegenet-basics where you'll find examples of dimension
reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see
contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658><tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>> [dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>] on behalf of Daniel Murrell
[dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>]
Sent: 19 July 2013<tel:2013><tel:2013> 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I have
is that there is too much of it and I need a way to reduce the number or
the dimensionality of the data points so that I can use them as input to
machine learning algorithms (genome wide, 1.3 million SNPs, 800
individuals). I've done some searching and found this paper:
http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing
something like this? I'm not from this field and I'm having some trouble
working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find
that subset from the 1.3 million, or whether I should be reducing the
dimensionality.

Thank you
Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/a331daec/attachment-0001.html>

------------------------------

Message: 2
Date: Thu, 1 Aug 2013<tel:2013> 15:22:27 +0000
From: "Jombart, Thibaut" <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>,
       "adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>"
       <adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>>
Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of
       SNP data for    machine learning
Message-ID:
       <2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk<mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk><mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk<mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk>>>
Content-Type: text/plain; charset="Windows-1252"


Dear Daniel,

the loss of attributes after cbind indeed is a glitch. Would you mind creating a ticket about it?
https://sourceforge.net/p/adegenet/tickets/

You're right about the issue. The encoding is indeed done row-wise so the conversion is done many times over. There's no option for transposing the data, but one solution would be converting your data to integers by blocks so that conversion takes place less often, while still keep RAM requirements reasonable.

All the best

Thibaut

________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>> [adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>]
Sent: 01 August 2013<tel:2013> 15:26
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for    machine learning

Hi All

This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>>
Date: Fri, Jul 19, 2013<tel:2013> at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>>


Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658><tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>> [dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>]
Sent: 19 July 2013<tel:2013> 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality.

Thank you
Daniel


------------------------------

Message: 3
Date: Thu, 1 Aug 2013<tel:2013> 17:14:37 +0100
From: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>
To: "Jombart, Thibaut" <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
Cc: "adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>"
       <adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>>
Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of
       SNP data for machine learning
Message-ID:
       <CADK=3Hz=iJSJePuCOSwCkFOQUWHQyAmk+YS=-qWD+EO5vOBihA at mail.gmail.com<mailto:qWD%2BEO5vOBihA at mail.gmail.com><mailto:qWD%2BEO5vOBihA at mail.gmail.com<mailto:qWD%252BEO5vOBihA at mail.gmail.com>>>
Content-Type: text/plain; charset="windows-1252"

Dear Thibaut

Ok, I could try that. I could also try and use the genlight object in a
transposed manner just for the purposes of holding the data so that I can
access individual SNPs easily. I mean nothing else would work expect the
containment.

Thanks for the help
Regards
Daniel

On Thu, Aug 1, 2013<tel:2013> at 4:22 PM, Jombart, Thibaut
<t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>wrote:


Dear Daniel,

the loss of attributes after cbind indeed is a glitch. Would you mind
creating a ticket about it?
https://sourceforge.net/p/adegenet/tickets/

You're right about the issue. The encoding is indeed done row-wise so the
conversion is done many times over. There's no option for transposing the
data, but one solution would be converting your data to integers by blocks
so that conversion takes place less often, while still keep RAM
requirements reasonable.

All the best

Thibaut

________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>> [
adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>>] on behalf of Daniel
Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>]
Sent: 01 August 2013<tel:2013> 15:26
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data
for    machine learning

Hi All

This is my first time using adegenet. I'm trying to perform some
pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
machine learning task. My data was stored in a format which had to be
converted to a genlight object. The data was split so that the information
for the SNPs in each chromosome was in a separate file. I've read each file
in, converted that to a genlight object and then concatenated the genlight
objects using cbind. All of that seems to work ok (except the position and
chromosome data went back to NULL during the concatenation and I had to
reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to
access the information for this SNP over the 800 individuals, it takes ages
to extract. Is this because the encoding is done row wise, and so the whole
object needs to be decoded for me to get out the information I require? Is
there a way to transpose this genlight object so that I can access the data
for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>>
Date: Fri, Jul 19, 2013 at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>>


Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the
tutorial on adegenet-basics where you'll find examples of dimension
reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see
contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>> [dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>
<mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>
<mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>]
Sent: 19 July 2013 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I
have is that there is too much of it and I need a way to reduce the number
or the dimensionality of the data points so that I can use them as input to
machine learning algorithms (genome wide, 1.3 million SNPs, 800
individuals). I've done some searching and found this paper:
http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing
something like this? I'm not from this field and I'm having some trouble
working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find
that subset from the 1.3 million, or whether I should be reducing the
dimensionality.

Thank you
Daniel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/4373022c/attachment.html>

------------------------------

_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

End of adegenet-forum Digest, Vol 60, Issue 2
*********************************************

_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130903/761956aa/attachment-0001.html>

From Jutta.Geismar at senckenberg.de  Wed Sep  4 15:03:35 2013
From: Jutta.Geismar at senckenberg.de (Jutta Geismar)
Date: Wed, 04 Sep 2013 15:03:35 +0200
Subject: [adegenet-forum] Question about genetic structure in admixed
	populations
Message-ID: <52274BC7020000CB0000539A@snggwia.senckenberg.de>


Dear Mr Jombart and DAPC users,
 
I used DAPC to analyze genetic structure in a small region with 20
microsatellite markers. I analyzed 330 individuals (14 sampling sites)
and found little genetic differences (FST, D Jost), but a significant
isolation by distance pattern. A cluster analysis in STRUCTURE resulted
in four clusters (STRUCTURE Harvester) but all individuals had more or
less equal posterior probability in all of the four inferred clusters.
Therefore I assume a panmictic population structure. Since STRUCTURE is
known for some problems analyzing datasets under IBD I analyzed the data
with DAPC. DAPC resulted in 3 or 4 clusters (and tested up until K=7 to
be sure), but in both cases these were randomly distributed among all
individuals without a geographic context. Only 94 individuals were not
assigned to one cluster with more than 90% and therefore would be
counted as ?admixed? (example in DAPC tutorial). For me the results of
STRUCTURE and DAPC are in conflict to each other, but I don?t know how a
panmictic population would look like in DAPC. Distances between sites
are small and it is very likely that gene flow occurs among my sampling
points, which might cause problems in genetic cluster analyses. I don?t
know if I made any mistake in my thinking, that?s why I want to explain
my procedure briefly:
1.       I used dapc and chose 1/3 of the sample size as PC (as
suggested) and counted DAs in the plot (100% of the variability was
included, 110 PC, 13 DA)
2.       To reduce variability I used optim.a.score (smart FALSE). The
best a-score was around 0.2 (PC 61)
3.       After that I wanted to estimate the number of clusters by
find.clusters and used the a-score as number of PCs and repeated the
dapc (conserved variance was still 98%, 61 PCs, 2 DA) 
I chose k in the BIC values after which the decrease was less compared
to the previous, but not the lowest k.
If I have some mistakes in my procedure I would appreciate some advice.
But also if the procedure is okay I cannot explain the contrariness of
these two analyses. 
Thanks a lot in advance for some help.
Jutta Geismar 
PhD student
Germany
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130904/8beb3a52/attachment-0001.html>

From mirainoshojo at gmail.com  Wed Sep  4 16:45:40 2013
From: mirainoshojo at gmail.com (Valeria Montano)
Date: Wed, 4 Sep 2013 16:45:40 +0200
Subject: [adegenet-forum] $li in sPCA analysis
In-Reply-To: <DE5F03D8-75DE-4D8D-B699-C151F77C0519@postgrad.manchester.ac.uk>
References: <CAHPXwHdpJ4bk4G8UBAZenwim-aURxKpvWQXpuBmNb-qCp_aevQ@mail.gmail.com>
 <2CB2DA8E426F3541AB1907F98ABA6570638B5234@icexch-m1.ic.ac.uk>
 <CAHPXwHdx=VdgVm9PyqSTokjMDyKBGME82HoRONTX_u4EVK86Ew@mail.gmail.com>
 <2CB2DA8E426F3541AB1907F98ABA6570638B5287@icexch-m1.ic.ac.uk>
 <DE5F03D8-75DE-4D8D-B699-C151F77C0519@postgrad.manchester.ac.uk>
Message-ID: <CADEmh=tU0XfymPAj4CJgehnd-a1zrnCOS79eRdmq9RT0i2QH-A@mail.gmail.com>

Hi Nate,

the $li scores are the scores of each locality onto a given component, the
same that you have in classic PCA, that is they are simply the coordinates
of the entities on the component you are interested in. As the component is
centred on zero, the values are both positive and negative and represent
the position of a specific location along that component. That is valid for
both positive and negative eigenvalues, respectively associated to global
and local spatial structure. A significant structure, whether global
(positive) or local (negative), is currently evaluated by the global and
local rtests on the basis of the overall genetic correlation with the
spatial distribution of the localities. Each positive and negative
component (with its own amount of genetic variance and moran Index
explained) is thus a partial representation of the global and local spatial
structure. So in your case, since you have a significant local structure,
you may plot one by one the first, second, third etc negative component and
see what the pattern looks like according to each component. Sometimes
there's interesting info in smaller cp.

Ehm, as usual it's a bit messy explanation (I am not good at explaining),
but I hope this helps. Otherwise I hope you will get better replies.

Ciao

Valeria

On 3 September 2013 14:44, Nathan Truelove <nathan.truelove at manchester.ac.uk
> wrote:

>  Hi Adegenet Forum,
>
>  Thanks in advance to anyone who has some advice to share with the forum
> on SPCA. If you're in a rush just read the parts in bold.
>
>  *I've been using SPCA to look at spatial genetics patterns among lobster
> populations*. I found positive local structure with the function
> local.rest and no global structure using global.rtest. I've followed
> Thibaut's advice in his previous sPCA email to forum and used $li to
> interpret local structure. I selected the local eigenvalue that had the
> highest levels of negative spatial autocorrelation and genetic variance for
> interpretation using the screeplot function. The $li values from this
> eigenvalue were then used to create an interpolated map.
>
>  *My question for the forum is*: *What do the positive and negative $li
> values associated with the local eigenvalue mean? *Do they correspond to
> levels of local (positive) and global (negative) scores at each location?
> Or are the $li values associated with the local eigenvalues simply a score
> for detecting local spatial genetic structure among sites and have nothing
> to do with global structure?
>
>  Best Wishes,
>
>  Nate
>
>   On Aug 11, 2013, at 4:35 PM, Jombart, Thibaut wrote:
>
>
> Hello,
>
> I think you attached the wrong file.
>
> Negative values and local structure are not related. Local structure =
> sharp differences between neighours. These would be overlooked by the
> lagged vector.
>
> If the structure is clear enough, use $li.
>
> As you have many overlapping points, s.value is suboptimal. You should
> consider using the colorplot, or interpolated maps. See the tutorial on
> sPCA for some example:
> http://cran.r-project.org/web/packages/adegenet/vignettes/adegenet-spca.pdf
>
> Best
> Thibaut
> ________________________________________
> From: dooshra at gmail.com [dooshra at gmail.com] on behalf of Hanan Sela [
> hans at tauex.tau.ac.il]
> Sent: 11 August 2013 12:19
> To: Jombart, Thibaut
> Subject: Re: [adegenet-forum] li vs. ls in sPCA analysis
>
> Hello Thibaut,
> Thank you for the response.
> In the file I have attached I see that with the $li variable there are no
> negative values in the southern sites while with the $ls values there are
> negative values in the south. It seems that I see more local spatial
> structure with $ls than with $li . When I tested the data with local test I
> got significant results.  Which variable is better to present in a paper.
> Thank you
> Hanan
> Mr. Hanan Sela Ph.D.
> Curator of the Lieberman Cereal Germplasm Bank
> The Institute for Cereal Crops Improvement
> Tel-Aviv University
> P.O. Box 39040
> Tel Aviv 69978
> Israel
>
> hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>
> Phone: 972-3-6405773
> Cell: 972-50-5727458 , local U.S 17203600603
> Fax: 972-3-6407857
>
>
> On Sun, Aug 11, 2013 at 12:37 PM, Jombart, Thibaut <
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>> wrote:
> Hello,
>
> the lagged vector is the spatially weighted average of the original
> vector. That is, the value of the score at a given location is the weighted
> average of the neighbouring values. This basically smooths the patterns so
> that they can be detected / visualized more easily.
>
> Cheers
> Thibaut.
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - School of Public Health
> St Mary?s Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658>
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org> [
> adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Hanan
> Sela [hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>]
> Sent: 11 August 2013<tel:2013> 06:21
> To: adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org>
> Subject: [adegenet-forum] li vs. ls in sPCA analysis
>
> Hello
> I have plotted the first  PC of sPCA analysis using s.value once with
> z=my.pca$li[,1]
> and once with z=my.pca$ls[,1]. The patterns seems to differ (see attached
> file). I do not understand what the lagged PC is representing. What is the
> meaning of "denoisified" in the practical day presentation  (Google does
> not know). How do i interpent the difference. Please explain.
> Thank you
>
> Mr. Hanan Sela Ph.D.
> Curator of the Lieberman Cereal Germplasm Bank
> The Institute for Cereal Crops Improvement
> Tel-Aviv University
> P.O. Box 39040
> Tel Aviv 69978
> Israel
>
> hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il><mailto:
> hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>>
> Phone: 972-3-6405773<tel:972-3-6405773>
> Cell: 972-50-5727458<tel:972-50-5727458> , local U.S 17203600603
> Fax: 972-3-6407857<tel:972-3-6407857>
>
>
> On Thu, Aug 1, 2013<tel:2013> at 7:15 PM, <
> adegenet-forum-request at lists.r-forge.r-project.org<mailto:
> adegenet-forum-request at lists.r-forge.r-project.org><mailto:
> adegenet-forum-request at lists.r-forge.r-project.org<mailto:
> adegenet-forum-request at lists.r-forge.r-project.org>>> wrote:
> Send adegenet-forum mailing list submissions to
>        adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org><mailto:
> adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org>>
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
> or, via email, send a message with subject or body 'help' to
>        adegenet-forum-request at lists.r-forge.r-project.org<mailto:
> adegenet-forum-request at lists.r-forge.r-project.org><mailto:
> adegenet-forum-request at lists.r-forge.r-project.org<mailto:
> adegenet-forum-request at lists.r-forge.r-project.org>>
>
> You can reach the person managing the list at
>        adegenet-forum-owner at lists.r-forge.r-project.org<mailto:
> adegenet-forum-owner at lists.r-forge.r-project.org><mailto:
> adegenet-forum-owner at lists.r-forge.r-project.org<mailto:
> adegenet-forum-owner at lists.r-forge.r-project.org>>
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of adegenet-forum digest..."
>
>
> Today's Topics:
>
>   1. Fwd: Question about pre-processing of SNP data for        machine
>      learning (Daniel Murrell)
>   2. Re: Fwd: Question about pre-processing of SNP data for
>      machine learning (Jombart, Thibaut)
>   3. Re: Fwd: Question about pre-processing of SNP data for
>      machine learning (Daniel Murrell)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 1 Aug 2013<tel:2013><tel:2013<tel:2013>> 15:26:00 +0100
> From: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:
> dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>
> To: adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org><mailto:
> adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org>>
> Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP
>        data for        machine learning
> Message-ID:
>        <CADK=3HwmiEO5v6fCQUYNkHFQ520avQJ9LFOAdu=Yu-Z+8h7BCg at mail.gmail.com
> <mailto:Yu-Z%2B8h7BCg at mail.gmail.com><mailto:Yu-Z%2B8h7BCg at mail.gmail.com
> <mailto:Yu-Z%252B8h7BCg at mail.gmail.com>>>
> Content-Type: text/plain; charset="windows-1252"
>
> Hi All
>
> This is my first time using adegenet. I'm trying to perform some
> pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
> machine learning task. My data was stored in a format which had to be
> converted to a genlight object. The data was split so that the information
> for the SNPs in each chromosome was in a separate file. I've read each file
> in, converted that to a genlight object and then concatenated the genlight
> objects using cbind. All of that seems to work ok (except the position and
> chromosome data went back to NULL during the concatenation and I had to
> reset it on the combined genlight object).
>
> So, now I want to do my own processing on each SNP and when I try to access
> the information for this SNP over the 800 individuals, it takes ages to
> extract. Is this because the encoding is done row wise, and so the whole
> object needs to be decoded for me to get out the information I require? Is
> there a way to transpose this genlight object so that I can access the data
> for a single SNP across all individual quickly?
>
> Thank you
> Daniel
>
> ---------- Forwarded message ----------
> From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk>>>
> Date: Fri, Jul 19, 2013<tel:2013><tel:2013<tel:2013>> at 4:27 PM
> Subject: RE: Question about pre-processing of SNP data for machine learning
> To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:
> dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>
>
>
> Dear Daniel,
>
> yes, adegenet is designed for that kind of task. Please look at the
> tutorial on adegenet-basics where you'll find examples of dimension
> reduction for SNP data, to be found on:
> http://adegenet.r-forge.r-project.org/
>
> Don't hesitate to use the adegenet-forum for further questions (see
> contacts on the website).
> Best
> Thibaut
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - School of Public Health
> St Mary?s Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658
> <tel:0044%20%280%2920%207594%203658><tel:0044%20%280%2920%207594%203658>
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
> ________________________________________
> From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:
> dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>> [dsmurrell at gmail.com
> <mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:
> dsmurrell at gmail.com>>] on behalf of Daniel Murrell
> [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:
> dsm38 at cam.ac.uk>>]
> Sent: 19 July 2013<tel:2013><tel:2013> 16:23
> To: Jombart, Thibaut
> Subject: Question about pre-processing of SNP data for machine learning
>
> Dear Thibaut
>
> I'm trying to build a model that uses SNP data as input. The problem I have
> is that there is too much of it and I need a way to reduce the number or
> the dimensionality of the data points so that I can use them as input to
> machine learning algorithms (genome wide, 1.3 million SNPs, 800
> individuals). I've done some searching and found this paper:
> http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).
>
> I also found your adegenet package and wondered if it's designed for doing
> something like this? I'm not from this field and I'm having some trouble
> working this out. Can you point me to anything that might help?
>
> I'm not sure whether I should be keeping a subset of SNPs and how to find
> that subset from the 1.3 million, or whether I should be reducing the
> dimensionality.
>
> Thank you
> Daniel
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/a331daec/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Thu, 1 Aug 2013<tel:2013> 15:22:27 +0000
> From: "Jombart, Thibaut" <t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk>>>
> To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:
> dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>,
>        "adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org><mailto:
> adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org>>"
>        <adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org><mailto:
> adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org>>>
> Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of
>        SNP data for    machine learning
> Message-ID:
>        <2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk
> <mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk
> ><mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk
> <mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk>>>
> Content-Type: text/plain; charset="Windows-1252"
>
>
> Dear Daniel,
>
> the loss of attributes after cbind indeed is a glitch. Would you mind
> creating a ticket about it?
> https://sourceforge.net/p/adegenet/tickets/
>
> You're right about the issue. The encoding is indeed done row-wise so the
> conversion is done many times over. There's no option for transposing the
> data, but one solution would be converting your data to integers by blocks
> so that conversion takes place less often, while still keep RAM
> requirements reasonable.
>
> All the best
>
> Thibaut
>
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org>> [
> adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org>>] on behalf of Daniel
> Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk
> <mailto:dsm38 at cam.ac.uk>>]
> Sent: 01 August 2013<tel:2013> 15:26
> To: adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org><mailto:
> adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org>>
> Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data
> for    machine learning
>
> Hi All
>
> This is my first time using adegenet. I'm trying to perform some
> pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
> machine learning task. My data was stored in a format which had to be
> converted to a genlight object. The data was split so that the information
> for the SNPs in each chromosome was in a separate file. I've read each file
> in, converted that to a genlight object and then concatenated the genlight
> objects using cbind. All of that seems to work ok (except the position and
> chromosome data went back to NULL during the concatenation and I had to
> reset it on the combined genlight object).
>
> So, now I want to do my own processing on each SNP and when I try to
> access the information for this SNP over the 800 individuals, it takes ages
> to extract. Is this because the encoding is done row wise, and so the whole
> object needs to be decoded for me to get out the information I require? Is
> there a way to transpose this genlight object so that I can access the data
> for a single SNP across all individual quickly?
>
> Thank you
> Daniel
>
> ---------- Forwarded message ----------
> From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk>>>>
> Date: Fri, Jul 19, 2013<tel:2013> at 4:27 PM
> Subject: RE: Question about pre-processing of SNP data for machine learning
> To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:
> dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:
> dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>>
>
>
> Dear Daniel,
>
> yes, adegenet is designed for that kind of task. Please look at the
> tutorial on adegenet-basics where you'll find examples of dimension
> reduction for SNP data, to be found on:
> http://adegenet.r-forge.r-project.org/
>
> Don't hesitate to use the adegenet-forum for further questions (see
> contacts on the website).
> Best
> Thibaut
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - School of Public Health
> St Mary?s Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658
> <tel:0044%20%280%2920%207594%203658><tel:0044%20%280%2920%207594%203658>
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
> ________________________________________
> From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:
> dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:
> dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com
> <mailto:dsmurrell at gmail.com>>> [dsmurrell at gmail.com<mailto:
> dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com
> >><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:
> dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>>] on behalf of Daniel
> Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk
> <mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk
> ><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>]
> Sent: 19 July 2013<tel:2013> 16:23
> To: Jombart, Thibaut
> Subject: Question about pre-processing of SNP data for machine learning
>
> Dear Thibaut
>
> I'm trying to build a model that uses SNP data as input. The problem I
> have is that there is too much of it and I need a way to reduce the number
> or the dimensionality of the data points so that I can use them as input to
> machine learning algorithms (genome wide, 1.3 million SNPs, 800
> individuals). I've done some searching and found this paper:
> http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).
>
> I also found your adegenet package and wondered if it's designed for doing
> something like this? I'm not from this field and I'm having some trouble
> working this out. Can you point me to anything that might help?
>
> I'm not sure whether I should be keeping a subset of SNPs and how to find
> that subset from the 1.3 million, or whether I should be reducing the
> dimensionality.
>
> Thank you
> Daniel
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 1 Aug 2013<tel:2013> 17:14:37 +0100
> From: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:
> dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>
> To: "Jombart, Thibaut" <t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk>>>
> Cc: "adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org><mailto:
> adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org>>"
>        <adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org><mailto:
> adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org>>>
> Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of
>        SNP data for machine learning
> Message-ID:
>        <CADK=3Hz=iJSJePuCOSwCkFOQUWHQyAmk+YS=-qWD+EO5vOBihA at mail.gmail.com
> <mailto:qWD%2BEO5vOBihA at mail.gmail.com><mailto:
> qWD%2BEO5vOBihA at mail.gmail.com<mailto:qWD%252BEO5vOBihA at mail.gmail.com>>>
> Content-Type: text/plain; charset="windows-1252"
>
> Dear Thibaut
>
> Ok, I could try that. I could also try and use the genlight object in a
> transposed manner just for the purposes of holding the data so that I can
> access individual SNPs easily. I mean nothing else would work expect the
> containment.
>
> Thanks for the help
> Regards
> Daniel
>
> On Thu, Aug 1, 2013<tel:2013> at 4:22 PM, Jombart, Thibaut
> <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>wrote:
>
>
>  Dear Daniel,
>
>
>  the loss of attributes after cbind indeed is a glitch. Would you mind
>
> creating a ticket about it?
>
> https://sourceforge.net/p/adegenet/tickets/
>
>
>  You're right about the issue. The encoding is indeed done row-wise so the
>
> conversion is done many times over. There's no option for transposing the
>
> data, but one solution would be converting your data to integers by blocks
>
> so that conversion takes place less often, while still keep RAM
>
> requirements reasonable.
>
>
>  All the best
>
>
>  Thibaut
>
>
>  ________________________________________
>
> From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org>> [
>
> adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:
> adegenet-forum-bounces at lists.r-forge.r-project.org>>] on behalf of Daniel
>
> Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk
> <mailto:dsm38 at cam.ac.uk>>]
>
> Sent: 01 August 2013<tel:2013> 15:26
>
> To: adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org><mailto:
> adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org>>
>
> Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data
>
> for    machine learning
>
>
>  Hi All
>
>
>  This is my first time using adegenet. I'm trying to perform some
>
> pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
>
> machine learning task. My data was stored in a format which had to be
>
> converted to a genlight object. The data was split so that the information
>
> for the SNPs in each chromosome was in a separate file. I've read each file
>
> in, converted that to a genlight object and then concatenated the genlight
>
> objects using cbind. All of that seems to work ok (except the position and
>
> chromosome data went back to NULL during the concatenation and I had to
>
> reset it on the combined genlight object).
>
>
>  So, now I want to do my own processing on each SNP and when I try to
>
> access the information for this SNP over the 800 individuals, it takes ages
>
> to extract. Is this because the encoding is done row wise, and so the whole
>
> object needs to be decoded for me to get out the information I require? Is
>
> there a way to transpose this genlight object so that I can access the data
>
> for a single SNP across all individual quickly?
>
>
>  Thank you
>
> Daniel
>
>
>  ---------- Forwarded message ----------
>
> From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk>><mailto:
>
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>>
>
> Date: Fri, Jul 19, 2013 at 4:27 PM
>
> Subject: RE: Question about pre-processing of SNP data for machine learning
>
> To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:
> dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:
> dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>>
>
>
>
>  Dear Daniel,
>
>
>  yes, adegenet is designed for that kind of task. Please look at the
>
> tutorial on adegenet-basics where you'll find examples of dimension
>
> reduction for SNP data, to be found on:
>
> http://adegenet.r-forge.r-project.org/
>
>
>  Don't hesitate to use the adegenet-forum for further questions (see
>
> contacts on the website).
>
> Best
>
> Thibaut
>
>
>  --
>
> ######################################
>
> Dr Thibaut JOMBART
>
> MRC Centre for Outbreak Analysis and Modelling
>
> Department of Infectious Disease Epidemiology
>
> Imperial College - School of Public Health
>
> St Mary?s Campus
>
> Norfolk Place
>
> London W2 1PG
>
> United Kingdom
>
> Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658>
>
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
>
> http://sites.google.com/site/thibautjombart/
>
> http://adegenet.r-forge.r-project.org/
>
> ________________________________________
>
> From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:
> dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:
> dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com
> <mailto:dsmurrell at gmail.com>>> [dsmurrell at gmail.com<mailto:
> dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com
> >>
>
> <mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:
> dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>>] on behalf of Daniel
> Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk
> <mailto:dsm38 at cam.ac.uk>>
>
> <mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk
> <mailto:dsm38 at cam.ac.uk>>>]
>
> Sent: 19 July 2013 16:23
>
> To: Jombart, Thibaut
>
> Subject: Question about pre-processing of SNP data for machine learning
>
>
>  Dear Thibaut
>
>
>  I'm trying to build a model that uses SNP data as input. The problem I
>
> have is that there is too much of it and I need a way to reduce the number
>
> or the dimensionality of the data points so that I can use them as input to
>
> machine learning algorithms (genome wide, 1.3 million SNPs, 800
>
> individuals). I've done some searching and found this paper:
>
> http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).
>
>
>  I also found your adegenet package and wondered if it's designed for
> doing
>
> something like this? I'm not from this field and I'm having some trouble
>
> working this out. Can you point me to anything that might help?
>
>
>  I'm not sure whether I should be keeping a subset of SNPs and how to find
>
> that subset from the 1.3 million, or whether I should be reducing the
>
> dimensionality.
>
>
>  Thank you
>
> Daniel
>
>
>  -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/4373022c/attachment.html
> >
>
> ------------------------------
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org><mailto:
> adegenet-forum at lists.r-forge.r-project.org<mailto:
> adegenet-forum at lists.r-forge.r-project.org>>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
> End of adegenet-forum Digest, Vol 60, Issue 2
> *********************************************
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130904/b00a4ea7/attachment-0001.html>

From mirainoshojo at gmail.com  Thu Sep  5 10:59:43 2013
From: mirainoshojo at gmail.com (Valeria Montano)
Date: Thu, 5 Sep 2013 10:59:43 +0200
Subject: [adegenet-forum] Question about genetic structure in admixed
	populations
In-Reply-To: <52274BC7020000CB0000539A@snggwia.senckenberg.de>
References: <52274BC7020000CB0000539A@snggwia.senckenberg.de>
Message-ID: <CADEmh=vJH1LeFd-QPe5BfaueDBaGGGR9fNBb3BCDv70Jo7P2=w@mail.gmail.com>

Dear Jutta,

cluster analysis can be tricky when the samples analysed are distributed
along a gradient and if there is no clear-cut subdivision, this can lead to
contradictory results (have a look at this paper
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.192.3029&rep=rep1&type=pdf).
You may want to consider using TESS or BAPS with the admixture model
option. These two software allow including the geographic coordinates as a
prior information and the admixture model is a way to model spatial
gradients. If you tested the IBD with a Mantel test, just be careful that a
significant mantel test is not directly due to IBD, geo to gen correlation
can be significant for different spatial/migratory schemes. I think your
DAPC is ok, a part from the fact that there is no need to use the
find.clusters with the number of PCs indicated by the optim.a.score. This
procedure is used to optimize the discriminant space among clusters in the
DAPC. To assign individuals to clusters you can simply retrieve all the
variance (even though in your case is almost the same given that you have
98%). Only thing, I would try with max number of clusters around 20, more
than your sampling locations. You can also give sPCA a try.

Hope this helps

Ciao

Valeria


On 4 September 2013 15:03, Jutta Geismar <Jutta.Geismar at senckenberg.de>wrote:

>  Dear Mr Jombart and DAPC users,******
>
> ** **
>
> I used DAPC to analyze genetic structure in a small region with 20
> microsatellite markers. I analyzed 330 individuals (14 sampling sites) and
> found little genetic differences (FST, D Jost), but a significant isolation
> by distance pattern. A cluster analysis in STRUCTURE resulted in four
> clusters (STRUCTURE Harvester) but all individuals had more or less equal
> posterior probability in all of the four inferred clusters. Therefore I
> assume a panmictic population structure. Since STRUCTURE is known for some
> problems analyzing datasets under IBD I analyzed the data with DAPC. DAPC
> resulted in 3 or 4 clusters (and tested up until K=7 to be sure), but in
> both cases these were randomly distributed among all individuals without a
> geographic context. Only 94 individuals were not assigned to one cluster
> with more than 90% and therefore would be counted as ?admixed? (example in
> DAPC tutorial). For me the results of STRUCTURE and DAPC are in conflict to
> each other, but I don?t know how a panmictic population would look like in
> DAPC. Distances between sites are small and it is very likely that gene
> flow occurs among my sampling points, which might cause problems in genetic
> cluster analyses. I don?t know if I made any mistake in my thinking, that?s
> why I want to explain my procedure briefly:****
>
> 1.       I used dapc and chose 1/3 of the sample size as PC (as
> suggested) and counted DAs in the plot (100% of the variability was
> included, 110 PC, 13 DA)****
>
> 2.       To reduce variability I used optim.a.score (smart FALSE). The
> best a-score was around 0.2 (PC 61)****
>
> 3.       After that I wanted to estimate the number of clusters by
> find.clusters and used the a-score as number of PCs and repeated the dapc
> (conserved variance was still 98%, 61 PCs, 2 DA) ****
>
> I chose k in the BIC values after which the decrease was less compared to
> the previous, but not the lowest k.****
>
> If I have some mistakes in my procedure I would appreciate some advice.
> But also if the procedure is okay I cannot explain the contrariness of
> these two analyses. ****
>
> Thanks a lot in advance for some help.****
>
> Jutta Geismar ****
>
> PhD student
>
> Germany****
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130905/21ca815d/attachment.html>

From mirainoshojo at gmail.com  Sun Sep  8 20:44:07 2013
From: mirainoshojo at gmail.com (Valeria Montano)
Date: Sun, 8 Sep 2013 20:44:07 +0200
Subject: [adegenet-forum] Question about genetic structure in admixed
	populations
In-Reply-To: <522C9934020000CB000053D5@snggwia.senckenberg.de>
References: <52274BC7020000CB0000539A@snggwia.senckenberg.de>
 <CADEmh=vJH1LeFd-QPe5BfaueDBaGGGR9fNBb3BCDv70Jo7P2=w@mail.gmail.com>
 <522C9934020000CB000053D5@snggwia.senckenberg.de>
Message-ID: <CADEmh=sVic5YSnRFGA6Cb8Z=9jWhWG=SJV3QjFLGUNBuzD8j5Q@mail.gmail.com>

Hi Jutta!

well, ehm...sooo,

you already know about the limitations of Structure and, in general,
bayesian approaches to cluster analysis.

For what matters, I can give you my opinion/suggestion in brief:

1) Structure and DAPC can give different results in several cases,
depending on the evolutionary processes ongoing among specific inds/pops. I
wish they always agreed - that would make our lives happier. In general,
relying on a method rather than another is a decision that can be made
based on the knowledge of the models assumed in different approaches and
their limitations, and certainly the feeling you have about your case study
given all the results you already got. Personally, I never take a best k
out of the find.clusters unless the BIC shows a very clear cut-off (i.e.
the curve nicely rising up after a certain K), but this is really a
personal standard.

2) My understanding of the distribution of continuous populations (as this
is seems to be the case of your data) is that there is actually no best
clustering one can do. When the spatial distribution of the allele
frequencies is organized in gradients or clines, the clusters are not the
best tool to use to describe the data. That is why a method such as BAPS is
useful. GENELAND is cool too, but there is no explicit modelling of
gradients, plus the integration of the spatial info has never been totally
clear to me. I find BAPS and TESS more straightforward. In this sense, they
are good approaches to optimize a number of "clusters" although what you
find out cannot be really called clusters (in the structure or dapc
meaning).

It took me a while to learn how to manage the sense of panic/disorientation
provoked by the absence of best clustering in some genetic datasets, but
afterwards I even developed a preference for gradients, although I admit
clusters are very useful.

Hope this is somehow useful
Best wishes

Valeria


On 8 September 2013 15:35, Jutta Geismar <Jutta.Geismar at senckenberg.de>wrote:

>  Dear Valeria,******
>
>  ****
>
> thank you very much for your quick answer. I?m aware of the problems
> STUCTURE has to analyze genetic data of continuous populations (see also
> http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2008.01606.x/pdf).
> That is one reason I don?t want to use STUCTURE as the only cluster
> analysis.  I haven?t attempted to use BAPS yet, but I gave GENELAND a
> trial to include spatial information. Besides testing for IBD with a Mantel
> test, I also modified the geographic distances by resistance values etc. I
> inferred from a SDM. A spatial autocorrelations didn?t show a clear pattern
> of spatial relation (also in different distance classes).  A PCA
> indicates a big cloud around the center point. Each of the first two axes
> explained about 19 % of the variance.****
>
> Thanks to assure the correctness of my DAPC script. I set the maximum
> number of clusters to 50 to exclude a missing of structural shifts.****
>
> Nonetheless, I cannot explain the contrary results of structure indicating
> a panmictic population (4 parallel stripes) and DAPC assigning most
> individuals to one specific cluster. ****
>
> Thanks again for your comments. I will have a look at BAPS.****
>
> Best wishes, ****
>
> Jutta****
> >>> Valeria Montano <mirainoshojo at gmail.com> 9/5/2013 10:59 >>>
>  Dear Jutta,
>
> cluster analysis can be tricky when the samples analysed are distributed
> along a gradient and if there is no clear-cut subdivision, this can lead to
> contradictory results (have a look at this paper
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.192.3029&rep=rep1&type=pdf).
> You may want to consider using TESS or BAPS with the admixture model
> option. These two software allow including the geographic coordinates as a
> prior information and the admixture model is a way to model spatial
> gradients. If you tested the IBD with a Mantel test, just be careful that a
> significant mantel test is not directly due to IBD, geo to gen correlation
> can be significant for different spatial/migratory schemes. I think your
> DAPC is ok, a part from the fact that there is no need to use the
> find.clusters with the number of PCs indicated by the optim.a.score. This
> procedure is used to optimize the discriminant space among clusters in the
> DAPC. To assign individuals to clusters you can simply retrieve all the
> variance (even though in your case is almost the same given that you have
> 98%). Only thing, I would try with max number of clusters around 20, more
> than your sampling locations. You can also give sPCA a try.
>
> Hope this helps
>
> Ciao
>
> Valeria
>
>
> On 4 September 2013 15:03, Jutta Geismar <Jutta.Geismar at senckenberg.de>wrote:
>
>>  Dear Mr Jombart and DAPC users,******
>>
>> ****
>>
>> I used DAPC to analyze genetic structure in a small region with 20
>> microsatellite markers. I analyzed 330 individuals (14 sampling sites) and
>> found little genetic differences (FST, D Jost), but a significant isolation
>> by distance pattern. A cluster analysis in STRUCTURE resulted in four
>> clusters (STRUCTURE Harvester) but all individuals had more or less equal
>> posterior probability in all of the four inferred clusters. Therefore I
>> assume a panmictic population structure. Since STRUCTURE is known for some
>> problems analyzing datasets under IBD I analyzed the data with DAPC. DAPC
>> resulted in 3 or 4 clusters (and tested up until K=7 to be sure), but in
>> both cases these were randomly distributed among all individuals without a
>> geographic context. Only 94 individuals were not assigned to one cluster
>> with more than 90% and therefore would be counted as ?admixed? (example in
>> DAPC tutorial). For me the results of STRUCTURE and DAPC are in conflict to
>> each other, but I don?t know how a panmictic population would look like in
>> DAPC. Distances between sites are small and it is very likely that gene
>> flow occurs among my sampling points, which might cause problems in genetic
>> cluster analyses. I don?t know if I made any mistake in my thinking, that?s
>> why I want to explain my procedure briefly:****
>>
>> 1. I used dapc and chose 1/3 of the sample size as PC (as suggested) and
>> counted DAs in the plot (100% of the variability was included, 110 PC, 13
>> DA)****
>>
>> 2. To reduce variability I used optim.a.score (smart FALSE). The best
>> a-score was around 0.2 (PC 61)****
>>
>> 3. After that I wanted to estimate the number of clusters by
>> find.clusters and used the a-score as number of PCs and repeated the dapc
>> (conserved variance was still 98%, 61 PCs, 2 DA) ****
>>
>> I chose k in the BIC values after which the decrease was less compared to
>> the previous, but not the lowest k.****
>>
>> If I have some mistakes in my procedure I would appreciate some advice.
>> But also if the procedure is okay I cannot explain the contrariness of
>> these two analyses. ****
>>
>> Thanks a lot in advance for some help.****
>>
>> Jutta Geismar ****
>>
>> PhD student
>>
>> Germany****
>>
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130908/95373a63/attachment.html>

From Frederik.VandenBroeck at bio.kuleuven.be  Mon Sep  9 13:33:43 2013
From: Frederik.VandenBroeck at bio.kuleuven.be (Frederik Van den Broeck)
Date: Mon, 9 Sep 2013 11:33:43 +0000
Subject: [adegenet-forum] adegenet-forum Digest, Vol 61, Issue 4
In-Reply-To: <mailman.21.1378720818.6733.adegenet-forum@lists.r-forge.r-project.org>
References: <mailman.21.1378720818.6733.adegenet-forum@lists.r-forge.r-project.org>
Message-ID: <02E355FCF1052B4B9BBB570769EDF76F10497DB2@ICTS-S-MBX13.luna.kuleuven.be>

Dear Jutta,

Did you already try to use individual based distance methods (which I prefer in most cases) such as the inverse proportion of shared alleles or euclidean distances? Did you try to do a PCA analysis? All this can be quickly done in adegenet and will give you major insight in the structure of your data. Another software to study genetic structure I also like a lot is SPAGeDi (http://ebe.ulb.ac.be/ebe/SPAGeDi.html).
I know this doesn't answer your questions, but I merely wanted to mention some alternatives to cluster analysis that could also give you insight into population structure.

Kind regards
Frederik

________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of adegenet-forum-request at lists.r-forge.r-project.org [adegenet-forum-request at lists.r-forge.r-project.org]
Sent: Monday, September 09, 2013 12:00 PM
To: adegenet-forum at lists.r-forge.r-project.org
Subject: adegenet-forum Digest, Vol 61, Issue 4

Send adegenet-forum mailing list submissions to
        adegenet-forum at lists.r-forge.r-project.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

or, via email, send a message with subject or body 'help' to
        adegenet-forum-request at lists.r-forge.r-project.org

You can reach the person managing the list at
        adegenet-forum-owner at lists.r-forge.r-project.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of adegenet-forum digest..."


Today's Topics:

   1. Re: Question about genetic structure in admixed   populations
      (Valeria Montano)


----------------------------------------------------------------------

Message: 1
Date: Sun, 8 Sep 2013 20:44:07 +0200
From: Valeria Montano <mirainoshojo at gmail.com>
To: Jutta Geismar <Jutta.Geismar at senckenberg.de>
Cc: "adegenet-forum at lists.r-forge.r-project.org"
        <adegenet-forum at lists.r-forge.r-project.org>
Subject: Re: [adegenet-forum] Question about genetic structure in
        admixed populations
Message-ID:
        <CADEmh=sVic5YSnRFGA6Cb8Z=9jWhWG=SJV3QjFLGUNBuzD8j5Q at mail.gmail.com>
Content-Type: text/plain; charset="windows-1252"

Hi Jutta!

well, ehm...sooo,

you already know about the limitations of Structure and, in general,
bayesian approaches to cluster analysis.

For what matters, I can give you my opinion/suggestion in brief:

1) Structure and DAPC can give different results in several cases,
depending on the evolutionary processes ongoing among specific inds/pops. I
wish they always agreed - that would make our lives happier. In general,
relying on a method rather than another is a decision that can be made
based on the knowledge of the models assumed in different approaches and
their limitations, and certainly the feeling you have about your case study
given all the results you already got. Personally, I never take a best k
out of the find.clusters unless the BIC shows a very clear cut-off (i.e.
the curve nicely rising up after a certain K), but this is really a
personal standard.

2) My understanding of the distribution of continuous populations (as this
is seems to be the case of your data) is that there is actually no best
clustering one can do. When the spatial distribution of the allele
frequencies is organized in gradients or clines, the clusters are not the
best tool to use to describe the data. That is why a method such as BAPS is
useful. GENELAND is cool too, but there is no explicit modelling of
gradients, plus the integration of the spatial info has never been totally
clear to me. I find BAPS and TESS more straightforward. In this sense, they
are good approaches to optimize a number of "clusters" although what you
find out cannot be really called clusters (in the structure or dapc
meaning).

It took me a while to learn how to manage the sense of panic/disorientation
provoked by the absence of best clustering in some genetic datasets, but
afterwards I even developed a preference for gradients, although I admit
clusters are very useful.

Hope this is somehow useful
Best wishes

Valeria


On 8 September 2013 15:35, Jutta Geismar <Jutta.Geismar at senckenberg.de>wrote:

>  Dear Valeria,******
>
>  ****
>
> thank you very much for your quick answer. I?m aware of the problems
> STUCTURE has to analyze genetic data of continuous populations (see also
> http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2008.01606.x/pdf).
> That is one reason I don?t want to use STUCTURE as the only cluster
> analysis.  I haven?t attempted to use BAPS yet, but I gave GENELAND a
> trial to include spatial information. Besides testing for IBD with a Mantel
> test, I also modified the geographic distances by resistance values etc. I
> inferred from a SDM. A spatial autocorrelations didn?t show a clear pattern
> of spatial relation (also in different distance classes).  A PCA
> indicates a big cloud around the center point. Each of the first two axes
> explained about 19 % of the variance.****
>
> Thanks to assure the correctness of my DAPC script. I set the maximum
> number of clusters to 50 to exclude a missing of structural shifts.****
>
> Nonetheless, I cannot explain the contrary results of structure indicating
> a panmictic population (4 parallel stripes) and DAPC assigning most
> individuals to one specific cluster. ****
>
> Thanks again for your comments. I will have a look at BAPS.****
>
> Best wishes, ****
>
> Jutta****
> >>> Valeria Montano <mirainoshojo at gmail.com> 9/5/2013 10:59 >>>
>  Dear Jutta,
>
> cluster analysis can be tricky when the samples analysed are distributed
> along a gradient and if there is no clear-cut subdivision, this can lead to
> contradictory results (have a look at this paper
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.192.3029&rep=rep1&type=pdf).
> You may want to consider using TESS or BAPS with the admixture model
> option. These two software allow including the geographic coordinates as a
> prior information and the admixture model is a way to model spatial
> gradients. If you tested the IBD with a Mantel test, just be careful that a
> significant mantel test is not directly due to IBD, geo to gen correlation
> can be significant for different spatial/migratory schemes. I think your
> DAPC is ok, a part from the fact that there is no need to use the
> find.clusters with the number of PCs indicated by the optim.a.score. This
> procedure is used to optimize the discriminant space among clusters in the
> DAPC. To assign individuals to clusters you can simply retrieve all the
> variance (even though in your case is almost the same given that you have
> 98%). Only thing, I would try with max number of clusters around 20, more
> than your sampling locations. You can also give sPCA a try.
>
> Hope this helps
>
> Ciao
>
> Valeria
>
>
> On 4 September 2013 15:03, Jutta Geismar <Jutta.Geismar at senckenberg.de>wrote:
>
>>  Dear Mr Jombart and DAPC users,******
>>
>> ****
>>
>> I used DAPC to analyze genetic structure in a small region with 20
>> microsatellite markers. I analyzed 330 individuals (14 sampling sites) and
>> found little genetic differences (FST, D Jost), but a significant isolation
>> by distance pattern. A cluster analysis in STRUCTURE resulted in four
>> clusters (STRUCTURE Harvester) but all individuals had more or less equal
>> posterior probability in all of the four inferred clusters. Therefore I
>> assume a panmictic population structure. Since STRUCTURE is known for some
>> problems analyzing datasets under IBD I analyzed the data with DAPC. DAPC
>> resulted in 3 or 4 clusters (and tested up until K=7 to be sure), but in
>> both cases these were randomly distributed among all individuals without a
>> geographic context. Only 94 individuals were not assigned to one cluster
>> with more than 90% and therefore would be counted as ?admixed? (example in
>> DAPC tutorial). For me the results of STRUCTURE and DAPC are in conflict to
>> each other, but I don?t know how a panmictic population would look like in
>> DAPC. Distances between sites are small and it is very likely that gene
>> flow occurs among my sampling points, which might cause problems in genetic
>> cluster analyses. I don?t know if I made any mistake in my thinking, that?s
>> why I want to explain my procedure briefly:****
>>
>> 1. I used dapc and chose 1/3 of the sample size as PC (as suggested) and
>> counted DAs in the plot (100% of the variability was included, 110 PC, 13
>> DA)****
>>
>> 2. To reduce variability I used optim.a.score (smart FALSE). The best
>> a-score was around 0.2 (PC 61)****
>>
>> 3. After that I wanted to estimate the number of clusters by
>> find.clusters and used the a-score as number of PCs and repeated the dapc
>> (conserved variance was still 98%, 61 PCs, 2 DA) ****
>>
>> I chose k in the BIC values after which the decrease was less compared to
>> the previous, but not the lowest k.****
>>
>> If I have some mistakes in my procedure I would appreciate some advice.
>> But also if the procedure is okay I cannot explain the contrariness of
>> these two analyses. ****
>>
>> Thanks a lot in advance for some help.****
>>
>> Jutta Geismar ****
>>
>> PhD student
>>
>> Germany****
>>
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130908/95373a63/attachment-0001.html>

------------------------------

_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

End of adegenet-forum Digest, Vol 61, Issue 4
*********************************************

From Jutta.Geismar at senckenberg.de  Sun Sep  8 15:35:16 2013
From: Jutta.Geismar at senckenberg.de (Jutta Geismar)
Date: Sun, 08 Sep 2013 15:35:16 +0200
Subject: [adegenet-forum] Antw: Re: Question about genetic structure in
 admixed populations
In-Reply-To: <CADEmh=vJH1LeFd-QPe5BfaueDBaGGGR9fNBb3BCDv70Jo7P2=w@mail.gmail.com>
References: <52274BC7020000CB0000539A@snggwia.senckenberg.de>
 <CADEmh=vJH1LeFd-QPe5BfaueDBaGGGR9fNBb3BCDv70Jo7P2=w@mail.gmail.com>
Message-ID: <522C9934020000CB000053D5@snggwia.senckenberg.de>


Dear Valeria,
 
thank you very much for your quick answer. I?m aware of the problems
STUCTURE has to analyze genetic data of continuous populations (see also
http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2008.01606.x/pdf).
That is one reason I don?t want to use STUCTURE as the only cluster
analysis.  I haven?t attempted to use BAPS yet, but I gave GENELAND a
trial to include spatial information. Besides testing for IBD with a
Mantel test, I also modified the geographic distances by resistance
values etc. I inferred from a SDM. A spatial autocorrelations didn?t
show a clear pattern of spatial relation (also in different distance
classes).  A PCA indicates a big cloud around the center point. Each of
the first two axes explained about 19 % of the variance.
Thanks to assure the correctness of my DAPC script. I set the maximum
number of clusters to 50 to exclude a missing of structural shifts.
Nonetheless, I cannot explain the contrary results of structure
indicating a panmictic population (4 parallel stripes) and DAPC
assigning most individuals to one specific cluster. 
Thanks again for your comments. I will have a look at BAPS.
Best wishes, 
Jutta
>>> Valeria Montano <mirainoshojo at gmail.com> 9/5/2013 10:59 >>>
Dear Jutta,

cluster analysis can be tricky when the samples analysed are
distributed along a gradient and if there is no clear-cut subdivision,
this can lead to contradictory results (have a look at this paper
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.192.3029&rep=rep1&type=pdf).
You may want to consider using TESS or BAPS with the admixture model
option. These two software allow including the geographic coordinates as
a prior information and the admixture model is a way to model spatial
gradients. If you tested the IBD with a Mantel test, just be careful
that a significant mantel test is not directly due to IBD, geo to gen
correlation can be significant for different spatial/migratory schemes.
I think your DAPC is ok, a part from the fact that there is no need to
use the find.clusters with the number of PCs indicated by the
optim.a.score. This procedure is used to optimize the discriminant space
among clusters in the DAPC. To assign individuals to clusters you can
simply retrieve all the variance (even though in your case is almost the
same given that you have 98%). Only thing, I would try with max number
of clusters around 20, more than your sampling locations. You can also
give sPCA a try.

Hope this helps

Ciao

Valeria


On 4 September 2013 15:03, Jutta Geismar <Jutta.Geismar at senckenberg.de>
wrote:


Dear Mr Jombart and DAPC users,

I used DAPC to analyze genetic structure in a small region with 20
microsatellite markers. I analyzed 330 individuals (14 sampling sites)
and found little genetic differences (FST, D Jost), but a significant
isolation by distance pattern. A cluster analysis in STRUCTURE resulted
in four clusters (STRUCTURE Harvester) but all individuals had more or
less equal posterior probability in all of the four inferred clusters.
Therefore I assume a panmictic population structure. Since STRUCTURE is
known for some problems analyzing datasets under IBD I analyzed the data
with DAPC. DAPC resulted in 3 or 4 clusters (and tested up until K=7 to
be sure), but in both cases these were randomly distributed among all
individuals without a geographic context. Only 94 individuals were not
assigned to one cluster with more than 90% and therefore would be
counted as ?admixed? (example in DAPC tutorial). For me the results of
STRUCTURE and DAPC are in conflict to each other, but I don?t know how a
panmictic population would look like in DAPC. Distances between sites
are small and it is very likely that gene flow occurs among my sampling
points, which might cause problems in genetic cluster analyses. I don?t
know if I made any mistake in my thinking, that?s why I want to explain
my procedure briefly:
1. I used dapc and chose 1/3 of the sample size as PC (as suggested)
and counted DAs in the plot (100% of the variability was included, 110
PC, 13 DA)
2. To reduce variability I used optim.a.score (smart FALSE). The best
a-score was around 0.2 (PC 61)
3. After that I wanted to estimate the number of clusters by
find.clusters and used the a-score as number of PCs and repeated the
dapc (conserved variance was still 98%, 61 PCs, 2 DA) 
I chose k in the BIC values after which the decrease was less compared
to the previous, but not the lowest k.
If I have some mistakes in my procedure I would appreciate some advice.
But also if the procedure is okay I cannot explain the contrariness of
these two analyses. 
Thanks a lot in advance for some help.
Jutta Geismar 
PhD student
Germany

_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130908/fc82b5d5/attachment-0001.html>

From Jutta.Geismar at senckenberg.de  Wed Sep 11 12:05:58 2013
From: Jutta.Geismar at senckenberg.de (Jutta Geismar)
Date: Wed, 11 Sep 2013 12:05:58 +0200
Subject: [adegenet-forum] Digest, Vol 61, Issue 4
Message-ID: <52305CA6020000CB00005435@snggwia.senckenberg.de>

Dear Frederik,
 
thank you for your comments.
I did a PCA based on individual genetic distances which showed a big cloud of points. I mentioned it in my last answer. Did you mean, I should try it with genetic relatedness? What else methods of individual based distance do you think of?
I worked also with SPAGeDi, but the results were more or less the same I got with a spatial autocorrelation, which was expectable.
Since I recieved no clear information in these analyses, I hoped to find more explicit structure information in a cluster approach.
 
Kind regards
Jutta
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130911/bab76637/attachment.html>

From t.jombart at imperial.ac.uk  Wed Sep 11 17:58:52 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Wed, 11 Sep 2013 15:58:52 +0000
Subject: [adegenet-forum] $li in sPCA analysis
In-Reply-To: <DE5F03D8-75DE-4D8D-B699-C151F77C0519@postgrad.manchester.ac.uk>
References: <CAHPXwHdpJ4bk4G8UBAZenwim-aURxKpvWQXpuBmNb-qCp_aevQ@mail.gmail.com>
 <2CB2DA8E426F3541AB1907F98ABA6570638B5234@icexch-m1.ic.ac.uk>,
 <CAHPXwHdx=VdgVm9PyqSTokjMDyKBGME82HoRONTX_u4EVK86Ew@mail.gmail.com>
 <2CB2DA8E426F3541AB1907F98ABA6570638B5287@icexch-m1.ic.ac.uk>,
 <DE5F03D8-75DE-4D8D-B699-C151F77C0519@postgrad.manchester.ac.uk>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA6570638DF333@icexch-m1.ic.ac.uk>

Hello, 

the values in $li have arbitrary signs. They are simply scores synthesizing the spatial structures in the data (linear combinations of variables optimizing the variance and Moran's I).

Cheers
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Nathan Truelove [nathan.truelove at manchester.ac.uk]
Sent: 03 September 2013 13:44
To: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] $li in sPCA analysis

Hi Adegenet Forum,

Thanks in advance to anyone who has some advice to share with the forum on SPCA. If you're in a rush just read the parts in bold.

I've been using SPCA to look at spatial genetics patterns among lobster populations. I found positive local structure with the function local.rest and no global structure using global.rtest. I've followed Thibaut's advice in his previous sPCA email to forum and used $li to interpret local structure. I selected the local eigenvalue that had the highest levels of negative spatial autocorrelation and genetic variance for interpretation using the screeplot function. The $li values from this eigenvalue were then used to create an interpolated map.

My question for the forum is: What do the positive and negative $li values associated with the local eigenvalue mean? Do they correspond to levels of local (positive) and global (negative) scores at each location? Or are the $li values associated with the local eigenvalues simply a score for detecting local spatial genetic structure among sites and have nothing to do with global structure?

Best Wishes,

Nate

On Aug 11, 2013, at 4:35 PM, Jombart, Thibaut wrote:


Hello,

I think you attached the wrong file.

Negative values and local structure are not related. Local structure = sharp differences between neighours. These would be overlooked by the lagged vector.

If the structure is clear enough, use $li.

As you have many overlapping points, s.value is suboptimal. You should consider using the colorplot, or interpolated maps. See the tutorial on sPCA for some example:
http://cran.r-project.org/web/packages/adegenet/vignettes/adegenet-spca.pdf

Best
Thibaut
________________________________________
From: dooshra at gmail.com [dooshra at gmail.com] on behalf of Hanan Sela [hans at tauex.tau.ac.il]
Sent: 11 August 2013 12:19
To: Jombart, Thibaut
Subject: Re: [adegenet-forum] li vs. ls in sPCA analysis

Hello Thibaut,
Thank you for the response.
In the file I have attached I see that with the $li variable there are no negative values in the southern sites while with the $ls values there are negative values in the south. It seems that I see more local spatial structure with $ls than with $li . When I tested the data with local test I got significant results.  Which variable is better to present in a paper.
Thank you
Hanan
Mr. Hanan Sela Ph.D.
Curator of the Lieberman Cereal Germplasm Bank
The Institute for Cereal Crops Improvement
Tel-Aviv University
P.O. Box 39040
Tel Aviv 69978
Israel

hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>
Phone: 972-3-6405773
Cell: 972-50-5727458 , local U.S 17203600603
Fax: 972-3-6407857


On Sun, Aug 11, 2013 at 12:37 PM, Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>> wrote:
Hello,

the lagged vector is the spatially weighted average of the original vector. That is, the value of the score at a given location is the weighted average of the neighbouring values. This basically smooths the patterns so that they can be detected / visualized more easily.

Cheers
Thibaut.

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org> [adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Hanan Sela [hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>]
Sent: 11 August 2013<tel:2013> 06:21
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
Subject: [adegenet-forum] li vs. ls in sPCA analysis

Hello
I have plotted the first  PC of sPCA analysis using s.value once with z=my.pca$li[,1]
and once with z=my.pca$ls[,1]. The patterns seems to differ (see attached file). I do not understand what the lagged PC is representing. What is the meaning of "denoisified" in the practical day presentation  (Google does not know). How do i interpent the difference. Please explain.
Thank you

Mr. Hanan Sela Ph.D.
Curator of the Lieberman Cereal Germplasm Bank
The Institute for Cereal Crops Improvement
Tel-Aviv University
P.O. Box 39040
Tel Aviv 69978
Israel

hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il><mailto:hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>>
Phone: 972-3-6405773<tel:972-3-6405773>
Cell: 972-50-5727458<tel:972-50-5727458> , local U.S 17203600603
Fax: 972-3-6407857<tel:972-3-6407857>


On Thu, Aug 1, 2013<tel:2013> at 7:15 PM, <adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org><mailto:adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org>>> wrote:
Send adegenet-forum mailing list submissions to
       adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>

To subscribe or unsubscribe via the World Wide Web, visit
       https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

or, via email, send a message with subject or body 'help' to
       adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org><mailto:adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org>>

You can reach the person managing the list at
       adegenet-forum-owner at lists.r-forge.r-project.org<mailto:adegenet-forum-owner at lists.r-forge.r-project.org><mailto:adegenet-forum-owner at lists.r-forge.r-project.org<mailto:adegenet-forum-owner at lists.r-forge.r-project.org>>

When replying, please edit your Subject line so it is more specific
than "Re: Contents of adegenet-forum digest..."


Today's Topics:

  1. Fwd: Question about pre-processing of SNP data for        machine
     learning (Daniel Murrell)
  2. Re: Fwd: Question about pre-processing of SNP data for
     machine learning (Jombart, Thibaut)
  3. Re: Fwd: Question about pre-processing of SNP data for
     machine learning (Daniel Murrell)


----------------------------------------------------------------------

Message: 1
Date: Thu, 1 Aug 2013<tel:2013><tel:2013<tel:2013>> 15:26:00 +0100
From: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP
       data for        machine learning
Message-ID:
       <CADK=3HwmiEO5v6fCQUYNkHFQ520avQJ9LFOAdu=Yu-Z+8h7BCg at mail.gmail.com<mailto:Yu-Z%2B8h7BCg at mail.gmail.com><mailto:Yu-Z%2B8h7BCg at mail.gmail.com<mailto:Yu-Z%252B8h7BCg at mail.gmail.com>>>
Content-Type: text/plain; charset="windows-1252"

Hi All

This is my first time using adegenet. I'm trying to perform some
pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
machine learning task. My data was stored in a format which had to be
converted to a genlight object. The data was split so that the information
for the SNPs in each chromosome was in a separate file. I've read each file
in, converted that to a genlight object and then concatenated the genlight
objects using cbind. All of that seems to work ok (except the position and
chromosome data went back to NULL during the concatenation and I had to
reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to access
the information for this SNP over the 800 individuals, it takes ages to
extract. Is this because the encoding is done row wise, and so the whole
object needs to be decoded for me to get out the information I require? Is
there a way to transpose this genlight object so that I can access the data
for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
Date: Fri, Jul 19, 2013<tel:2013><tel:2013<tel:2013>> at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>


Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the
tutorial on adegenet-basics where you'll find examples of dimension
reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see
contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658><tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>> [dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>] on behalf of Daniel Murrell
[dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>]
Sent: 19 July 2013<tel:2013><tel:2013> 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I have
is that there is too much of it and I need a way to reduce the number or
the dimensionality of the data points so that I can use them as input to
machine learning algorithms (genome wide, 1.3 million SNPs, 800
individuals). I've done some searching and found this paper:
http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing
something like this? I'm not from this field and I'm having some trouble
working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find
that subset from the 1.3 million, or whether I should be reducing the
dimensionality.

Thank you
Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/a331daec/attachment-0001.html>

------------------------------

Message: 2
Date: Thu, 1 Aug 2013<tel:2013> 15:22:27 +0000
From: "Jombart, Thibaut" <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>,
       "adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>"
       <adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>>
Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of
       SNP data for    machine learning
Message-ID:
       <2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk<mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk><mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk<mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk>>>
Content-Type: text/plain; charset="Windows-1252"


Dear Daniel,

the loss of attributes after cbind indeed is a glitch. Would you mind creating a ticket about it?
https://sourceforge.net/p/adegenet/tickets/

You're right about the issue. The encoding is indeed done row-wise so the conversion is done many times over. There's no option for transposing the data, but one solution would be converting your data to integers by blocks so that conversion takes place less often, while still keep RAM requirements reasonable.

All the best

Thibaut

________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>> [adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>]
Sent: 01 August 2013<tel:2013> 15:26
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for    machine learning

Hi All

This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>>
Date: Fri, Jul 19, 2013<tel:2013> at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>>


Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658><tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>> [dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>]
Sent: 19 July 2013<tel:2013> 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality.

Thank you
Daniel


------------------------------

Message: 3
Date: Thu, 1 Aug 2013<tel:2013> 17:14:37 +0100
From: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>
To: "Jombart, Thibaut" <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
Cc: "adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>"
       <adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>>
Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of
       SNP data for machine learning
Message-ID:
       <CADK=3Hz=iJSJePuCOSwCkFOQUWHQyAmk+YS=-qWD+EO5vOBihA at mail.gmail.com<mailto:qWD%2BEO5vOBihA at mail.gmail.com><mailto:qWD%2BEO5vOBihA at mail.gmail.com<mailto:qWD%252BEO5vOBihA at mail.gmail.com>>>
Content-Type: text/plain; charset="windows-1252"

Dear Thibaut

Ok, I could try that. I could also try and use the genlight object in a
transposed manner just for the purposes of holding the data so that I can
access individual SNPs easily. I mean nothing else would work expect the
containment.

Thanks for the help
Regards
Daniel

On Thu, Aug 1, 2013<tel:2013> at 4:22 PM, Jombart, Thibaut
<t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>wrote:


Dear Daniel,

the loss of attributes after cbind indeed is a glitch. Would you mind
creating a ticket about it?
https://sourceforge.net/p/adegenet/tickets/

You're right about the issue. The encoding is indeed done row-wise so the
conversion is done many times over. There's no option for transposing the
data, but one solution would be converting your data to integers by blocks
so that conversion takes place less often, while still keep RAM
requirements reasonable.

All the best

Thibaut

________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>> [
adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>>] on behalf of Daniel
Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>]
Sent: 01 August 2013<tel:2013> 15:26
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data
for    machine learning

Hi All

This is my first time using adegenet. I'm trying to perform some
pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
machine learning task. My data was stored in a format which had to be
converted to a genlight object. The data was split so that the information
for the SNPs in each chromosome was in a separate file. I've read each file
in, converted that to a genlight object and then concatenated the genlight
objects using cbind. All of that seems to work ok (except the position and
chromosome data went back to NULL during the concatenation and I had to
reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to
access the information for this SNP over the 800 individuals, it takes ages
to extract. Is this because the encoding is done row wise, and so the whole
object needs to be decoded for me to get out the information I require? Is
there a way to transpose this genlight object so that I can access the data
for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>>
Date: Fri, Jul 19, 2013 at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>>


Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the
tutorial on adegenet-basics where you'll find examples of dimension
reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see
contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>> [dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>
<mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>
<mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>]
Sent: 19 July 2013 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I
have is that there is too much of it and I need a way to reduce the number
or the dimensionality of the data points so that I can use them as input to
machine learning algorithms (genome wide, 1.3 million SNPs, 800
individuals). I've done some searching and found this paper:
http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing
something like this? I'm not from this field and I'm having some trouble
working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find
that subset from the 1.3 million, or whether I should be reducing the
dimensionality.

Thank you
Daniel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/4373022c/attachment.html>

------------------------------

_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

End of adegenet-forum Digest, Vol 60, Issue 2
*********************************************

_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum


From danica_714 at hotmail.com  Wed Sep 18 12:03:30 2013
From: danica_714 at hotmail.com (Danica Fabrigar)
Date: Wed, 18 Sep 2013 11:03:30 +0100
Subject: [adegenet-forum] help with scaleGEN
Message-ID: <DUB124-W46E00239D27D4EB5C91F38A2200@phx.gbl>

Hi adegenet users,
 
I am having some trouble interpreting how scaleGEN is supposed to be used when plotting a PCA.
 
I get very different results when running the following two commands (note: "scale=FALSE" is omitted in the second object):
 
A)
obj <- scaleGen(mosquitoind, scale=FALSE, missing="mean")
pca.obj <- dudi.pca(obj,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3) 
 
B)
obj 2<- scaleGen(mosquitoind, missing="mean")
pca.obj2 <- dudi.pca(obj2,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3) 
 
 
I guess my question is, what is the appropriate way of using scaleGEN if I want to scale my missing data to the mean allele frequency?
 
 
Thanks in advance,
Danica
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130918/c1e08e7c/attachment.html>

From t.jombart at imperial.ac.uk  Wed Sep 18 16:53:53 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Wed, 18 Sep 2013 14:53:53 +0000
Subject: [adegenet-forum] help with scaleGEN
In-Reply-To: <DUB124-W46E00239D27D4EB5C91F38A2200@phx.gbl>
References: <DUB124-W46E00239D27D4EB5C91F38A2200@phx.gbl>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA6570638EF17F@icexch-m1.ic.ac.uk>

Hello, 

I think some clarification should help here.

"scaling" means transforming a variable to that its variance is 1. It is usually used to remove the effects of variances inherently different across a bunch of variables (typically because of different units). In genetics, most of the time, I think scaling is a bad idea: all variable have the same unit, and differences in variances are probably meaningful.

missing="mean" refers to the procedure for replacing missing data. They are set to the origin, which is the mean of the corresponding allele frequencies (typically the 'non-informative' point in PCA).

Best
Thibaut


________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Danica Fabrigar [danica_714 at hotmail.com]
Sent: 18 September 2013 11:03
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] help with scaleGEN

Hi adegenet users,

I am having some trouble interpreting how scaleGEN is supposed to be used when plotting a PCA.

I get very different results when running the following two commands (note: "scale=FALSE" is omitted in the second object):

A)
obj <- scaleGen(mosquitoind, scale=FALSE, missing="mean")
pca.obj <- dudi.pca(obj,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3)

B)
obj 2<- scaleGen(mosquitoind, missing="mean")
pca.obj2 <- dudi.pca(obj2,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3)


I guess my question is, what is the appropriate way of using scaleGEN if I want to scale my missing data to the mean allele frequency?


Thanks in advance,
Danica

From danica_714 at hotmail.com  Thu Sep 19 10:57:46 2013
From: danica_714 at hotmail.com (Danica Fabrigar)
Date: Thu, 19 Sep 2013 09:57:46 +0100
Subject: [adegenet-forum] help with scaleGEN
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA6570638EF17F@icexch-m1.ic.ac.uk>
References: <DUB124-W46E00239D27D4EB5C91F38A2200@phx.gbl>,
 <2CB2DA8E426F3541AB1907F98ABA6570638EF17F@icexch-m1.ic.ac.uk>
Message-ID: <DUB124-W33EA2B446AFDDAB9D22837A2210@phx.gbl>

Hi Thibaut,
Thank you for the clarification. I got confused myself there.
What you've said made a lot of sense, are there cases in genetics in which scaling would be a good idea?

Regards,Danica


 ________________________________________
> From: t.jombart at imperial.ac.uk
> To: danica_714 at hotmail.com; adegenet-forum at lists.r-forge.r-project.org
> Subject: RE: [adegenet-forum] help with scaleGEN
> Date: Wed, 18 Sep 2013 14:53:53 +0000
> 
> Hello, 
> 
> I think some clarification should help here.
> 
> "scaling" means transforming a variable to that its variance is 1. It is usually used to remove the effects of variances inherently different across a bunch of variables (typically because of different units). In genetics, most of the time, I think scaling is a bad idea: all variable have the same unit, and differences in variances are probably meaningful.
> 
> missing="mean" refers to the procedure for replacing missing data. They are set to the origin, which is the mean of the corresponding allele frequencies (typically the 'non-informative' point in PCA).
> 
> Best
> Thibaut
> 
> 
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Danica Fabrigar [danica_714 at hotmail.com]
> Sent: 18 September 2013 11:03
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] help with scaleGEN
> 
> Hi adegenet users,
> 
> I am having some trouble interpreting how scaleGEN is supposed to be used when plotting a PCA.
> 
> I get very different results when running the following two commands (note: "scale=FALSE" is omitted in the second object):
> 
> A)
> obj <- scaleGen(mosquitoind, scale=FALSE, missing="mean")
> pca.obj <- dudi.pca(obj,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3)
> 
> B)
> obj 2<- scaleGen(mosquitoind, missing="mean")
> pca.obj2 <- dudi.pca(obj2,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3)
> 
> 
> I guess my question is, what is the appropriate way of using scaleGEN if I want to scale my missing data to the mean allele frequency?
> 
> 
> Thanks in advance,
> Danica
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130919/bfbea0a7/attachment.html>

From t.jombart at imperial.ac.uk  Thu Sep 19 13:41:06 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Thu, 19 Sep 2013 11:41:06 +0000
Subject: [adegenet-forum] help with scaleGEN
In-Reply-To: <DUB124-W33EA2B446AFDDAB9D22837A2210@phx.gbl>
References: <DUB124-W46E00239D27D4EB5C91F38A2200@phx.gbl>,
 <2CB2DA8E426F3541AB1907F98ABA6570638EF17F@icexch-m1.ic.ac.uk>,
 <DUB124-W33EA2B446AFDDAB9D22837A2210@phx.gbl>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA6570638EF5A0@icexch-m1.ic.ac.uk>

I haven't seen many, but one can think of a few cases, yes. 

In multialllelic markers such as microsatellites, one may want to give the same 'weight' to each marker, and thus use a scaling so that the total variance (ie summed over alleles) would be the same for all markers. But this is already a bit different from standardizing alleles, at least in practice (on a theoretical level, the procedure is nearly identical, we divide vectors/matrices by their norm). 

Same idea could apply to SNPs of different genes. 

Cheers
Thibaut

________________________________________
From: Danica Fabrigar [danica_714 at hotmail.com]
Sent: 19 September 2013 09:57
To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
Subject: RE: [adegenet-forum] help with scaleGEN

Hi Thibaut,

Thank you for the clarification. I got confused myself there.

What you've said made a lot of sense, are there cases in genetics in which scaling would be a good idea?


Regards,
Danica


 ________________________________________
> From: t.jombart at imperial.ac.uk
> To: danica_714 at hotmail.com; adegenet-forum at lists.r-forge.r-project.org
> Subject: RE: [adegenet-forum] help with scaleGEN
> Date: Wed, 18 Sep 2013 14:53:53 +0000
>
> Hello,
>
> I think some clarification should help here.
>
> "scaling" means transforming a variable to that its variance is 1. It is usually used to remove the effects of variances inherently different across a bunch of variables (typically because of different units). In genetics, most of the time, I think scaling is a bad idea: all variable have the same unit, and differences in variances are probably meaningful.
>
> missing="mean" refers to the procedure for replacing missing data. They are set to the origin, which is the mean of the corresponding allele frequencies (typically the 'non-informative' point in PCA).
>
> Best
> Thibaut
>
>
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Danica Fabrigar [danica_714 at hotmail.com]
> Sent: 18 September 2013 11:03
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] help with scaleGEN
>
> Hi adegenet users,
>
> I am having some trouble interpreting how scaleGEN is supposed to be used when plotting a PCA.
>
> I get very different results when running the following two commands (note: "scale=FALSE" is omitted in the second object):
>
> A)
> obj <- scaleGen(mosquitoind, scale=FALSE, missing="mean")
> pca.obj <- dudi.pca(obj,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3)
>
> B)
> obj 2<- scaleGen(mosquitoind, missing="mean")
> pca.obj2 <- dudi.pca(obj2,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3)
>
>
> I guess my question is, what is the appropriate way of using scaleGEN if I want to scale my missing data to the mean allele frequency?
>
>
> Thanks in advance,
> Danica