[adegenet-forum] $li in sPCA analysis

Wed Sep 11 17:58:52 CEST 2013

Hello, 

the values in $li have arbitrary signs. They are simply scores synthesizing the spatial structures in the data (linear combinations of variables optimizing the variance and Moran's I).

Cheers
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary’s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Nathan Truelove [nathan.truelove at manchester.ac.uk]
Sent: 03 September 2013 13:44
To: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] $li in sPCA analysis

Hi Adegenet Forum,

Thanks in advance to anyone who has some advice to share with the forum on SPCA. If you're in a rush just read the parts in bold.

I've been using SPCA to look at spatial genetics patterns among lobster populations. I found positive local structure with the function local.rest and no global structure using global.rtest. I've followed Thibaut's advice in his previous sPCA email to forum and used $li to interpret local structure. I selected the local eigenvalue that had the highest levels of negative spatial autocorrelation and genetic variance for interpretation using the screeplot function. The $li values from this eigenvalue were then used to create an interpolated map.

My question for the forum is: What do the positive and negative $li values associated with the local eigenvalue mean? Do they correspond to levels of local (positive) and global (negative) scores at each location? Or are the $li values associated with the local eigenvalues simply a score for detecting local spatial genetic structure among sites and have nothing to do with global structure?

Best Wishes,

Nate

On Aug 11, 2013, at 4:35 PM, Jombart, Thibaut wrote:

Hello,

I think you attached the wrong file.

Negative values and local structure are not related. Local structure = sharp differences between neighours. These would be overlooked by the lagged vector.

If the structure is clear enough, use $li.

As you have many overlapping points, s.value is suboptimal. You should consider using the colorplot, or interpolated maps. See the tutorial on sPCA for some example:
http://cran.r-project.org/web/packages/adegenet/vignettes/adegenet-spca.pdf

Best
Thibaut
________________________________________
From: dooshra at gmail.com [dooshra at gmail.com] on behalf of Hanan Sela [hans at tauex.tau.ac.il]
Sent: 11 August 2013 12:19
To: Jombart, Thibaut
Subject: Re: [adegenet-forum] li vs. ls in sPCA analysis

Hello Thibaut,
Thank you for the response.
In the file I have attached I see that with the $li variable there are no negative values in the southern sites while with the $ls values there are negative values in the south. It seems that I see more local spatial structure with $ls than with $li . When I tested the data with local test I got significant results.  Which variable is better to present in a paper.
Thank you
Hanan
Mr. Hanan Sela Ph.D.
Curator of the Lieberman Cereal Germplasm Bank
The Institute for Cereal Crops Improvement
Tel-Aviv University
P.O. Box 39040
Tel Aviv 69978
Israel

hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>
Phone: 972-3-6405773
Cell: 972-50-5727458 , local U.S 17203600603
Fax: 972-3-6407857

On Sun, Aug 11, 2013 at 12:37 PM, Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>> wrote:
Hello,

the lagged vector is the spatially weighted average of the original vector. That is, the value of the score at a given location is the weighted average of the neighbouring values. This basically smooths the patterns so that they can be detected / visualized more easily.

Cheers
Thibaut.

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary’s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org> [adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Hanan Sela [hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>]
Sent: 11 August 2013<tel:2013> 06:21
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
Subject: [adegenet-forum] li vs. ls in sPCA analysis

Hello
I have plotted the first  PC of sPCA analysis using s.value once with z=my.pca$li[,1]
and once with z=my.pca$ls[,1]. The patterns seems to differ (see attached file). I do not understand what the lagged PC is representing. What is the meaning of "denoisified" in the practical day presentation  (Google does not know). How do i interpent the difference. Please explain.
Thank you

Mr. Hanan Sela Ph.D.
Curator of the Lieberman Cereal Germplasm Bank
The Institute for Cereal Crops Improvement
Tel-Aviv University
P.O. Box 39040
Tel Aviv 69978
Israel

hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il><mailto:hans at tauex.tau.ac.il<mailto:hans at tauex.tau.ac.il>>
Phone: 972-3-6405773<tel:972-3-6405773>
Cell: 972-50-5727458<tel:972-50-5727458> , local U.S 17203600603
Fax: 972-3-6407857<tel:972-3-6407857>

On Thu, Aug 1, 2013<tel:2013> at 7:15 PM, <adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org><mailto:adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org>>> wrote:
Send adegenet-forum mailing list submissions to
       adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>

To subscribe or unsubscribe via the World Wide Web, visit
       https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

or, via email, send a message with subject or body 'help' to
       adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org><mailto:adegenet-forum-request at lists.r-forge.r-project.org<mailto:adegenet-forum-request at lists.r-forge.r-project.org>>

You can reach the person managing the list at
       adegenet-forum-owner at lists.r-forge.r-project.org<mailto:adegenet-forum-owner at lists.r-forge.r-project.org><mailto:adegenet-forum-owner at lists.r-forge.r-project.org<mailto:adegenet-forum-owner at lists.r-forge.r-project.org>>

When replying, please edit your Subject line so it is more specific
than "Re: Contents of adegenet-forum digest..."

Today's Topics:

  1. Fwd: Question about pre-processing of SNP data for        machine
     learning (Daniel Murrell)
  2. Re: Fwd: Question about pre-processing of SNP data for
     machine learning (Jombart, Thibaut)
  3. Re: Fwd: Question about pre-processing of SNP data for
     machine learning (Daniel Murrell)

----------------------------------------------------------------------

Message: 1
Date: Thu, 1 Aug 2013<tel:2013><tel:2013<tel:2013>> 15:26:00 +0100
From: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP
       data for        machine learning
Message-ID:
       <CADK=3HwmiEO5v6fCQUYNkHFQ520avQJ9LFOAdu=Yu-Z+8h7BCg at mail.gmail.com<mailto:Yu-Z%2B8h7BCg at mail.gmail.com><mailto:Yu-Z%2B8h7BCg at mail.gmail.com<mailto:Yu-Z%252B8h7BCg at mail.gmail.com>>>
Content-Type: text/plain; charset="windows-1252"

Hi All

This is my first time using adegenet. I'm trying to perform some
pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
machine learning task. My data was stored in a format which had to be
converted to a genlight object. The data was split so that the information
for the SNPs in each chromosome was in a separate file. I've read each file
in, converted that to a genlight object and then concatenated the genlight
objects using cbind. All of that seems to work ok (except the position and
chromosome data went back to NULL during the concatenation and I had to
reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to access
the information for this SNP over the 800 individuals, it takes ages to
extract. Is this because the encoding is done row wise, and so the whole
object needs to be decoded for me to get out the information I require? Is
there a way to transpose this genlight object so that I can access the data
for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
Date: Fri, Jul 19, 2013<tel:2013><tel:2013<tel:2013>> at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>

Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the
tutorial on adegenet-basics where you'll find examples of dimension
reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see
contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658><tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>> [dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>] on behalf of Daniel Murrell
[dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>]
Sent: 19 July 2013<tel:2013><tel:2013> 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I have
is that there is too much of it and I need a way to reduce the number or
the dimensionality of the data points so that I can use them as input to
machine learning algorithms (genome wide, 1.3 million SNPs, 800
individuals). I've done some searching and found this paper:
http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing
something like this? I'm not from this field and I'm having some trouble
working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find
that subset from the 1.3 million, or whether I should be reducing the
dimensionality.

Thank you
Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/a331daec/attachment-0001.html>

------------------------------

Message: 2
Date: Thu, 1 Aug 2013<tel:2013> 15:22:27 +0000
From: "Jombart, Thibaut" <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>,
       "adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>"
       <adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>>
Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of
       SNP data for    machine learning
Message-ID:
       <2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk<mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk><mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk<mailto:2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk>>>
Content-Type: text/plain; charset="Windows-1252"

Dear Daniel,

the loss of attributes after cbind indeed is a glitch. Would you mind creating a ticket about it?
https://sourceforge.net/p/adegenet/tickets/

You're right about the issue. The encoding is indeed done row-wise so the conversion is done many times over. There's no option for transposing the data, but one solution would be converting your data to integers by blocks so that conversion takes place less often, while still keep RAM requirements reasonable.

All the best

Thibaut

________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>> [adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>]
Sent: 01 August 2013<tel:2013> 15:26
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for    machine learning

Hi All

This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>>
Date: Fri, Jul 19, 2013<tel:2013> at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>>

Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658><tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>> [dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>]
Sent: 19 July 2013<tel:2013> 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality.

Thank you
Daniel

------------------------------

Message: 3
Date: Thu, 1 Aug 2013<tel:2013> 17:14:37 +0100
From: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>
To: "Jombart, Thibaut" <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
Cc: "adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>"
       <adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>>
Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of
       SNP data for machine learning
Message-ID:
       <CADK=3Hz=iJSJePuCOSwCkFOQUWHQyAmk+YS=-qWD+EO5vOBihA at mail.gmail.com<mailto:qWD%2BEO5vOBihA at mail.gmail.com><mailto:qWD%2BEO5vOBihA at mail.gmail.com<mailto:qWD%252BEO5vOBihA at mail.gmail.com>>>
Content-Type: text/plain; charset="windows-1252"

Dear Thibaut

Ok, I could try that. I could also try and use the genlight object in a
transposed manner just for the purposes of holding the data so that I can
access individual SNPs easily. I mean nothing else would work expect the
containment.

Thanks for the help
Regards
Daniel

On Thu, Aug 1, 2013<tel:2013> at 4:22 PM, Jombart, Thibaut
<t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>wrote:

Dear Daniel,

the loss of attributes after cbind indeed is a glitch. Would you mind
creating a ticket about it?
https://sourceforge.net/p/adegenet/tickets/

You're right about the issue. The encoding is indeed done row-wise so the
conversion is done many times over. There's no option for transposing the
data, but one solution would be converting your data to integers by blocks
so that conversion takes place less often, while still keep RAM
requirements reasonable.

All the best

Thibaut

________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>> [
adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org><mailto:adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>>] on behalf of Daniel
Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>]
Sent: 01 August 2013<tel:2013> 15:26
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data
for    machine learning

Hi All

This is my first time using adegenet. I'm trying to perform some
pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
machine learning task. My data was stored in a format which had to be
converted to a genlight object. The data was split so that the information
for the SNPs in each chromosome was in a separate file. I've read each file
in, converted that to a genlight object and then concatenated the genlight
objects using cbind. All of that seems to work ok (except the position and
chromosome data went back to NULL during the concatenation and I had to
reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to
access the information for this SNP over the 800 individuals, it takes ages
to extract. Is this because the encoding is done row wise, and so the whole
object needs to be decoded for me to get out the information I require? Is
there a way to transpose this genlight object so that I can access the data
for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>>
Date: Fri, Jul 19, 2013 at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>>

Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the
tutorial on adegenet-basics where you'll find examples of dimension
reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see
contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>> [dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>
<mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com><mailto:dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>
<mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk><mailto:dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>>]
Sent: 19 July 2013 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I
have is that there is too much of it and I need a way to reduce the number
or the dimensionality of the data points so that I can use them as input to
machine learning algorithms (genome wide, 1.3 million SNPs, 800
individuals). I've done some searching and found this paper:
http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing
something like this? I'm not from this field and I'm having some trouble
working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find
that subset from the 1.3 million, or whether I should be reducing the
dimensionality.

Thank you
Daniel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/4373022c/attachment.html>

------------------------------

_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org><mailto:adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

End of adegenet-forum Digest, Vol 60, Issue 2
*********************************************

_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum