Hi All<br><br>This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object).<br>
<br>So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly?<br>
<br>Thank you<br>Daniel<br><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Jombart, Thibaut</b> <span dir="ltr"><<a href="mailto:t.jombart@imperial.ac.uk">t.jombart@imperial.ac.uk</a>></span><br>
Date: Fri, Jul 19, 2013 at 4:27 PM<br>Subject: RE: Question about pre-processing of SNP data for machine learning<br>To: Daniel Murrell <<a href="mailto:dsm38@cam.ac.uk">dsm38@cam.ac.uk</a>><br><br><br>Dear Daniel,<br>
<br>
yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on:<br>
<a href="http://adegenet.r-forge.r-project.org/" target="_blank">http://adegenet.r-forge.r-project.org/</a><br>
<br>
Don't hesitate to use the adegenet-forum for further questions (see contacts on the website).<br>
Best<br>
Thibaut<br>
<br>
--<br>
######################################<br>
Dr Thibaut JOMBART<br>
MRC Centre for Outbreak Analysis and Modelling<br>
Department of Infectious Disease Epidemiology<br>
Imperial College - School of Public Health<br>
St Mary’s Campus<br>
Norfolk Place<br>
London W2 1PG<br>
United Kingdom<br>
Tel. : <a href="tel:0044%20%280%2920%207594%203658" value="+442075943658">0044 (0)20 7594 3658</a><br>
<a href="mailto:t.jombart@imperial.ac.uk">t.jombart@imperial.ac.uk</a><br>
<a href="http://sites.google.com/site/thibautjombart/" target="_blank">http://sites.google.com/site/thibautjombart/</a><br>
<a href="http://adegenet.r-forge.r-project.org/" target="_blank">http://adegenet.r-forge.r-project.org/</a><br>
________________________________________<br>
From: <a href="mailto:dsmurrell@gmail.com">dsmurrell@gmail.com</a> [<a href="mailto:dsmurrell@gmail.com">dsmurrell@gmail.com</a>] on behalf of Daniel Murrell [<a href="mailto:dsm38@cam.ac.uk">dsm38@cam.ac.uk</a>]<br>
Sent: 19 July 2013 16:23<br>
To: Jombart, Thibaut<br>
Subject: Question about pre-processing of SNP data for machine learning<br>
<div class="HOEnZb"><div class="h5"><br>
Dear Thibaut<br>
<br>
I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: <a href="http://www.ncbi.nlm.nih.gov/pubmed/18076475" target="_blank">http://www.ncbi.nlm.nih.gov/pubmed/18076475</a> (pdf attached).<br>
<br>
I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help?<br>
<br>
I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality.<br>
<br>
Thank you<br>
Daniel<br>
</div></div></div><br>