Dear Thibaut<br><br>Ok, I could try that. I could also try and use the genlight object in a transposed manner just for the purposes of holding the data so that I can access individual SNPs easily. I mean nothing else would work expect the containment.<br>
<br>Thanks for the help<br>Regards<br>Daniel<br><br><div class="gmail_quote">On Thu, Aug 1, 2013 at 4:22 PM, Jombart, Thibaut <span dir="ltr"><<a href="mailto:t.jombart@imperial.ac.uk" target="_blank">t.jombart@imperial.ac.uk</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Dear Daniel,<br>
<br>
the loss of attributes after cbind indeed is a glitch. Would you mind creating a ticket about it?<br>
<a href="https://sourceforge.net/p/adegenet/tickets/" target="_blank">https://sourceforge.net/p/adegenet/tickets/</a><br>
<br>
You're right about the issue. The encoding is indeed done row-wise so the conversion is done many times over. There's no option for transposing the data, but one solution would be converting your data to integers by blocks so that conversion takes place less often, while still keep RAM requirements reasonable.<br>
<br>
All the best<br>
<br>
Thibaut<br>
<br>
________________________________________<br>
From: <a href="mailto:adegenet-forum-bounces@lists.r-forge.r-project.org" target="_blank">adegenet-forum-bounces@lists.r-forge.r-project.org</a> [<a href="mailto:adegenet-forum-bounces@lists.r-forge.r-project.org" target="_blank">adegenet-forum-bounces@lists.r-forge.r-project.org</a>] on behalf of Daniel Murrell [<a href="mailto:dsm38@cam.ac.uk" target="_blank">dsm38@cam.ac.uk</a>]<br>
Sent: 01 August 2013 15:26<br>
To: <a href="mailto:adegenet-forum@lists.r-forge.r-project.org" target="_blank">adegenet-forum@lists.r-forge.r-project.org</a><br>
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning<br>
<div><br>
Hi All<br>
<br>
This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object).<br>
<br>
So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly?<br>
<br>
Thank you<br>
Daniel<br>
<br>
---------- Forwarded message ----------<br>
</div><div>From: Jombart, Thibaut <<a href="mailto:t.jombart@imperial.ac.uk" target="_blank">t.jombart@imperial.ac.uk</a><mailto:<a href="mailto:t.jombart@imperial.ac.uk" target="_blank">t.jombart@imperial.ac.uk</a>>><br>
Date: Fri, Jul 19, 2013 at 4:27 PM<br>
Subject: RE: Question about pre-processing of SNP data for machine learning<br>
</div><div>To: Daniel Murrell <<a href="mailto:dsm38@cam.ac.uk" target="_blank">dsm38@cam.ac.uk</a><mailto:<a href="mailto:dsm38@cam.ac.uk" target="_blank">dsm38@cam.ac.uk</a>>><br>
<br>
<br>
Dear Daniel,<br>
<br>
yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on:<br>
<a href="http://adegenet.r-forge.r-project.org/" target="_blank">http://adegenet.r-forge.r-project.org/</a><br>
<br>
Don't hesitate to use the adegenet-forum for further questions (see contacts on the website).<br>
Best<br>
Thibaut<br>
<br>
--<br>
######################################<br>
Dr Thibaut JOMBART<br>
MRC Centre for Outbreak Analysis and Modelling<br>
Department of Infectious Disease Epidemiology<br>
Imperial College - School of Public Health<br>
St Mary’s Campus<br>
Norfolk Place<br>
London W2 1PG<br>
United Kingdom<br>
</div>Tel. : <a href="tel:0044%20%280%2920%207594%203658" value="+442075943658" target="_blank">0044 (0)20 7594 3658</a><tel:0044%20%280%2920%207594%203658><br>
<a href="mailto:t.jombart@imperial.ac.uk" target="_blank">t.jombart@imperial.ac.uk</a><mailto:<a href="mailto:t.jombart@imperial.ac.uk" target="_blank">t.jombart@imperial.ac.uk</a>><br>
<div><a href="http://sites.google.com/site/thibautjombart/" target="_blank">http://sites.google.com/site/thibautjombart/</a><br>
<a href="http://adegenet.r-forge.r-project.org/" target="_blank">http://adegenet.r-forge.r-project.org/</a><br>
________________________________________<br>
</div>From: <a href="mailto:dsmurrell@gmail.com" target="_blank">dsmurrell@gmail.com</a><mailto:<a href="mailto:dsmurrell@gmail.com" target="_blank">dsmurrell@gmail.com</a>> [<a href="mailto:dsmurrell@gmail.com" target="_blank">dsmurrell@gmail.com</a><mailto:<a href="mailto:dsmurrell@gmail.com" target="_blank">dsmurrell@gmail.com</a>>] on behalf of Daniel Murrell [<a href="mailto:dsm38@cam.ac.uk" target="_blank">dsm38@cam.ac.uk</a><mailto:<a href="mailto:dsm38@cam.ac.uk" target="_blank">dsm38@cam.ac.uk</a>>]<br>
<div><div>Sent: 19 July 2013 16:23<br>
To: Jombart, Thibaut<br>
Subject: Question about pre-processing of SNP data for machine learning<br>
<br>
Dear Thibaut<br>
<br>
I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: <a href="http://www.ncbi.nlm.nih.gov/pubmed/18076475" target="_blank">http://www.ncbi.nlm.nih.gov/pubmed/18076475</a> (pdf attached).<br>
<br>
I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help?<br>
<br>
I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality.<br>
<br>
Thank you<br>
Daniel<br>
</div></div></blockquote></div><br>