[adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning

Thu Aug 1 16:26:00 CEST 2013

Hi All

This is my first time using adegenet. I'm trying to perform some
pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
machine learning task. My data was stored in a format which had to be
converted to a genlight object. The data was split so that the information
for the SNPs in each chromosome was in a separate file. I've read each file
in, converted that to a genlight object and then concatenated the genlight
objects using cbind. All of that seems to work ok (except the position and
chromosome data went back to NULL during the concatenation and I had to
reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to access
the information for this SNP over the 800 individuals, it takes ages to
extract. Is this because the encoding is done row wise, and so the whole
object needs to be decoded for me to get out the information I require? Is
there a way to transpose this genlight object so that I can access the data
for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk>
Date: Fri, Jul 19, 2013 at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk>

Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the
tutorial on adegenet-basics where you'll find examples of dimension
reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see
contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary’s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com [dsmurrell at gmail.com] on behalf of Daniel Murrell
[dsm38 at cam.ac.uk]
Sent: 19 July 2013 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I have
is that there is too much of it and I need a way to reduce the number or
the dimensionality of the data points so that I can use them as input to
machine learning algorithms (genome wide, 1.3 million SNPs, 800
individuals). I've done some searching and found this paper:
http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing
something like this? I'm not from this field and I'm having some trouble
working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find
that subset from the 1.3 million, or whether I should be reducing the
dimensionality.

Thank you
Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/a331daec/attachment.html>