[adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning

Jombart, Thibaut t.jombart at imperial.ac.uk
Thu Aug 1 17:22:27 CEST 2013


Dear Daniel, 

the loss of attributes after cbind indeed is a glitch. Would you mind creating a ticket about it?
https://sourceforge.net/p/adegenet/tickets/

You're right about the issue. The encoding is indeed done row-wise so the conversion is done many times over. There's no option for transposing the data, but one solution would be converting your data to integers by blocks so that conversion takes place less often, while still keep RAM requirements reasonable.

All the best

Thibaut

________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Daniel Murrell [dsm38 at cam.ac.uk]
Sent: 01 August 2013 15:26
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for    machine learning

Hi All

This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object).

So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly?

Thank you
Daniel

---------- Forwarded message ----------
From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>
Date: Fri, Jul 19, 2013 at 4:27 PM
Subject: RE: Question about pre-processing of SNP data for machine learning
To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>


Dear Daniel,

yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on:
http://adegenet.r-forge.r-project.org/

Don't hesitate to use the adegenet-forum for further questions (see contacts on the website).
Best
Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary’s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658>
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com> [dsmurrell at gmail.com<mailto:dsmurrell at gmail.com>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>]
Sent: 19 July 2013 16:23
To: Jombart, Thibaut
Subject: Question about pre-processing of SNP data for machine learning

Dear Thibaut

I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).

I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help?

I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality.

Thank you
Daniel


More information about the adegenet-forum mailing list