[adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning

Daniel Murrell dsm38 at cam.ac.uk
Thu Aug 1 18:14:37 CEST 2013


Dear Thibaut

Ok, I could try that. I could also try and use the genlight object in a
transposed manner just for the purposes of holding the data so that I can
access individual SNPs easily. I mean nothing else would work expect the
containment.

Thanks for the help
Regards
Daniel

On Thu, Aug 1, 2013 at 4:22 PM, Jombart, Thibaut
<t.jombart at imperial.ac.uk>wrote:

>
> Dear Daniel,
>
> the loss of attributes after cbind indeed is a glitch. Would you mind
> creating a ticket about it?
> https://sourceforge.net/p/adegenet/tickets/
>
> You're right about the issue. The encoding is indeed done row-wise so the
> conversion is done many times over. There's no option for transposing the
> data, but one solution would be converting your data to integers by blocks
> so that conversion takes place less often, while still keep RAM
> requirements reasonable.
>
> All the best
>
> Thibaut
>
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [
> adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Daniel
> Murrell [dsm38 at cam.ac.uk]
> Sent: 01 August 2013 15:26
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data
> for    machine learning
>
> Hi All
>
> This is my first time using adegenet. I'm trying to perform some
> pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a
> machine learning task. My data was stored in a format which had to be
> converted to a genlight object. The data was split so that the information
> for the SNPs in each chromosome was in a separate file. I've read each file
> in, converted that to a genlight object and then concatenated the genlight
> objects using cbind. All of that seems to work ok (except the position and
> chromosome data went back to NULL during the concatenation and I had to
> reset it on the combined genlight object).
>
> So, now I want to do my own processing on each SNP and when I try to
> access the information for this SNP over the 800 individuals, it takes ages
> to extract. Is this because the encoding is done row wise, and so the whole
> object needs to be decoded for me to get out the information I require? Is
> there a way to transpose this genlight object so that I can access the data
> for a single SNP across all individual quickly?
>
> Thank you
> Daniel
>
> ---------- Forwarded message ----------
> From: Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk>>
> Date: Fri, Jul 19, 2013 at 4:27 PM
> Subject: RE: Question about pre-processing of SNP data for machine learning
> To: Daniel Murrell <dsm38 at cam.ac.uk<mailto:dsm38 at cam.ac.uk>>
>
>
> Dear Daniel,
>
> yes, adegenet is designed for that kind of task. Please look at the
> tutorial on adegenet-basics where you'll find examples of dimension
> reduction for SNP data, to be found on:
> http://adegenet.r-forge.r-project.org/
>
> Don't hesitate to use the adegenet-forum for further questions (see
> contacts on the website).
> Best
> Thibaut
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - School of Public Health
> St Mary’s Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658<tel:0044%20%280%2920%207594%203658>
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
> ________________________________________
> From: dsmurrell at gmail.com<mailto:dsmurrell at gmail.com> [dsmurrell at gmail.com
> <mailto:dsmurrell at gmail.com>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk
> <mailto:dsm38 at cam.ac.uk>]
> Sent: 19 July 2013 16:23
> To: Jombart, Thibaut
> Subject: Question about pre-processing of SNP data for machine learning
>
> Dear Thibaut
>
> I'm trying to build a model that uses SNP data as input. The problem I
> have is that there is too much of it and I need a way to reduce the number
> or the dimensionality of the data points so that I can use them as input to
> machine learning algorithms (genome wide, 1.3 million SNPs, 800
> individuals). I've done some searching and found this paper:
> http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached).
>
> I also found your adegenet package and wondered if it's designed for doing
> something like this? I'm not from this field and I'm having some trouble
> working this out. Can you point me to anything that might help?
>
> I'm not sure whether I should be keeping a subset of SNPs and how to find
> that subset from the 1.3 million, or whether I should be reducing the
> dimensionality.
>
> Thank you
> Daniel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/4373022c/attachment-0001.html>


More information about the adegenet-forum mailing list