[GenABEL-dev] function for conversion a plink format file to a GenABEL format file

Maksim Struchalin m.v.struchalin at mail.ru
Mon Nov 25 15:39:21 CET 2013


I checked the read.plink from snpMatrix (Nicola) and snpStats (Maarten). 
I see that the code under them is quite simple (~40 lines of c code 
under snpMatrix read.plink).

The bed plink format is very similar to GenABEL format 
(http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml). Looks like 
that the main difference between them is that the plink bed file has 
first 3 bytes with some special meaning. The other bytes store genotypes 
(0, 1, 2 or NA) in 2 bits per genotype (like in GenA).

I think it would be easy just to write a C function which convert bed to 
databel format. Also, we can think about making the bed as the format 
which is nativelly supported by genabel. For this, we only need a 
function which extract an array from bed and make iterator to use this 
function.

best,
Maksim


On 22/11/2013 23:51, Yurii Aulchenko wrote:
> Great idea
>
> I know nothing of plink bin format, but many packages make use of it, 
> so it should be not that complicated. Also plink is gnu GPL if I 
> remember correctly so we can use the code if needed
>
> Y
>
> On Friday, November 22, 2013, L.C. Karssen wrote:
>
>     How difficult would it be to import .bed files [1] instead of the text
>     conversion? Given the binary data of both the .bed and the GenABEL
>     format, wouldn't conversion be much quicker?
>
>
>     Lennart.
>
>     [1] http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml
>     <http://pngu.mgh.harvard.edu/%7Epurcell/plink/binary.shtml>
>
>
>     On 11/22/2013 09:54 AM, Yurii Aulchenko wrote:
>     > Too slow, too difficult for the user, or both? :)
>     >
>     > On Friday, November 22, 2013, Maksim Struchalin wrote:
>     >
>     >     Yes. Looks like it was a bad idea to use plink R-plugin for
>     >     converting plink files to *ABEL format.
>     >     Maksim
>     >
>     >     On 18/11/2013 18:48, Yury Aulchenko wrote:
>     >>     I would say that in principle DatABEL::text2databel is the
>     >>     "natural" way to go from text-files to DatABEL-files
>     >>
>     >>     The problem is that 'regular' text input may be allele by
>     allele,
>     >>     not genotype by genotype... (e.g. data are in format "A G", or
>     >>     "A/G", not "0" or "1" or "2").
>     >>
>     >>     Y
>     >>
>     >>     On Nov 15, 2013, at 17:48 PM, L.C. Karssen
>     <lennart at karssen.org <javascript:;>>
>     >>     wrote:
>     >>
>     >>>     Hi Maksim,
>     >>>
>     >>>     On 15-11-13 05:53, Maksim Struchalin wrote:
>     >>>>     An easy way to write a function for conversion a plink format
>     >>>>     file to a
>     >>>>     GenABEL format file:
>     >>>>
>     >>>>     Use plink support of 'plug-in' functions
>     >>>
>     >>>     Nice find. I didn't know that existed.
>     >>>
>     >>>>     (http://pngu.mgh.harvard.edu/~purcell/plink/rfunc.shtml
>     <http://pngu.mgh.harvard.edu/%7Epurcell/plink/rfunc.shtml>
>     >>>>     <http://pngu.mgh.harvard.edu/%7Epurcell/plink/rfunc.shtml>).
>     >>>>     This allows us
>     >>>>     to write a simple R script (myscript.R) which is called
>     by plink
>     >>>>     (plink
>     >>>>     --file mydata --R myscript.R). plink reads the file mydata
>     >>>>     (which is in
>     >>>>     plink format) and iteratively, SNP by SNP, trasfer all
>     the data to a
>     >>>>     script myscript.R. This script contains a function
>     >>>>     Rplink(PHENO,GENO,CLUSTER,COVAR) which will take every
>     SNP (GENO
>     >>>>     variable) and store it in a *flv format through calling
>     DatABEL
>     >>>>     functions.
>     >>>>
>     >>>>     The whole process of conversion will look like this:
>     >>>>
>     >>>>     1) User asks GenA convert plink file to GenA file
>     >>>>     2) GenA looks weather the plink is installed. If it is not
>     >>>>     installed,
>     >>>>     then GenA goes to a plink site and download/install it itself
>     >>>>     (use an R
>     >>>>     function "download.file" from "utils" package)
>     >>>>     3) GenA run a simple line: system('plink --file mydata --R
>     >>>>     myscript.R')
>     >>>>     4) Rplink function (from myscript.R) gets every SNP and
>     stote it
>     >>>>     in *flv
>     >>>>     format. This function creates an flv file and then open and
>     >>>>     close it for
>     >>>>     saving every single SNP.
>     >>>>     5) Work is Done
>     >>>
>     >>>     I'm not sure how portable it is to download and run plink.
>     Also, the
>     >>>     plink page says: Currently, there is only support for
>     R-plugins for
>     >>>     Linux-based and Mac OS PLINK distributions.
>     >>>
>     >>>>
>     >>>>     The only issue is how fast the converssion will run: how much
>     >>>>     time does
>     >>>>     it take to open a filvector file, store one SNP and close
>     it? I
>     >>>>     can not
>     >>>>     find a DatABEL R function for adding SNP to a flv file.
>     Is there a C
>     >>>>     DatABEL function which can do it?
>     >>>
>     >>>     Wouldn't it be easier/possible to use plink to export to text
>     >>>     (.csv) and
>     >>>     then use filevector's txt2fvf binary (of course this could be
>     >>>     done from
>     >>>     R using system())?
>     >>>
>     >>>     I'm also wondering if going per SNP is really necessary. If I
>     >>>     understand
>     >>>     it correctly the R script (myscript.R) has to have a
>     function called:
>     >>>     Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
>     >>>     where GENO is the matrix of genotypes. So we could write
>     that into a
>     >>>     DatABEL file at once. Of course you may want to do this per
>     >>>     chromosome
>     >>>     to reduce memory consumption (not sure how plink/R would
>     handle large
>     >>>     data sets).
>     >>>
>     >
>     >
>     > --
>     > -----------------------------------------------------
>     > Yurii S. Aulchenko
>     >
>     > [ LinkedIn <http://nl.linkedin.com/in/yuriiaulchenko> ] [ Twitter
>     > <http://twitter.com/YuriiAulchenko> ] [ Blog
>     > <http://yurii-aulchenko.blogspot.nl/> ]
>     >
>     >
>     >
>     > _______________________________________________
>     > genabel-devel mailing list
>     > genabel-devel at lists.r-forge.r-project.org <javascript:;>
>     >
>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>     >
>
>     --
>     *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
>     L.C. Karssen
>     Utrecht
>     The Netherlands
>
>     lennart at karssen.org <javascript:;>
>     http://blog.karssen.org
>     GPG key ID: A88F554A
>     -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>
>
>
> -- 
> -----------------------------------------------------
> Yurii S. Aulchenko
>
> [ LinkedIn <http://nl.linkedin.com/in/yuriiaulchenko> ] [ Twitter 
> <http://twitter.com/YuriiAulchenko> ] [ Blog 
> <http://yurii-aulchenko.blogspot.nl/> ]
>
>
>
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20131125/90581f57/attachment.html>


More information about the genabel-devel mailing list