[GenABEL-dev] function for conversion a plink format file to a GenABEL format file

L.C. Karssen lennart at karssen.org
Tue Nov 26 14:37:20 CET 2013


Dear Maksim,

On 11/25/2013 03:39 PM, Maksim Struchalin wrote:
> I checked the read.plink from snpMatrix (Nicola) and snpStats (Maarten).
> I see that the code under them is quite simple (~40 lines of c code
> under snpMatrix read.plink).
> 
> The bed plink format is very similar to GenABEL format
> (http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml). Looks like
> that the main difference between them is that the plink bed file has
> first 3 bytes with some special meaning. The other bytes store genotypes
> (0, 1, 2 or NA) in 2 bits per genotype (like in GenA).

That sounds good. I actually never looked under the hood of the GenABEL
format, and the plink format is indeed quite simple. If it only the
first few bytes that differ, that sounds promising!

> 
> I think it would be easy just to write a C function which convert bed to
> databel format. 

That sounds useful! But what does that mean for the GenABEL functions?
Do you propose to let the GenABEL functions (like
merge.snp.data/merge.gwaa.data work on DatABEL objects as well)?

> Also, we can think about making the bed as the format
> which is nativelly supported by genabel. For this, we only need a
> function which extract an array from bed and make iterator to use this
> function.

Similar to my question above: what do you exactly mean? Do you want to
change all (relevant) GenABEL functions to work with three backend
formats (GenABEL/DatABEL/.bed)? That sounds like quite a lot of work!

Or do you simply mean to write import/convert functions between these
formats?


Best,

Lennart.

> 
> best,
> Maksim
> 
> 
> On 22/11/2013 23:51, Yurii Aulchenko wrote:
>> Great idea
>>
>> I know nothing of plink bin format, but many packages make use of it,
>> so it should be not that complicated. Also plink is gnu GPL if I
>> remember correctly so we can use the code if needed
>>
>> Y
>>
>> On Friday, November 22, 2013, L.C. Karssen wrote:
>>
>>     How difficult would it be to import .bed files [1] instead of the text
>>     conversion? Given the binary data of both the .bed and the GenABEL
>>     format, wouldn't conversion be much quicker?
>>
>>
>>     Lennart.
>>
>>     [1] http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml
>>     <http://pngu.mgh.harvard.edu/%7Epurcell/plink/binary.shtml>
>>
>>
>>     On 11/22/2013 09:54 AM, Yurii Aulchenko wrote:
>>     > Too slow, too difficult for the user, or both? :)
>>     >
>>     > On Friday, November 22, 2013, Maksim Struchalin wrote:
>>     >
>>     >     Yes. Looks like it was a bad idea to use plink R-plugin for
>>     >     converting plink files to *ABEL format.
>>     >     Maksim
>>     >
>>     >     On 18/11/2013 18:48, Yury Aulchenko wrote:
>>     >>     I would say that in principle DatABEL::text2databel is the
>>     >>     "natural" way to go from text-files to DatABEL-files
>>     >>
>>     >>     The problem is that 'regular' text input may be allele by
>>     allele,
>>     >>     not genotype by genotype... (e.g. data are in format "A G", or
>>     >>     "A/G", not "0" or "1" or "2").
>>     >>
>>     >>     Y
>>     >>
>>     >>     On Nov 15, 2013, at 17:48 PM, L.C. Karssen
>>     <lennart at karssen.org <javascript:;>>
>>     >>     wrote:
>>     >>
>>     >>>     Hi Maksim,
>>     >>>
>>     >>>     On 15-11-13 05:53, Maksim Struchalin wrote:
>>     >>>>     An easy way to write a function for conversion a plink format
>>     >>>>     file to a
>>     >>>>     GenABEL format file:
>>     >>>>
>>     >>>>     Use plink support of 'plug-in' functions
>>     >>>
>>     >>>     Nice find. I didn't know that existed.
>>     >>>
>>     >>>>     (http://pngu.mgh.harvard.edu/~purcell/plink/rfunc.shtml
>>     <http://pngu.mgh.harvard.edu/%7Epurcell/plink/rfunc.shtml>
>>     >>>>     <http://pngu.mgh.harvard.edu/%7Epurcell/plink/rfunc.shtml>).
>>     >>>>     This allows us
>>     >>>>     to write a simple R script (myscript.R) which is called
>>     by plink
>>     >>>>     (plink
>>     >>>>     --file mydata --R myscript.R). plink reads the file mydata
>>     >>>>     (which is in
>>     >>>>     plink format) and iteratively, SNP by SNP, trasfer all
>>     the data to a
>>     >>>>     script myscript.R. This script contains a function
>>     >>>>     Rplink(PHENO,GENO,CLUSTER,COVAR) which will take every
>>     SNP (GENO
>>     >>>>     variable) and store it in a *flv format through calling
>>     DatABEL
>>     >>>>     functions.
>>     >>>>
>>     >>>>     The whole process of conversion will look like this:
>>     >>>>
>>     >>>>     1) User asks GenA convert plink file to GenA file
>>     >>>>     2) GenA looks weather the plink is installed. If it is not
>>     >>>>     installed,
>>     >>>>     then GenA goes to a plink site and download/install it itself
>>     >>>>     (use an R
>>     >>>>     function "download.file" from "utils" package)
>>     >>>>     3) GenA run a simple line: system('plink --file mydata --R
>>     >>>>     myscript.R')
>>     >>>>     4) Rplink function (from myscript.R) gets every SNP and
>>     stote it
>>     >>>>     in *flv
>>     >>>>     format. This function creates an flv file and then open and
>>     >>>>     close it for
>>     >>>>     saving every single SNP.
>>     >>>>     5) Work is Done
>>     >>>
>>     >>>     I'm not sure how portable it is to download and run plink.
>>     Also, the
>>     >>>     plink page says: Currently, there is only support for
>>     R-plugins for
>>     >>>     Linux-based and Mac OS PLINK distributions.
>>     >>>
>>     >>>>
>>     >>>>     The only issue is how fast the converssion will run: how much
>>     >>>>     time does
>>     >>>>     it take to open a filvector file, store one SNP and close
>>     it? I
>>     >>>>     can not
>>     >>>>     find a DatABEL R function for adding SNP to a flv file.
>>     Is there a C
>>     >>>>     DatABEL function which can do it?
>>     >>>
>>     >>>     Wouldn't it be easier/possible to use plink to export to text
>>     >>>     (.csv) and
>>     >>>     then use filevector's txt2fvf binary (of course this could be
>>     >>>     done from
>>     >>>     R using system())?
>>     >>>
>>     >>>     I'm also wondering if going per SNP is really necessary. If I
>>     >>>     understand
>>     >>>     it correctly the R script (myscript.R) has to have a
>>     function called:
>>     >>>     Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
>>     >>>     where GENO is the matrix of genotypes. So we could write
>>     that into a
>>     >>>     DatABEL file at once. Of course you may want to do this per
>>     >>>     chromosome
>>     >>>     to reduce memory consumption (not sure how plink/R would
>>     handle large
>>     >>>     data sets).
>>     >>>
>>     >
>>     >
>>     > --
>>     > -----------------------------------------------------
>>     > Yurii S. Aulchenko
>>     >
>>     > [ LinkedIn <http://nl.linkedin.com/in/yuriiaulchenko> ] [ Twitter
>>     > <http://twitter.com/YuriiAulchenko> ] [ Blog
>>     > <http://yurii-aulchenko.blogspot.nl/> ]
>>     >
>>     >
>>     >
>>     > _______________________________________________
>>     > genabel-devel mailing list
>>     > genabel-devel at lists.r-forge.r-project.org <javascript:;>
>>     >
>>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>     >
>>
>>     --
>>     *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
>>     L.C. Karssen
>>     Utrecht
>>     The Netherlands
>>
>>     lennart at karssen.org <javascript:;>
>>     http://blog.karssen.org
>>     GPG key ID: A88F554A
>>     -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>
>>
>>
>> -- 
>> -----------------------------------------------------
>> Yurii S. Aulchenko
>>
>> [ LinkedIn <http://nl.linkedin.com/in/yuriiaulchenko> ] [ Twitter
>> <http://twitter.com/YuriiAulchenko> ] [ Blog
>> <http://yurii-aulchenko.blogspot.nl/> ]
>>
>>
>>
>> _______________________________________________
>> genabel-devel mailing list
>> genabel-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
> 
> 
> 
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
> 

-- 
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20131126/b2b09302/attachment.sig>


More information about the genabel-devel mailing list