[GenABEL-dev] function for conversion a plink format file to a GenABEL format file
L.C. Karssen
lennart at karssen.org
Tue Nov 26 14:37:20 CET 2013
Dear Maksim,
On 11/25/2013 03:39 PM, Maksim Struchalin wrote:
> I checked the read.plink from snpMatrix (Nicola) and snpStats (Maarten).
> I see that the code under them is quite simple (~40 lines of c code
> under snpMatrix read.plink).
>
> The bed plink format is very similar to GenABEL format
> (http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml). Looks like
> that the main difference between them is that the plink bed file has
> first 3 bytes with some special meaning. The other bytes store genotypes
> (0, 1, 2 or NA) in 2 bits per genotype (like in GenA).
That sounds good. I actually never looked under the hood of the GenABEL
format, and the plink format is indeed quite simple. If it only the
first few bytes that differ, that sounds promising!
>
> I think it would be easy just to write a C function which convert bed to
> databel format.
That sounds useful! But what does that mean for the GenABEL functions?
Do you propose to let the GenABEL functions (like
merge.snp.data/merge.gwaa.data work on DatABEL objects as well)?
> Also, we can think about making the bed as the format
> which is nativelly supported by genabel. For this, we only need a
> function which extract an array from bed and make iterator to use this
> function.
Similar to my question above: what do you exactly mean? Do you want to
change all (relevant) GenABEL functions to work with three backend
formats (GenABEL/DatABEL/.bed)? That sounds like quite a lot of work!
Or do you simply mean to write import/convert functions between these
formats?
Best,
Lennart.
>
> best,
> Maksim
>
>
> On 22/11/2013 23:51, Yurii Aulchenko wrote:
>> Great idea
>>
>> I know nothing of plink bin format, but many packages make use of it,
>> so it should be not that complicated. Also plink is gnu GPL if I
>> remember correctly so we can use the code if needed
>>
>> Y
>>
>> On Friday, November 22, 2013, L.C. Karssen wrote:
>>
>> How difficult would it be to import .bed files [1] instead of the text
>> conversion? Given the binary data of both the .bed and the GenABEL
>> format, wouldn't conversion be much quicker?
>>
>>
>> Lennart.
>>
>> [1] http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml
>> <http://pngu.mgh.harvard.edu/%7Epurcell/plink/binary.shtml>
>>
>>
>> On 11/22/2013 09:54 AM, Yurii Aulchenko wrote:
>> > Too slow, too difficult for the user, or both? :)
>> >
>> > On Friday, November 22, 2013, Maksim Struchalin wrote:
>> >
>> > Yes. Looks like it was a bad idea to use plink R-plugin for
>> > converting plink files to *ABEL format.
>> > Maksim
>> >
>> > On 18/11/2013 18:48, Yury Aulchenko wrote:
>> >> I would say that in principle DatABEL::text2databel is the
>> >> "natural" way to go from text-files to DatABEL-files
>> >>
>> >> The problem is that 'regular' text input may be allele by
>> allele,
>> >> not genotype by genotype... (e.g. data are in format "A G", or
>> >> "A/G", not "0" or "1" or "2").
>> >>
>> >> Y
>> >>
>> >> On Nov 15, 2013, at 17:48 PM, L.C. Karssen
>> <lennart at karssen.org <javascript:;>>
>> >> wrote:
>> >>
>> >>> Hi Maksim,
>> >>>
>> >>> On 15-11-13 05:53, Maksim Struchalin wrote:
>> >>>> An easy way to write a function for conversion a plink format
>> >>>> file to a
>> >>>> GenABEL format file:
>> >>>>
>> >>>> Use plink support of 'plug-in' functions
>> >>>
>> >>> Nice find. I didn't know that existed.
>> >>>
>> >>>> (http://pngu.mgh.harvard.edu/~purcell/plink/rfunc.shtml
>> <http://pngu.mgh.harvard.edu/%7Epurcell/plink/rfunc.shtml>
>> >>>> <http://pngu.mgh.harvard.edu/%7Epurcell/plink/rfunc.shtml>).
>> >>>> This allows us
>> >>>> to write a simple R script (myscript.R) which is called
>> by plink
>> >>>> (plink
>> >>>> --file mydata --R myscript.R). plink reads the file mydata
>> >>>> (which is in
>> >>>> plink format) and iteratively, SNP by SNP, trasfer all
>> the data to a
>> >>>> script myscript.R. This script contains a function
>> >>>> Rplink(PHENO,GENO,CLUSTER,COVAR) which will take every
>> SNP (GENO
>> >>>> variable) and store it in a *flv format through calling
>> DatABEL
>> >>>> functions.
>> >>>>
>> >>>> The whole process of conversion will look like this:
>> >>>>
>> >>>> 1) User asks GenA convert plink file to GenA file
>> >>>> 2) GenA looks weather the plink is installed. If it is not
>> >>>> installed,
>> >>>> then GenA goes to a plink site and download/install it itself
>> >>>> (use an R
>> >>>> function "download.file" from "utils" package)
>> >>>> 3) GenA run a simple line: system('plink --file mydata --R
>> >>>> myscript.R')
>> >>>> 4) Rplink function (from myscript.R) gets every SNP and
>> stote it
>> >>>> in *flv
>> >>>> format. This function creates an flv file and then open and
>> >>>> close it for
>> >>>> saving every single SNP.
>> >>>> 5) Work is Done
>> >>>
>> >>> I'm not sure how portable it is to download and run plink.
>> Also, the
>> >>> plink page says: Currently, there is only support for
>> R-plugins for
>> >>> Linux-based and Mac OS PLINK distributions.
>> >>>
>> >>>>
>> >>>> The only issue is how fast the converssion will run: how much
>> >>>> time does
>> >>>> it take to open a filvector file, store one SNP and close
>> it? I
>> >>>> can not
>> >>>> find a DatABEL R function for adding SNP to a flv file.
>> Is there a C
>> >>>> DatABEL function which can do it?
>> >>>
>> >>> Wouldn't it be easier/possible to use plink to export to text
>> >>> (.csv) and
>> >>> then use filevector's txt2fvf binary (of course this could be
>> >>> done from
>> >>> R using system())?
>> >>>
>> >>> I'm also wondering if going per SNP is really necessary. If I
>> >>> understand
>> >>> it correctly the R script (myscript.R) has to have a
>> function called:
>> >>> Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
>> >>> where GENO is the matrix of genotypes. So we could write
>> that into a
>> >>> DatABEL file at once. Of course you may want to do this per
>> >>> chromosome
>> >>> to reduce memory consumption (not sure how plink/R would
>> handle large
>> >>> data sets).
>> >>>
>> >
>> >
>> > --
>> > -----------------------------------------------------
>> > Yurii S. Aulchenko
>> >
>> > [ LinkedIn <http://nl.linkedin.com/in/yuriiaulchenko> ] [ Twitter
>> > <http://twitter.com/YuriiAulchenko> ] [ Blog
>> > <http://yurii-aulchenko.blogspot.nl/> ]
>> >
>> >
>> >
>> > _______________________________________________
>> > genabel-devel mailing list
>> > genabel-devel at lists.r-forge.r-project.org <javascript:;>
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>> >
>>
>> --
>> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
>> L.C. Karssen
>> Utrecht
>> The Netherlands
>>
>> lennart at karssen.org <javascript:;>
>> http://blog.karssen.org
>> GPG key ID: A88F554A
>> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Yurii S. Aulchenko
>>
>> [ LinkedIn <http://nl.linkedin.com/in/yuriiaulchenko> ] [ Twitter
>> <http://twitter.com/YuriiAulchenko> ] [ Blog
>> <http://yurii-aulchenko.blogspot.nl/> ]
>>
>>
>>
>> _______________________________________________
>> genabel-devel mailing list
>> genabel-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>
>
>
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>
--
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands
lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20131126/b2b09302/attachment.sig>
More information about the genabel-devel
mailing list