[GenABEL-dev] probabel big endian support

Thu May 1 19:04:12 CEST 2014

Dear Jurica,

On 29-04-14 17:05, Jurica Stanojkovic wrote:
> Dear Karssen,
> 
>>> What is the best course of action for supporting probabel on big endian?
>>> Should *.fvi, *.fvd files allways be in little endian format (than
>>> DatABEL needs to be changed to always create little endian files)?
>>> Or can *.fvd, *.fvi files be replaced with big endian files for big
>>> endian build?
> 
>>I would say that ideally the files need only to be created once and then
>>usable on all systems. Especially since these files are usually large
>>and converting from text format to .fvi/.fvd takes quite a while.
> 
> If I had to change some values in text format, would I have to generate
> again fvd/fvi files?

Yes. And for that you would either need R + GenABEL and DatABEL, or the
tools in filevector's fvutil directory [1].

> Does one when working with ProbABEL has to change those files often?

No. The workflow is as follows:

1) genetic data (let's say 1e5 to 1e6 data points) are 'imputed' to a
reference set. That means that through statistical inference based on a
reference set the genetic data is 'interpolated' to ~30e6 data points
(SNPs).
These data points are floating point values between 0.0 and 2.0, so
called 'dosages', usually with ~3 digits after the decimal.
This process takes several days on a multi-node cluster for, for
example, a sample size of 7000 people.

2) This imputation process results in text files of N_people columns and
N_SNPs rows. In order to parallelise the imputation process
for 30e6 genetic SNPs, the files are usually split into sections of a
few million SNPs. Usually these text files are gzipped. In total these
files are a few hundred GB in size.

3) The purpose of converting to filevector format is that with .fv?
files we don't need to load the text files into RAM, but can quickly
access a given row (or column). For the analysis performed by ProbABEL
we want to read the SNP dosages for all individual for a given SNP.
Basically ProbABEL is one big for-loop over all 30e6 SNPs.

4) So, in a real life situation a bioinformatician would run the
imputations, and convert the data to filevector format once for the
whole research group (and store them somewhere centrally). For 7000
people and 30e6 SNPs the DatABEL files (which are not compressed) can
get ~ 1TB in size.
That is why I don't think people will transfer these files a lot. They
are stored centrally for all users to use. Transfer to a different
server happens, but not often. Transfer to a machine with a different
architecture will be even rarer.

> If we do byte-swap on the run for every data in the fvd/fvi file would
> that be also time consuming?
> I understand that user then do not need to wait files to generate again
> on big endian,
> but same task (run) will last longer on big-endian machine than on
> little-endian one?
> 

Do I understand correctly that you are talking about on-the-fly
conversion? So while someone runs ProbABEL and we detect a big-endian
machine conversion is done while reading the data?
That may be a better option than the conversion tool I mentioned below
for people who are low on disk space. On the other hand, given that
uusally several users use the same filevector files, each of those users
pay the penalty and currently ProbABEL is already mostly limited by
reading the data from disk.

Does anyone have an idea how much time an endianness conversion would
add to the reading of the data?

>>This, however, would require diving into the filevector and the DatABEL
>>code (filevector or libfilevector is the name of the 'backend' code in
>>which the .fvd/.fvi files are 'defined'; both DatABEL and ProbABEL use
>>that code when dealing with .fvi/.fvd files). I don't have very much
>>experience with either code base, but could probably have a look and
>>give you some pointers.
> 
> I tried to work around this and got some results, but a I did not manage
> to find every place in code where endian swap is needed.
> I am currently busy with other work, but i will soon look at this again.
> 
>>Jurica, can you tell us a bit more about why you are using a MIPS
>>machine for your work with ProbABEL? And do you think it would be a
>>common task to move these files between machines with different
>>architectures at your site?
> 
> I work on supporting mips/mipsel for Debian sid.
> I have access to mips and mipsel boards and can help with bigendian support.
> But I do not use ProbABEL actively.

OK, good to know. Hopefully the explanation of typical usage I gave
above will give you an idea of how ProbABEL is used.

> 
>>Maybe a converter from big to little and vice versa would be the easiest
>>solution? I guess such a conversion can be done rather quick. The
>>downside would be that it (at least temporarily) requires double the
>>disk space.
>>Such a converter could be part of the fvutils and/or of DatABEL, for
>>example.
> 
> Maybe this could be a good solution, presuming that this would be faster
> then just converting from text to fileVector format?

Good point. I don't know what would be faster, but my feeling is that a
conversion of binary data to binary data is faster than conversion from
ASCII text to binary.

> I will have to look closer how data is converted and writen from text to
> fvd/fvi in order to be able to convert them to different endian.
> 
> There is also a option to always create a fvd/fvi in both endian formats,
> or to create some universal file that have data in both endians inside.

Of course, if we simply confine ourselves to getting ProbABEL to run on
all Debian architectures, than adding big endian .fv? files is
definitely an option (although we would need some way of determining
which .fv? files to use given an architecture). Then we could instruct
the users on how to deal with this in the manual.

Best,

Lennart.

[1]
https://r-forge.r-project.org/scm/viewvc.php/pkg/filevector/?root=genabel

> 
> Regards,
> Jurica
> 
> -------- Original Message --------
> Subject: Re: [GenABEL-dev] probabel big endian support
> Date: Saturday, April 26, 2014 22:17 CEST
> From: "L.C. Karssen" <lennart at karssen.org>
> To: genabel-devel at lists.r-forge.r-project.org
> References: <896-53591700-f-3be4eec0 at 227853676>
>  
>> Dear Jurica,
>>
>> On 24-04-14 15:52, Jurica Stanojkovic wrote:
>> > Dear list,
>> >
>> > I have tried building package probabel on mips big endian.
>>
>> That is great to hear! As far as I know, none of the current developers
>> have access to such a machine.
>>
>> > It looks like that inputfiles/*.fvd and inputfiles/*.fvi are created
>> on> little endian machine and are not working on big endian ones.
>>
>> That is correct, we found out
>>
>> >
>> > I have tried to create them on big endian mips, and replace ones that
>> > came with source package with the ones that I have created.
>> > The package was built with new files without an error.
>>
>> That is good news. So GenABEL and DatABEL work on big-endian machines.
>>
>> >
>> > I used following command to create files:
>> > library(GenABEL)
>> > library(DatABEL)
>> > fvdose <- mach2databel(imputedg="./checks/inputfiles/test.mldose",
>> > mlinfo="./checks/inputfiles/test.mlinfo",
>> > outfile="./checks/inputfiles/test.dose")
>> > fvprob <- mach2databel(imputedg="./checks/inputfiles/test.mlprob",
>> > mlinfo="./checks/inputfiles/test.mlinfo",
>> > outfile="./checks/inputfiles/test.prob", isprob=TRUE)
>> > mmdose <-
>> > mach2databel(imputedg="./checks/inputfiles/mmscore_gen.mldose",
>> > mlinfo="./checks/inputfiles/mmscore_gen.mlinfo",
>> > outfile="./checks/inputfiles/mmscore_gen.dose")
>> > mmprob <-
>> > mach2databel(imputedg="./checks/inputfiles/mmscore_gen.mlprob",
>> > mlinfo="./checks/inputfiles/mmscore_gen.mlinfo",
>> > outfile="./checks/inputfiles/mmscore_gen.prob", isprob=TRUE)
>> >
>> > I am new to ProbABEL, GenABEL, DatABEL so could someone please help me
>> > with following questions:
>> >
>> > What is the best course of action for supporting probabel on big endian?
>> > Should *.fvi, *.fvd files allways be in little endian format (than
>> > DatABEL needs to be changed to always create little endian files)?
>> > Or can *.fvd, *.fvi files be replaced with big endian files for big
>> > endian build?
>>
>> I would say that ideally the files need only to be created once and then
>> usable on all systems. Especially since these files are usually large
>> and converting from text format to .fvi/.fvd takes quite a while.
>>
>> This, however, would require diving into the filevector and the DatABEL
>> code (filevector or libfilevector is the name of the 'backend' code in
>> which the .fvd/.fvi files are 'defined'; both DatABEL and ProbABEL use
>> that code when dealing with .fvi/.fvd files). I don't have very much
>> experience with either code base, but could probably have a look and
>> give you some pointers.
>>
>> >
>> > Is it necessary to be able to use *.fvd *.fvi files created on a
>> > different endian system?
>>
>> On the other hand, how often will people transfer these files to
>> machines of different architectures?
>>
>> Jurica, can you tell us a bit more about why you are using a MIPS
>> machine for your work with ProbABEL? And do you think it would be a
>> common task to move these files between machines with different
>> architectures at your site?
>>
>> Maybe a converter from big to little and vice versa would be the easiest
>> solution? I guess such a conversion can be done rather quick. The
>> downside would be that it (at least temporarily) requires double the
>> disk space.
>> Such a converter could be part of the fvutils and/or of DatABEL, for
>> example.
>>
>> >
>> > I am willing to work on adding big endian support and I will
>> appreciate> any help in determining the right course of action in
>> resolving this
>> > problem.
>>
>> Thank you for your time and willingness to help! It is very much
>> appreciated. We're a small group of developers, but we'll try to help as
>> much as we can.
>>
>>
>> Best,
>>
>> Lennart.
>>
>> >
>> > Regards,
>> > Jurica
>> >
>> >
>> > _______________________________________________
>> > genabel-devel mailing list
>> > genabel-devel at lists.r-forge.r-project.org
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>> >
>>
>> --
>> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
>> L.C. Karssen
>> Utrecht
>> The Netherlands
>>
>> lennart at karssen.org
>> http://blog.karssen.org
>> GPG key ID: A88F554A
>> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>  
> 
>  

-- 
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

-- 
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 213 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140501/9368259b/attachment.sig>