From lennart at karssen.org Thu May 1 19:04:12 2014 From: lennart at karssen.org (L.C. Karssen) Date: Thu, 01 May 2014 19:04:12 +0200 Subject: [GenABEL-dev] probabel big endian support In-Reply-To: <1897-535fc000-21-6a994800@159572789> References: <1897-535fc000-21-6a994800@159572789> Message-ID: <53627E8C.2040407@karssen.org> Dear Jurica, On 29-04-14 17:05, Jurica Stanojkovic wrote: > Dear Karssen, > >>> What is the best course of action for supporting probabel on big endian? >>> Should *.fvi, *.fvd files allways be in little endian format (than >>> DatABEL needs to be changed to always create little endian files)? >>> Or can *.fvd, *.fvi files be replaced with big endian files for big >>> endian build? > >>I would say that ideally the files need only to be created once and then >>usable on all systems. Especially since these files are usually large >>and converting from text format to .fvi/.fvd takes quite a while. > > If I had to change some values in text format, would I have to generate > again fvd/fvi files? Yes. And for that you would either need R + GenABEL and DatABEL, or the tools in filevector's fvutil directory [1]. > Does one when working with ProbABEL has to change those files often? No. The workflow is as follows: 1) genetic data (let's say 1e5 to 1e6 data points) are 'imputed' to a reference set. That means that through statistical inference based on a reference set the genetic data is 'interpolated' to ~30e6 data points (SNPs). These data points are floating point values between 0.0 and 2.0, so called 'dosages', usually with ~3 digits after the decimal. This process takes several days on a multi-node cluster for, for example, a sample size of 7000 people. 2) This imputation process results in text files of N_people columns and N_SNPs rows. In order to parallelise the imputation process for 30e6 genetic SNPs, the files are usually split into sections of a few million SNPs. Usually these text files are gzipped. In total these files are a few hundred GB in size. 3) The purpose of converting to filevector format is that with .fv? files we don't need to load the text files into RAM, but can quickly access a given row (or column). For the analysis performed by ProbABEL we want to read the SNP dosages for all individual for a given SNP. Basically ProbABEL is one big for-loop over all 30e6 SNPs. 4) So, in a real life situation a bioinformatician would run the imputations, and convert the data to filevector format once for the whole research group (and store them somewhere centrally). For 7000 people and 30e6 SNPs the DatABEL files (which are not compressed) can get ~ 1TB in size. That is why I don't think people will transfer these files a lot. They are stored centrally for all users to use. Transfer to a different server happens, but not often. Transfer to a machine with a different architecture will be even rarer. > If we do byte-swap on the run for every data in the fvd/fvi file would > that be also time consuming? > I understand that user then do not need to wait files to generate again > on big endian, > but same task (run) will last longer on big-endian machine than on > little-endian one? > Do I understand correctly that you are talking about on-the-fly conversion? So while someone runs ProbABEL and we detect a big-endian machine conversion is done while reading the data? That may be a better option than the conversion tool I mentioned below for people who are low on disk space. On the other hand, given that uusally several users use the same filevector files, each of those users pay the penalty and currently ProbABEL is already mostly limited by reading the data from disk. Does anyone have an idea how much time an endianness conversion would add to the reading of the data? >>This, however, would require diving into the filevector and the DatABEL >>code (filevector or libfilevector is the name of the 'backend' code in >>which the .fvd/.fvi files are 'defined'; both DatABEL and ProbABEL use >>that code when dealing with .fvi/.fvd files). I don't have very much >>experience with either code base, but could probably have a look and >>give you some pointers. > > I tried to work around this and got some results, but a I did not manage > to find every place in code where endian swap is needed. > I am currently busy with other work, but i will soon look at this again. > >>Jurica, can you tell us a bit more about why you are using a MIPS >>machine for your work with ProbABEL? And do you think it would be a >>common task to move these files between machines with different >>architectures at your site? > > I work on supporting mips/mipsel for Debian sid. > I have access to mips and mipsel boards and can help with bigendian support. > But I do not use ProbABEL actively. OK, good to know. Hopefully the explanation of typical usage I gave above will give you an idea of how ProbABEL is used. > >>Maybe a converter from big to little and vice versa would be the easiest >>solution? I guess such a conversion can be done rather quick. The >>downside would be that it (at least temporarily) requires double the >>disk space. >>Such a converter could be part of the fvutils and/or of DatABEL, for >>example. > > Maybe this could be a good solution, presuming that this would be faster > then just converting from text to fileVector format? Good point. I don't know what would be faster, but my feeling is that a conversion of binary data to binary data is faster than conversion from ASCII text to binary. > I will have to look closer how data is converted and writen from text to > fvd/fvi in order to be able to convert them to different endian. > > There is also a option to always create a fvd/fvi in both endian formats, > or to create some universal file that have data in both endians inside. Of course, if we simply confine ourselves to getting ProbABEL to run on all Debian architectures, than adding big endian .fv? files is definitely an option (although we would need some way of determining which .fv? files to use given an architecture). Then we could instruct the users on how to deal with this in the manual. Best, Lennart. [1] https://r-forge.r-project.org/scm/viewvc.php/pkg/filevector/?root=genabel > > Regards, > Jurica > > -------- Original Message -------- > Subject: Re: [GenABEL-dev] probabel big endian support > Date: Saturday, April 26, 2014 22:17 CEST > From: "L.C. Karssen" > To: genabel-devel at lists.r-forge.r-project.org > References: <896-53591700-f-3be4eec0 at 227853676> > >> Dear Jurica, >> >> On 24-04-14 15:52, Jurica Stanojkovic wrote: >> > Dear list, >> > >> > I have tried building package probabel on mips big endian. >> >> That is great to hear! As far as I know, none of the current developers >> have access to such a machine. >> >> > It looks like that inputfiles/*.fvd and inputfiles/*.fvi are created >> on> little endian machine and are not working on big endian ones. >> >> That is correct, we found out >> >> > >> > I have tried to create them on big endian mips, and replace ones that >> > came with source package with the ones that I have created. >> > The package was built with new files without an error. >> >> That is good news. So GenABEL and DatABEL work on big-endian machines. >> >> > >> > I used following command to create files: >> > library(GenABEL) >> > library(DatABEL) >> > fvdose <- mach2databel(imputedg="./checks/inputfiles/test.mldose", >> > mlinfo="./checks/inputfiles/test.mlinfo", >> > outfile="./checks/inputfiles/test.dose") >> > fvprob <- mach2databel(imputedg="./checks/inputfiles/test.mlprob", >> > mlinfo="./checks/inputfiles/test.mlinfo", >> > outfile="./checks/inputfiles/test.prob", isprob=TRUE) >> > mmdose <- >> > mach2databel(imputedg="./checks/inputfiles/mmscore_gen.mldose", >> > mlinfo="./checks/inputfiles/mmscore_gen.mlinfo", >> > outfile="./checks/inputfiles/mmscore_gen.dose") >> > mmprob <- >> > mach2databel(imputedg="./checks/inputfiles/mmscore_gen.mlprob", >> > mlinfo="./checks/inputfiles/mmscore_gen.mlinfo", >> > outfile="./checks/inputfiles/mmscore_gen.prob", isprob=TRUE) >> > >> > I am new to ProbABEL, GenABEL, DatABEL so could someone please help me >> > with following questions: >> > >> > What is the best course of action for supporting probabel on big endian? >> > Should *.fvi, *.fvd files allways be in little endian format (than >> > DatABEL needs to be changed to always create little endian files)? >> > Or can *.fvd, *.fvi files be replaced with big endian files for big >> > endian build? >> >> I would say that ideally the files need only to be created once and then >> usable on all systems. Especially since these files are usually large >> and converting from text format to .fvi/.fvd takes quite a while. >> >> This, however, would require diving into the filevector and the DatABEL >> code (filevector or libfilevector is the name of the 'backend' code in >> which the .fvd/.fvi files are 'defined'; both DatABEL and ProbABEL use >> that code when dealing with .fvi/.fvd files). I don't have very much >> experience with either code base, but could probably have a look and >> give you some pointers. >> >> > >> > Is it necessary to be able to use *.fvd *.fvi files created on a >> > different endian system? >> >> On the other hand, how often will people transfer these files to >> machines of different architectures? >> >> Jurica, can you tell us a bit more about why you are using a MIPS >> machine for your work with ProbABEL? And do you think it would be a >> common task to move these files between machines with different >> architectures at your site? >> >> Maybe a converter from big to little and vice versa would be the easiest >> solution? I guess such a conversion can be done rather quick. The >> downside would be that it (at least temporarily) requires double the >> disk space. >> Such a converter could be part of the fvutils and/or of DatABEL, for >> example. >> >> > >> > I am willing to work on adding big endian support and I will >> appreciate> any help in determining the right course of action in >> resolving this >> > problem. >> >> Thank you for your time and willingness to help! It is very much >> appreciated. We're a small group of developers, but we'll try to help as >> much as we can. >> >> >> Best, >> >> Lennart. >> >> > >> > Regards, >> > Jurica >> > >> > >> > _______________________________________________ >> > genabel-devel mailing list >> > genabel-devel at lists.r-forge.r-project.org >> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel >> > >> >> -- >> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* >> L.C. Karssen >> Utrecht >> The Netherlands >> >> lennart at karssen.org >> http://blog.karssen.org >> GPG key ID: A88F554A >> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- >> > > -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: From fabregat at aices.rwth-aachen.de Thu May 1 19:25:00 2014 From: fabregat at aices.rwth-aachen.de (Diego Fabregat) Date: Thu, 1 May 2014 19:25:00 +0200 Subject: [GenABEL-dev] Proposal to move to Github In-Reply-To: <535EB58D.6010900@karssen.org> References: <20140428094937.65E8B186FC6@r-forge.r-project.org> <535E2774.6030606@karssen.org> <535E422F.4080402@gmail.com> <535E69D7.1050005@karssen.org> <535E6BDB.3000206@karssen.org> <535EA05E.40201@gmail.com> <535EB58D.6010900@karssen.org> Message-ID: <5362836C.4000105@aices.rwth-aachen.de> I like the idea of moving to git. I have no experience with github, but I'm using git on an almost daily basis (we have our own git server in our group for code and papers). I would have no problem in uploading OmicABEL to a git repo. Does dropping R-forge have a (bad) impact on the visibility of the project or on the user experience (e.g., installation of R packages)? On 04/28/2014 10:09 PM, L.C. Karssen wrote: > Dear Maarten, dear all, > > Moving to github... Hmm... That is quite a decision, so I've renamed the > subject to better reflect the discussion. I've also dropped the older > e-mails from the bottom of the thread. > > First off, are there any people that have experience with git and/or > github? I've got some git experience (still learning), but no real > experience with github. > > I agree with Maarten that SVN is showing its age. As he indicates things > like branching are much easier in git. Moreover, since I'm travelling > regularly being able to work without internet connection is a pro. > > On the other hand, moving to git (whether github or elsewhere) means > leaving R-forge, which is our well-known infrastructure. Furthermore, > such a move operation will cost quite some time, I guess. Moving all > bugs, features, etc... If we decide to move we should plan well and not > rush. And then the current developers will need to learn git if they > don't already know how to use it. > > One thing I think we should definitely do is migrate slowly, package by > package. Given that Maarten is positive about such a move and that I am > in a bit of limbo but not fully against, it seems logical that ProbABEL > is the first package to try such a migration. > > > Looking forward to your comments! > > > Lennart. > > > On 28-04-14 20:39, Maarten Kooyman wrote: >> Dear all, >> >> I think it is easier to use for code review github: >> >> Please check to get a impression >> :https://github.com/jquery/jquery/pull/1241/files >> >> I think we should reconsider an other the software version system: the >> current system is not up to date to current usability. Bug tracking and >> branching is quite hard in terms of usability. Please have a look at >> github.com to get a impression what is possible. >> >> Kind regards, >> >> Maarten >> >> > -- > *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* > L.C. Karssen > Utrecht > The Netherlands > > lennart at karssen.org > http://blog.karssen.org > GPG key ID: A88F554A > -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- > > > > _______________________________________________ > genabel-devel mailing list > genabel-devel at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From lennart at karssen.org Fri May 2 09:56:44 2014 From: lennart at karssen.org (L.C. Karssen) Date: Fri, 02 May 2014 09:56:44 +0200 Subject: [GenABEL-dev] Proposal to move to Github In-Reply-To: <5362836C.4000105@aices.rwth-aachen.de> References: <20140428094937.65E8B186FC6@r-forge.r-project.org> <535E2774.6030606@karssen.org> <535E422F.4080402@gmail.com> <535E69D7.1050005@karssen.org> <535E6BDB.3000206@karssen.org> <535EA05E.40201@gmail.com> <535EB58D.6010900@karssen.org> <5362836C.4000105@aices.rwth-aachen.de> Message-ID: <53634FBC.6080504@karssen.org> Hi Diego, On 01-05-14 19:25, Diego Fabregat wrote: > I like the idea of moving to git. I have no experience with github, but > I'm using git on an almost daily basis (we have our own git server in > our group for code and papers). I would have no problem in uploading > OmicABEL to a git repo. Thanks. That's good to know. > > Does dropping R-forge have a (bad) impact on the visibility of the > project or on the user experience (e.g., installation of R packages)? It will not affect R package installation. Even though R-forge (tries to) build packages and makes them available, regular users will download packages from CRAN. Uploading a package to CRAN is something we do manually. As to visibility, yes, I think that will be affected. Maybe not directly, but many people/potential developers would assume we are on R-forge and on the R-forge main page we are regularly listed as one of the most active projects of the week (including this week). This list seems to be (partially?) powered by SVN activity. From an infrastructure point of view I would like to keep using R-forge (although minus SVN): things like the mailing lists and bug/feature trackers are now nicely centralised. Although, if we really move to github, we might move the trackers there as well... I don't think github has mailing lists (including archives), right? Does anybody have any feelings about the fact that github is run by a company, whereas R-forge is run by academia? "Conventional wisdom" has it that a company could close down (parts of) the site or put them behind a paywall (necessitating moving to another hoster), whereas an site run by academia would be free/open "forever". Personally I don't think it is a big issue here, but others may have different opinions. Lennart. > > On 04/28/2014 10:09 PM, L.C. Karssen wrote: >> Dear Maarten, dear all, >> >> Moving to github... Hmm... That is quite a decision, so I've renamed the >> subject to better reflect the discussion. I've also dropped the older >> e-mails from the bottom of the thread. >> >> First off, are there any people that have experience with git and/or >> github? I've got some git experience (still learning), but no real >> experience with github. >> >> I agree with Maarten that SVN is showing its age. As he indicates things >> like branching are much easier in git. Moreover, since I'm travelling >> regularly being able to work without internet connection is a pro. >> >> On the other hand, moving to git (whether github or elsewhere) means >> leaving R-forge, which is our well-known infrastructure. Furthermore, >> such a move operation will cost quite some time, I guess. Moving all >> bugs, features, etc... If we decide to move we should plan well and not >> rush. And then the current developers will need to learn git if they >> don't already know how to use it. >> >> One thing I think we should definitely do is migrate slowly, package by >> package. Given that Maarten is positive about such a move and that I am >> in a bit of limbo but not fully against, it seems logical that ProbABEL >> is the first package to try such a migration. >> >> >> Looking forward to your comments! >> >> >> Lennart. >> >> >> On 28-04-14 20:39, Maarten Kooyman wrote: >>> Dear all, >>> >>> I think it is easier to use for code review github: >>> >>> Please check to get a impression >>> :https://github.com/jquery/jquery/pull/1241/files >>> >>> I think we should reconsider an other the software version system: the >>> current system is not up to date to current usability. Bug tracking and >>> branching is quite hard in terms of usability. Please have a look at >>> github.com to get a impression what is possible. >>> >>> Kind regards, >>> >>> Maarten >>> >>> >> -- >> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* >> L.C. Karssen >> Utrecht >> The Netherlands >> >> lennart at karssen.org >> http://blog.karssen.org >> GPG key ID: A88F554A >> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- >> >> >> >> _______________________________________________ >> genabel-devel mailing list >> genabel-devel at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel > > > > _______________________________________________ > genabel-devel mailing list > genabel-devel at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel > -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: From yurii.aulchenko at gmail.com Fri May 2 10:27:34 2014 From: yurii.aulchenko at gmail.com (Yurii Aulchenko) Date: Fri, 2 May 2014 15:27:34 +0700 Subject: [GenABEL-dev] Proposal to move to Github In-Reply-To: <5362836C.4000105@aices.rwth-aachen.de> References: <20140428094937.65E8B186FC6@r-forge.r-project.org> <535E2774.6030606@karssen.org> <535E422F.4080402@gmail.com> <535E69D7.1050005@karssen.org> <535E6BDB.3000206@karssen.org> <535EA05E.40201@gmail.com> <535EB58D.6010900@karssen.org> <5362836C.4000105@aices.rwth-aachen.de> Message-ID: On Fri, May 2, 2014 at 12:25 AM, Diego Fabregat < fabregat at aices.rwth-aachen.de> wrote: > I like the idea of moving to git. I have no experience with github, but > I'm using git on an almost daily basis (we have our own git server in our > group for code and papers). I would have no problem in uploading OmicABEL > to a git repo. > > Does dropping R-forge have a (bad) impact on the visibility of the project > or on the user experience (e.g., installation of R packages)? > In my opinion - not really (visibility: I do not think we get many users because they've found us at r-forge; also we can keep the account and make links from there; as for installation, the argument partly holds only for R-packages). What we need to think is of course how we keep/move all parts such as a) code b) trackers c) project docs such as code guidelines To me it seems that the idea to migrate few packages first is the most reasonable; few are likely to stay at r-forge for long Yurii > > On 04/28/2014 10:09 PM, L.C. Karssen wrote: > > Dear Maarten, dear all, > > Moving to github... Hmm... That is quite a decision, so I've renamed the > subject to better reflect the discussion. I've also dropped the older > e-mails from the bottom of the thread. > > First off, are there any people that have experience with git and/or > github? I've got some git experience (still learning), but no real > experience with github. > > I agree with Maarten that SVN is showing its age. As he indicates things > like branching are much easier in git. Moreover, since I'm travelling > regularly being able to work without internet connection is a pro. > > On the other hand, moving to git (whether github or elsewhere) means > leaving R-forge, which is our well-known infrastructure. Furthermore, > such a move operation will cost quite some time, I guess. Moving all > bugs, features, etc... If we decide to move we should plan well and not > rush. And then the current developers will need to learn git if they > don't already know how to use it. > > One thing I think we should definitely do is migrate slowly, package by > package. Given that Maarten is positive about such a move and that I am > in a bit of limbo but not fully against, it seems logical that ProbABEL > is the first package to try such a migration. > > > Looking forward to your comments! > > > Lennart. > > > On 28-04-14 20:39, Maarten Kooyman wrote: > > Dear all, > > I think it is easier to use for code review github: > > Please check to get a impression > :https://github.com/jquery/jquery/pull/1241/files > > I think we should reconsider an other the software version system: the > current system is not up to date to current usability. Bug tracking and > branching is quite hard in terms of usability. Please have a look atgithub.com to get a impression what is possible. > > Kind regards, > > Maarten > > > > -- > *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* > L.C. Karssen > Utrecht > The Netherlands > lennart at karssen.orghttp://blog.karssen.org > GPG key ID: A88F554A > -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- > > > > > _______________________________________________ > genabel-devel mailing listgenabel-devel at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel > > > > _______________________________________________ > genabel-devel mailing list > genabel-devel at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel > -- ----------------------------------------------------- Yurii S. Aulchenko [ LinkedIn ] [ Twitter] [ Blog ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From lennart at karssen.org Fri May 9 17:03:56 2014 From: lennart at karssen.org (L.C. Karssen) Date: Fri, 09 May 2014 17:03:56 +0200 Subject: [GenABEL-dev] mmscore_regression() in ProbABEL: code review Message-ID: <536CEE5C.3000207@karssen.org> Dear list (and Maarten in particular), I've written some Doxygen documentation for the mmscore_regression() function that Maarten recently created. While doing so I changed some of the variable names to be a bit more understandable and documented what the function does according to Diego's suggestion on this list some time ago. I also slightly changed the variables that are created in the function to get rid of one transpose() action. My question is: could you have a look at the code (lines 57--62) in the attached file and see if this still preservers the vectorisation potential as mentioned in Maarten's comment in the code? The function can be found in reg1.cpp. Thanks a lot, Lennart. -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- -------------- next part -------------- A non-text attachment was scrubbed... Name: mmscore.cpp Type: text/x-c++src Size: 2667 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: From kooyman at gmail.com Sun May 11 11:49:33 2014 From: kooyman at gmail.com (Maarten Kooyman) Date: Sun, 11 May 2014 11:49:33 +0200 Subject: [GenABEL-dev] mmscore_regression() in ProbABEL: code review In-Reply-To: <536CEE5C.3000207@karssen.org> References: <536CEE5C.3000207@karssen.org> Message-ID: <536F47AD.3070603@gmail.com> Hi Lennart, This seems alright to me. The most time consuming step is multiplication of the variance-covariance matrix, which I optimised. Since this is the same, I do not expect any big change in performance. With every commit done for ProbABEL my personal Jenkins instance benchmark 3 scenario's: Palinear with DatABEL files ((Npeople=3485,Npredictor=1,Nsnp=33815,correct for sex and age) Palinear with mldose files (Npeople=3485,Npredictor=1,Nsnp=33815,correct for sex and age) Palinear with DatABEL files and mmscore( Npeople=500,Npredictor=1,Nsnp=1000). When there is a slow down (or speed up!) I will inform you. Kind regards, Maarten On 09-05-14 17:03, L.C. Karssen wrote: > Dear list (and Maarten in particular), > > I've written some Doxygen documentation for the mmscore_regression() > function that Maarten recently created. While doing so I changed some of > the variable names to be a bit more understandable and documented what > the function does according to Diego's suggestion on this list some time > ago. I also slightly changed the variables that are created in the > function to get rid of one transpose() action. > > My question is: could you have a look at the code (lines 57--62) in the > attached file and see if this still preservers the vectorisation > potential as mentioned in Maarten's comment in the code? > > The function can be found in reg1.cpp. > > > Thanks a lot, > > Lennart. > > > _______________________________________________ > genabel-devel mailing list > genabel-devel at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From lennart at karssen.org Sun May 11 14:12:00 2014 From: lennart at karssen.org (L.C. Karssen) Date: Sun, 11 May 2014 14:12:00 +0200 Subject: [GenABEL-dev] mmscore_regression() in ProbABEL: code review In-Reply-To: <536F47AD.3070603@gmail.com> References: <536CEE5C.3000207@karssen.org> <536F47AD.3070603@gmail.com> Message-ID: <536F6910.2010601@karssen.org> Thanks Maarten! I just committed my changes. Let me know if this changes the benchmark (for better or worse). Best, Lennart. On 11-05-14 11:49, Maarten Kooyman wrote: > Hi Lennart, > > This seems alright to me. The most time consuming step is multiplication > of the variance-covariance matrix, which I optimised. Since this is the > same, I do not expect any big change in performance. With every commit > done for ProbABEL my personal Jenkins instance benchmark 3 scenario's: > > Palinear with DatABEL files > ((Npeople=3485,Npredictor=1,Nsnp=33815,correct for sex and age) > Palinear with mldose files (Npeople=3485,Npredictor=1,Nsnp=33815,correct > for sex and age) > Palinear with DatABEL files and mmscore( > Npeople=500,Npredictor=1,Nsnp=1000). > > When there is a slow down (or speed up!) I will inform you. > > Kind regards, > > Maarten > > > On 09-05-14 17:03, L.C. Karssen wrote: >> Dear list (and Maarten in particular), >> >> I've written some Doxygen documentation for the mmscore_regression() >> function that Maarten recently created. While doing so I changed some of >> the variable names to be a bit more understandable and documented what >> the function does according to Diego's suggestion on this list some time >> ago. I also slightly changed the variables that are created in the >> function to get rid of one transpose() action. >> >> My question is: could you have a look at the code (lines 57--62) in the >> attached file and see if this still preservers the vectorisation >> potential as mentioned in Maarten's comment in the code? >> >> The function can be found in reg1.cpp. >> >> >> Thanks a lot, >> >> Lennart. >> >> >> _______________________________________________ >> genabel-devel mailing list >> genabel-devel at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel > > > > _______________________________________________ > genabel-devel mailing list > genabel-devel at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel > -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: From alvaro.frank at rwth-aachen.de Tue May 13 16:03:02 2014 From: alvaro.frank at rwth-aachen.de (Frank, Alvaro Jesus) Date: Tue, 13 May 2014 14:03:02 +0000 Subject: [GenABEL-dev] t-statistic, p-values from source data Message-ID: <244CF001646FF74FB34F372310A332C57B10BF@MBX2.rwth-ad.de> Hi All, I apologize for any noise this may produce. Adding p-values, std erros and by default t-stests/statistics to omicabelnomm has been more than difficult. The reason for this is that I cannot seem to find a unified consensus of what the user wants in terms of statistics. Some do t-stat on X, others on Y, but all expect a p-val from the linear regression that not even them know where it comes from. I have good handling of all formulas needed, but the final-pvalue requires a t-test from a sample data. Is that sample data the residual of the Y-XB or just the produced factors B or simply the data X or Y? Some of this make sense while others make no sense at all. Another concern of mine is that some of the data used may not have enough significant digits beyond the 3rd digit. IF the p-value is supposed to come from the residual, this residual could be good or even bad depending on the conditioning of X. If the residual is very close to machine precision, using it for a pvalue becomes not at all advisable because of significant digits, or am I wrong about this? I feel I am missing something in terms of workflow or formulas or even purpose/usage of the regression and the p-value. Any help would be appreciated. -Alvaro -------------- next part -------------- An HTML attachment was scrubbed... URL: From erindunn at pngu.mgh.harvard.edu Wed May 14 01:20:15 2014 From: erindunn at pngu.mgh.harvard.edu (Erin C. Dunn) Date: Tue, 13 May 2014 19:20:15 -0400 Subject: [GenABEL-dev] Question about probABLE Message-ID: <427D80A8-DB6F-4D8B-B911-4BD61AB2B845@pngu.mgh.harvard.edu> Good Afternoon, I am interested in possibly using probABLE for some genome-wide GxE analyses I am planning to run. I was wondering if you could tell me whether probABLE would allow me to: (1) run Pete Kraft's d2f joint test (for the main genetic effect and test for GxE); (2) use dosage data; (3) have a continuous outcome; (4) and obtain robust SE. It looks like probABLE would enable me to do all of these things. I was hoping to have some verification of this before I embark down this path, as I haven't used this software before and would imagine the learning curve would be somewhat steep. Any insights you might be able to share would be immensely helpful. Thanks, Erin ____________________________ Erin C. Dunn, ScD, MPH Post-Doctoral Research Fellow Psychiatric and Neurodevelopmental Genetics Unit Center for Human Genetic Research Massachusetts General Hospital 185 Cambridge Street Simches, Room 6.252 Boston, MA 02114 erindunn at pngu.mgh.harvard.edu 617-726-9387 (work phone) 617-726-0830 (work fax) To schedule a meeting, please visit my doodle poll: http://doodle.com/erindunn The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From yurii.aulchenko at gmail.com Wed May 14 08:52:24 2014 From: yurii.aulchenko at gmail.com (Yurii Aulchenko) Date: Wed, 14 May 2014 08:52:24 +0200 Subject: [GenABEL-dev] Question about probABLE In-Reply-To: <427D80A8-DB6F-4D8B-B911-4BD61AB2B845@pngu.mgh.harvard.edu> References: <427D80A8-DB6F-4D8B-B911-4BD61AB2B845@pngu.mgh.harvard.edu> Message-ID: <6589491268731029869@unknownmsgid> Dear Erin, This is question for the forum.genabel.org rather then this list. It is ProbABEL, not probABLE. Answers in short: yes. Yurii ---------------------- Yurii Aulchenko (sent from mobile device) On May 14, 2014, at 8:46 AM, "Erin C. Dunn" wrote: Good Afternoon, I am interested in possibly using probABLE for some genome-wide GxE analyses I am planning to run. I was wondering if you could tell me whether probABLE would allow me to: (1) run Pete Kraft's d2f joint test (for the main genetic effect and test for GxE); (2) use dosage data; (3) have a continuous outcome; (4) and obtain robust SE. It looks like probABLE would enable me to do all of these things. I was hoping to have some verification of this before I embark down this path, as I haven't used this software before and would imagine the learning curve would be somewhat steep. Any insights you might be able to share would be immensely helpful. Thanks, Erin ____________________________ Erin C. Dunn, ScD, MPH Post-Doctoral Research Fellow Psychiatric and Neurodevelopmental Genetics Unit Center for Human Genetic Research Massachusetts General Hospital 185 Cambridge Street Simches, Room 6.252 Boston, MA 02114 erindunn at pngu.mgh.harvard.edu 617-726-9387 (work phone) 617-726-0830 (work fax) To schedule a meeting, please visit my doodle poll: http://doodle.com/erindunn The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. _______________________________________________ genabel-devel mailing list genabel-devel at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvaro.frank at rwth-aachen.de Thu May 15 17:01:12 2014 From: alvaro.frank at rwth-aachen.de (Frank, Alvaro Jesus) Date: Thu, 15 May 2014 15:01:12 +0000 Subject: [GenABEL-dev] automake Message-ID: <244CF001646FF74FB34F372310A332C57B83DC@MBX5.rwth-ad.de> Hi Lennart, I need some help with making the installation of omicabelnomm with ubuntu and similar systems possible for end users. Aparently some pre-compiled blas do not get compiled with support for openmp. After compiling an openblas i wish to keep using the automake path of installation. But I m having problems having automake to grab the new blas library. Anyone aware of how to do this? -Alvaro -------------- next part -------------- An HTML attachment was scrubbed... URL: From lennart at karssen.org Thu May 15 20:03:13 2014 From: lennart at karssen.org (L.C. Karssen) Date: Thu, 15 May 2014 20:03:13 +0200 Subject: [GenABEL-dev] automake In-Reply-To: <244CF001646FF74FB34F372310A332C57B83DC@MBX5.rwth-ad.de> References: <244CF001646FF74FB34F372310A332C57B83DC@MBX5.rwth-ad.de> Message-ID: <53750161.6080408@karssen.org> Hi Alvaro, On 15-05-14 17:01, Frank, Alvaro Jesus wrote: > Hi Lennart, > > I need some help with making the installation of omicabelnomm with > ubuntu and similar systems possible for end users. Aparently some > pre-compiled blas do not get compiled with support for openmp. After > compiling an openblas i wish to keep using the automake path of > installation. I don't think I understand you: what exactly do you mean with "the automake path of installation"? > But I m having problems having automake to grab the new > blas library. Normally you would set the LD_LIBRARY_PATH environment variable if you want to point to a new (shared) library. Would that solve your problem as well? Lennart. > > Anyone aware of how to do this? > > -Alvaro > > > _______________________________________________ > genabel-devel mailing list > genabel-devel at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel > -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: From lennart at karssen.org Mon May 19 16:06:49 2014 From: lennart at karssen.org (L.C. Karssen) Date: Mon, 19 May 2014 16:06:49 +0200 Subject: [GenABEL-dev] Best way to 'round' small floats Message-ID: <537A0FF9.5070803@karssen.org> Dear list, While working on adding p-values to the ProbABEL output I ran into the fact that (at least in the checks) we end up with small negative chi^2 values (e.g. -1.14e-13) for some models (e.g. dominant). Of course this shouldn't happen, but my guess is it's a numerical problem caused by subtracting the two likelihoods for the Likelihood Ratio Test. I'd like to add a bit of code like this to mitigate this issue: if (chi2 < 0 && abs(chi2) < EPS) { chi2 = 0 } with EPS set to e.g. 1e-9. This won't harm any analysis, since we're only interested in chi^2 values away from zero, but I was wondering if there is a more appropriate way to do this. Thanks, Lennart. -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: From lennart at karssen.org Mon May 19 18:38:28 2014 From: lennart at karssen.org (L.C. Karssen) Date: Mon, 19 May 2014 18:38:28 +0200 Subject: [GenABEL-dev] automake In-Reply-To: <53750161.6080408@karssen.org> References: <244CF001646FF74FB34F372310A332C57B83DC@MBX5.rwth-ad.de> <53750161.6080408@karssen.org> Message-ID: <537A3384.6060406@karssen.org> Hi Alvaro, I just realised that my previous answer may not be what you were looking for. The LD_LIBRARY_PATH variable is used to search for shared libraries at run time. Instead you probably mean how to tell where the libraries is at compile time. For that you should add the CXXFLAGS option -L followed by the path. Best, Lennart. On 15-05-14 20:03, L.C. Karssen wrote: > Hi Alvaro, > > On 15-05-14 17:01, Frank, Alvaro Jesus wrote: >> Hi Lennart, >> >> I need some help with making the installation of omicabelnomm with >> ubuntu and similar systems possible for end users. Aparently some >> pre-compiled blas do not get compiled with support for openmp. After >> compiling an openblas i wish to keep using the automake path of >> installation. > > I don't think I understand you: what exactly do you mean with "the > automake path of installation"? > >> But I m having problems having automake to grab the new >> blas library. > > Normally you would set the LD_LIBRARY_PATH environment variable if you > want to point to a new (shared) library. Would that solve your problem > as well? > > > Lennart. > >> >> Anyone aware of how to do this? >> >> -Alvaro >> >> >> _______________________________________________ >> genabel-devel mailing list >> genabel-devel at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel >> > > > > _______________________________________________ > genabel-devel mailing list > genabel-devel at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel > -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: From alvaro.frank at rwth-aachen.de Wed May 21 09:26:20 2014 From: alvaro.frank at rwth-aachen.de (Frank, Alvaro Jesus) Date: Wed, 21 May 2014 07:26:20 +0000 Subject: [GenABEL-dev] automake In-Reply-To: <537A3384.6060406@karssen.org> References: <244CF001646FF74FB34F372310A332C57B83DC@MBX5.rwth-ad.de> <53750161.6080408@karssen.org>,<537A3384.6060406@karssen.org> Message-ID: <244CF001646FF74FB34F372310A332C57B936C@MBX5.rwth-ad.de> Thank you for the info. My problems with this custom compiled libraries seem to be hard to solve once a normal shared one is installed. blas has to be compiled with OPENMP=1 enabled and default binaries for Ubuntu seem not to have them. I will try your suggestion and will follow up soon too. ________________________________________ From: genabel-devel-bounces at lists.r-forge.r-project.org [genabel-devel-bounces at lists.r-forge.r-project.org] on behalf of L.C. Karssen [lennart at karssen.org] Sent: Monday, May 19, 2014 6:38 PM To: genabel-devel at lists.r-forge.r-project.org Subject: Re: [GenABEL-dev] automake Hi Alvaro, I just realised that my previous answer may not be what you were looking for. The LD_LIBRARY_PATH variable is used to search for shared libraries at run time. Instead you probably mean how to tell where the libraries is at compile time. For that you should add the CXXFLAGS option -L followed by the path. Best, Lennart. On 15-05-14 20:03, L.C. Karssen wrote: > Hi Alvaro, > > On 15-05-14 17:01, Frank, Alvaro Jesus wrote: >> Hi Lennart, >> >> I need some help with making the installation of omicabelnomm with >> ubuntu and similar systems possible for end users. Aparently some >> pre-compiled blas do not get compiled with support for openmp. After >> compiling an openblas i wish to keep using the automake path of >> installation. > > I don't think I understand you: what exactly do you mean with "the > automake path of installation"? > >> But I m having problems having automake to grab the new >> blas library. > > Normally you would set the LD_LIBRARY_PATH environment variable if you > want to point to a new (shared) library. Would that solve your problem > as well? > > > Lennart. > >> >> Anyone aware of how to do this? >> >> -Alvaro >> >> >> _______________________________________________ >> genabel-devel mailing list >> genabel-devel at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel >> > > > > _______________________________________________ > genabel-devel mailing list > genabel-devel at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel > -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- From alvaro.frank at rwth-aachen.de Wed May 21 10:02:25 2014 From: alvaro.frank at rwth-aachen.de (Frank, Alvaro Jesus) Date: Wed, 21 May 2014 08:02:25 +0000 Subject: [GenABEL-dev] Best way to 'round' small floats In-Reply-To: <537A0FF9.5070803@karssen.org> References: <537A0FF9.5070803@karssen.org> Message-ID: <244CF001646FF74FB34F372310A332C57B9399@MBX5.rwth-ad.de> Hi All, chi^2 refers to the std error used to divide beta to calculate the t-test value? I am not familiar with this kind of t-statistic using Likelihood Ratio Test, the one I was familiar with is the one presented in the attached paper. It seem to me weird that a ^2 value can be negative... Dominant should not have much difference with any other in terms of relation between the values used. I think Paolo and us have hinted several times to possible problems with significant digits and how this may affect results. I.e: Biomarkers with values exact up to 10^-3 will not give reliable results on any value direct or indirectly multiplied or added with it beyond 10^-3. Or perhaps that is a different issue. ________________________________________ From: genabel-devel-bounces at lists.r-forge.r-project.org [genabel-devel-bounces at lists.r-forge.r-project.org] on behalf of L.C. Karssen [lennart at karssen.org] Sent: Monday, May 19, 2014 4:06 PM To: GenABEL Development list Subject: [GenABEL-dev] Best way to 'round' small floats Dear list, While working on adding p-values to the ProbABEL output I ran into the fact that (at least in the checks) we end up with small negative chi^2 values (e.g. -1.14e-13) for some models (e.g. dominant). Of course this shouldn't happen, but my guess is it's a numerical problem caused by subtracting the two likelihoods for the Likelihood Ratio Test. I'd like to add a bit of code like this to mitigate this issue: if (chi2 < 0 && abs(chi2) < EPS) { chi2 = 0 } with EPS set to e.g. 1e-9. This won't harm any analysis, since we're only interested in chi^2 values away from zero, but I was wondering if there is a more appropriate way to do this. Thanks, Lennart. -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- -------------- next part -------------- A non-text attachment was scrubbed... Name: hippokratia-14-23.pdf Type: application/pdf Size: 461655 bytes Desc: hippokratia-14-23.pdf URL: From alvaro.frank at rwth-aachen.de Wed May 21 13:01:45 2014 From: alvaro.frank at rwth-aachen.de (Frank, Alvaro Jesus) Date: Wed, 21 May 2014 11:01:45 +0000 Subject: [GenABEL-dev] compressed dosage files and Big Data issues Message-ID: <244CF001646FF74FB34F372310A332C57B93FB@MBX5.rwth-ad.de> Hi All, It has been brought to my attention that dosage files with imputed data used in regression analysis are ussually stored in disk in a compressed manner. For the tool snptest and perhaps others, users seem to just pass the path to the compressed files. Do the other GenABEL tools also decompress the data on the fly before using it on the analysis? It seems to me that the project is meant to handle the "Big Data" problem, but many aspects of input and out data are being ignored. The example dataset that I am talking about requires around 1.4 terabytes of data in compressed form. The uncompressed form seems to bring this to 10-20 terabytes. If one were to use the computational power of an entire 16core machine with pigz(parallel gzip) to uncompress the data, 4 hours would be required to uncompress the data. Any other tool wither in unix/R/c would take even longer in a sequential manner. I am being informed that reported times are around 24h+ hours to wait for the total uncompressing of the data when trying to extrac a subset of the data. I can imagine that great chinks of the runtome from tools that do regression also suffer from having to uncompress on the fly. The total uncompressed data is never kept, only partial subparts of it as temporary files. Drives tipically offer 4Tb of storage space, so storing 10-20 TB seems a bit overwhelming if not done or designed correctly. Now consider a regression tool that is supposed to use all the data for a proper GWAS. This tool would have to spend days decompressing the data alone. If an entire research group is meant to use this data, and each member has to share resources on the same system, then uncompressing on the fly is even less optimal. The only real alternative is to keep the data uncompressed in disk and use only computational resources to calculate regressions. But then this runs into the problem of how to store it since drives are small and limited compared to the amount of data. A solution to this aspect of "Big data" comes from properly design supercomputing clusters or databases. This systems do not handle the filesystem where files are stored as part of individual drives. They simply use a version of a Distributed File System, like HDSP from Apache. The entire capacity filesystem can be expanded by simply adding drives to it, like normal hard drives for storage and PCIE SSD?s for high speed cache. This is all transparent to the end user, who only sees a unified filesystem. To solve the "Big Data" problem, such aspects of IT Infrastructure and systems like HDSP have to be included in the entire workflow process. What is the stance of the GenABEL in regards to how data is stored and handled? My recommendation would be to to a tleast have a best practice disclosure, where many other aspects of workflow are included and discussed, as to make the usage of the computational tools optimal. It is not feasible to tackle big data with just faster or easier to use computational tools, since those tools have to adapt to the data going in and going out. Sorry for the long email. TL;DR: Lets encourage uncompressing data and keeping it in disk using Distributed file systems as to make the computational tools faster and workflows more efficient to end-users. http://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/ https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From lennart at karssen.org Tue May 27 14:18:55 2014 From: lennart at karssen.org (L.C. Karssen) Date: Tue, 27 May 2014 14:18:55 +0200 Subject: [GenABEL-dev] [Genabel-commits] r1748 - in pkg/OmicABELnoMM: . src tests In-Reply-To: <20140527120853.AB60C1873C7@r-forge.r-project.org> References: <20140527120853.AB60C1873C7@r-forge.r-project.org> Message-ID: <538482AF.2060507@karssen.org> Hi Alvaro, On 27-05-14 14:08, noreply at r-forge.r-project.org wrote: > Author: afrank > Date: 2014-05-27 14:08:53 +0200 (Tue, 27 May 2014) > New Revision: 1748 > > Modified: > pkg/OmicABELnoMM/Makefile.am > pkg/OmicABELnoMM/configure.ac > pkg/OmicABELnoMM/src/Algorithm.cpp > pkg/OmicABELnoMM/tests/Makefile > pkg/OmicABELnoMM/tests/test.cpp > Log: > Automake integration of tests now runs them using make check. Tests are also compiled along with the normal executable. That sounds good! > > > Modified: pkg/OmicABELnoMM/tests/Makefile > =================================================================== Now that you have a Makefile.am, the Makefile itself can be removed from SVN. Thanks a lot! Lennart. > > To get the complete diff run: > svnlook diff /svnroot/genabel -r 1748 > _______________________________________________ > Genabel-commits mailing list > Genabel-commits at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits > -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: From alvaro.frank at rwth-aachen.de Tue May 27 14:38:41 2014 From: alvaro.frank at rwth-aachen.de (Frank, Alvaro Jesus) Date: Tue, 27 May 2014 12:38:41 +0000 Subject: [GenABEL-dev] [Genabel-commits] r1748 - in pkg/OmicABELnoMM: . src tests In-Reply-To: <538482AF.2060507@karssen.org> References: <20140527120853.AB60C1873C7@r-forge.r-project.org>, <538482AF.2060507@karssen.org> Message-ID: <244CF001646FF74FB34F372310A332C57B9BCD@MBX5.rwth-ad.de> >> Now that you have a Makefile.am, the Makefile itself can be removed from >> SVN. I will cleanse in my next commit when adding the missing statistics. I have me version of them alreadyin prototype form, but they are not that similar to those provided in the document you provided long ago. Different sources use different ways to calculate the divisor of the t-statistic and the degrees of freedom seems to vary too. Users do not help me in this case because pretty much no user knows what formula should be, they just care that it displays the resulting p values. Any thoughts on how to clarify this? ________________________________________ From: genabel-devel-bounces at lists.r-forge.r-project.org [genabel-devel-bounces at lists.r-forge.r-project.org] on behalf of L.C. Karssen [lennart at karssen.org] Sent: Tuesday, May 27, 2014 2:18 PM To: genabel-devel at lists.r-forge.r-project.org Subject: Re: [GenABEL-dev] [Genabel-commits] r1748 - in pkg/OmicABELnoMM: . src tests Hi Alvaro, On 27-05-14 14:08, noreply at r-forge.r-project.org wrote: > Author: afrank > Date: 2014-05-27 14:08:53 +0200 (Tue, 27 May 2014) > New Revision: 1748 > > Modified: > pkg/OmicABELnoMM/Makefile.am > pkg/OmicABELnoMM/configure.ac > pkg/OmicABELnoMM/src/Algorithm.cpp > pkg/OmicABELnoMM/tests/Makefile > pkg/OmicABELnoMM/tests/test.cpp > Log: > Automake integration of tests now runs them using make check. Tests are also compiled along with the normal executable. That sounds good! > > > Modified: pkg/OmicABELnoMM/tests/Makefile > =================================================================== Now that you have a Makefile.am, the Makefile itself can be removed from SVN. Thanks a lot! Lennart. > > To get the complete diff run: > svnlook diff /svnroot/genabel -r 1748 > _______________________________________________ > Genabel-commits mailing list > Genabel-commits at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits > -- *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen Utrecht The Netherlands lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*- From alvaro.frank at rwth-aachen.de Tue May 27 14:52:32 2014 From: alvaro.frank at rwth-aachen.de (Frank, Alvaro Jesus) Date: Tue, 27 May 2014 12:52:32 +0000 Subject: [GenABEL-dev] compression of binary data Message-ID: <244CF001646FF74FB34F372310A332C57B9BE1@MBX5.rwth-ad.de> Hi All, Regarding compression of data, when ALL the data is in its genotyped form there is a very cosistent tructure of only 1?s and 0?s. since there are three columns per individual representing the observed snp, 2/3 of the resulting data are zeroes. Has it been considered to offer a variation of compression based on this? Either through sparse matrices or even by reducing the 3 columns to just 1 using a single digit, 1,2,3 to represent the presence of either AA, AB, BB ? this of course would not work with imputed data. Compressing data using gz or similar is a bad practice anyway, with data handling taking HOURS just to uncompress datasets. Sparse matrices already work great with linear equation solvers and algorithms exist for them. I already managed to commence a cultural change locally here to move to uncompressed data. This requires a lot of infrastructure changes for the cluster used here, but waiting for uncompression is just bad practice when data is used across many institutes with limited computational resources. There seems to be some willingness to consider it. Because of this I don't think pursuing a compression of imputed binary from filevector is of convenience. Offering some kind of tutorials on how to have a proper sustainable workflow seems more beneficial. Topics could be, quality control, scalable storage and computational resources, statistical requirements of the data, etc. Problems arise when the workflow is a mess of inconsistencies and in that case no single isolated tool can help. Just some thoughts. If there is any interest in any of this let me know. -Alvaro -------------- next part -------------- An HTML attachment was scrubbed... URL: