[GenABEL-dev] The GenABEL project fundamentals: post #1

Fri May 13 00:40:54 CEST 2011

As promised, here is the first post on the GenABEL project
fundamentals. This first post describes my general view on
(statistical genomics) methodology development.

As suggested by Lennart, after discussion at this mailing list, this
post is likely to become a part of project's documentation to be
published on www.genabel.org; if little discussion, this will be
reflected in a footnote.

Yurii

---------------------------

The GenABEL project: methodology to address real world problems

The GenABEL project is dedicated to development of statistical
genomics methodology of large impact on the real world. From this
perspective, methodology development includes statistical methodology
itself, its implementation in an usable software, and application of
this software to real data in order to generate new knowledge.

Thus, we see methodology development as a three-stage process
including mathematical formulation of the method, formulation and
implementation of an algorithm in a software, and, finally,
application of the methodology to real data. Actually, most of the
time, the data will call for a new method. In that, application comes
before the mathematical formulation. Presence of all three stages, and
feedback between these is a key aspect of our approach to statistical
genomics methodology development.

Why all three stages are critical?

For example, you may develop something, which looks like a nice and
promising piece of math, but when you try to implement it, you figure
out that you did not completely understand the problem, or that you
were operating under some implicit assumptions, which are not likely
to be correct. You may also figure out that computational complexity
is too high, and you need to change the method in order for it to be
practically applicable.

Next, it is important to apply your methodology to the simulated data
(for which you know the answer): nonsense results will provide
feedback on implementation (is that a bug?) or even on methodology
(ah, formula 15-3 was wrong!).

It is even more important to apply your methodology to REAL data as
early as possible: it will provide feedback on implementation (is it
feasible to run my analysis in reality?), and methodology -- the
situation when a method works on simulated data, but fails miserably
on real data is not that uncommon! Also, trying to use real data will
tell you about data formats people are using, and will eventually make
your implementation really usable. There are example of great methods
implemented in a software requiring such specific data format, that it
becomes almost impossible to use these.

To conclude: methodology development should be viewed as integral
process including development of methodology itself, development of
software, and application of software to real data.

While such integral approach is a tall order for an individual
researcher or even a (smaller) group of these, it is feasible if an
open source approach -- commons-based production by openly exchanging
ideas and collaboration -- is applied throughout. I will elaborate on
this point in my further posts.

Disclaimer. This is my personal position, which is open for
discussion. In this post, I am not speaking on behalf of the GenABEL
project community, but rather seeding the discussion which will
eventually set the project's standards.

I would like to thank Dr. Lennart Karssen for valuable and continuous
discussion through which many of my views on the GenABEL project have
evolved.