[GenABEL-dev] Population Mean and Variance

Mon Aug 4 16:16:06 CEST 2014

Hi all,

I have the following dilemma and possible solution, hope I can get 1 or 2 responses.
Dilemma:
There is an overall population of N individuals.
For a set of SNP's X and Traits Y there will be data missing for some X or Y depending on a set of individuals in N. This is normal/expected.
For two different pairs of X, Y, depending on the missing values, the effective population becomes N->n smaller. Two different pairs X,Y wont have the same effective population participating in the calculation of Beta, n1 != n2. This is normal/expected.
After doing the regression and when calculating the t-statistic, MEAN and VARIANCE of X,Y have to be calculated.
Omicabel does this once during the loading of the data, since it assumes that any missing's will become valid population samples by replacing them with the avg and therefore all analyses will have the same n1=n2=ni=N.  This is normal/expected.
With noMM and they way I handle missing data, all n1 !=n2 !=ni != N. I still wish to compute avg and variances only once during load time. I do not wish to calculate the mean/variance of the sample population once for every subset of n. This is not only time-consuming being expensive (since it has to be recalculated for each pair of X,Y), but also bad for the evaluation of the regression. The regression is evaluated using the t-statistic (p-value has a 1-1 relationship with it so I will stick to the t-stat for this discussion). The t-stat requires GOOD estimates of avg(X), avg(Y), var(X), var(Y). The fundamentals of the best estimates (BLUE) prefer bigger sample populations for the calculations of the avg/var. But if I only take into account n1 instead of N, my estimate will be less accurate, and so will be the evaluation of the regression through the t-stat, which requires avg and variance.
Solution:
I can save a lot of computations by calculating them ONCE with a bigger population N and also give better estimates (population size N > n). This might sound controversial at first, but it is already being done by omicabel. The fact that for a specific pair of X,Y n<<N wont invalidate a t-stat using N instead of n. If I can better estimate avg(X) using all X data available then the resulting evaluation of t-stat will be better. This of-course as long as the user understands that any data not excluded by means of the exclusion list, will be considered valid as part of the population of interest with size N.
For example, for a dataset where men and women are present and a trait Y has to be correlated with their age: only Women are of interest for the correlation, the MEN have to be excluded by an exclusion list and the user shall not set their trait Y to NaN to simulate exclusion. The population of interest is women in this example, so even if there are a few missings in Y for women, the avg(AGE_WOMEN) will be calculated with all available present data N, and not the subset n from only the present data of the relation X,Y. This will still generate the standard missing data correlation of the slope beta, but during the evaluation using t-stat, the evaluation will have at its disposal a better estimate of avg(Y). Men had to be excluded using the exclusion list and not forced missing data.

I need to underline the importance of the user knowing that any data in the analysis will be considered as part of the population of interest, so that the assumption that avg using N is better than n. Also, note that this is crucial to avoid having to recompute BAD estimates of avg and var for every pair of X and Y.

Is this something reasonable? are there are any theoretical or practical objections?

Any questions let me know!

Alvaro Frank

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140804/68f46b34/attachment.html>