[GenABEL-dev] Population Mean and Variance

Wed Aug 6 17:14:56 CEST 2014

Hi Alvaro,

On 04-08-14 16:16, Frank, Alvaro Jesus wrote:
> Hi all,
> 
> I have the following dilemma and possible solution, hope I can get 1 or
> 2 responses.
> *Dilemma*:
>

(some text removed)

> With noMM and they way I handle missing data, all n1 !=n2 !=ni != N. I
> still wish to compute avg and variances only once during load time. I do
> not wish to calculate the mean/variance of the sample population once
> for every subset of n. This is not only time-consuming being expensive
> (since it has to be recalculated for each pair of X,Y), but also bad for
> the evaluation of the regression. The regression is evaluated using the
> t-statistic (p-value has a 1-1 relationship with it so I will stick to
> the t-stat for this discussion). The t-stat requires GOOD estimates of
> avg(X), avg(Y), var(X), var(Y). The fundamentals of the best estimates
> (BLUE) prefer bigger sample populations for the calculations of the
> avg/var. But if I only take into account n1 instead of N, my estimate
> will be less accurate, and so will be the evaluation of the regression
> through the t-stat, which requires avg and variance.

My first thought was that usually n1 is not << N (not much missing
phenotype data. However, with increasing numbers of Omics measured
usually not all samples are actually measured.

Let's assume that N is the number of samples for which genetic (imputed)
data is present. I'd say that this is always the largest number of
samples, newer omics data (Y) may only be present for a subset n_i of N.
I can easily see that only 1/3 of N has another omics measured (n_i/N =
0.33).
This is (almost) what you mean, right? I say almost, because you allow
for missing X as well, but I don't think that after imputation there
will be missing X. Let's say that a study has basic phenotype data (i.e.
height, BMI) on 8000 people. If they have (imputed) genetic data on 7500
people, then that is the number we care about. Right? Genomics is our X
data. So missing X should not occur. Of course, a missing Y for a given
X is very much possible.

In principle you know n_i and N at data load time. Maybe this is the
place to add a warning if n_i/N dives below a certain threshold?

> *Solution*:
> I can save a lot of computations by calculating them ONCE with a bigger
> population N and also give better estimates (population size N > n).
> This might sound controversial at first, but it is already being done by
> omicabel. The fact that for a specific pair of X,Y n<<N wont invalidate
> a t-stat using N instead of n. If I can better estimate avg(X) using all
> X data available then the resulting evaluation of t-stat will be better.
> This of-course as long as the user understands that any data not
> excluded by means of the exclusion list, will be considered valid as
> part of the population of interest with size N.
> For example, for a dataset where men and women are present and a trait Y
> has to be correlated with their age: only Women are of interest for the
> correlation, the MEN have to be excluded by an exclusion list and the
> user shall not set their trait Y to NaN to simulate exclusion.

This is important information that you should put in the manual. If it
is in the manual then we can assume people read it (if they don't, I
don't feel responsible for their negligence).

> The
> population of interest is women in this example, so even if there are a
> few missings in Y for women, the avg(AGE_WOMEN) will be calculated with
> all available present data N, and not the subset n from only the present
> data of the relation X,Y. This will still generate the standard missing
> data correlation of the slope beta, but during the evaluation using
> t-stat, the evaluation will have at its disposal a better estimate of
> avg(Y). Men had to be excluded using the exclusion list and not forced
> missing data.
> 
> I need to underline the importance of the user knowing that any data in
> the analysis will be considered as part of the population of interest,
> so that the assumption that avg using N is better than n.

Again, this is something to put in the documentation. Maybe there should
be a chapter/section containing a list of these important requirements
(so they are not (only) buried in the main text).

> Also, note
> that this is crucial to avoid having to recompute BAD estimates of avg
> and var for every pair of X and Y.
> 
> Is this something reasonable? are there are any theoretical or practical
> objections?
> 
> Any questions let me know!
> 
> Alvaro Frank

Best,

Lennart.

> 
> 
> 
> 
> 
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
> 

-- 
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 213 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140806/05c38a33/attachment.sig>