[Genabel-commits] r1869 - pkg/OmicABELnoMM/doc

Tue Oct 28 16:59:18 CET 2014

Author: afrank
Date: 2014-10-28 16:59:18 +0100 (Tue, 28 Oct 2014)
New Revision: 1869

Modified:
   pkg/OmicABELnoMM/doc/UserGuide.tex
Log:
Extended .tex external documentation. Needs a revision for the  proper installation procedure.

Modified: pkg/OmicABELnoMM/doc/UserGuide.tex
===================================================================

--- pkg/OmicABELnoMM/doc/UserGuide.tex	2014-10-28 15:55:48 UTC (rev 1868)
+++ pkg/OmicABELnoMM/doc/UserGuide.tex	2014-10-28 15:59:18 UTC (rev 1869)
@@ -18,7 +18,7 @@
 \begin{document}
 
 \title{OmicabelNoMM User's Guide}
-\author{Alvaro Frank}
+\author{Alvaro Frank, NAME,NAME}
 \date{October 2014}
 \maketitle
 
@@ -97,76 +97,407 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \chapter{Setting OmicabelNoMM up}
 
-\section{Your Machine}
+\section{Setup a project}
+\begin{lstlisting}[style=BASH,escapechar=\%]
+#projects location
+mkdir GWAS_PROJECT
 
-\subsection{Clusters vs personal Computers}
+cd GWAS_PROJECT
+%
+\end{lstlisting}
 
-\section{Source Files}
+\section{Library and program Requirements}
 
+\subsection{autoconf, autotools}
+
+Make sure you have autoconf/autotools installed
 \begin{lstlisting}[style=BASH,escapechar=\%]
 
-user at ubuntu:~$ svn checkout svn://svn.r-forge.r-project.org/svnroot/genabel/pkg/OmicABELnoMM
-Checked out revision 1838.
-user at ubuntu:~$ cd OmicABELnoMM/
-user at ubuntu:~/OmicABELnoMM$
-$%
+sudo apt-get install autoconf
+autoreconf -fi
+autoconf
+%
 \end{lstlisting}
 
-\section{Compilers}
+\subsection{Compilers}
 
+You will need the latest gcc compiler for your system for running OmicABELnoMM on a single multi-core computer .
+
 \begin{lstlisting}[style=BASH,escapechar=\%]
 
-TODO:Install Compilers cmds
-$%
+sudo apt-get install gcc-4.9
+%
 \end{lstlisting}
 
-\section{3rd Party Libraries}
+For compute-cluster you will need MPI support.
 
 \begin{lstlisting}[style=BASH,escapechar=\%]
+sudo apt-get install openmpi-bin
+sudo apt-get install openmpi-common
+sudo apt-get install libopenmpi
+sudo apt-get install libopenmpi-dbg 
+sudo apt-get install libopenmpi-dev
+\end{lstlisting}
 
-TODO:Install BOOST and BLAS LIBRARIES cmds
+\subsection{Blas and Lapack}
+
+You will need a Linear Algebra Library for high performance matrix computations.
+The standard is to use openblas and lapack.
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+sudo apt-get install libopenblas-dev
+sudo apt-get install libopenblas-base
+sudo apt-get install liblapack3gf
+sudo apt-get install liblapack-doc
+sudo apt-get install liblapack-dev
+sudo apt-get install liblapacke
+sudo apt-get install liblapacke-dev
+%
+\end{lstlisting}
+
+For alternative ways of installing BLAS and lapack, you can download the source code directly and compile for your own machine, guaranting that the settings will be optimal. Sometimes distributions lack USE\_OPENMP=1.
+Remember to change path\_to\_ with your your own path to the specified folder.
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+git clone git://github.com/xianyi/OpenBLAS
+
+cd OpenBLAS
+
+#make sure you use g++ 4.8 or Higher!
+make all HOSTCC=g++ FC=gfortran USE_OPENMP=1
+
+#install the libraries relative to OmicABELnoMM
+make install PREFIX="path_to_/OmicABELnoMM/libs/"
+%
+\end{lstlisting}
+(Status: Support Broken)
+You can Use AMD's ACML (BLAS from AMD) by going to:\\
+http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/acml-downloads-resources/ \\
+and copy the supplied binary libraries to "/OmicABELnoMM/libs/". IF both libraries are present (Openblas + ACML), the system will use ACML.
+
+Let Omicabel know where BLAS is located by:
+
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:path_to_/OmicABELnoMM/libs/lib
+autoreconf -fi
+
+./configure
 $%
 \end{lstlisting}
 
+\section{Source Files}
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+#get the source files
+svn checkout svn://svn.r-forge.r-project.org/svnroot/genabel/pkg/OmicABELnoMM/
+
+cd OmicABELnoMM
+%
+\end{lstlisting}
+
 \section{Compiling}
 
 For compiling the final executable binary use:
 \begin{lstlisting}[style=BASH,escapechar=\%]
 
-user at ubuntu:~/OmicABELnoMM$ make
-$%
+#in /OmicABELnoMM/
+make
+%
 \end{lstlisting}
 
 For compiling the test binary use:
 \begin{lstlisting}[style=BASH,escapechar=\%]
 
-user at ubuntu:~/OmicABELnoMM$ make check
-$%
+#in /OmicABELnoMM/
+make check
+%
 \end{lstlisting}
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \chapter{Preparing Source Data}
 
 \section{Overview}
+
+OmicABELnoMM uses a DatABEL format for the source files. DatABEL uses less storage space, and helps computations to be done faster.
+
+Original source files can be in any format as long as there is a way to load them into R for a table(matrix) format. Once in table format, they can be just transformed to DatABEL format to be used by OmicABEL.
+
+
 \section{Databel}
+Start R, then use library(DatABEL); help("DatABEL-package")\\
+More info: http://www.genabel.org/packages/DatABEL\\
+Start R and load DatABEL
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+library(DatABEL)
+%
+\end{lstlisting}
+
 \section{Covariates}
+
+This example shows how to artificially crate covariates:
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+#START_FAKE_DATA
+n = 2000		 # number of individuals
+l = 3   		 # number of covariates+1 for intercept
+r = 2 			 # how many columns per SNP
+m = r*100000		 # number of snps
+t = 10000  		 # number of traits
+set.seed(1001)
+runif(3)
+XL <- matrix(rnorm((l+1)*n),ncol=(l+1)) # first column should be ones (intercept)
+for(i in 1:(n*(l+1))){ if(sample(1:100,1) > 95){XL[i]=0/0} }#fill in NANs
+#END_FAKE_DATA
+
+#FROM here on if you have your real data stored in the matrix variable XL you are ok.
+#how to get your data into XL depends on your original files and how they were stored.
+
+#The first column of covariates has to have 1's! it is the intercepts
+#Make sure you add this column of ones and that you have the space for it
+#without loosing your own data.
+for(i in 1:n){ XL[i]=1}
+
+#add your own idnames!
+colnames(XL) <- c("intercept", paste("cov",1:l,sep=""))
+rownames(XL) <- paste("ind",1:n,sep="")
+
+#transform to databel (store it)
+XL_db <- matrix2databel(XL,filename="XL",type="FLOAT")
+
+#XL[1:n,1:(l+1)]
+#XL
+%
+\end{lstlisting}
+
 \section{Independent Variables, SNPs,CPG Sites,Measurements used to explain other Measurements}
+\begin{lstlisting}[style=BASH,escapechar=\%]
+#START_FAKE_DATA
+n = 2000		 # number of individuals
+l = 3   		 # number of covariates+1 for intercept
+r = 2 			 # how many columns per SNP
+m = r*100000		 # number of snps
+t = 10000  		 # number of traits
+#r=2
+XR <- matrix(rnorm(m*n),ncol=m)
+
+#Assumes that you had the previous Y still stored, this will create XR linearly dependent on the Y
+for(i in 1 + r*(0:((m-2)/r)) )
+{ 
+	#print(i)
+	yIdx=ceiling(i/r)
+	#print(i)
+	#print(yIdx)
+	for(j in 1:n)
+	{ 
+		XR[j,i]=Y[j,yIdx]
+		for(k in 1:l)
+		{
+			XR[j,i]=XR[j,i]-XL[j,k]*0.01
+		}
+		for(k in 1:(r-1))
+		{
+			XR[j,i]=XR[j,i]-XR[j,i+k]*0.01
+		}
+		#XR[j,i]=XR[j,i]/2.8888
+		#XR[j,i] = XR[j,i]*runif(1, 1.0-var, 1.0)
+		
+	}
+}
+
+#add missing data
+for(i in 1:(n*m)){ if(sample(1:100,1) > 90) XR[i]=0/0}
+#END_FAKE_DATA
+
+#FROM here on if you have your real data stored in the matrix variable XL you are ok.
+#how to get your data into XL depends on your original files and how they were stored.
+
+#The first column of covariates has to have 1's! it is the intercepts
+#Make sure you add this column of ones and that you have the space for it
+#without loosing your own data.
+
+#add your own idnames!
+colnames(XR) <- paste("miss",1:m,sep="")
+for(i in 1:(m/r))
+{
+	for(j in 1:r) 
+	{
+		colnames(XR)[(i-1)*r+(j)] = paste0("snp",paste(i,j,sep="_") )
+	}
+}
+
+#add your own idnames!
+rownames(XR) <- paste("ind",1:n,sep="")
+
+#transform to databel (store it)
+XR_db <- matrix2databel(XR,filename="XR",type="FLOAT")
+%
+\end{lstlisting}
+
 \section{Dependent Variable, Phenotypes,Measurements to be explained}
 
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+
+%
+\end{lstlisting}
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \chapter{Running Analysis}
 
+\section{Getting help from the program}
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+./omicabelnomm -h
+usage: omicabelnomm -c <path/fname> --geno <path/fname> -p <path/fname> -o <path/fname> 
+                -x <path/fname> -n <#SNPcols> -t <#CPUs>
+                         -d <0.0~1.0> -r <-10.0~1.0> -b -s <0.0~1.0>  -e <-10.0~1.0> -i -f
+omicabelnomm Version 0.96b 
+	Required: 
+	-p --phe    	 <path/filename> to the inputs containing phenotypes. 
+	-g --geno   	 <path/filename> to the inputs containing genotypes. 
+	-c --cov    	 <path/filename> to the inputs containing covariates. 
+	-o --out    	 <path/filename> to store the output to (used for all .txt and .ibin & .dbin). 
+
+Optional: 
+	-n --ngpred 	 <#SNPcols> Number of columns in the geno file that represent a single SNP. 
+	-t --thr    	 <#CPUs> Number of computing threads to use to speed computations.
+			 Recommended is 4-8 per node (see MPI). 
+	-x --excl   	 <path/filename> file containing list of individuals to exclude 
+			 from input files, (see example file). 
+	-d --pdisp  	 <0.0~1.0> Value to use as maximum threshold for significance.
+			 Results with P-values UNDER this threshold will be 
+			 displayed in the putput .txt file. 
+	-r --rdisp  	 <-10.0~1.0> Value to use as minimum threshold for R2. 
+			 Results with R2-values ABOVE this threshold will be displayed
+			 in the putput .txt file. 
+	-b --stobin 	 Flag that forces to ALSO store results in a
+			 smaller binary format (*.ibin & *.dbin). 
+	-s --psto   	 <0.0~1.0>  Results with P-values UNDER this threshold will be 
+			 displayed in the putput binary files. 
+	-e --rsto   	 <-10.0~1.0> Results with R2-values ABOVE this threshold will be 
+			 stored in the putput binary files. 
+	-i --fdcov  	 Flag that forces to include covariates (when its genotype is significant) 
+			 as part of the results stored 
+	-f --fdgen  	 Flag that forces to consider all included results 
+			 (causes the analisis to ignores ALL threshold values). 
+	-j --additive  	 Flag that runs the analisis with an Additive Model with 
+			 (2*AA,1*AB,0*BB) effects. 
+	-k --dominant  	 Flag that runs the analisis with an Dominant Model with 
+			 (1*AA,1*AB,0*BB) effects. 
+	-l --recessive 	 Flag that runs the analisis with an Recessive Model with 
+			 (1*AA,0*AB,0*BB) effects. 
+	-z --mylinear 	 <path/filename> to read Factors 'f_i' for a Custom Linear Model with
+			 f1*X1,f2*X2,f3*X3...fn*X_ngpred as effects,
+			 each column of each independent variable will be multiplied with
+			 the specified factors. 
+			 Formula: y~alpha*cov + beta_1*f1*X1 + beta_2*f2*X2 +...+ beta_n*fn*Xn, 
+			 (see example files!). 
+%
+\end{lstlisting}
+\pagebreak
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+	-y --myaddit  	 <path/filename> to read Factors 'f_i' for a Custom Additive Model with
+			 (f1*X1,f2*X2,f3*X3...fn*X_ngpred) as effects,
+			 each column of each independent variable will be multiplied with the 
+			 specified factors and then added together. 
+			 Formula: y~alpha*cov + beta*(f1*X1 + f2*X2 +...+ fn*Xn), (see example files!).
+	-v --simpleinter <path/filename> to read the interactions from;
+			 for single analysis using multile interactions. 
+	-w --multinter 	 <path/filename> to read the interactions from;
+			 for multiple analysis using single interaction per analysis. 
+	-u --keepinter 	 Flag that sets if the interaction analysis chose is to too keep the dependent 
+			 variable X. If set, Formula: y~alpha*cov + beta_1*INT*X + beta_2*X, 
+			 (see example files!).  Default not set, 
+			 Formula: y~alpha*cov + beta_1*INT*X, (see example files!). 
+
+			 Support for MPI is available. 
+			 Simply use mpirun -np <#nodes> omicabelnomm <params> 
+			 on an Open-MPI enabled computer/cluster.
+			 Recommended is to use MPI when dealing with problems with over 2000 genotypes,
+			 at a rate of 1 node per 2000 genotypes.
+	
+
+%
+\end{lstlisting}
+
 \section{WARNING: Theoretical Caveats}
 
 \section{Simple Linear Regression}
 
-\section{Cluster usage for Simple Linear Regression}
+Simple linear regression analysis with 4 threads can be done using (note long and short versions).
+This setup assumes as default 1 column per XR (-n 1). In the default case, each column (-n 1) gets its own regression coefficient.
+\begin{lstlisting}[style=BASH,escapechar=\%]
 
-\section{Covariates in Linear Regression}
+./omicabelnomm --cov examples/XL --geno examples/XR --phe examples/Y --out examples/B --thr 4
 
+./omicabelnomm -c examples/XL -g examples/XR -p examples/Y -o examples/B -t 4
+%
+\end{lstlisting}
+
+When using more than one column per snp, you specify the value with -n 3, where each column of XR gets its own regression coefficient, i.e: 
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+./omicabelnomm -c examples/XL -g examples/XR -p examples/Y -o examples/B -t 4 -n 3
+%
+\end{lstlisting}
+
+For analysis involving snp's and dosage models, the following popular options are allowed:
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+./omicabelnomm -c examples/XL -g examples/XR -p examples/Y -o examples/B -t 4 --additive
+
+./omicabelnomm -c examples/XL -g examples/XR -p examples/Y -o examples/B -t 4 --recessive
+
+./omicabelnomm -c examples/XL -g examples/XR -p examples/Y -o examples/B -t 4 --dominant
+%
+\end{lstlisting}
+
+\section{Custom Dosage Analysis}
+
+When using custom dosages, you need to specify how many columns per snp are you using. You also have to specify the file from which the dosage factors will be read. The file has to contain 1 factor per column of the snp. 
+Using --myaddit will cause for all columns to be multiplied by the specific factors and then added together. The resulting vector (1 per -n  of the snp) will obtain a collective regression coefficient.\\
+Using --mylinear each single -n will obtain its own regression coefficient after being multiplied by the respective dosage factor.
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+./omicabelnomm -c examples/XL -g examples/XR -p examples/Y -o examples/B -t 4 
+						-n 2 --myaddit examples/dosages_2.txt
+
+./omicabelnomm -c examples/XL -g examples/XR -p examples/Y -o examples/B -t 4
+						-n 1 --mylinear examples/dosages_1.txt
+%
+\end{lstlisting}
+
+
+\section{MPI and Cluster usage for Simple Linear Regression}
+
+Compute clusters offer multiple compute nodes(computers) where each has multi threading capabilities. On OmicABELnoMM compiled using MPI support, you could use mpirun to use multiple nodes at once. 10 nodes using 8 threads each:
+
+\begin{lstlisting}[style=BASH,escapechar=\%]
+
+mpirun -np 10 ./omicabelnomm -c examples/XL --g examples/XR -p examples/Y -o examples/B -t 8
+%
+\end{lstlisting}
+
+In this case each process (1 per node specified using -np for a total of 10 in the example) will create a different outputfile named from B\_mpi1\_dis.txt ... B\_mpi10\_dis.txt
+
+
 \section{Simple interactions of non linear terms, Enviromental Effects}
 
 
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \chapter{FAQ}