[GenABEL-dev] [Genabel-commits] r929 - pkg/ProbABEL/src

L.C. Karssen l.karssen at erasmusmc.nl
Thu Jul 12 17:13:28 CEST 2012


Hi Yurii,

Thanks for the nice comments. And a good question indeed. As you say,
this can be done in DatABEL too.
The context is the following. At present we have a set of scripts (part
perl, part R/DatABEL) that extract this kind of information (together
with data from the imputation .info file). These scripts were written
rather ad hoc and needed constant maintenance, so I'm in the process of
rewriting them in a maintainable way.

I choose to do it in this way for a couple of reasons:
- I think more people will benefit from such an extraction
script/program, so I wanted to distribute it as part of ProbABEL. Since
ProbABEL has no (direct) dependence on R I didn't want to use DatABEL
for it.

- Learning how to deal with filevector data at the C/C++ level will make
it easier to incorporate it into other programs I/we may write. For
example, some time ago I've toyed with the idea of letting the C-code of
the program we used for our SNP-SNP interaction work talk directly with
.fv? files. At that time I decided against it because I didn't have the
time to dive into filevector the way I did now.
This may also help me in figuring out why pacoxph doesn't play well with
.fv? data.

- I need to keep my C/C++ skills somewhat up to date :-)


At first I wondered whether I should put this program in ProbABEL or as
part of the fvfutil directory, since that directory already contains
binaries that convert between filevector and plain text data. Since my
program is specifically designed for this extraction of SNP data (in
terms of its output messages and input parameters, I decided to put it
in ProbABEL.

My next plan is to add a wrapper shell script that reads the probABEL
config file for the locations of the files with imputed data. Then it
should use grep to extract info from the .info files (and find out in
which .fvf file the SNP is) and send that file name to the extract-snp
C-code.
Using a shell script will make the ProbABEL package less portable to
Windows, but implementing all of that functionality in C costs too much
time. I also don't think many people use probABEL on Windows anyway.

Any suggestions w.r.t. this plan are welcome, of course.


Best,

Lennart.


On 07/12/2012 04:26 PM, Yury Aulchenko wrote:
> Hi Lennart,
> 
> nice little utility! good style, very readable code.
> 
> I am curious why you got into trouble to program this in C++ (e.g.
> this can be done using DatABEL functionality). Suppose I miss the
> context :)
> 
> best wishes, Yurii 
> ------------------------------------------------------- Yurii S.
> Aulchenko, PhD Director, YuriiA consulting Rotterdam, The
> Netherlands yurii [dot] aulchenko [at] gmail [dot] com
> 
> On Jul 12, 2012, at 4:16 PM, noreply at r-forge.r-project.org wrote:
> 
>> Author: lckarssen Date: 2012-07-12 14:16:00 +0200 (Thu, 12 Jul
>> 2012) New Revision: 929
>> 
>> Added: pkg/ProbABEL/src/extract-snp.cpp Modified: 
>> pkg/ProbABEL/src/Makefile.am Log: ProbABEL: added the extract-snp
>> binary that extracts SNP dosages from filevector files (from the
>> command line). It can do this for all IDs or just one.
>> 
>> Of course, this program is not necessarily looking at SNPs and IDs.
>> In filevector parlance SNPs are Variables (rows) and IDs are
>> Observations (rows).
>> 
>> At the moment it can only extract one SNP at a time (given as
>> command line option). This may change in the future. For example,
>> we may want the program to read SNP names from stdin too. The -d
>> option for debug info prints out all SNP names and IDs in the
>> filevector file, which may be too much for your screen.
>> 
>> 
>> Modified: pkg/ProbABEL/src/Makefile.am 
>> ===================================================================
>>
>> 
--- pkg/ProbABEL/src/Makefile.am	2012-07-06 05:14:23 UTC (rev 928)
>> +++ pkg/ProbABEL/src/Makefile.am	2012-07-12 12:16:00 UTC (rev 929) 
>> @@ -24,7 +24,7 @@ include/R_ext/Print.h include/R_ext/Random.h
>> include/R_ext/Utils.h	\ include/R_ext/RS.h
>> 
>> -bin_PROGRAMS = palinear palogist +bin_PROGRAMS = palinear palogist
>> extract-snp if BUILD_pacoxph ## Extra files for pacoxph COXBASE =
>> coxfit2 chinv2 cholesky2 chsolve2 dmatrix @@ -47,6 +47,8 @@ 
>> pacoxph_CPPFLAGS = $(AM_CPPFLAGS) pacoxph_CFLAGS = -DCOXPH -I
>> $(top_srcdir)/src/include $(AM_CFLAGS)
>> 
>> +extract_snp_SOURCES = extract-snp.cpp $(FVSRC) $(FVHEADERS) + ##
>> Install these scripts in the bin directory as well: 
>> dist_bin_SCRIPTS = probabel.pl extIDS.pl
>> 
>> 
>> Added: pkg/ProbABEL/src/extract-snp.cpp 
>> ===================================================================
>>
>> 
--- pkg/ProbABEL/src/extract-snp.cpp	                        (rev 0)
>> +++ pkg/ProbABEL/src/extract-snp.cpp	2012-07-12 12:16:00 UTC (rev
>> 929) @@ -0,0 +1,209 @@ 
>> +/*******************************************************************************
>>
>> 
+ *
>> + * This program extracts a given row (dosages for all individuals
>> for a given + * SNP) from filevector files. + * + * (C) 2012 L.C.
>> Karssen + * + * This program is free software: you can redistribute
>> it and/or modify + * it under the terms of the GNU General Public
>> License as published by + * the Free Software Foundation, either
>> version 3 of the License, or + * (at your option) any later
>> version. + * + * This program is distributed in the hope that it
>> will be useful, + * but WITHOUT ANY WARRANTY; without even the
>> implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR
>> PURPOSE.  See the + * GNU General Public License for more details. 
>> + * + * You should have received a copy of the GNU General Public
>> License + * along with this program.  If not, see
>> <http://www.gnu.org/licenses/>. + * +
>> ******************************************************************************/
>>
>> 
+
>> +#include <stdio.h> +#include <string> +#include <iostream> 
>> +#include <fstream> +#include <map> +#include <getopt.h> + +using
>> namespace std; + +#include "fvlib/Logger.h" +#include
>> "fvlib/FileVector.h" +#include "fvlib/CastUtils.h" + + +void
>> info(char *program_name) +{ +    cout << program_name +	 << "
>> extracts the dosage for a SNP for one or more individuals from a" +
>> << "file in filevector format (.fvi/.fvd)." << endl; +    cout <<
>> endl; +    cout << "Usage: " << program_name << " --file <fv file>
>> --snp <snpname>" +	 << endl; +    cout << "   or: " << program_name
>> << " -f <fv file> -s <snpname>" << endl; +    cout << endl; +
>> cout << "Additional options:" << endl; +    cout << "\t--id (-i)
>> <idname>: only print the dosage for the given SNP and ID" +	 <<
>> endl; +    cout << "\t--debug (-d): show debugging output" <<
>> endl; +    cout << "\t--help (-h): show this information" << endl; 
>> +} + + +int main(int argc, char* argv[]) +{ +    if (argc < 2) +
>> { +	info(argv[0]); +	exit(3); +    } + +    int next_option; +
>> const char * const short_options = "df:hi:s:"; +    const struct
>> option long_options [] = +	{ +	    {"debug",  0, NULL, 'd'}, +
>> {"file",  1, NULL, 'f'}, +	    {"help",  0, NULL, 'h'}, +
>> {"id",   1, NULL, 'i'}, +	    {"snp",  1, NULL, 's'}, +	    {NULL
>> ,   0, NULL, 0  } +	}; +    char *program_name = argv[0]; +    bool
>> debug = false; +    string inputFileName = ""; +    string idname =
>> ""; +    string snpname = ""; +    do +    { +	next_option =
>> getopt_long(argc, argv, short_options, long_options, NULL); +
>> switch (next_option) +	{ +	case 'h': +	    info(program_name); +
>> exit(0); +	case 'd': +	    debug = true; +	    break; +	case 'f': +
>> inputFileName = string(optarg); +	    break; +	case 'i': +
>> idname = string(optarg); +	    break; +	case 's': +	    snpname =
>> string(optarg); +	    break; +	case '?': break; +	case -1 : break; 
>> +	} +    } +    while (next_option != -1); + +    if
>> (snpname.compare("") == 0) +    { +	cerr << "Error: No SNP name
>> given" << endl << endl; +	info(program_name); +	exit(2); +    } + +
>> unsigned long int snprow = 0; +    bool snpfound = false; +    bool
>> idfound  = false; + +    if (debug) cout << "Input file is '" <<
>> inputFileName << "'." << endl; + +    FileVector fv(inputFileName,
>> 64, true); + +    if (debug) +    { +	for (unsigned long int col=0;
>> col<fv.getNumObservations(); col++) +	{ +	    cout <<
>> fv.readObservationName(col).name << " "; +	} + +	cout << endl; +
>> cout << "----------------------" << endl; +    } + +    // Look at
>> the SNPs (rows) first +    for (unsigned long int row=0;
>> row<fv.getNumVariables(); row++) +    { +	string current_snpname =
>> fv.readVariableName(row).name; +	if
>> (current_snpname.compare(snpname) == 0) +	{ +	    snpfound = true; 
>> +	    snprow = row; +	    if (debug) +	    { +		cout << "*" <<
>> current_snpname << "* "; +	    } +	} else { +	    if (debug) +
>> { +		cout << current_snpname << " "; +	    } +	} +    } +    if
>> (debug) +    { +	cout << endl; +	cout << "----------------------"
>> << endl; + +	cout << "N_obs = " << fv.getNumObservations() <<
>> "\telement size: " +	     <<  fv.getElementSize() << "\tDataType:
>> " +	     << dataTypeToString(fv.getElementType()) << endl; +    } 
>> + +    if(!snpfound) +    { +	cerr << "SNP name '" << snpname << "'
>> not found in data file " +	     << inputFileName << endl; +
>> exit(1); +    } + +    char * data = new (nothrow)
>> char[fv.getNumObservations() * +					 fv.getElementSize()]; +    if
>> (!data) +    { +	cerr << "Cannot allocate memory for data vector.
>> Exiting..." << endl; +	exit(2); +    } + +
>> fv.readVariable(snprow, data); +    for (unsigned long int col=0;
>> col<fv.getNumObservations(); col++) +    { +	if (idname.compare("")
>> != 0) +	{ +	    // An ID name has been set, only print the dosage
>> for that ID +	    // and SNP combination +
>> if(idname.compare(fv.readObservationName(col).name) == 0) +	    { +
>> idfound = true; +		cout << fv.readObservationName(col).name <<
>> "\t" +		     << bufToString(fv.getElementType(), +
>> &data[col*fv.getElementSize()], +				    string("NA")) +		     <<
>> endl; +	    } +	} else { +	    // Print the dosages for all IDs +
>> cout << fv.readObservationName(col).name << "\t" +		 <<
>> bufToString(fv.getElementType(), +
>> &data[col*fv.getElementSize()], +				string("NA")) +		 << endl; +
>> } +    } + +    if (idfound == false && idname.compare("") != 0) +
>> { +	cerr << "Id '" << idname << "' not found in data file " +
>> << inputFileName << endl; +	exit(1); +    } + +    delete data; +}
>> 
>> _______________________________________________ Genabel-commits
>> mailing list Genabel-commits at lists.r-forge.r-project.org 
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits
>
>> 
> 
> 
> 
> _______________________________________________ genabel-devel mailing
> list genabel-devel at lists.r-forge.r-project.org 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>
> 
-- 
-----------------------------------------------
dr. L.C. Karssen
Erasmus MC
Department of Epidemiology
Room Ee2224

Postbus 2040
3000 CA Rotterdam
The Netherlands

phone: +31-10-7044217
fax: +31-10-7044657
email: l.karssen at erasmusmc.nl
GPG key ID: 0E1D39E3
-----------------------------------------------



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20120712/b743c00c/attachment.sig>


More information about the genabel-devel mailing list