<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>
</head>
<body ocsi="0" fpstyle="1">
<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Hi All,
<br>
<br>
Regarding compression of data, when ALL the data is in its genotyped form there is a very cosistent tructure of only 1īs and 0īs. since there are three columns per individual representing the observed snp, 2/3 of the resulting data are zeroes. Has it been considered
to offer a variation of compression based on this? Either through sparse matrices or even by reducing the 3 columns to just 1 using a single digit, 1,2,3 to represent the presence of either AA, AB, BB ?<br>
this of course would not work with imputed data. <br>
Compressing data using gz or similar is a bad practice anyway, with data handling taking HOURS just to uncompress datasets. Sparse matrices already work great with linear equation solvers and algorithms exist for them.<br>
I already managed to commence a cultural change locally here to move to uncompressed data. This requires a lot of infrastructure changes for the cluster used here, but waiting for uncompression is just bad practice when data is used across many institutes with
limited computational resources. There seems to be some willingness to consider it.
<br>
Because of this I don't think pursuing a compression of imputed binary from filevector is of convenience. Offering some kind of tutorials on how to have a proper sustainable workflow seems more beneficial. Topics could be, quality control, scalable storage
and computational resources, statistical requirements of the data, etc. <br>
<br>
Problems arise when the workflow is a mess of inconsistencies and in that case no single isolated tool can help.<br>
<br>
Just some thoughts. If there is any interest in any of this let me know.<br>
<br>
-Alvaro<br>
</div>
</body>
</html>