[GenABEL-dev] DatABEL

L.C. Karssen lennart at karssen.org
Fri May 1 16:57:55 CEST 2015


Hi Benjamin,

Thanks for your interest in DatABEL. Because most of DatABEL was
developed before I took over maintenance of the package I have put our
development mailing list in CC. Just in case one of the other developers
wants to chime in.

On 28-04-15 14:29, Benjamin Hofner wrote:
> Hi Lennart,
> 
> we are currently trying to use your package DatABEL to store the data
> for complex GWAS analysis. We are not using your standard tool sets
> implemented in GenABLE and co but are trying to implement a novel method
> ourselves. Currently, we are facing several problems which are most
> likely related to the fact that you store the data on the HDD and use
> pointers (?) to access the data.

The DatABEL package is basically an R interface to a lower-level library
written in C++, which we call filevector [1]. Maybe it's worth looking
at that as well. In the source code repo at [1] you will also find a few
utilities written in C++ to convert text files to and from fvi/fvd files.

When you create a DatABEL object in R it is indeed basically a pointer
to the data in the backing file. The .fvi file contains index data which
is then used to quickly read the actual data from the .fvd file.

> 
> 1) How can one store and share databel objects? I.e. is it possible to
> store a databel object using save("objectname", file = "data.Rda")? On
> one system it works fine. 

So you say you basically create a DatABEL object using databel() and
then want to save that object. Interesting, I never tried that.

> It seems to be transferable if one moves the
> Rda file together with the fvd and fvi  files (and do not rename these).

Yes, that's what I expect. Because of the large amount of data the
actual object (and therefore your .Rda file) will not be copied from the
.fv{i,d} files when creating an object. As you surmised, it's only a
pointer to the data (with some associated information like the buffer
size).

> Couldn't one include this file in the Rda file and or allow to alter the
> path via
> 
> backingfilename() <- "newpath/filename"
> 
> 2) We are trying to use multicore aka mclapply techniques to speed up
> computations. 

If I understand it correctly, you would like to share a (saved) DatABEL
object among several processes where each process works on a subset of
the data in that object. Is that correct?

My first reaction is to say that (imputed) genetic data is usually
already split into several hundred files (assuming 1kG imputed data), so
you could simply use those for data parallelism.
But I can see that parallel access to a subset of a DatABEL object has
its use.

> However, this does not work as the forked processes seem
> to have lost the pointer to the databel file. Sequentially, i.e., using
> lapply, everything works fine. Do you have any experiences here? 

Unfortunately not.


Best regards,

Lennart.


> Can you
> provide any help? If necessary, I can try to provide a minimal example
> that reproduces this problem/error.
> 
> Best regards,
> Benjamin

-- 
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 213 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20150501/ee0e2ae1/attachment.sig>


More information about the genabel-devel mailing list