[GenABEL-dev] new approach for data storage in GenABEL package

Maksim Struchalin m.v.struchalin at mail.ru
Fri Nov 29 08:43:50 CET 2013


I looked at how other developres deal with issue of dependency between a 
package and its data.package. I checked out two random packages from 
CRAN: GANPA (GANPAdata) and gamlss (gamlss.data). Both of them (GANPA 
and gamlss) dependes on their data packages - that means their 
DESCRIPTION files contain a reference to their data packages in the 
"Depends:" field. Only GANPAdata suggests GANPA (gamlss.data does not 
Depends/Suggests gamlss).

When I made GenABEL depending on GenABEL.data, I kept in my mind the 
same idea as Nicola pronounced below - that, in this case, GenABEL.data 
is installed automaticly when users run "install.package(GenABEL)". This 
is convinient for users who install GenABEL from CRAN and this is in 
line with GANPA and gamlss but it, probably, does not fully reflect the 
GenABEL reality. The dependency between GenABEL and GenABL.data is weak 
- GenABEL is gonna be mostly used without GenABEL.data. So, I support 
the Yurii's idea about making GenABEL.data as 'suggested' and including 
'requre(...'.

About dot: Personally, I like GenABEL.data. From this name, It is clear 
that this package is some kind of a 'subpackage' of GenABEL package and 
it is not a standalone one.

best,
Maksim

On 28/11/2013 18:24, L.C. Karssen wrote:
>
> On 11/28/2013 12:12 PM, Yury Aulchenko wrote:
>> I would think that GenABEL(.)data is "suggested" and then any
>> examples using the data from this packages start with something like
>>
>> if (require("GenABEL(.)data") ...
> This sounds like a good solution.
>
>> How do other packages which lean on data-packages solve this?
>>
>> As for the "dot" - I do not have any strong opinion - both options
>> seem ok to me :)
> Great :-). Then I propose (of course) to stick with the dot, also
> because that's already used now.
>
>
> Best,
>
> Lennart.
>
>
>> best, Yurii
>>
>>
>> On Nov 28, 2013, at 12:06 PM, Nicola Pirastu
>> <nicola.pirastu at burlo.trieste.it> wrote:
>>
>>> Hi all,
>>>
>>> I've been following this conversation with much interest although
>>> I'm sorry I can't contribute much.
>>>
>>> I was just wondering, could GenABEL.data not be just a dependency
>>> on GenABEL? This way installing GenABEL trough install.packages
>>> would result in the installation also of GenABEL.data without the
>>> user actually having to do it himself.
>>>
>>> Best.
>>>
>>> Nicola
>>>
>>>
>>> Dr. Nicola Pirastu PhD Research Fellow Medical Sciences,
>>> Chirurgical and Health Department University of Trieste Medical
>>> Genetics IRCCS Burlo Garofolo Via dell'Istria 65/1 34137 Italy tel.
>>> +390403785539
>>>
>>> Il giorno 28/nov/2013, alle ore 11:59, "L.C. Karssen"
>>> <lennart at karssen.org> ha scritto:
>>>
>>>> Hi Maksim,
>>>>
>>>> First of all, thanks for the good work!
>>>>
>>>> On 11/27/2013 07:58 PM, Maksim Struchalin wrote:
>>>>> Hi All,
>>>>>
>>>>> I created a GenABEL.data package where I moved the following
>>>>> data: GenABEL/data/* , inst/exdata/srgenos.dat and
>>>>> inst/exdata/srphenos.dat. All the corresponding files are
>>>>> deleted from GenABEL. Also, GenABEL.data contains R directory
>>>>> with three files (ge03d2c.R, ge03d2ex.R and srdta.R). These
>>>>> scripts does not go to the final distribution and needed only
>>>>> for possible future usage. Only GenABEL.data/data/* files go to
>>>>> GenABEL.data_1.0.tar.gz after running "R CMD build
>>>>> GenABEL.data". The directories "R" and "inst" are removed by
>>>>> running GenABEL/data/clean.R in "build" process. May be it is
>>>>> not a good idea to do it in such a way but, at least, it is
>>>>> convinient and has no any reflection on end users (suggest a
>>>>> better way plz).
>>>>>
>>>>> The way how GenABEL.data works now is not like how we discussed
>>>>> below. It is impossible to generate files during "R CMD
>>>>> INSTALL" and undisarable during "R CMD build". The best opition
>>>>> was just to move all the data to GenABEL.data from GenABEL
>>>>> (like CRAN people suggested). In this case, we can install
>>>>> GenABEL.data without having GenABEL installed. After this, we
>>>>> install GenABELL.
>>>> This sounds very strange to me. Does the user first need to
>>>> install the GenABEL.data package and then the 'main' GenABEL
>>>> package? Or do I misunderstand you? What happens if the user
>>>> installs them in a different order? I guess that shouldn't
>>>> matter, right, as the package contains only data?
>>>>
>>>>> When we run library(GenABEL), it automaticly attaches
>>>>> GenBEL.data. Thus, the only change for users is that they need
>>>>> to install two packages now (GenABEL.data and GebABEL).
>>>> And GenABEL.data is only needed if they actually want to use the
>>>> examples, right? Or do we simply put GenABEL.data in the list of
>>>> required packages in the DESCRIPTION file?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Lennart.
>>>>
>>>>> Now we have sizes of both packages much smaller: 469K for
>>>>> GenABEL and 2.4M for GenABEL.data.
>>>>>
>>>>> It should work now, but if you experience some problems, let me
>>>>> know.
>>>>>
>>>>> best, Maksim
>>>>>
>>>>>
>>>>> On 26/11/2013 20:48, L.C. Karssen wrote:
>>>>>> Hi Maksim,
>>>>>>
>>>>>> On 11/26/2013 12:11 PM, Maksim Struchalin wrote:
>>>>>>> I am still in the way of compressing GenABEL data. To
>>>>>>> remind you: the idea consists of compressing the original
>>>>>>> data text files and use them later for generating RData
>>>>>>> files (e.g. srdta).
>>>>>>>
>>>>>>> Yurii proposed to make RData files in examples which use
>>>>>>> them. I see now only one way how this idea can be
>>>>>>> implemented. We replace "data(srdta)" line in every file
>>>>>>> where it is used by a function e.g. "generate_srdt()" which
>>>>>>> generate srdta object. The same procedure for other five
>>>>>>> *.RData files from GenABEL/data. If we follow this way, we
>>>>>>> have to change 71 files in man directory and, additionally
>>>>>>> to this, the GenABEL manual. Also, users will not be able
>>>>>>> to load the srdta set (and others) by typing "data(srdta)"
>>>>>>> in a command line (how they get used to) and has to know
>>>>>>> that the function generate_srdt() now services for these
>>>>>>> needs. This all sounds nasty :-).
>>>>>> I'm not sure how many user actually type data(srdta), but I
>>>>>> see you point.
>>>>>>
>>>>>>> Making the data during package installation time is also a
>>>>>>> bad idea as Yurii noted below. Actually, this is impossible
>>>>>>> because the process of making GenABEL data requires GenABEL
>>>>>>> functions which are not available during installation time
>>>>>>> (they are avaialble only after GenABEL installed).
>>>>>> Good point!
>>>>>>
>>>>>>> I see only one good solution now: move all the GenABEL data
>>>>>>> to a new package e.g. GenABELdata as it was proposed by
>>>>>>> CRAN people from the begining. In this case, it is possible
>>>>>>> to generate RData during installation time using GenABEL
>>>>>>> functions (which are installed by that time). I think this
>>>>>>> solution is paltform independent because R rules permit
>>>>>>> runing *.R scripts to generate data during installation
>>>>>>> time.
>>>>>>>
>>>>>>> What do you think about making a data package for GenABEL?
>>>>>>> Do you think the name GenABELdata is ok? May be we can move
>>>>>>> all the *ABEL data in DatABEL package instead of making
>>>>>>> *ABELdata data packages?
>>>>>> Sounds like this is the best solution. Thanks for digging in
>>>>>> to this. As for the package name, either GenABELdata or
>>>>>> GenABEL.data sounds find with me (the latter one being a bit
>>>>>> clearer in my opinion).
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Lennart
>>>>>>
>>>>>>> best, Maksim
>>>>>>>
>>>>>>> On 18/11/2013 18:54, Yury Aulchenko wrote:
>>>>>>>> On Nov 15, 2013, at 17:21 PM, L.C. Karssen
>>>>>>>> <lennart at karssen.org> wrote:
>>>>>>>>
>>>>>>>>> Hi Maksim,
>>>>>>>>>
>>>>>>>>> On 14-11-13 22:38, Maksim Struchalin wrote:
>>>>>>>>>> In this email, I propose a new approach which allows
>>>>>>>>>> to reduce total size of data from 8Mb to 2Mb that
>>>>>>>>>> reduce the entire GenABEL size from 12Mb to 6Mb.
>>>>>>>>> I gues you mean B (bytes) instead of b (bits) here
>>>>>>>>> :-).
>>>>>>>>>
>>>>>>>>>> "R CMD check --as-cran" reports that the following
>>>>>>>>>> sub-directories have too big size: data (2.3Mb),
>>>>>>>>>> exdata (5.7Mb) and libs (2.6Mb). After the last
>>>>>>>>>> GenABEL submission to CRAN, the maintainers suggested
>>>>>>>>>> to create a new package called GenABELdata and move
>>>>>>>>>> all the data there. I run through the data and found
>>>>>>>>>> that: 1) "exdata" directory can be compressed by gzip
>>>>>>>>>> and reduced from 5.8Mb -> 1.1Mb. - There is a
>>>>>>>>>> function guzip() from library R.utils which can
>>>>>>>>>> decompress the files. It works on any OS. - Moreover:
>>>>>>>>>> the native R function read.table() can read gzip
>>>>>>>>>> files without decompression. - Even more: it looks
>>>>>>>>>> like that the biggest file "srgenos.dat" is used only
>>>>>>>>>> once a long time ago for generating "srdta.RData" and
>>>>>>>>>> now it is just sitting there and eating space
>>>>>>>>>> needlessly.
>>>>>>>>> Sounds like a waste of space!
>>>>>>>>>
>>>>>>>>>> 2) We can delete some files from the "data"
>>>>>>>>>> directory. The deleted files will be generated on the
>>>>>>>>>> user computer based on the files from exdata. It can
>>>>>>>>>> be done during INSTALLATION (a line in Makefile?) or
>>>>>>>>>> on the first load through (|run funcion .onAttach()
>>>>>>>>>> in R/zzz.R|).
>>>>>>>>> This sounds like a perfectly acceptable option.
>>>>>>>> I suggest this is done in the "example" which make use of
>>>>>>>> this data, NOT in the INSTALL etc. - we should make
>>>>>>>> things as "robust" as possible and interfere as little as
>>>>>>>> possible with the usual workflow (which is very much
>>>>>>>> system-specific, in that we will need to to test on all
>>>>>>>> platforms)
>>>>>>>>
>>>>>>>>
>>>>>>>>>> It will reduce total size of "data" directory from
>>>>>>>>>> 2.3Mb to 800Kb.
>>>>>>>>> Fantastic! If no one has other objections I say: go
>>>>>>>>> ahead.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Lennart.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Any objections/suggestions?
>>>>>>>>>>
>>>>>>>>>> best, Maksim
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>>>
>>>>>>>>>>
> --
>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
> L.C. Karssen
>>>>>>>>> Utrecht The Netherlands
>>>>>>>>>
>>>>>>>>> lennart at karssen.org http://blog.karssen.org
>>>>>>>>>
>>>>>>>>> Stuur mij aub geen Word of Powerpoint bestanden! Zie
>>>>>>>>> http://www.gnu.org/philosophy/no-word-attachments.nl.html
>>>>>>>>>
>>>>>>>>>
> ------------------------------------------------------------------
>>>>>>>>> _______________________________________________
>>>>>>>>> genabel-devel mailing list
>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>>
> _______________________________________________
>>>>>>>> genabel-devel mailing list
>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>
> _______________________________________________
>>>>>>> genabel-devel mailing list
>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>
>>>>>>
> _______________________________________________
>>>>>> genabel-devel mailing list
>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>
>>>>>
>>>>>
> _______________________________________________
>>>>> genabel-devel mailing list
>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>
>>>>
> --
>>>> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen
>>>> Utrecht The Netherlands
>>>>
>>>> lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A
>>>> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>>>
>>>> _______________________________________________ genabel-devel
>>>> mailing list genabel-devel at lists.r-forge.r-project.org
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>
> AVVISO DI RISERVATEZZA Informazioni riservate possono essere contenute
> nel messaggio o nei suoi allegati. Se non siete i destinatari indicati
> nel messaggio, o responsabili per la sua consegna alla persona, o se
> avete ricevuto il messaggio per errore, siete pregati di non
> trascriverlo, copiarlo o inviarlo a nessuno. In tal caso vi invitiamo a
> cancellare il messaggio ed i suoi allegati. Grazie. CONFIDENTIALITY
> NOTICE Confidential information may be contained in this message or in
> its attachments. If you are not the addressee indicated in this message,
> or responsible for message delivering to that person, or if you have
> received this message in error, you may not transcribe, copy or deliver
> this message to anyone. In that case, you should delete this message and
> its attachments. Thank you.
>>> _______________________________________________ genabel-devel
>>> mailing list genabel-devel at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>
>>>
>>> _______________________________________________
>>> genabel-devel mailing list
>>> genabel-devel at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20131129/55a5db16/attachment-0001.html>


More information about the genabel-devel mailing list