[GenABEL-dev] new approach for data storage in GenABEL package
Yurii Aulchenko
yurii.aulchenko at gmail.com
Fri Nov 29 10:43:34 CET 2013
Lennart,
Good point about "depends"!
Again, my question would be how other people do it?
Y
----------------------
Yurii Aulchenko
(sent from mobile device)
> On Nov 29, 2013, at 10:36, "L.C. Karssen" <lennart at karssen.org> wrote:
>
> Hi Maksim,
>
>> On 11/29/2013 08:43 AM, Maksim Struchalin wrote:
>> I looked at how other developres deal with issue of dependency between a
>> package and its data.package. I checked out two random packages from
>> CRAN: GANPA (GANPAdata) and gamlss (gamlss.data). Both of them (GANPA
>> and gamlss) dependes on their data packages - that means their
>> DESCRIPTION files contain a reference to their data packages in the
>> "Depends:" field. Only GANPAdata suggests GANPA (gamlss.data does not
>> Depends/Suggests gamlss).
>>
>> When I made GenABEL depending on GenABEL.data, I kept in my mind the
>> same idea as Nicola pronounced below - that, in this case, GenABEL.data
>> is installed automaticly when users run "install.package(GenABEL)". This
>> is convinient for users who install GenABEL from CRAN and this is in
>> line with GANPA and gamlss but it, probably, does not fully reflect the
>> GenABEL reality. The dependency between GenABEL and GenABL.data is weak
>> - GenABEL is gonna be mostly used without GenABEL.data. So, I support
>> the Yurii's idea about making GenABEL.data as 'suggested' and including
>> 'requre(...'.
>
> I agree with you that the dependence between GA and GA.data is rather
> weak. On the other hand, why not keep GA.data in Depends? That gives the
> same behaviour as before (install everything by default). Sounds
> convenient to me.
> With modern internet bandwidth the few MB of the data package are not a
> problem.
>
>> About dot: Personally, I like GenABEL.data. From this name, It is clear
>> that this package is some kind of a 'subpackage' of GenABEL package and
>> it is not a standalone one.
>
> Good point!
>
>
> Best regards,
>
> Lennart.
>
>>
>> best,
>> Maksim
>>
>>> On 28/11/2013 18:24, L.C. Karssen wrote:
>>>
>>>> On 11/28/2013 12:12 PM, Yury Aulchenko wrote:
>>>> I would think that GenABEL(.)data is "suggested" and then any
>>>> examples using the data from this packages start with something like
>>>>
>>>> if (require("GenABEL(.)data") ...
>>> This sounds like a good solution.
>>>
>>>> How do other packages which lean on data-packages solve this?
>>>>
>>>> As for the "dot" - I do not have any strong opinion - both options
>>>> seem ok to me :)
>>> Great :-). Then I propose (of course) to stick with the dot, also
>>> because that's already used now.
>>>
>>>
>>> Best,
>>>
>>> Lennart.
>>>
>>>
>>>> best, Yurii
>>>>
>>>>
>>>> On Nov 28, 2013, at 12:06 PM, Nicola Pirastu
>>>> <nicola.pirastu at burlo.trieste.it> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've been following this conversation with much interest although
>>>>> I'm sorry I can't contribute much.
>>>>>
>>>>> I was just wondering, could GenABEL.data not be just a dependency
>>>>> on GenABEL? This way installing GenABEL trough install.packages
>>>>> would result in the installation also of GenABEL.data without the
>>>>> user actually having to do it himself.
>>>>>
>>>>> Best.
>>>>>
>>>>> Nicola
>>>>>
>>>>>
>>>>> Dr. Nicola Pirastu PhD Research Fellow Medical Sciences,
>>>>> Chirurgical and Health Department University of Trieste Medical
>>>>> Genetics IRCCS Burlo Garofolo Via dell'Istria 65/1 34137 Italy tel.
>>>>> +390403785539
>>>>>
>>>>> Il giorno 28/nov/2013, alle ore 11:59, "L.C. Karssen"
>>>>> <lennart at karssen.org> ha scritto:
>>>>>
>>>>>> Hi Maksim,
>>>>>>
>>>>>> First of all, thanks for the good work!
>>>>>>
>>>>>>> On 11/27/2013 07:58 PM, Maksim Struchalin wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I created a GenABEL.data package where I moved the following
>>>>>>> data: GenABEL/data/* , inst/exdata/srgenos.dat and
>>>>>>> inst/exdata/srphenos.dat. All the corresponding files are
>>>>>>> deleted from GenABEL. Also, GenABEL.data contains R directory
>>>>>>> with three files (ge03d2c.R, ge03d2ex.R and srdta.R). These
>>>>>>> scripts does not go to the final distribution and needed only
>>>>>>> for possible future usage. Only GenABEL.data/data/* files go to
>>>>>>> GenABEL.data_1.0.tar.gz after running "R CMD build
>>>>>>> GenABEL.data". The directories "R" and "inst" are removed by
>>>>>>> running GenABEL/data/clean.R in "build" process. May be it is
>>>>>>> not a good idea to do it in such a way but, at least, it is
>>>>>>> convinient and has no any reflection on end users (suggest a
>>>>>>> better way plz).
>>>>>>>
>>>>>>> The way how GenABEL.data works now is not like how we discussed
>>>>>>> below. It is impossible to generate files during "R CMD
>>>>>>> INSTALL" and undisarable during "R CMD build". The best opition
>>>>>>> was just to move all the data to GenABEL.data from GenABEL
>>>>>>> (like CRAN people suggested). In this case, we can install
>>>>>>> GenABEL.data without having GenABEL installed. After this, we
>>>>>>> install GenABELL.
>>>>>> This sounds very strange to me. Does the user first need to
>>>>>> install the GenABEL.data package and then the 'main' GenABEL
>>>>>> package? Or do I misunderstand you? What happens if the user
>>>>>> installs them in a different order? I guess that shouldn't
>>>>>> matter, right, as the package contains only data?
>>>>>>
>>>>>>> When we run library(GenABEL), it automaticly attaches
>>>>>>> GenBEL.data. Thus, the only change for users is that they need
>>>>>>> to install two packages now (GenABEL.data and GebABEL).
>>>>>> And GenABEL.data is only needed if they actually want to use the
>>>>>> examples, right? Or do we simply put GenABEL.data in the list of
>>>>>> required packages in the DESCRIPTION file?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Lennart.
>>>>>>
>>>>>>> Now we have sizes of both packages much smaller: 469K for
>>>>>>> GenABEL and 2.4M for GenABEL.data.
>>>>>>>
>>>>>>> It should work now, but if you experience some problems, let me
>>>>>>> know.
>>>>>>>
>>>>>>> best, Maksim
>>>>>>>
>>>>>>>
>>>>>>>> On 26/11/2013 20:48, L.C. Karssen wrote:
>>>>>>>> Hi Maksim,
>>>>>>>>
>>>>>>>>> On 11/26/2013 12:11 PM, Maksim Struchalin wrote:
>>>>>>>>> I am still in the way of compressing GenABEL data. To
>>>>>>>>> remind you: the idea consists of compressing the original
>>>>>>>>> data text files and use them later for generating RData
>>>>>>>>> files (e.g. srdta).
>>>>>>>>>
>>>>>>>>> Yurii proposed to make RData files in examples which use
>>>>>>>>> them. I see now only one way how this idea can be
>>>>>>>>> implemented. We replace "data(srdta)" line in every file
>>>>>>>>> where it is used by a function e.g. "generate_srdt()" which
>>>>>>>>> generate srdta object. The same procedure for other five
>>>>>>>>> *.RData files from GenABEL/data. If we follow this way, we
>>>>>>>>> have to change 71 files in man directory and, additionally
>>>>>>>>> to this, the GenABEL manual. Also, users will not be able
>>>>>>>>> to load the srdta set (and others) by typing "data(srdta)"
>>>>>>>>> in a command line (how they get used to) and has to know
>>>>>>>>> that the function generate_srdt() now services for these
>>>>>>>>> needs. This all sounds nasty :-).
>>>>>>>> I'm not sure how many user actually type data(srdta), but I
>>>>>>>> see you point.
>>>>>>>>
>>>>>>>>> Making the data during package installation time is also a
>>>>>>>>> bad idea as Yurii noted below. Actually, this is impossible
>>>>>>>>> because the process of making GenABEL data requires GenABEL
>>>>>>>>> functions which are not available during installation time
>>>>>>>>> (they are avaialble only after GenABEL installed).
>>>>>>>> Good point!
>>>>>>>>
>>>>>>>>> I see only one good solution now: move all the GenABEL data
>>>>>>>>> to a new package e.g. GenABELdata as it was proposed by
>>>>>>>>> CRAN people from the begining. In this case, it is possible
>>>>>>>>> to generate RData during installation time using GenABEL
>>>>>>>>> functions (which are installed by that time). I think this
>>>>>>>>> solution is paltform independent because R rules permit
>>>>>>>>> runing *.R scripts to generate data during installation
>>>>>>>>> time.
>>>>>>>>>
>>>>>>>>> What do you think about making a data package for GenABEL?
>>>>>>>>> Do you think the name GenABELdata is ok? May be we can move
>>>>>>>>> all the *ABEL data in DatABEL package instead of making
>>>>>>>>> *ABELdata data packages?
>>>>>>>> Sounds like this is the best solution. Thanks for digging in
>>>>>>>> to this. As for the package name, either GenABELdata or
>>>>>>>> GenABEL.data sounds find with me (the latter one being a bit
>>>>>>>> clearer in my opinion).
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Lennart
>>>>>>>>
>>>>>>>>> best, Maksim
>>>>>>>>>
>>>>>>>>>> On 18/11/2013 18:54, Yury Aulchenko wrote:
>>>>>>>>>> On Nov 15, 2013, at 17:21 PM, L.C. Karssen
>>>>>>>>>> <lennart at karssen.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Maksim,
>>>>>>>>>>>
>>>>>>>>>>>> On 14-11-13 22:38, Maksim Struchalin wrote:
>>>>>>>>>>>> In this email, I propose a new approach which allows
>>>>>>>>>>>> to reduce total size of data from 8Mb to 2Mb that
>>>>>>>>>>>> reduce the entire GenABEL size from 12Mb to 6Mb.
>>>>>>>>>>> I gues you mean B (bytes) instead of b (bits) here
>>>>>>>>>>> :-).
>>>>>>>>>>>
>>>>>>>>>>>> "R CMD check --as-cran" reports that the following
>>>>>>>>>>>> sub-directories have too big size: data (2.3Mb),
>>>>>>>>>>>> exdata (5.7Mb) and libs (2.6Mb). After the last
>>>>>>>>>>>> GenABEL submission to CRAN, the maintainers suggested
>>>>>>>>>>>> to create a new package called GenABELdata and move
>>>>>>>>>>>> all the data there. I run through the data and found
>>>>>>>>>>>> that: 1) "exdata" directory can be compressed by gzip
>>>>>>>>>>>> and reduced from 5.8Mb -> 1.1Mb. - There is a
>>>>>>>>>>>> function guzip() from library R.utils which can
>>>>>>>>>>>> decompress the files. It works on any OS. - Moreover:
>>>>>>>>>>>> the native R function read.table() can read gzip
>>>>>>>>>>>> files without decompression. - Even more: it looks
>>>>>>>>>>>> like that the biggest file "srgenos.dat" is used only
>>>>>>>>>>>> once a long time ago for generating "srdta.RData" and
>>>>>>>>>>>> now it is just sitting there and eating space
>>>>>>>>>>>> needlessly.
>>>>>>>>>>> Sounds like a waste of space!
>>>>>>>>>>>
>>>>>>>>>>>> 2) We can delete some files from the "data"
>>>>>>>>>>>> directory. The deleted files will be generated on the
>>>>>>>>>>>> user computer based on the files from exdata. It can
>>>>>>>>>>>> be done during INSTALLATION (a line in Makefile?) or
>>>>>>>>>>>> on the first load through (|run funcion .onAttach()
>>>>>>>>>>>> in R/zzz.R|).
>>>>>>>>>>> This sounds like a perfectly acceptable option.
>>>>>>>>>> I suggest this is done in the "example" which make use of
>>>>>>>>>> this data, NOT in the INSTALL etc. - we should make
>>>>>>>>>> things as "robust" as possible and interfere as little as
>>>>>>>>>> possible with the usual workflow (which is very much
>>>>>>>>>> system-specific, in that we will need to to test on all
>>>>>>>>>> platforms)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> It will reduce total size of "data" directory from
>>>>>>>>>>>> 2.3Mb to 800Kb.
>>>>>>>>>>> Fantastic! If no one has other objections I say: go
>>>>>>>>>>> ahead.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Lennart.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Any objections/suggestions?
>>>>>>>>>>>>
>>>>>>>>>>>> best, Maksim
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>> --
>>>>>>>>>>> -----------------------------------------------------------------
>>> L.C. Karssen
>>>>>>>>>>> Utrecht The Netherlands
>>>>>>>>>>>
>>>>>>>>>>> lennart at karssen.org http://blog.karssen.org
>>>>>>>>>>>
>>>>>>>>>>> Stuur mij aub geen Word of Powerpoint bestanden! Zie
>>>>>>>>>>> http://www.gnu.org/philosophy/no-word-attachments.nl.html
>>> ------------------------------------------------------------------
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>> _______________________________________________
>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>> _______________________________________________
>>>>>>>>> genabel-devel mailing list
>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>> _______________________________________________
>>>>>>>> genabel-devel mailing list
>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>> _______________________________________________
>>>>>>> genabel-devel mailing list
>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>> --
>>>>>> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen
>>>>>> Utrecht The Netherlands
>>>>>>
>>>>>> lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A
>>>>>> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>>>>>
>>>>>> _______________________________________________ genabel-devel
>>>>>> mailing list genabel-devel at lists.r-forge.r-project.org
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>> AVVISO DI RISERVATEZZA Informazioni riservate possono essere contenute
>>> nel messaggio o nei suoi allegati. Se non siete i destinatari indicati
>>> nel messaggio, o responsabili per la sua consegna alla persona, o se
>>> avete ricevuto il messaggio per errore, siete pregati di non
>>> trascriverlo, copiarlo o inviarlo a nessuno. In tal caso vi invitiamo a
>>> cancellare il messaggio ed i suoi allegati. Grazie. CONFIDENTIALITY
>>> NOTICE Confidential information may be contained in this message or in
>>> its attachments. If you are not the addressee indicated in this message,
>>> or responsible for message delivering to that person, or if you have
>>> received this message in error, you may not transcribe, copy or deliver
>>> this message to anyone. In that case, you should delete this message and
>>> its attachments. Thank you.
>>>>> _______________________________________________ genabel-devel
>>>>> mailing list genabel-devel at lists.r-forge.r-project.org
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> genabel-devel mailing list
>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>
>>
>>
>> _______________________________________________
>> genabel-devel mailing list
>> genabel-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>
> --
> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
> L.C. Karssen
> Utrecht
> The Netherlands
>
> lennart at karssen.org
> http://blog.karssen.org
> GPG key ID: A88F554A
> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
More information about the genabel-devel
mailing list