[GenABEL-dev] new approach for data storage in GenABEL package
Maksim Struchalin
m.v.struchalin at mail.ru
Fri Nov 29 11:42:42 CET 2013
Hi Yurii & Lennart,
Yesterday, you supported the idea of making GenABEL.data as 'suggested':
________________________________________________________________
On 28/11/2013 18:24, L.C. Karssen wrote:
> On 11/28/2013 12:12 PM, Yury Aulchenko wrote:
> I would think that GenABEL(.)data is "suggested" and then any
> examples using the data from this packages start with something like
>
> if (require("GenABEL(.)data") ...
This sounds like a good solution.
________________________________________________________________
Today, you propose to make it 'depends' or I misunderstand something here?
About how other people do it: I looked in GANPAdata and gamlss.data
packages. They 'depends' on GANPA and gamlss (see my message below).
best,
Maksim
On 29/11/2013 16:43, Yurii Aulchenko wrote:
> Lennart,
>
> Good point about "depends"!
>
> Again, my question would be how other people do it?
>
> Y
>
> ----------------------
> Yurii Aulchenko
> (sent from mobile device)
>
>> On Nov 29, 2013, at 10:36, "L.C. Karssen" <lennart at karssen.org> wrote:
>>
>> Hi Maksim,
>>
>>> On 11/29/2013 08:43 AM, Maksim Struchalin wrote:
>>> I looked at how other developres deal with issue of dependency between a
>>> package and its data.package. I checked out two random packages from
>>> CRAN: GANPA (GANPAdata) and gamlss (gamlss.data). Both of them (GANPA
>>> and gamlss) dependes on their data packages - that means their
>>> DESCRIPTION files contain a reference to their data packages in the
>>> "Depends:" field. Only GANPAdata suggests GANPA (gamlss.data does not
>>> Depends/Suggests gamlss).
>>>
>>> When I made GenABEL depending on GenABEL.data, I kept in my mind the
>>> same idea as Nicola pronounced below - that, in this case, GenABEL.data
>>> is installed automaticly when users run "install.package(GenABEL)". This
>>> is convinient for users who install GenABEL from CRAN and this is in
>>> line with GANPA and gamlss but it, probably, does not fully reflect the
>>> GenABEL reality. The dependency between GenABEL and GenABL.data is weak
>>> - GenABEL is gonna be mostly used without GenABEL.data. So, I support
>>> the Yurii's idea about making GenABEL.data as 'suggested' and including
>>> 'requre(...'.
>> I agree with you that the dependence between GA and GA.data is rather
>> weak. On the other hand, why not keep GA.data in Depends? That gives the
>> same behaviour as before (install everything by default). Sounds
>> convenient to me.
>> With modern internet bandwidth the few MB of the data package are not a
>> problem.
>>
>>> About dot: Personally, I like GenABEL.data. From this name, It is clear
>>> that this package is some kind of a 'subpackage' of GenABEL package and
>>> it is not a standalone one.
>> Good point!
>>
>>
>> Best regards,
>>
>> Lennart.
>>
>>> best,
>>> Maksim
>>>
>>>> On 28/11/2013 18:24, L.C. Karssen wrote:
>>>>
>>>>> On 11/28/2013 12:12 PM, Yury Aulchenko wrote:
>>>>> I would think that GenABEL(.)data is "suggested" and then any
>>>>> examples using the data from this packages start with something like
>>>>>
>>>>> if (require("GenABEL(.)data") ...
>>>> This sounds like a good solution.
>>>>
>>>>> How do other packages which lean on data-packages solve this?
>>>>>
>>>>> As for the "dot" - I do not have any strong opinion - both options
>>>>> seem ok to me :)
>>>> Great :-). Then I propose (of course) to stick with the dot, also
>>>> because that's already used now.
>>>>
>>>>
>>>> Best,
>>>>
>>>> Lennart.
>>>>
>>>>
>>>>> best, Yurii
>>>>>
>>>>>
>>>>> On Nov 28, 2013, at 12:06 PM, Nicola Pirastu
>>>>> <nicola.pirastu at burlo.trieste.it> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've been following this conversation with much interest although
>>>>>> I'm sorry I can't contribute much.
>>>>>>
>>>>>> I was just wondering, could GenABEL.data not be just a dependency
>>>>>> on GenABEL? This way installing GenABEL trough install.packages
>>>>>> would result in the installation also of GenABEL.data without the
>>>>>> user actually having to do it himself.
>>>>>>
>>>>>> Best.
>>>>>>
>>>>>> Nicola
>>>>>>
>>>>>>
>>>>>> Dr. Nicola Pirastu PhD Research Fellow Medical Sciences,
>>>>>> Chirurgical and Health Department University of Trieste Medical
>>>>>> Genetics IRCCS Burlo Garofolo Via dell'Istria 65/1 34137 Italy tel.
>>>>>> +390403785539
>>>>>>
>>>>>> Il giorno 28/nov/2013, alle ore 11:59, "L.C. Karssen"
>>>>>> <lennart at karssen.org> ha scritto:
>>>>>>
>>>>>>> Hi Maksim,
>>>>>>>
>>>>>>> First of all, thanks for the good work!
>>>>>>>
>>>>>>>> On 11/27/2013 07:58 PM, Maksim Struchalin wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I created a GenABEL.data package where I moved the following
>>>>>>>> data: GenABEL/data/* , inst/exdata/srgenos.dat and
>>>>>>>> inst/exdata/srphenos.dat. All the corresponding files are
>>>>>>>> deleted from GenABEL. Also, GenABEL.data contains R directory
>>>>>>>> with three files (ge03d2c.R, ge03d2ex.R and srdta.R). These
>>>>>>>> scripts does not go to the final distribution and needed only
>>>>>>>> for possible future usage. Only GenABEL.data/data/* files go to
>>>>>>>> GenABEL.data_1.0.tar.gz after running "R CMD build
>>>>>>>> GenABEL.data". The directories "R" and "inst" are removed by
>>>>>>>> running GenABEL/data/clean.R in "build" process. May be it is
>>>>>>>> not a good idea to do it in such a way but, at least, it is
>>>>>>>> convinient and has no any reflection on end users (suggest a
>>>>>>>> better way plz).
>>>>>>>>
>>>>>>>> The way how GenABEL.data works now is not like how we discussed
>>>>>>>> below. It is impossible to generate files during "R CMD
>>>>>>>> INSTALL" and undisarable during "R CMD build". The best opition
>>>>>>>> was just to move all the data to GenABEL.data from GenABEL
>>>>>>>> (like CRAN people suggested). In this case, we can install
>>>>>>>> GenABEL.data without having GenABEL installed. After this, we
>>>>>>>> install GenABELL.
>>>>>>> This sounds very strange to me. Does the user first need to
>>>>>>> install the GenABEL.data package and then the 'main' GenABEL
>>>>>>> package? Or do I misunderstand you? What happens if the user
>>>>>>> installs them in a different order? I guess that shouldn't
>>>>>>> matter, right, as the package contains only data?
>>>>>>>
>>>>>>>> When we run library(GenABEL), it automaticly attaches
>>>>>>>> GenBEL.data. Thus, the only change for users is that they need
>>>>>>>> to install two packages now (GenABEL.data and GebABEL).
>>>>>>> And GenABEL.data is only needed if they actually want to use the
>>>>>>> examples, right? Or do we simply put GenABEL.data in the list of
>>>>>>> required packages in the DESCRIPTION file?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Lennart.
>>>>>>>
>>>>>>>> Now we have sizes of both packages much smaller: 469K for
>>>>>>>> GenABEL and 2.4M for GenABEL.data.
>>>>>>>>
>>>>>>>> It should work now, but if you experience some problems, let me
>>>>>>>> know.
>>>>>>>>
>>>>>>>> best, Maksim
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 26/11/2013 20:48, L.C. Karssen wrote:
>>>>>>>>> Hi Maksim,
>>>>>>>>>
>>>>>>>>>> On 11/26/2013 12:11 PM, Maksim Struchalin wrote:
>>>>>>>>>> I am still in the way of compressing GenABEL data. To
>>>>>>>>>> remind you: the idea consists of compressing the original
>>>>>>>>>> data text files and use them later for generating RData
>>>>>>>>>> files (e.g. srdta).
>>>>>>>>>>
>>>>>>>>>> Yurii proposed to make RData files in examples which use
>>>>>>>>>> them. I see now only one way how this idea can be
>>>>>>>>>> implemented. We replace "data(srdta)" line in every file
>>>>>>>>>> where it is used by a function e.g. "generate_srdt()" which
>>>>>>>>>> generate srdta object. The same procedure for other five
>>>>>>>>>> *.RData files from GenABEL/data. If we follow this way, we
>>>>>>>>>> have to change 71 files in man directory and, additionally
>>>>>>>>>> to this, the GenABEL manual. Also, users will not be able
>>>>>>>>>> to load the srdta set (and others) by typing "data(srdta)"
>>>>>>>>>> in a command line (how they get used to) and has to know
>>>>>>>>>> that the function generate_srdt() now services for these
>>>>>>>>>> needs. This all sounds nasty :-).
>>>>>>>>> I'm not sure how many user actually type data(srdta), but I
>>>>>>>>> see you point.
>>>>>>>>>
>>>>>>>>>> Making the data during package installation time is also a
>>>>>>>>>> bad idea as Yurii noted below. Actually, this is impossible
>>>>>>>>>> because the process of making GenABEL data requires GenABEL
>>>>>>>>>> functions which are not available during installation time
>>>>>>>>>> (they are avaialble only after GenABEL installed).
>>>>>>>>> Good point!
>>>>>>>>>
>>>>>>>>>> I see only one good solution now: move all the GenABEL data
>>>>>>>>>> to a new package e.g. GenABELdata as it was proposed by
>>>>>>>>>> CRAN people from the begining. In this case, it is possible
>>>>>>>>>> to generate RData during installation time using GenABEL
>>>>>>>>>> functions (which are installed by that time). I think this
>>>>>>>>>> solution is paltform independent because R rules permit
>>>>>>>>>> runing *.R scripts to generate data during installation
>>>>>>>>>> time.
>>>>>>>>>>
>>>>>>>>>> What do you think about making a data package for GenABEL?
>>>>>>>>>> Do you think the name GenABELdata is ok? May be we can move
>>>>>>>>>> all the *ABEL data in DatABEL package instead of making
>>>>>>>>>> *ABELdata data packages?
>>>>>>>>> Sounds like this is the best solution. Thanks for digging in
>>>>>>>>> to this. As for the package name, either GenABELdata or
>>>>>>>>> GenABEL.data sounds find with me (the latter one being a bit
>>>>>>>>> clearer in my opinion).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Lennart
>>>>>>>>>
>>>>>>>>>> best, Maksim
>>>>>>>>>>
>>>>>>>>>>> On 18/11/2013 18:54, Yury Aulchenko wrote:
>>>>>>>>>>> On Nov 15, 2013, at 17:21 PM, L.C. Karssen
>>>>>>>>>>> <lennart at karssen.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Maksim,
>>>>>>>>>>>>
>>>>>>>>>>>>> On 14-11-13 22:38, Maksim Struchalin wrote:
>>>>>>>>>>>>> In this email, I propose a new approach which allows
>>>>>>>>>>>>> to reduce total size of data from 8Mb to 2Mb that
>>>>>>>>>>>>> reduce the entire GenABEL size from 12Mb to 6Mb.
>>>>>>>>>>>> I gues you mean B (bytes) instead of b (bits) here
>>>>>>>>>>>> :-).
>>>>>>>>>>>>
>>>>>>>>>>>>> "R CMD check --as-cran" reports that the following
>>>>>>>>>>>>> sub-directories have too big size: data (2.3Mb),
>>>>>>>>>>>>> exdata (5.7Mb) and libs (2.6Mb). After the last
>>>>>>>>>>>>> GenABEL submission to CRAN, the maintainers suggested
>>>>>>>>>>>>> to create a new package called GenABELdata and move
>>>>>>>>>>>>> all the data there. I run through the data and found
>>>>>>>>>>>>> that: 1) "exdata" directory can be compressed by gzip
>>>>>>>>>>>>> and reduced from 5.8Mb -> 1.1Mb. - There is a
>>>>>>>>>>>>> function guzip() from library R.utils which can
>>>>>>>>>>>>> decompress the files. It works on any OS. - Moreover:
>>>>>>>>>>>>> the native R function read.table() can read gzip
>>>>>>>>>>>>> files without decompression. - Even more: it looks
>>>>>>>>>>>>> like that the biggest file "srgenos.dat" is used only
>>>>>>>>>>>>> once a long time ago for generating "srdta.RData" and
>>>>>>>>>>>>> now it is just sitting there and eating space
>>>>>>>>>>>>> needlessly.
>>>>>>>>>>>> Sounds like a waste of space!
>>>>>>>>>>>>
>>>>>>>>>>>>> 2) We can delete some files from the "data"
>>>>>>>>>>>>> directory. The deleted files will be generated on the
>>>>>>>>>>>>> user computer based on the files from exdata. It can
>>>>>>>>>>>>> be done during INSTALLATION (a line in Makefile?) or
>>>>>>>>>>>>> on the first load through (|run funcion .onAttach()
>>>>>>>>>>>>> in R/zzz.R|).
>>>>>>>>>>>> This sounds like a perfectly acceptable option.
>>>>>>>>>>> I suggest this is done in the "example" which make use of
>>>>>>>>>>> this data, NOT in the INSTALL etc. - we should make
>>>>>>>>>>> things as "robust" as possible and interfere as little as
>>>>>>>>>>> possible with the usual workflow (which is very much
>>>>>>>>>>> system-specific, in that we will need to to test on all
>>>>>>>>>>> platforms)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> It will reduce total size of "data" directory from
>>>>>>>>>>>>> 2.3Mb to 800Kb.
>>>>>>>>>>>> Fantastic! If no one has other objections I say: go
>>>>>>>>>>>> ahead.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Lennart.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Any objections/suggestions?
>>>>>>>>>>>>>
>>>>>>>>>>>>> best, Maksim
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>> --
>>>>>>>>>>>> -----------------------------------------------------------------
>>>> L.C. Karssen
>>>>>>>>>>>> Utrecht The Netherlands
>>>>>>>>>>>>
>>>>>>>>>>>> lennart at karssen.org http://blog.karssen.org
>>>>>>>>>>>>
>>>>>>>>>>>> Stuur mij aub geen Word of Powerpoint bestanden! Zie
>>>>>>>>>>>> http://www.gnu.org/philosophy/no-word-attachments.nl.html
>>>> ------------------------------------------------------------------
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>> _______________________________________________
>>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>> _______________________________________________
>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>> _______________________________________________
>>>>>>>>> genabel-devel mailing list
>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>> _______________________________________________
>>>>>>>> genabel-devel mailing list
>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>> --
>>>>>>> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen
>>>>>>> Utrecht The Netherlands
>>>>>>>
>>>>>>> lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A
>>>>>>> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>>>>>>
>>>>>>> _______________________________________________ genabel-devel
>>>>>>> mailing list genabel-devel at lists.r-forge.r-project.org
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>> AVVISO DI RISERVATEZZA Informazioni riservate possono essere contenute
>>>> nel messaggio o nei suoi allegati. Se non siete i destinatari indicati
>>>> nel messaggio, o responsabili per la sua consegna alla persona, o se
>>>> avete ricevuto il messaggio per errore, siete pregati di non
>>>> trascriverlo, copiarlo o inviarlo a nessuno. In tal caso vi invitiamo a
>>>> cancellare il messaggio ed i suoi allegati. Grazie. CONFIDENTIALITY
>>>> NOTICE Confidential information may be contained in this message or in
>>>> its attachments. If you are not the addressee indicated in this message,
>>>> or responsible for message delivering to that person, or if you have
>>>> received this message in error, you may not transcribe, copy or deliver
>>>> this message to anyone. In that case, you should delete this message and
>>>> its attachments. Thank you.
>>>>>> _______________________________________________ genabel-devel
>>>>>> mailing list genabel-devel at lists.r-forge.r-project.org
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> genabel-devel mailing list
>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>
>>>
>>> _______________________________________________
>>> genabel-devel mailing list
>>> genabel-devel at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>> --
>> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
>> L.C. Karssen
>> Utrecht
>> The Netherlands
>>
>> lennart at karssen.org
>> http://blog.karssen.org
>> GPG key ID: A88F554A
>> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>
>> _______________________________________________
>> genabel-devel mailing list
>> genabel-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
More information about the genabel-devel
mailing list