[GenABEL-dev] new approach for data storage in GenABEL package
L.C. Karssen
lennart at karssen.org
Fri Nov 29 12:16:49 CET 2013
Hi Maksim,
Good that you raise this again. I've been thinking about it a bit longer.
What is the point of separating the data into a separate package if the
user still downloads it automatically ('depends'). The idea behind the
data package is of course to only download what is necessary (even if a
few MB is not very much). So that would point to using 'suggests'.
When I worked in Africa (very limited bandwidth) it was actually really
good to have these kind of 'suggests', because then you can only
download what is really necessary.
Something I haven't tested: what happens if we use 'suggests' and the
user wants to run an example (and GA.data is not installed)? Will (s)he
get an error/warning message? I guess so if each example in GenABEL has
a 'require(GenABEL.data)' line at the start. If the message to the user
is very clear then "suggests" is fine. Otherwise I would go with the old
behavior (have everything installed): 'depends'.
Lennart.
On 11/29/2013 11:42 AM, Maksim Struchalin wrote:
> Hi Yurii & Lennart,
>
> Yesterday, you supported the idea of making GenABEL.data as 'suggested':
>
> ________________________________________________________________
> On 28/11/2013 18:24, L.C. Karssen wrote:
>
>> On 11/28/2013 12:12 PM, Yury Aulchenko wrote:
>> I would think that GenABEL(.)data is "suggested" and then any
>> examples using the data from this packages start with something like
>>
>> if (require("GenABEL(.)data") ...
>
> This sounds like a good solution.
> ________________________________________________________________
>
>
> Today, you propose to make it 'depends' or I misunderstand something here?
>
> About how other people do it: I looked in GANPAdata and gamlss.data
> packages. They 'depends' on GANPA and gamlss (see my message below).
>
> best,
> Maksim
>
>
> On 29/11/2013 16:43, Yurii Aulchenko wrote:
>> Lennart,
>>
>> Good point about "depends"!
>>
>> Again, my question would be how other people do it?
>>
>> Y
>>
>> ----------------------
>> Yurii Aulchenko
>> (sent from mobile device)
>>
>>> On Nov 29, 2013, at 10:36, "L.C. Karssen" <lennart at karssen.org> wrote:
>>>
>>> Hi Maksim,
>>>
>>>> On 11/29/2013 08:43 AM, Maksim Struchalin wrote:
>>>> I looked at how other developres deal with issue of dependency
>>>> between a
>>>> package and its data.package. I checked out two random packages from
>>>> CRAN: GANPA (GANPAdata) and gamlss (gamlss.data). Both of them (GANPA
>>>> and gamlss) dependes on their data packages - that means their
>>>> DESCRIPTION files contain a reference to their data packages in the
>>>> "Depends:" field. Only GANPAdata suggests GANPA (gamlss.data does not
>>>> Depends/Suggests gamlss).
>>>>
>>>> When I made GenABEL depending on GenABEL.data, I kept in my mind the
>>>> same idea as Nicola pronounced below - that, in this case, GenABEL.data
>>>> is installed automaticly when users run "install.package(GenABEL)".
>>>> This
>>>> is convinient for users who install GenABEL from CRAN and this is in
>>>> line with GANPA and gamlss but it, probably, does not fully reflect the
>>>> GenABEL reality. The dependency between GenABEL and GenABL.data is weak
>>>> - GenABEL is gonna be mostly used without GenABEL.data. So, I support
>>>> the Yurii's idea about making GenABEL.data as 'suggested' and including
>>>> 'requre(...'.
>>> I agree with you that the dependence between GA and GA.data is rather
>>> weak. On the other hand, why not keep GA.data in Depends? That gives the
>>> same behaviour as before (install everything by default). Sounds
>>> convenient to me.
>>> With modern internet bandwidth the few MB of the data package are not a
>>> problem.
>>>
>>>> About dot: Personally, I like GenABEL.data. From this name, It is clear
>>>> that this package is some kind of a 'subpackage' of GenABEL package and
>>>> it is not a standalone one.
>>> Good point!
>>>
>>>
>>> Best regards,
>>>
>>> Lennart.
>>>
>>>> best,
>>>> Maksim
>>>>
>>>>> On 28/11/2013 18:24, L.C. Karssen wrote:
>>>>>
>>>>>> On 11/28/2013 12:12 PM, Yury Aulchenko wrote:
>>>>>> I would think that GenABEL(.)data is "suggested" and then any
>>>>>> examples using the data from this packages start with something like
>>>>>>
>>>>>> if (require("GenABEL(.)data") ...
>>>>> This sounds like a good solution.
>>>>>
>>>>>> How do other packages which lean on data-packages solve this?
>>>>>>
>>>>>> As for the "dot" - I do not have any strong opinion - both options
>>>>>> seem ok to me :)
>>>>> Great :-). Then I propose (of course) to stick with the dot, also
>>>>> because that's already used now.
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Lennart.
>>>>>
>>>>>
>>>>>> best, Yurii
>>>>>>
>>>>>>
>>>>>> On Nov 28, 2013, at 12:06 PM, Nicola Pirastu
>>>>>> <nicola.pirastu at burlo.trieste.it> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I've been following this conversation with much interest although
>>>>>>> I'm sorry I can't contribute much.
>>>>>>>
>>>>>>> I was just wondering, could GenABEL.data not be just a dependency
>>>>>>> on GenABEL? This way installing GenABEL trough install.packages
>>>>>>> would result in the installation also of GenABEL.data without the
>>>>>>> user actually having to do it himself.
>>>>>>>
>>>>>>> Best.
>>>>>>>
>>>>>>> Nicola
>>>>>>>
>>>>>>>
>>>>>>> Dr. Nicola Pirastu PhD Research Fellow Medical Sciences,
>>>>>>> Chirurgical and Health Department University of Trieste Medical
>>>>>>> Genetics IRCCS Burlo Garofolo Via dell'Istria 65/1 34137 Italy tel.
>>>>>>> +390403785539
>>>>>>>
>>>>>>> Il giorno 28/nov/2013, alle ore 11:59, "L.C. Karssen"
>>>>>>> <lennart at karssen.org> ha scritto:
>>>>>>>
>>>>>>>> Hi Maksim,
>>>>>>>>
>>>>>>>> First of all, thanks for the good work!
>>>>>>>>
>>>>>>>>> On 11/27/2013 07:58 PM, Maksim Struchalin wrote:
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I created a GenABEL.data package where I moved the following
>>>>>>>>> data: GenABEL/data/* , inst/exdata/srgenos.dat and
>>>>>>>>> inst/exdata/srphenos.dat. All the corresponding files are
>>>>>>>>> deleted from GenABEL. Also, GenABEL.data contains R directory
>>>>>>>>> with three files (ge03d2c.R, ge03d2ex.R and srdta.R). These
>>>>>>>>> scripts does not go to the final distribution and needed only
>>>>>>>>> for possible future usage. Only GenABEL.data/data/* files go to
>>>>>>>>> GenABEL.data_1.0.tar.gz after running "R CMD build
>>>>>>>>> GenABEL.data". The directories "R" and "inst" are removed by
>>>>>>>>> running GenABEL/data/clean.R in "build" process. May be it is
>>>>>>>>> not a good idea to do it in such a way but, at least, it is
>>>>>>>>> convinient and has no any reflection on end users (suggest a
>>>>>>>>> better way plz).
>>>>>>>>>
>>>>>>>>> The way how GenABEL.data works now is not like how we discussed
>>>>>>>>> below. It is impossible to generate files during "R CMD
>>>>>>>>> INSTALL" and undisarable during "R CMD build". The best opition
>>>>>>>>> was just to move all the data to GenABEL.data from GenABEL
>>>>>>>>> (like CRAN people suggested). In this case, we can install
>>>>>>>>> GenABEL.data without having GenABEL installed. After this, we
>>>>>>>>> install GenABELL.
>>>>>>>> This sounds very strange to me. Does the user first need to
>>>>>>>> install the GenABEL.data package and then the 'main' GenABEL
>>>>>>>> package? Or do I misunderstand you? What happens if the user
>>>>>>>> installs them in a different order? I guess that shouldn't
>>>>>>>> matter, right, as the package contains only data?
>>>>>>>>
>>>>>>>>> When we run library(GenABEL), it automaticly attaches
>>>>>>>>> GenBEL.data. Thus, the only change for users is that they need
>>>>>>>>> to install two packages now (GenABEL.data and GebABEL).
>>>>>>>> And GenABEL.data is only needed if they actually want to use the
>>>>>>>> examples, right? Or do we simply put GenABEL.data in the list of
>>>>>>>> required packages in the DESCRIPTION file?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Lennart.
>>>>>>>>
>>>>>>>>> Now we have sizes of both packages much smaller: 469K for
>>>>>>>>> GenABEL and 2.4M for GenABEL.data.
>>>>>>>>>
>>>>>>>>> It should work now, but if you experience some problems, let me
>>>>>>>>> know.
>>>>>>>>>
>>>>>>>>> best, Maksim
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 26/11/2013 20:48, L.C. Karssen wrote:
>>>>>>>>>> Hi Maksim,
>>>>>>>>>>
>>>>>>>>>>> On 11/26/2013 12:11 PM, Maksim Struchalin wrote:
>>>>>>>>>>> I am still in the way of compressing GenABEL data. To
>>>>>>>>>>> remind you: the idea consists of compressing the original
>>>>>>>>>>> data text files and use them later for generating RData
>>>>>>>>>>> files (e.g. srdta).
>>>>>>>>>>>
>>>>>>>>>>> Yurii proposed to make RData files in examples which use
>>>>>>>>>>> them. I see now only one way how this idea can be
>>>>>>>>>>> implemented. We replace "data(srdta)" line in every file
>>>>>>>>>>> where it is used by a function e.g. "generate_srdt()" which
>>>>>>>>>>> generate srdta object. The same procedure for other five
>>>>>>>>>>> *.RData files from GenABEL/data. If we follow this way, we
>>>>>>>>>>> have to change 71 files in man directory and, additionally
>>>>>>>>>>> to this, the GenABEL manual. Also, users will not be able
>>>>>>>>>>> to load the srdta set (and others) by typing "data(srdta)"
>>>>>>>>>>> in a command line (how they get used to) and has to know
>>>>>>>>>>> that the function generate_srdt() now services for these
>>>>>>>>>>> needs. This all sounds nasty :-).
>>>>>>>>>> I'm not sure how many user actually type data(srdta), but I
>>>>>>>>>> see you point.
>>>>>>>>>>
>>>>>>>>>>> Making the data during package installation time is also a
>>>>>>>>>>> bad idea as Yurii noted below. Actually, this is impossible
>>>>>>>>>>> because the process of making GenABEL data requires GenABEL
>>>>>>>>>>> functions which are not available during installation time
>>>>>>>>>>> (they are avaialble only after GenABEL installed).
>>>>>>>>>> Good point!
>>>>>>>>>>
>>>>>>>>>>> I see only one good solution now: move all the GenABEL data
>>>>>>>>>>> to a new package e.g. GenABELdata as it was proposed by
>>>>>>>>>>> CRAN people from the begining. In this case, it is possible
>>>>>>>>>>> to generate RData during installation time using GenABEL
>>>>>>>>>>> functions (which are installed by that time). I think this
>>>>>>>>>>> solution is paltform independent because R rules permit
>>>>>>>>>>> runing *.R scripts to generate data during installation
>>>>>>>>>>> time.
>>>>>>>>>>>
>>>>>>>>>>> What do you think about making a data package for GenABEL?
>>>>>>>>>>> Do you think the name GenABELdata is ok? May be we can move
>>>>>>>>>>> all the *ABEL data in DatABEL package instead of making
>>>>>>>>>>> *ABELdata data packages?
>>>>>>>>>> Sounds like this is the best solution. Thanks for digging in
>>>>>>>>>> to this. As for the package name, either GenABELdata or
>>>>>>>>>> GenABEL.data sounds find with me (the latter one being a bit
>>>>>>>>>> clearer in my opinion).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Lennart
>>>>>>>>>>
>>>>>>>>>>> best, Maksim
>>>>>>>>>>>
>>>>>>>>>>>> On 18/11/2013 18:54, Yury Aulchenko wrote:
>>>>>>>>>>>> On Nov 15, 2013, at 17:21 PM, L.C. Karssen
>>>>>>>>>>>> <lennart at karssen.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Maksim,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 14-11-13 22:38, Maksim Struchalin wrote:
>>>>>>>>>>>>>> In this email, I propose a new approach which allows
>>>>>>>>>>>>>> to reduce total size of data from 8Mb to 2Mb that
>>>>>>>>>>>>>> reduce the entire GenABEL size from 12Mb to 6Mb.
>>>>>>>>>>>>> I gues you mean B (bytes) instead of b (bits) here
>>>>>>>>>>>>> :-).
>>>>>>>>>>>>>
>>>>>>>>>>>>>> "R CMD check --as-cran" reports that the following
>>>>>>>>>>>>>> sub-directories have too big size: data (2.3Mb),
>>>>>>>>>>>>>> exdata (5.7Mb) and libs (2.6Mb). After the last
>>>>>>>>>>>>>> GenABEL submission to CRAN, the maintainers suggested
>>>>>>>>>>>>>> to create a new package called GenABELdata and move
>>>>>>>>>>>>>> all the data there. I run through the data and found
>>>>>>>>>>>>>> that: 1) "exdata" directory can be compressed by gzip
>>>>>>>>>>>>>> and reduced from 5.8Mb -> 1.1Mb. - There is a
>>>>>>>>>>>>>> function guzip() from library R.utils which can
>>>>>>>>>>>>>> decompress the files. It works on any OS. - Moreover:
>>>>>>>>>>>>>> the native R function read.table() can read gzip
>>>>>>>>>>>>>> files without decompression. - Even more: it looks
>>>>>>>>>>>>>> like that the biggest file "srgenos.dat" is used only
>>>>>>>>>>>>>> once a long time ago for generating "srdta.RData" and
>>>>>>>>>>>>>> now it is just sitting there and eating space
>>>>>>>>>>>>>> needlessly.
>>>>>>>>>>>>> Sounds like a waste of space!
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2) We can delete some files from the "data"
>>>>>>>>>>>>>> directory. The deleted files will be generated on the
>>>>>>>>>>>>>> user computer based on the files from exdata. It can
>>>>>>>>>>>>>> be done during INSTALLATION (a line in Makefile?) or
>>>>>>>>>>>>>> on the first load through (|run funcion .onAttach()
>>>>>>>>>>>>>> in R/zzz.R|).
>>>>>>>>>>>>> This sounds like a perfectly acceptable option.
>>>>>>>>>>>> I suggest this is done in the "example" which make use of
>>>>>>>>>>>> this data, NOT in the INSTALL etc. - we should make
>>>>>>>>>>>> things as "robust" as possible and interfere as little as
>>>>>>>>>>>> possible with the usual workflow (which is very much
>>>>>>>>>>>> system-specific, in that we will need to to test on all
>>>>>>>>>>>> platforms)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> It will reduce total size of "data" directory from
>>>>>>>>>>>>>> 2.3Mb to 800Kb.
>>>>>>>>>>>>> Fantastic! If no one has other objections I say: go
>>>>>>>>>>>>> ahead.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Lennart.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any objections/suggestions?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> best, Maksim
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>>>>>>>
>>>>> --
>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>
>>>>> L.C. Karssen
>>>>>>>>>>>>> Utrecht The Netherlands
>>>>>>>>>>>>>
>>>>>>>>>>>>> lennart at karssen.org http://blog.karssen.org
>>>>>>>>>>>>>
>>>>>>>>>>>>> Stuur mij aub geen Word of Powerpoint bestanden! Zie
>>>>>>>>>>>>> http://www.gnu.org/philosophy/no-word-attachments.nl.html
>>>>> ------------------------------------------------------------------
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>>>>>>
>>>>> _______________________________________________
>>>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>>>>>
>>>>> _______________________________________________
>>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>>>>
>>>>> _______________________________________________
>>>>>>>>>> genabel-devel mailing list
>>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>>>
>>>>> _______________________________________________
>>>>>>>>> genabel-devel mailing list
>>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>>
>>>>> --
>>>>>>>> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* L.C. Karssen
>>>>>>>> Utrecht The Netherlands
>>>>>>>>
>>>>>>>> lennart at karssen.org http://blog.karssen.org GPG key ID: A88F554A
>>>>>>>> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>>>>>>>
>>>>>>>> _______________________________________________ genabel-devel
>>>>>>>> mailing list genabel-devel at lists.r-forge.r-project.org
>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>
>>>>> AVVISO DI RISERVATEZZA Informazioni riservate possono essere contenute
>>>>> nel messaggio o nei suoi allegati. Se non siete i destinatari indicati
>>>>> nel messaggio, o responsabili per la sua consegna alla persona, o se
>>>>> avete ricevuto il messaggio per errore, siete pregati di non
>>>>> trascriverlo, copiarlo o inviarlo a nessuno. In tal caso vi
>>>>> invitiamo a
>>>>> cancellare il messaggio ed i suoi allegati. Grazie. CONFIDENTIALITY
>>>>> NOTICE Confidential information may be contained in this message or in
>>>>> its attachments. If you are not the addressee indicated in this
>>>>> message,
>>>>> or responsible for message delivering to that person, or if you have
>>>>> received this message in error, you may not transcribe, copy or
>>>>> deliver
>>>>> this message to anyone. In that case, you should delete this
>>>>> message and
>>>>> its attachments. Thank you.
>>>>>>> _______________________________________________ genabel-devel
>>>>>>> mailing list genabel-devel at lists.r-forge.r-project.org
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> genabel-devel mailing list
>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> genabel-devel mailing list
>>>> genabel-devel at lists.r-forge.r-project.org
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>
>>> --
>>> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
>>> L.C. Karssen
>>> Utrecht
>>> The Netherlands
>>>
>>> lennart at karssen.org
>>> http://blog.karssen.org
>>> GPG key ID: A88F554A
>>> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>>
>>> _______________________________________________
>>> genabel-devel mailing list
>>> genabel-devel at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>
>> _______________________________________________
>> genabel-devel mailing list
>> genabel-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>
>
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
--
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands
lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20131129/3ad8fc93/attachment.sig>
More information about the genabel-devel
mailing list