[GenABEL-dev] new approach for data storage in GenABEL package

L.C. Karssen lennart at karssen.org
Thu Nov 28 11:59:46 CET 2013


Dear all,

On 11/27/2013 08:45 PM, Maksim Struchalin wrote:
> About package names: There are 5056 packages on CRAN. 42 of them are
> data packages and 6 of them has name like packagename.data
> ("cluster.datasets", "data.table", "gamlss.data", "g.data",
> "survJamda.data" and "TH.data"). Thus, GenABELdata would be more in line
> then GenABEL.data.

I proposed GenABEL.data because I saw it appear multiple times on CRAN
(although I didn't do the counting Maksim did). I still think having the
dot in the name is better as it improves the readability of the package
name, it provides a better "visual" queue as to what the package contains.
The fact that most of the other data packages don't have a dot doesn't
carry much weight in my opinion. I think the improvement in readability
of the package name (with dot) outweighs the 'conform to the majority'
argument.

> 
> About submission to CRAN: I still see small warnings in R CMD --check
> output and at least one FAILURE in test.polylik from RUnit test. I think
> we can make the next submission to CRAN by the end of the next week: one
> week for testing our new data package + fixing small errors.
> 
> p.s. We should submit to CRAN GenABEL.data as well.


I agree with Yurii that a jump in minor version number is warranted.
Let's go for 1.8-0!


Best,

Lennart.

> 
> best,
> Maksim
> 
> 
> 
> On 28/11/2013 02:07, Yury Aulchenko wrote:
>> Wow, very impressive, Maksim!
>>
>> Can you please check if GenABEL.data complies with the naming
>> conventions (I do not recall seeing the names with dots as package
>> names; what other data-packages use as the names?)
>>
>> If naming is ok, do you think we are close to submit to CRAN? With so
>> many changes, I think we should "jump" on the version number (say, to
>> 1.8-0?)
>>
>> best,
>> Yurii
>>
>> On Nov 27, 2013, at 19:58 PM, Maksim Struchalin
>> <m.v.struchalin at mail.ru <mailto:m.v.struchalin at mail.ru>> wrote:
>>
>>> Hi All,
>>>
>>> I created a GenABEL.data package where I moved the following data:
>>> GenABEL/data/* , inst/exdata/srgenos.dat and
>>> inst/exdata/srphenos.dat. All the corresponding files are deleted
>>> from GenABEL.
>>> Also, GenABEL.data contains R directory with three files (ge03d2c.R,
>>> ge03d2ex.R and srdta.R). These scripts does not go to the final
>>> distribution and needed only for possible future usage.
>>> Only GenABEL.data/data/* files go to GenABEL.data_1.0.tar.gz after
>>> running "R CMD build GenABEL.data". The directories "R" and "inst"
>>> are removed by running GenABEL/data/clean.R in "build" process. May
>>> be it is not a good idea to do it in such a way but, at least, it is
>>> convinient and has no any reflection on end users (suggest a better
>>> way plz).
>>>
>>> The way how GenABEL.data works now is not like how we discussed
>>> below. It is impossible to generate files during "R CMD INSTALL" and
>>> undisarable during "R CMD build". The best opition was just to move
>>> all the data to GenABEL.data from GenABEL (like CRAN people
>>> suggested). In this case, we can install GenABEL.data without having
>>> GenABEL installed. After this, we install GenABELL. When we run
>>> library(GenABEL), it automaticly attaches GenBEL.data. Thus, the only
>>> change for users is that they need to install two packages now
>>> (GenABEL.data and GebABEL).
>>>
>>> Now we have sizes of both packages much smaller: 469K for GenABEL and
>>> 2.4M for GenABEL.data.
>>>
>>> It should work now, but if you experience some problems, let me know.
>>>
>>> best,
>>> Maksim
>>>
>>>
>>> On 26/11/2013 20:48, L.C. Karssen wrote:
>>>> Hi Maksim,
>>>>
>>>> On 11/26/2013 12:11 PM, Maksim Struchalin wrote:
>>>>> I am still in the way of compressing GenABEL data.
>>>>> To remind you: the idea consists of compressing the original data text
>>>>> files and use them later for generating RData files (e.g. srdta).
>>>>>
>>>>> Yurii proposed to make RData files in examples which use them. I see now
>>>>> only one way how this idea can be implemented. We replace "data(srdta)"
>>>>> line in every file where it is used by a function e.g. "generate_srdt()"
>>>>> which generate srdta object. The same procedure for other five *.RData
>>>>> files from GenABEL/data. If we follow this way, we have to change 71
>>>>> files in man directory and, additionally to this, the GenABEL manual.
>>>>> Also, users will not be able to load the srdta set (and others) by
>>>>> typing "data(srdta)" in a command line (how they get used to) and has to
>>>>> know that the function generate_srdt() now services for these needs.
>>>>> This all sounds nasty :-).
>>>> I'm not sure how many user actually type data(srdta), but I see you point.
>>>>
>>>>> Making the data during package installation time is also a bad idea as
>>>>> Yurii noted below. Actually, this is impossible because the process of
>>>>> making GenABEL data requires GenABEL functions which are not available
>>>>> during installation time (they are avaialble only after GenABEL installed).
>>>> Good point!
>>>>
>>>>> I see only one good solution now: move all the GenABEL data to a new
>>>>> package e.g. GenABELdata as it was proposed by CRAN people from the
>>>>> begining. In this case, it is possible to generate RData during
>>>>> installation time using GenABEL functions (which are installed by that
>>>>> time). I think this solution is paltform independent because R rules
>>>>> permit runing *.R scripts to generate data during installation time.
>>>>>
>>>>> What do you think about making a data package for GenABEL? Do you think
>>>>> the name GenABELdata is ok? May be we can move all the *ABEL data in
>>>>> DatABEL package instead of making *ABELdata data packages?
>>>> Sounds like this is the best solution. Thanks for digging in to this. As
>>>> for the package name, either GenABELdata or GenABEL.data sounds find
>>>> with me (the latter one being a bit clearer in my opinion).
>>>>
>>>>
>>>> Best,
>>>>
>>>> Lennart
>>>>
>>>>> best,
>>>>> Maksim
>>>>>
>>>>> On 18/11/2013 18:54, Yury Aulchenko wrote:
>>>>>> On Nov 15, 2013, at 17:21 PM, L.C. Karssen <lennart at karssen.org> wrote:
>>>>>>
>>>>>>> Hi Maksim,
>>>>>>>
>>>>>>> On 14-11-13 22:38, Maksim Struchalin wrote:
>>>>>>>> In this email, I propose a new approach which allows to reduce total
>>>>>>>> size of data from 8Mb to 2Mb that reduce the entire GenABEL size from
>>>>>>>> 12Mb to 6Mb.
>>>>>>> I gues you mean B (bytes) instead of b (bits) here :-).
>>>>>>>
>>>>>>>> "R CMD check --as-cran" reports that the following sub-directories have
>>>>>>>> too big size: data (2.3Mb), exdata (5.7Mb) and libs (2.6Mb). After the
>>>>>>>> last GenABEL submission to CRAN, the maintainers suggested to create a
>>>>>>>> new package called GenABELdata and move all the data there. I run
>>>>>>>> through the data and found that:
>>>>>>>> 1) "exdata" directory can be compressed by gzip and reduced from 5.8Mb
>>>>>>>> -> 1.1Mb.
>>>>>>>>     - There is a function guzip() from library R.utils which can
>>>>>>>> decompress the files. It works on any OS.
>>>>>>>>     - Moreover: the native R function read.table() can read gzip files
>>>>>>>> without decompression.
>>>>>>>>     - Even more: it looks like that the biggest file "srgenos.dat" is
>>>>>>>> used only once a long time ago for generating "srdta.RData" and now it
>>>>>>>> is just sitting there and eating space needlessly.
>>>>>>> Sounds like a waste of space!
>>>>>>>
>>>>>>>> 2) We can delete some files from the "data" directory. The deleted
>>>>>>>> files
>>>>>>>> will be generated on the user computer based on the files from exdata.
>>>>>>>> It can be done during INSTALLATION (a line in Makefile?) or on the
>>>>>>>> first
>>>>>>>> load through (|run funcion .onAttach() in R/zzz.R|).
>>>>>>> This sounds like a perfectly acceptable option.
>>>>>> I suggest this is done in the "example" which make use of this data,
>>>>>> NOT in the INSTALL etc. - we should make things as "robust" as
>>>>>> possible and interfere as little as possible with the usual workflow
>>>>>> (which is very much system-specific, in that we will need to to test
>>>>>> on all platforms)
>>>>>>
>>>>>>
>>>>>>>> It will reduce
>>>>>>>> total size of "data" directory from 2.3Mb to 800Kb.
>>>>>>> Fantastic! If no one has other objections I say: go ahead.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Lennart.
>>>>>>>
>>>>>>>
>>>>>>>> Any objections/suggestions?
>>>>>>>>
>>>>>>>> best,
>>>>>>>> Maksim
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> genabel-devel mailing list
>>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>>
>>>>>>>>
>>>>>>> -- 
>>>>>>> -----------------------------------------------------------------
>>>>>>> L.C. Karssen
>>>>>>> Utrecht
>>>>>>> The Netherlands
>>>>>>>
>>>>>>> lennart at karssen.org
>>>>>>> http://blog.karssen.org
>>>>>>>
>>>>>>> Stuur mij aub geen Word of Powerpoint bestanden!
>>>>>>> Zie http://www.gnu.org/philosophy/no-word-attachments.nl.html
>>>>>>> ------------------------------------------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> genabel-devel mailing list
>>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>>
>>>>>> _______________________________________________
>>>>>> genabel-devel mailing list
>>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>>>
>>>>> _______________________________________________
>>>>> genabel-devel mailing list
>>>>> genabel-devel at lists.r-forge.r-project.org
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>>
>>>>
>>>> _______________________________________________
>>>> genabel-devel mailing list
>>>> genabel-devel at lists.r-forge.r-project.org
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>>
>>> _______________________________________________
>>> genabel-devel mailing list
>>> genabel-devel at lists.r-forge.r-project.org
>>> <mailto:genabel-devel at lists.r-forge.r-project.org>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>
> 
> 
> 
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
> 

-- 
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20131128/c00de977/attachment.sig>


More information about the genabel-devel mailing list