[datatable-help] Variable labels suggestion

Steve Lianoglou mailinglist.honeypot at gmail.com
Fri Jul 29 16:50:08 CEST 2011


It seems that Hadley also raised this issue as one of the things he
wishes R supported better. See his post on SO here:
http://stackoverflow.com/questions/6796490/tools-for-professional-r-developers

(he also has some good tips on packages to check out there, too).

FWIW, I'm also of the mind that this is a bit out of data.table's
scope. I also haven't found this to be a huge issue, though. For some
reason I haven't run up against this issue so much.

-steve

On Fri, Jul 29, 2011 at 10:25 AM, Joseph Voelkel <jgvcqa at rit.edu> wrote:
> This seems to be outside the scope of data.table. It is really a global R issue, and one that should be addressed at that level (for example, natural addition of these attributes to data frames (and of course data tables :) ), with easy usage in functions such as plot.
>
> -----Original Message-----
> From: datatable-help-bounces at r-forge.wu-wien.ac.at [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Bacou, Melanie
> Sent: Friday, July 29, 2011 12:16 AM
> To: 'Griffith Rees'; mdowle at mdowle.plus.com
> Cc: datatable-help at r-forge.wu-wien.ac.at
> Subject: Re: [datatable-help] Variable labels suggestion
>
> Griff, Matt,
>
> I agree that codebook support or more generally support for maintaining meta-data is very poor in R. I also use Hmisc and end up maintaining my codebook in separate files. Often times I need to carry over not just variable labels, but also units, type, category, etc..
>
> I'm forced to use inefficient and wordy procedures, the likes of:
>
> ## Add variable labels and units from codebook file (usually some dump from STATA)
> i <- 1
> for (x in names(df)) {
>  label(df[, x]) <- codebook [i, "varName"]
>  units(df[, x]) <- codebook [i, "varUnit"]
>  type(df[, x]) <- codebook [i, "varType"]
>  i <- i + 1
> }
>
> [...some variable recoding...]
>
> ## Save codebook to CSV
> codebook <- data.frame(names(df), label(df), sapply(df, units), sapply(df, type))
> names(codebook) <- c("varCode", "varName", "varUnit", "varType")
> write.csv(codebook, file="codebook.csv")
>
> Any optimization for data.table that would facilitate read/write of meta-data would make a lot sense.
>
> --Mel.
>
>
>
>
> -----Original Message-----
> From: datatable-help-bounces at r-forge.wu-wien.ac.at [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Griffith Rees
> Sent: Thursday, July 28, 2011 6:56 PM
> To: mdowle at mdowle.plus.com
> Cc: datatable-help at r-forge.wu-wien.ac.at
> Subject: Re: [datatable-help] Variable labels suggestion
>
> Indeed, making such labels useful is only is highly dependent on their
> ability to be used with functions like toLatex. I think the first step
> would be to provide a way of adding labels and then consider functions
> that could help use them in formatting contexts, but kind of leave the
> last mile up to users for the time being. If it catches on, people
> will start to write wrappers that do the extra work.
>
> For example: the mtable function, which is what I primarily use to
> format tables for latex, can be used with the relabel function (also
> from the memisc package) to replace variable names in tables (see the
> relabel example in:
> http://www.oga-lab.net/RGM2/func.php?rd_id=memisc:mtable). A method
> which returns those labels appropriately could be called directly when
> mtable is used. It's not the prettiest solution, but it's a start.
>
> Obviously there's a mindshare aspect to this: the more people using
> data.table and find variable labels useful, the more likely they are
> to alter other packages to allow them to take advantage of those
> labels. The way to accrue that advantage is to make it simple but
> useful initially, and then wrappers can be added to make better use of
> it. Obviously, the prior art in the Hmisc package failed to garner
> enough mindshare for it to be used in other contexts, and data.table
> succeeds here by retaining interoperability with everything else.
>
> I know the first thing I would probably do: write a wrapper around
> read.dta which would read a stata file and return a data.table with
> the stata labels.
>
> just an idea. Oh and an optimized data.table save format as well but
> that's icing ;)
>
> -griff
>
> On Thu, Jul 28, 2011 at 8:11 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>
>> The toLatex aspect struck a chord. I sometimes embed the string 'PCT'
>> into the column name and then gsub("PCT","\%") just before output to
>> latex. Maybe a label would be more robust and could allow more complex
>> latex expressions in the column heading.  Long column names with spaces
>> are ok, but that may make it cumbersome to follow the advice to use
>> names not positions in j expressions.  But how would the latex output
>> command know to use the labels rather than the names? And would
>> data.table need to know about column labels to carry them through
>> subsets and joins etc?
>>
>> Matthew
>>
>>
>> On Thu, 2011-07-28 at 13:51 -0400, Chris Neff wrote:
>>> I think this is definitely out of the scope of data.table.
>>>
>>> On 28 July 2011 13:43, Tom Short <tshort.rlists at gmail.com> wrote:
>>>         On Thu, Jul 28, 2011 at 8:26 AM, Griffith Rees
>>>         <griffith.rees at sociology.ox.ac.uk> wrote:
>>>         > I think this page quite succinctly describes this issue:
>>>         > http://www.statmethods.net/input/variablelables.html
>>>
>>>
>>>         It would be easy to add to data.table. You could also add
>>>         support
>>>         outside of data.table by writing label.data.table and similar
>>>         functions. Actually using the labels for useful things is more
>>>         difficult. I often find it useful just to use more verbose
>>>         variable
>>>         names that include spaces as follows:
>>>
>>>         > dt <- data.table(`My first column` = 1:3, `A character
>>>         column` = letters[1:3], check.names = FALSE)
>>>         > str(dt)
>>>         Classes 'data.table' and 'data.frame':  3 obs. of  2
>>>         variables:
>>>          $ My first column   : int  1 2 3
>>>          $ A character column: Factor w/ 3 levels "a","b","c": 1 2 3
>>>
>>>         That way, columns look better with automatic plotting and with
>>>         lattice
>>>         or ggplot legends.
>>>
>>>         - Tom
>>>
>>>         _______________________________________________
>>>         datatable-help mailing list
>>>         datatable-help at lists.r-forge.r-project.org
>>>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
>
>
> --
> Griffith Rees
> Sociology DPhil Candidate
> Oxford University
> CABDyN Complexity Centre
> http://www.cabdyn.ox.ac.uk
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the datatable-help mailing list