[R-gregmisc-commits] r2157 - pkg/gdata/inst/doc

Tue Jun 6 19:18:49 CEST 2017

Author: warnes
Date: 2017-06-06 19:18:49 +0200 (Tue, 06 Jun 2017)
New Revision: 2157

Added:
   pkg/gdata/inst/doc/mapLevels.Rnw
   pkg/gdata/inst/doc/unknown.Rnw
Log:
Add vignette files

Added: pkg/gdata/inst/doc/mapLevels.Rnw
===================================================================

--- pkg/gdata/inst/doc/mapLevels.Rnw	                        (rev 0)
+++ pkg/gdata/inst/doc/mapLevels.Rnw	2017-06-06 17:18:49 UTC (rev 2157)
@@ -0,0 +1,230 @@
+
+%\VignetteIndexEntry{Mapping levels of a factor}
+%\VignettePackage{gdata}
+%\VignetteKeywords{levels, factor, manip}
+
+\documentclass[a4paper]{report}
+\usepackage{Rnews}
+\usepackage[round]{natbib}
+\bibliographystyle{abbrvnat}
+
+\usepackage{Sweave}
+\SweaveOpts{strip.white=all, keep.source=TRUE}
+
+\begin{document}
+\SweaveOpts{concordance=TRUE}
+
+\begin{article}
+
+\title{Mapping levels of a factor}
+\subtitle{The \pkg{gdata} package}
+\author{by Gregor Gorjanc}
+
+\maketitle
+
+\section{Introduction}
+
+Factors use levels attribute to store information on mapping between
+internal integer codes and character values i.e. levels. First level is
+mapped to internal integer code 1 and so on. Although some users do not
+like factors, their use is more efficient in terms of storage than for
+character vectors. Additionally, there are many functions in base \R{} that
+provide additional value for factors. Sometimes users need to work with
+internal integer codes and mapping them back to factor, especially when
+interfacing external programs. Mapping information is also of interest if
+there are many factors that should have the same set of levels. This note
+describes \code{mapLevels} function, which is an utility function for
+mapping the levels of a factor in \pkg{gdata} \footnote{from version 2.3.1}
+package \citep{WarnesGdata}.
+
+\section{Description with examples}
+
+Function \code{mapLevels()} is an (S3) generic function and works on
+\code{factor} and \code{character} atomic classes. It also works on
+\code{list} and \code{data.frame} objects with previously mentioned atomic
+classes. Function \code{mapLevels} produces a so called ``map'' with names
+and values. Names are levels, while values can be internal integer codes or
+(possibly other) levels. This will be clarified later on.  Class of this
+``map'' is \code{levelsMap}, if \code{x} in \code{mapLevels()} was atomic
+or \code{listLevelsMap} otherwise - for \code{list} and \code{data.frame}
+classes. The following example shows the creation and printout of such a
+``map''.
+
+<<ex01>>=
+library(gdata)
+(fac <- factor(c("B", "A", "Z", "D")))
+(map <- mapLevels(x=fac))
+@
+
+If we have to work with internal integer codes, we can transform factor to
+integer and still get ``back the original factor'' with ``map'' used as
+argument in \code{mapLevels<-} function as shown bellow. \code{mapLevels<-}
+is also an (S3) generic function and works on same classes as
+\code{mapLevels} plus \code{integer} atomic class.
+
+<<ex02>>=
+(int <- as.integer(fac))
+mapLevels(x=int) <- map
+int
+identical(fac, int)
+@
+
+Internally ``map'' (\code{levelsMap} class) is a \code{list} (see bellow),
+but its print method unlists it for ease of inspection. ``Map'' from
+example has all components of length 1. This is not mandatory as
+\code{mapLevels<-} function is only a wrapper around workhorse function
+\code{levels<-} and the later can accept \code{list} with components of
+various lengths.
+
+<<ex03>>=
+str(map)
+@
+
+Although not of primary importance, this ``map'' can also be used to remap
+factor levels as shown bellow.  Components ``later'' in the map take over
+the ``previous'' ones. Since this is not optimal I would rather recommend
+other approaches for ``remapping'' the levels of a \code{factor}, say
+\code{recode} in \pkg{car} package \citep{FoxCar}.
+
+<<ex04>>=
+map[[2]] <- as.integer(c(1, 2))
+map
+int <- as.integer(fac)
+mapLevels(x=int) <- map
+int
+@
+
+Up to now examples showed ``map'' with internal integer codes for values
+and levels for names. I call this integer ``map''. On the other hand
+character ``map'' uses levels for values and (possibly other) levels for
+names. This feature is a bit odd at first sight, but can be used to easily
+unify levels and internal integer codes across several factors.  Imagine
+you have a factor that is for some reason split into two factors \code{f1}
+and \code{f2} and that each factor does not have all levels. This is not
+uncommon situation.
+
+<<ex05>>=
+(f1 <- factor(c("A", "D", "C")))
+(f2 <- factor(c("B", "D", "C")))
+@
+
+If we work with this factors, we need to be careful as they do not have the
+same set of levels. This can be solved with appropriately specifying
+\code{levels} argument in creation of factors i.e. \code{levels=c("A", "B",
+  "C", "D")} or with proper use of \code{levels<-} function. I say proper
+as it is very tempting to use:
+
+<<ex06>>=
+fTest <- f1
+levels(fTest) <- c("A", "B", "C", "D")
+fTest
+@
+
+Above example extends set of levels, but also changes level of 2nd and 3rd
+element in \code{fTest}! Proper use of \code{levels<-} (as shown in
+\code{levels} help page) would be:
+
+<<ex07>>=
+fTest <- f1
+levels(fTest) <- list(A="A", B="B",
+                      C="C", D="D")
+fTest
+@
+
+Function \code{mapLevels} with character ``map'' can help us in such
+scenarios to unify levels and internal integer codes across several
+factors. Again the workhorse under this process is \code{levels<-} function
+from base \R{}! Function \code{mapLevels<-} just controls the assignment of
+(integer or character) ``map'' to \code{x}. Levels in \code{x} that match
+``map'' values (internal integer codes or levels) are changed to ``map''
+names (possibly other levels) as shown in \code{levels} help page. Levels
+that do not match are converted to \code{NA}. Integer ``map'' can be
+applied to \code{integer} or \code{factor}, while character ``map'' can be
+applied to \code{character} or \code{factor}. Result of \code{mapLevels<-}
+is always a \code{factor} with possibly ``remapped'' levels.
+
+To get one joint character ``map'' for several factors, we need to put
+factors in a \code{list} or \code{data.frame} and use arguments
+\code{codes=FALSE} and \code{combine=TRUE}. Such map can then be used to
+unify levels and internal integer codes.
+
+<<ex08>>=
+(bigMap <- mapLevels(x=list(f1, f2),
+                     codes=FALSE,
+                     combine=TRUE))
+mapLevels(f1) <- bigMap
+mapLevels(f2) <- bigMap
+f1
+f2
+cbind(as.character(f1), as.integer(f1),
+      as.character(f2), as.integer(f2))
+@
+
+If we do not specify \code{combine=TRUE} (which is the default behaviour)
+and \code{x} is a \code{list} or \code{data.frame}, \code{mapLevels}
+returns ``map'' of class \code{listLevelsMap}. This is internally a
+\code{list} of ``maps'' (\code{levelsMap} objects). Both
+\code{listLevelsMap} and \code{levelsMap} objects can be passed to
+\code{mapLevels<-} for \code{list}/\code{data.frame}. Recycling occurs when
+length of \code{listLevelsMap} is not the same as number of
+components/columns of a \code{list}/\code{data.frame}.
+
+Additional convenience methods are also implemented to ease the work with
+``maps'':
+
+\begin{itemize}
+
+\item \code{is.levelsMap}, \code{is.listLevelsMap}, \code{as.levelsMap} and
+  \code{as.listLevelsMap} for testing and coercion of user defined
+  ``maps'',
+
+\item \code{"["} for subsetting,
+
+\item \code{c} for combining \code{levelsMap} or \code{listLevelsMap}
+  objects; argument \code{recursive=TRUE} can be used to coerce
+  \code{listLevelsMap} to \code{levelsMap}, for example \code{c(llm1, llm2,
+    recursive=TRUE)} and
+
+\item \code{unique} and \code{sort} for \code{levelsMap}.
+
+\end{itemize}
+
+\section{Summary}
+
+Functions \code{mapLevels} and \code{mapLevels<-} can help users to map
+internal integer codes to factor levels and unify levels as well as
+internal integer codes among several factors. I welcome any comments or
+suggestions.
+
+% \bibliography{refs}
+\begin{thebibliography}{1}
+\providecommand{\natexlab}[1]{#1}
+\providecommand{\url}[1]{\texttt{#1}}
+\expandafter\ifx\csname urlstyle\endcsname\relax
+  \providecommand{\doi}[1]{doi: #1}\else
+  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
+
+\bibitem[Fox(2006)]{FoxCar}
+J.~Fox.
+\newblock \emph{car: Companion to Applied Regression}, 2006.
+\newblock URL \url{http://socserv.socsci.mcmaster.ca/jfox/}.
+\newblock R package version 1.1-1.
+
+\bibitem[Warnes(2006)]{WarnesGdata}
+G.~R. Warnes.
+\newblock \emph{gdata: Various R programming tools for data manipulation},
+  2006.
+\newblock URL
+  \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}.
+\newblock R package version 2.3.1. Includes R source code and/or documentation
+  contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley.
+
+\end{thebibliography}
+
+\address{Gregor Gorjanc\\
+  University of Ljubljana, Slovenia\\
+\email{gregor.gorjanc at bfro.uni-lj.si}}
+
+\end{article}
+
+\end{document}

Added: pkg/gdata/inst/doc/unknown.Rnw
===================================================================
--- pkg/gdata/inst/doc/unknown.Rnw	                        (rev 0)
+++ pkg/gdata/inst/doc/unknown.Rnw	2017-06-06 17:18:49 UTC (rev 2157)
@@ -0,0 +1,272 @@
+
+%\VignetteIndexEntry{Working with Unknown Values}
+%\VignettePackage{gdata}
+%\VignetteKeywords{unknown, missing, manip}
+
+\documentclass[a4paper]{report}
+\usepackage{Rnews}
+\usepackage[round]{natbib}
+\bibliographystyle{abbrvnat}
+
+\usepackage{Sweave}
+\SweaveOpts{strip.white=all, keep.source=TRUE}
+
+\begin{document}
+
+\begin{article}
+
+\title{Working with Unknown Values}
+\subtitle{The \pkg{gdata} package}
+\author{by Gregor Gorjanc}
+
+\maketitle
+
+This vignette has been published as \cite{Gorjanc}.
+
+\section{Introduction}
+
+Unknown or missing values can be represented in various ways. For example
+SAS uses \code{.}~(dot), while \R{} uses \code{NA}, which we can read as
+Not Available. When we import data into \R{}, say via \code{read.table} or
+its derivatives, conversion of blank fields to \code{NA} (according to
+\code{read.table} help) is done for \code{logical}, \code{integer},
+\code{numeric} and \code{complex} classes. Additionally, the
+\code{na.strings} argument can be used to specify values that should also
+be converted to \code{NA}. Inversely, there is an argument \code{na} in
+\code{write.table} and its derivatives to define value that will replace
+\code{NA} in exported data. There are also other ways to import/export data
+into \R{} as described in the {\emph R Data Import/Export} manual
+\citep{RImportExportManual}.  However, all approaches lack the possibility
+to define unknown value(s) for some particular column. It is possible that
+an unknown value in one column is a valid value in another column. For
+example, I have seen many datasets where values such as 0, -9, 999 and
+specific dates are used as column specific unknown values.
+
+This note describes a set of functions in package \pkg{gdata}\footnote{
+  package version 2.3.1} \citep{WarnesGdata}: \code{isUnknown},
+\code{unknownToNA} and \code{NAToUnknown}, which can help with testing for
+unknown values and conversions between unknown values and \code{NA}. All
+three functions are generic (S3) and were tested (at the time of writing)
+to work with: \code{integer}, \code{numeric}, \code{character},
+\code{factor}, \code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list},
+\code{data.frame} and \code{matrix} classes.
+
+\section{Description with examples}
+
+The following examples show simple usage of these functions on
+\code{numeric} and \code{factor} classes, where value \code{0} (beside
+\code{NA}) should be treated as an unknown value:
+
+<<ex01>>=
+library("gdata")
+xNum <- c(0, 6, 0, 7, 8, 9, NA)
+isUnknown(x=xNum)
+@
+
+The default unknown value in \code{isUnknown} is \code{NA}, which means
+that output is the same as \code{is.na} --- at least for atomic
+classes. However, we can pass the argument \code{unknown} to define which
+values should be treated as unknown:
+
+<<ex02>>=
+isUnknown(x=xNum, unknown=0)
+@
+
+This skipped \code{NA}, but we can get the expected answer after
+appropriately adding \code{NA} into the argument \code{unknown}:
+
+<<ex03>>=
+isUnknown(x=xNum, unknown=c(0, NA))
+@
+
+Now, we can change all unknown values to \code{NA} with \code{unknownToNA}.
+There is clearly no need to add \code{NA} here. This step is very handy
+after importing data from an external source, where many different unknown
+values might be used. Argument \code{warning=TRUE} can be used, if there is
+a need to be warned about ``original'' \code{NA}s:
+
+<<ex04>>=
+(xNum2 <- unknownToNA(x=xNum, unknown=0))
+@
+
+Prior to export from \R{}, we might want to change unknown values
+(\code{NA} in \R{}) to some other value. Function \code{NAToUnknown} can be
+used for this:
+
+<<ex05>>=
+NAToUnknown(x=xNum2, unknown=999)
+@
+
+Converting \code{NA} to a value that already exists in \code{x} issues an
+error, but \code{force=TRUE} can be used to overcome this if needed. But be
+warned that there is no way back from this step:
+
+<<ex06>>=
+NAToUnknown(x=xNum2, unknown=7, force=TRUE)
+@
+
+Examples below show all peculiarities with class \code{factor}.
+\code{unknownToNA} removes \code{unknown} value from levels and inversely
+\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is
+properly distinguished from \code{NA}. It can also be seen that the
+argument \code{unknown} in functions \code{isUnknown} and
+\code{unknownToNA} need not match the class of \code{x} (otherwise factor
+should be used) as the test is internally done with \code{\%in\%}, which
+nicely resolves coercing issues.
+
+<<ex07>>=
+(xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA")))
+isUnknown(x=xFac)
+isUnknown(x=xFac, unknown=0)
+isUnknown(x=xFac, unknown=c(0, NA))
+isUnknown(x=xFac, unknown=c(0, "NA"))
+isUnknown(x=xFac, unknown=c(0, "NA", NA))
+
+(xFac <- unknownToNA(x=xFac, unknown=0))
+(xFac <- NAToUnknown(x=xFac, unknown=0))
+@
+
+These two examples with classes \code{numeric} and \code{factor} are fairly
+simple and we could get the same results with one or two lines of \R{}
+code. The real benefit of the set of functions presented here is in
+\code{list} and \code{data.frame} methods, where \code{data.frame} methods
+are merely wrappers for \code{list} methods.
+
+We need additional flexibility for \code{list}/\code{data.frame} methods,
+due to possibly having multiple unknown values that can be different among
+\code{list} components or \code{data.frame} columns. For these two methods,
+the argument \code{unknown} can be either a \code{vector} or \code{list},
+both possibly named. Of course, greater flexibility (defining multiple
+unknown values per component/column) can be achieved with a \code{list}.
+
+When a \code{vector}/\code{list} object passed to the argument
+\code{unknown} is not named, the first value/component of a
+\code{vector}/\code{list} matches the first component/column of a
+\code{list}/\code{data.frame}. This can be quite error prone, especially
+with \code{vectors}. Therefore, I encourage the use of a \code{list}. In
+case \code{vector}/\code{list} passed to argument \code{unknown} is named,
+names are matched to names of \code{list} or \code{data.frame}. If lengths
+of \code{unknown} and \code{list} or \code{data.frame} do not match,
+recycling occurs.
+
+The example below illustrates the application of the described functions to
+a list which is composed of previously defined and modified numeric
+(\code{xNum}) and factor (\code{xFac}) classes. First, function
+\code{isUnknown} is used with \code{0} as an unknown value. Note that we
+get \code{FALSE} for \code{NA}s as has been the case in the first example.
+
+<<ex08>>=
+(xList <- list(a=xNum, b=xFac))
+isUnknown(x=xList, unknown=0)
+@
+
+We need to add \code{NA} as an unknown value. However, we do not get the
+expected result this way!
+
+<<ex09>>=
+isUnknown(x=xList, unknown=c(0, NA))
+@
+
+This is due to matching of values in the argument \code{unknown} and
+components in a \code{list}; i.e., \code{0} is used for component \code{a}
+and \code{NA} for component \code{b}.  Therefore, it is less error prone
+and more flexible to pass a \code{list} (preferably a named list) to the
+argument \code{unknown}, as shown below.
+
+<<ex10>>=
+(xList1 <- unknownToNA(x=xList,
+                       unknown=list(b=c(0, "NA"),
+                                    a=0)))
+@
+
+Changing \code{NA}s to some other value (only one per component/column) can
+be accomplished as follows:
+
+<<ex11>>=
+NAToUnknown(x=xList1,
+            unknown=list(b="no", a=0))
+@
+
+A named component \code{.default} of a \code{list} passed to argument
+\code{unknown} has a special meaning as it will match a component/column
+with that name and any other not defined in \code{unknown}. As such it is
+very useful if the number of components/columns with the same unknown
+value(s) is large. Consider a wide \code{data.frame} named \code{df}. Now
+\code{.default} can be used to define unknown value for several columns:
+
+<<ex12, echo=FALSE>>=
+df <- data.frame(col1=c(0, 1, 999, 2),
+                 col2=c("a", "b", "c", "unknown"),
+                 col3=c(0, 1, 2, 3),
+                 col4=c(0, 1, 2, 2))
+@
+
+<<ex13>>=
+tmp <- list(.default=0,
+            col1=999,
+            col2="unknown")
+(df2 <- unknownToNA(x=df,
+                    unknown=tmp))
+@
+
+If there is a need to work only on some components/columns you can of
+course ``skip'' columns with standard \R{} mechanisms, i.e.,
+by subsetting \code{list} or \code{data.frame} objects:
+
+<<ex14>>=
+df2 <- df
+cols <- c("col1", "col2")
+tmp <- list(col1=999,
+            col2="unknown")
+df2[, cols] <- unknownToNA(x=df[, cols],
+                           unknown=tmp)
+df2
+@
+
+\section{Summary}
+
+Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown}
+provide a useful interface to work with various representations of
+unknown/missing values. Their use is meant primarily for shaping the data
+after importing to or before exporting from \R{}. I welcome any comments or
+suggestions.
+
+% \bibliography{refs}
+
+\begin{thebibliography}{1}
+\providecommand{\natexlab}[1]{#1}
+\providecommand{\url}[1]{\texttt{#1}}
+\expandafter\ifx\csname urlstyle\endcsname\relax
+  \providecommand{\doi}[1]{doi: #1}\else
+  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
+
+\bibitem[Gorjanc(2007)]{Gorjanc}
+G.~Gorjanc.
+\newblock Working with unknown values: the gdata package.
+\newblock \emph{R News}, 7\penalty0 (1):\penalty0 24--26, 2007.
+\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-1.pdf}.
+
+\bibitem[{R Development Core Team}(2006)]{RImportExportManual}
+{R Development Core Team}.
+\newblock \emph{R Data Import/Export}, 2006.
+\newblock URL \url{http://cran.r-project.org/manuals.html}.
+\newblock ISBN 3-900051-10-0.
+
+\bibitem[Warnes (2006)]{WarnesGdata}
+G.~R. Warnes.
+\newblock \emph{gdata: Various R programming tools for data manipulation},
+  2006.
+\newblock URL
+  \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}.
+\newblock R package version 2.3.1. Includes R source code and/or documentation
+  contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley.
+
+\end{thebibliography}
+
+\address{Gregor Gorjanc\\
+  University of Ljubljana, Slovenia\\
+\email{gregor.gorjanc at bfro.uni-lj.si}}
+
+\end{article}
+
+\end{document}