[Returnanalytics-commits] r3555 - in pkg/FactorAnalytics: R man vignettes

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Fri Nov 21 12:25:12 CET 2014


Author: pragnya
Date: 2014-11-21 12:25:12 +0100 (Fri, 21 Nov 2014)
New Revision: 3555

Modified:
   pkg/FactorAnalytics/R/fitTsfm.R
   pkg/FactorAnalytics/R/fitTsfm.control.R
   pkg/FactorAnalytics/man/fitTsfm.Rd
   pkg/FactorAnalytics/man/fitTsfm.control.Rd
   pkg/FactorAnalytics/vignettes/fitTsfm_vignette.Rnw
   pkg/FactorAnalytics/vignettes/fitTsfm_vignette.pdf
Log:
Added option: best model from a range of subset sizes using subsets regression

Modified: pkg/FactorAnalytics/R/fitTsfm.R
===================================================================
--- pkg/FactorAnalytics/R/fitTsfm.R	2014-11-20 04:38:47 UTC (rev 3554)
+++ pkg/FactorAnalytics/R/fitTsfm.R	2014-11-21 11:25:12 UTC (rev 3555)
@@ -25,11 +25,12 @@
 #' Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC), 
 #' improves. And, "subsets" enables subsets selection using 
 #' \code{\link[leaps]{regsubsets}}; chooses the best performing subset of any 
-#' given size. See \code{\link{fitTsfm.control}} for more details on the 
-#' control arguments. \code{variable.selection="lars"} corresponds to least 
-#' angle regression using \code{\link[lars]{lars}} with variants "lasso", 
-#' "lar", "forward.stagewise" or "stepwise". Note: If 
-#' \code{variable.selection="lars"}, \code{fit.method} will be ignored.
+#' given size or within a range of subset sizes. See 
+#' \code{\link{fitTsfm.control}} for more details on the control arguments. 
+#' \code{variable.selection="lars"} corresponds to least angle regression 
+#' using \code{\link[lars]{lars}} with variants "lasso", "lar", "stepwise" or 
+#' "forward.stagewise". Note: If \code{variable.selection="lars"}, 
+#' \code{fit.method} will be ignored.
 #' 
 #' Arguments \code{mkt.name} and \code{mkt.timing} allow for market-timing 
 #' factors to be added to any of the above methods. Market timing accounts for 
@@ -197,6 +198,7 @@
   
   # extract arguments to pass to different fit and variable selection functions
   decay <- control$decay
+  nvmin <- control$nvmin
   subset.size <- control$subset.size
   lars.criterion <- control$lars.criterion
   m1 <- match(c("weights","model","x","y","qr"), 
@@ -208,7 +210,7 @@
   m3 <-  match(c("scope","scale","direction","trace","steps","k"), 
                names(control), 0L)
   step.args <- control[m3, drop=TRUE]
-  m4 <-  match(c("weights","nbest","nvmax","force.in","force.out","method",
+  m4 <-  match(c("weights","nvmax","force.in","force.out","method",
                  "really.big"), names(control), 0L)
   regsubsets.args <- control[m4, drop=TRUE]
   m5 <-  match(c("type","normalize","eps","max.steps","trace"), 
@@ -268,7 +270,7 @@
   } else if (variable.selection == "subsets") {
     reg.list <- SelectAllSubsets(dat.xts, asset.names, factor.names, fit.method, 
                                  lm.args, lmRob.args, regsubsets.args, 
-                                 subset.size, decay)
+                                 nvmin, subset.size, decay)
   } else if (variable.selection == "lars") {
     result.lars <- SelectLars(dat.xts, asset.names, factor.names, lars.args, 
                               cv.lars.args, lars.criterion)
@@ -365,8 +367,8 @@
 ### method variable.selection = "subsets"
 #
 SelectAllSubsets <- function(dat.xts, asset.names, factor.names, fit.method, 
-                             lm.args, lmRob.args, regsubsets.args, subset.size, 
-                             decay) {
+                             lm.args, lmRob.args, regsubsets.args, nvmin, 
+                             subset.size, decay) {
   
   # initialize list object to hold the fitted objects
   reg.list <- list()
@@ -387,7 +389,19 @@
     fm.subsets <- do.call(regsubsets, c(list(fm.formula,data=reg.xts), 
                                         regsubsets.args))
     sum.sub <- summary(fm.subsets)
-    names.sub <- names(which(sum.sub$which[as.character(subset.size),-1]==TRUE))
+    
+    # choose best model of a given subset.size (or) 
+    # best model amongst subset sizes in [nvmin, nvmax]
+    if (!is.null(subset.size)) { 
+      names.sub <- names(which(sum.sub$which[subset.size,-1]==TRUE))
+      bic <- sum.sub$bic[subset.size - nvmin + 1]
+    } else { 
+      best.size <- which.min(sum.sub$bic[nvmin:length(sum.sub$bic)]) + nvmin -1
+      names.sub <- names(which(sum.sub$which[best.size,-1]==TRUE))
+      bic <- min(sum.sub$bic[nvmin:length(sum.sub$bic)])
+    }
+    
+    # completely remove NA cases for chosen subset
     reg.xts <- na.omit(dat.xts[,c(i,names.sub)])
     
     # fit based on time series regression method chosen

Modified: pkg/FactorAnalytics/R/fitTsfm.control.R
===================================================================
--- pkg/FactorAnalytics/R/fitTsfm.control.R	2014-11-20 04:38:47 UTC (rev 3554)
+++ pkg/FactorAnalytics/R/fitTsfm.control.R	2014-11-21 11:25:12 UTC (rev 3555)
@@ -63,7 +63,7 @@
 #' @param k the multiple of the number of degrees of freedom used for the 
 #' penalty in \code{"stepwise"}. Only \code{k = 2} gives the genuine AIC. 
 #' \code{k = log(n)} is sometimes referred to as BIC or SBC. Default is 2.
-#' @param nbest number of subsets of each size to record for \code{"subsets"}. 
+#' @param nvmin minimum size of subsets to examine for \code{"subsets"}. 
 #' Default is 1.
 #' @param nvmax maximum size of subsets to examine for \code{"subsets"}. 
 #' Default is 8.
@@ -77,9 +77,10 @@
 #' "exhaustive".
 #' @param really.big option for \code{"subsets"}; Must be \code{TRUE} to 
 #' perform exhaustive search on more than 50 variables.
-#' @param subset.size number of factors required in the factor model; 
-#' an option for \code{"subsets"} variable selection. Default is 1. 
-#' Note: \code{nvmax >= subset.size >= length(force.in)}.
+#' @param subset.size number of factors required in the factor model; an 
+#' option for \code{"subsets"} variable selection. \code{NULL} selects the 
+#' best model (BIC) from amongst subset sizes in [\code{nvmin},\code{nvmax}]. 
+#' Default is \code{NULL}.
 #' @param type option for \code{"lars"}. One of "lasso", "lar", 
 #' "forward.stagewise" or "stepwise". The names can be abbreviated to any 
 #' unique substring. Default is "lasso".
@@ -133,11 +134,11 @@
 
 fitTsfm.control <- function(decay=0.95, weights, model=TRUE, x=FALSE, y=FALSE, 
                             qr=TRUE, nrep=NULL, scope, scale, direction, 
-                            trace=FALSE, steps=1000, k=2, nbest=1, nvmax=8, 
+                            trace=FALSE, steps=1000, k=2, nvmin=1, nvmax=8, 
                             force.in=NULL, force.out=NULL, method, 
-                            really.big=FALSE, subset.size=1, type, 
+                            really.big=FALSE, subset.size=NULL, type, 
                             normalize=TRUE, eps=.Machine$double.eps, max.steps, 
-                            lars.criterion="Cp", K = 10) {
+                            lars.criterion="Cp", K=10) {
   
   # get the user-specified arguments (that have no defaults)
   # this part of the code was adapted from stats::lm
@@ -171,13 +172,15 @@
   if (!is.logical(really.big) || length(really.big) != 1) {
     stop("Invalid argument: control parameter 'really.big' must be logical")
   }
-  if (subset.size <= 0 || round(subset.size) != subset.size) {
-    stop("control parameter 'subset.size' must be a positive integer")
+  if (!is.null(subset.size)) {
+    if (subset.size <= 0 || round(subset.size) != subset.size) {
+      stop("Control parameter 'subset.size' must be a positive integer or NULL")
+    }
+    if (nvmax < subset.size || subset.size < length(force.in)) {
+      stop("Invaid Argument: nvmax should be >= subset.size and subset.size 
+           should be >= length(force.in)")
+    }
   }
-  if (nvmax < subset.size || subset.size < length(force.in)) {
-    stop("Invaid Argument: nvmax should be >= subset.size and subset.size 
-         should be >= length(force.in)")
-  }
   if (!is.logical(normalize) || length(normalize) != 1) {
     stop("Invalid argument: control parameter 'normalize' must be logical")
   }
@@ -187,7 +190,7 @@
   
   # return list of arguments with defaults if they are unspecified
   result <- c(args, list(decay=decay, model=model, x=x, y=y, qr=qr, nrep=nrep, 
-                         trace=trace, steps=steps, k=k, nbest=nbest, 
+                         trace=trace, steps=steps, k=k, nvmin=nvmin, 
                          nvmax=nvmax, force.in=force.in, force.out=force.out, 
                          really.big=really.big, subset.size=subset.size, 
                          normalize=normalize, eps=eps, 

Modified: pkg/FactorAnalytics/man/fitTsfm.Rd
===================================================================
--- pkg/FactorAnalytics/man/fitTsfm.Rd	2014-11-20 04:38:47 UTC (rev 3554)
+++ pkg/FactorAnalytics/man/fitTsfm.Rd	2014-11-21 11:25:12 UTC (rev 3555)
@@ -108,11 +108,12 @@
 Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC),
 improves. And, "subsets" enables subsets selection using
 \code{\link[leaps]{regsubsets}}; chooses the best performing subset of any
-given size. See \code{\link{fitTsfm.control}} for more details on the
-control arguments. \code{variable.selection="lars"} corresponds to least
-angle regression using \code{\link[lars]{lars}} with variants "lasso",
-"lar", "forward.stagewise" or "stepwise". Note: If
-\code{variable.selection="lars"}, \code{fit.method} will be ignored.
+given size or within a range of subset sizes. See
+\code{\link{fitTsfm.control}} for more details on the control arguments.
+\code{variable.selection="lars"} corresponds to least angle regression
+using \code{\link[lars]{lars}} with variants "lasso", "lar", "stepwise" or
+"forward.stagewise". Note: If \code{variable.selection="lars"},
+\code{fit.method} will be ignored.
 
 Arguments \code{mkt.name} and \code{mkt.timing} allow for market-timing
 factors to be added to any of the above methods. Market timing accounts for

Modified: pkg/FactorAnalytics/man/fitTsfm.control.Rd
===================================================================
--- pkg/FactorAnalytics/man/fitTsfm.control.Rd	2014-11-20 04:38:47 UTC (rev 3554)
+++ pkg/FactorAnalytics/man/fitTsfm.control.Rd	2014-11-21 11:25:12 UTC (rev 3555)
@@ -5,9 +5,9 @@
 \usage{
 fitTsfm.control(decay = 0.95, weights, model = TRUE, x = FALSE,
   y = FALSE, qr = TRUE, nrep = NULL, scope, scale, direction,
-  trace = FALSE, steps = 1000, k = 2, nbest = 1, nvmax = 8,
+  trace = FALSE, steps = 1000, k = 2, nvmin = 1, nvmax = 8,
   force.in = NULL, force.out = NULL, method, really.big = FALSE,
-  subset.size = 1, type, normalize = TRUE, eps = .Machine$double.eps,
+  subset.size = NULL, type, normalize = TRUE, eps = .Machine$double.eps,
   max.steps, lars.criterion = "Cp", K = 10)
 }
 \arguments{
@@ -56,7 +56,7 @@
 penalty in \code{"stepwise"}. Only \code{k = 2} gives the genuine AIC.
 \code{k = log(n)} is sometimes referred to as BIC or SBC. Default is 2.}
 
-\item{nbest}{number of subsets of each size to record for \code{"subsets"}.
+\item{nvmin}{minimum size of subsets to examine for \code{"subsets"}.
 Default is 1.}
 
 \item{nvmax}{maximum size of subsets to examine for \code{"subsets"}.
@@ -76,9 +76,10 @@
 \item{really.big}{option for \code{"subsets"}; Must be \code{TRUE} to
 perform exhaustive search on more than 50 variables.}
 
-\item{subset.size}{number of factors required in the factor model;
-an option for \code{"subsets"} variable selection. Default is 1.
-Note: \code{nvmax >= subset.size >= length(force.in)}.}
+\item{subset.size}{number of factors required in the factor model; an
+option for \code{"subsets"} variable selection. \code{NULL} selects the
+best model (BIC) from amongst subset sizes in [\code{nvmin},\code{nvmax}].
+Default is \code{NULL}.}
 
 \item{type}{option for \code{"lars"}. One of "lasso", "lar",
 "forward.stagewise" or "stepwise". The names can be abbreviated to any

Modified: pkg/FactorAnalytics/vignettes/fitTsfm_vignette.Rnw
===================================================================
--- pkg/FactorAnalytics/vignettes/fitTsfm_vignette.Rnw	2014-11-20 04:38:47 UTC (rev 3554)
+++ pkg/FactorAnalytics/vignettes/fitTsfm_vignette.Rnw	2014-11-21 11:25:12 UTC (rev 3555)
@@ -158,7 +158,7 @@
 plot(fit2, which.plot.group=5, loop=FALSE, xlim=c(0,0.043))
 @
 
-By adding more factors in fit1 and fit2, though the R-squared values have improved (when compared to Sharpe's single index model), one might prefer to employ variable selection methods such as \verb"stepwise", \verb"subsets" or \verb"lars" to avoid over-fitting. The method can be selected via the \code{variable.selection} argument. The default \verb"none", uses all the factors and performs no variable selection. \verb"stepwise" performs traditional forward or backward stepwise OLS regression, starting from an initial (given) set of factors and adds factors only if the regression fit, as measured by the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC), improves. \verb"subsets" enables subsets selection using \code{regsubsets}; chooses the best performing subset of any given size. \verb"lars" corresponds to least angle regression using \code{lars} with variants "lasso", "lar", "forward.stagewise" or "stepwise".  
+By adding more factors in fit1 and fit2, though the R-squared values have improved (when compared to Sharpe's single index model), one might prefer to employ variable selection methods such as \verb"stepwise", \verb"subsets" or \verb"lars" to avoid over-fitting. The method can be selected via the \code{variable.selection} argument. The default \verb"none", uses all the factors and performs no variable selection. \verb"stepwise" performs traditional forward or backward stepwise OLS regression, starting from an initial (given) set of factors and adds factors only if the regression fit, as measured by the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC), improves. \verb"subsets" enables subsets selection using \code{regsubsets}; chooses the best performing subset of any given size or within a range of subset sizes. \verb"lars" corresponds to least angle regression using \code{lars} with variants "lasso", "lar", "forward.stagewise" or "stepwise".  
 
 Remarks:
 \begin{itemize}
@@ -205,15 +205,16 @@
 \item \verb"lm": "weights","model","x","y","qr"
 \item \verb"lmRob": "weights","model","x","y","nrep"
 \item \verb"step": "scope","scale","direction","trace","steps","k"
-\item \verb"regsubsets": "weights","nbest","nvmax","force.in","force.out","method","really.big"
+\item \verb"regsubsets": "weights","nvmax","force.in","force.out","method","really.big"
 \item \verb"lars": "type","normalize","eps","max.steps","trace"
 \item \verb"cv.lars": "K","type","normalize","eps","max.steps","trace"
 \end{itemize}
 
-There are 3 other important arguments passed to \code{fitTsfm.control} that determine the type of factor model fit chosen.
+There are 4 other arguments passed to \code{fitTsfm.control} that determine the type of factor model fit chosen.
 \begin{itemize}
 \item \verb"decay": Determines the decay factor for \code{fit.method="DLS"}, which performs exponentially weighted least squares, with weights adding to unity.
-\item \verb"subset.size": Number of factors required in the factor model when performing \verb"subsets" selection. This might be meaningful when looking for the best model of a certain size (perhaps for parsimony, perhaps to compare with a different model of the same size, perhaps to avoid over-fitting/ data dredging etc.) 
+\item \verb"nvmin": The lower limit for the range of subset sizes from which the best model (BIC) is found when performing \verb"subsets" selection. Note that the upper limit was already passed to \verb"regsubsets" function.
+\item \verb"subset.size": Number of factors required in the factor model when performing \verb"subsets" selection. This might be meaningful when looking for the best model of a certain size (perhaps for parsimony, perhaps to compare with a different model of the same size, perhaps to avoid over-fitting/ data dredging etc.) Alternately, users can specify \code{NULL} to get the best model from amongst subset sizes in the range \code{[nvmin,nvmax]}.
 \item \verb"lars.criterion": An option (one of "Cp" or "cv") to assess model selection for the \code{"lars"} variable selection method. "Cp" is Mallow's Cp statistic and "cv" is K-fold cross-validated mean squared prediction error.
 \end{itemize}
 

Modified: pkg/FactorAnalytics/vignettes/fitTsfm_vignette.pdf
===================================================================
(Binary files differ)



More information about the Returnanalytics-commits mailing list