modelselect.utf8

Model Selection

Introduction

We are looking at the same fits as the last module - best, forward, and backward subset selection. In this notebook, however, we are determining which models are best using a variety of approached including a training validation set, and Cross-Validation.

Validation Set

Load packages, and prepped data:

library(ISLR)
library(leaps)
Hitters = na.omit(Hitters)

Make a training and a validation set so that we can choose a good subset model:

We split the observations into a training set and a test set by creating a random vector with elements either True or False. Then we index the vectors using these values so that the vectors return random rows in the data. Then, we perform a fit using regsubsets().

set.seed(1)
train = sample(c(T, F), nrow(Hitters), replace = T) 
test = !train
regfit.best = regsubsets(Salary~., data = Hitters[train,],nvmax = 19)

Using the model.matrix() function we build an “X” matrix that is used in many regression packages. Then we run a simple loop that extracts the coefficients for each size i and multiply them into the appropriate columns of the test model matrix to form predictions and compute the test MSE.

test.mat = model.matrix(Salary~., data = Hitters[test,])
errors = rep(NA, 19)
for(i in 1:19){
  coefi = coef(regfit.best, id = i)
  pred = test.mat[,names(coefi)] %*% coefi
  errors[i] = mean((Hitters$Salary[test] - pred)^2)
}

What is our lowest error? Well we can quickly discover this below:

which.min(errors)
[1] 7
coef(regfit.best,10)
 (Intercept)        AtBat         Hits        HmRun        Walks       CAtBat 
  71.8074075   -1.5038124    5.9130470  -11.5241809    8.4349759   -0.1654850 
       CRuns         CRBI       CWalks    DivisionW      PutOuts 
   1.7064330    0.7903694   -0.9107515 -109.5616997    0.2426078

plot(sqrt(errors),ylab = "Root MSE", ylim = c(200,430), pch = 19, type = 'b')
points(sqrt(regfit.best$rss[-1]/180), col = 'blue', pch = 19, type = 'b')
legend('topright', legend = c('Training', 'Validation'), col = c('blue', 'black'), pch = 19)

Now we will make predictions using the observations that were not used to train our model. There are 19 models, so we initialize empty vectors to record the errors. There is no predict method for the package ‘regsubsets’ so we have to perform this prediction ourselves

predict.regsubsets = function(object, newdata, id, ...){
  form = as.formula(object$call[[2]])
  mat = model.matrix(form,newdata)
  coefi = coef(object, id = id)
  xvars = names(coefi)
  mat[,xvars] %*% coefi
  
}

Now lets perform a best subset selection on the full data set and use the best-variable model we found, which is 10.

regfit.best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(regfit.best,10)
 (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
 162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
        CRBI       CWalks    DivisionW      PutOuts      Assists 
   0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680

10 Fold Cross-Validation

Here we create a loop for a k-fold manual Cross Validation. In the Jth fold, the elements of folds that equal j are in the test set and the remainder are in the training set. We make a prediction for each model size k, using the predict function we just created, compute the test errors, and store those errors in cv.errors.

k = 10
folds = sample(rep(1:k, length = nrow(Hitters)), replace = T)
cv.errors = matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
for (j in 1:k){
  best.fit = regsubsets(Salary~., data = Hitters[folds != j,], nvmax = 19)
  for (i in 1:19){
    pred = predict(best.fit, Hitters[folds == j,], id = i)
    cv.errors[j,i] = mean((Hitters$Salary[folds == j] - pred)^2)
  }
}

This has given us a 10×19 matrix, of which the (i,j)th element corresponds to the test MSE for the ith cross-validation fold for the best j-variable model. We use the apply() function to average over the columns of this matrix in order to obtain a vector for which the jth element is the crossvalidation error for the j-variable model.

mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors
       1        2        3        4        5        6        7        8 
148035.8 127232.2 139513.0 135696.2 139339.6 130314.8 131366.7 120977.3 
       9       10       11       12       13       14       15       16 
122370.2 117907.8 123961.7 124218.7 126008.0 128049.7 128552.4 128669.6 
      17       18       19 
128166.8 128516.0 128399.2 

plot(mean.cv.errors, type = 'b')

The CV selects an 11-variable model, so lets perform a best subset selection on the full data in order to obtain the 11-variable model.


reg.best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(reg.best, 11)
 (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
 135.7512195   -2.1277482    6.9236994    5.6202755   -0.1389914    1.4553310 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
   0.7852528   -0.8228559   43.1116152 -111.1460252    0.2894087    0.2688277