We are looking at the same fits as the last module - best, forward, and backward subset selection. In this notebook, however, we are determining which models are best using a variety of approached including a training validation set, and Cross-Validation.
Load packages, and prepped data:
library(ISLR)
library(leaps)
Hitters = na.omit(Hitters)
Make a training and a validation set so that we can choose a good subset model:
We split the observations into a training set and a test set by creating a random vector with elements either True
or False
. Then we index the vectors using these values so that the vectors return random rows in the data. Then, we perform a fit using regsubsets()
.
set.seed(1)
train = sample(c(T, F), nrow(Hitters), replace = T)
test = !train
regfit.best = regsubsets(Salary~., data = Hitters[train,],nvmax = 19)
Using the model.matrix()
function we build an “X” matrix that is used in many regression packages. Then we run a simple loop that extracts the coefficients for each size i
and multiply them into the appropriate columns of the test model matrix to form predictions and compute the test MSE.
test.mat = model.matrix(Salary~., data = Hitters[test,])
errors = rep(NA, 19)
for(i in 1:19){
coefi = coef(regfit.best, id = i)
pred = test.mat[,names(coefi)] %*% coefi
errors[i] = mean((Hitters$Salary[test] - pred)^2)
}
What is our lowest error? Well we can quickly discover this below:
which.min(errors)
[1] 7
coef(regfit.best,10)
(Intercept) AtBat Hits HmRun Walks CAtBat
71.8074075 -1.5038124 5.9130470 -11.5241809 8.4349759 -0.1654850
CRuns CRBI CWalks DivisionW PutOuts
1.7064330 0.7903694 -0.9107515 -109.5616997 0.2426078
plot(sqrt(errors),ylab = "Root MSE", ylim = c(200,430), pch = 19, type = 'b')
points(sqrt(regfit.best$rss[-1]/180), col = 'blue', pch = 19, type = 'b')
legend('topright', legend = c('Training', 'Validation'), col = c('blue', 'black'), pch = 19)
Now we will make predictions using the observations that were not used to train our model. There are 19 models, so we initialize empty vectors to record the errors. There is no predict method for the package ‘regsubsets’ so we have to perform this prediction ourselves
predict.regsubsets = function(object, newdata, id, ...){
form = as.formula(object$call[[2]])
mat = model.matrix(form,newdata)
coefi = coef(object, id = id)
xvars = names(coefi)
mat[,xvars] %*% coefi
}
Now lets perform a best subset selection on the full data set and use the best-variable model we found, which is 10.
regfit.best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(regfit.best,10)
(Intercept) AtBat Hits Walks CAtBat CRuns
162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
CRBI CWalks DivisionW PutOuts Assists
0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680
Here we create a loop for a k-fold manual Cross Validation. In the Jth fold, the elements of folds that equal j are in the test set and the remainder are in the training set. We make a prediction for each model size k, using the predict function we just created, compute the test errors, and store those errors in cv.errors
.
k = 10
folds = sample(rep(1:k, length = nrow(Hitters)), replace = T)
cv.errors = matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
for (j in 1:k){
best.fit = regsubsets(Salary~., data = Hitters[folds != j,], nvmax = 19)
for (i in 1:19){
pred = predict(best.fit, Hitters[folds == j,], id = i)
cv.errors[j,i] = mean((Hitters$Salary[folds == j] - pred)^2)
}
}
This has given us a 10×19 matrix, of which the (i,j)th element corresponds to the test MSE for the ith cross-validation fold for the best j-variable model. We use the apply() function to average over the columns of this matrix in order to obtain a vector for which the jth element is the crossvalidation error for the j-variable model.
mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors
1 2 3 4 5 6 7 8
148035.8 127232.2 139513.0 135696.2 139339.6 130314.8 131366.7 120977.3
9 10 11 12 13 14 15 16
122370.2 117907.8 123961.7 124218.7 126008.0 128049.7 128552.4 128669.6
17 18 19
128166.8 128516.0 128399.2
plot(mean.cv.errors, type = 'b')
The CV selects an 11-variable model, so lets perform a best subset selection on the full data in order to obtain the 11-variable model.
reg.best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(reg.best, 11)
(Intercept) AtBat Hits Walks CAtBat CRuns
135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914 1.4553310
CRBI CWalks LeagueN DivisionW PutOuts Assists
0.7852528 -0.8228559 43.1116152 -111.1460252 0.2894087 0.2688277