Hello! In this notebook we are going to cover Best Subset Selection. In Best Subset Seelction we are doing exactly what the same suggests - selecting the best subset of predictors that give our model the best outcome. This is an exhaustive process and is overwhelmingly complex when compared to forward and backward step models. Still, this is an excellent fundamental model to use for evaluating predictors and may provide insight. In subset selection, we fit a separate least squarees regression for each possible combination of the predictors in our data. This is a 2^p complexity. We can also apply best subset selection to logistic models, and instead of measuring the best subset on the usual RSS, we look at deviance.
Here we begin by using the ISLR library and Hitters dataset that shows baseball statistics. Lets take a look at the summary below.
library(ISLR)
summary(Hitters) #59 missing values
AtBat Hits HmRun Runs
Min. : 16.0 Min. : 1 Min. : 0.00 Min. : 0.00
1st Qu.:255.2 1st Qu.: 64 1st Qu.: 4.00 1st Qu.: 30.25
Median :379.5 Median : 96 Median : 8.00 Median : 48.00
Mean :380.9 Mean :101 Mean :10.77 Mean : 50.91
3rd Qu.:512.0 3rd Qu.:137 3rd Qu.:16.00 3rd Qu.: 69.00
Max. :687.0 Max. :238 Max. :40.00 Max. :130.00
RBI Walks Years CAtBat
Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
1st Qu.: 28.00 1st Qu.: 22.00 1st Qu.: 4.000 1st Qu.: 816.8
Median : 44.00 Median : 35.00 Median : 6.000 Median : 1928.0
Mean : 48.03 Mean : 38.74 Mean : 7.444 Mean : 2648.7
3rd Qu.: 64.75 3rd Qu.: 53.00 3rd Qu.:11.000 3rd Qu.: 3924.2
Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
CHits CHmRun CRuns CRBI
Min. : 4.0 Min. : 0.00 Min. : 1.0 Min. : 0.00
1st Qu.: 209.0 1st Qu.: 14.00 1st Qu.: 100.2 1st Qu.: 88.75
Median : 508.0 Median : 37.50 Median : 247.0 Median : 220.50
Mean : 717.6 Mean : 69.49 Mean : 358.8 Mean : 330.12
3rd Qu.:1059.2 3rd Qu.: 90.00 3rd Qu.: 526.2 3rd Qu.: 426.25
Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.00
CWalks League Division PutOuts Assists
Min. : 0.00 A:175 E:157 Min. : 0.0 Min. : 0.0
1st Qu.: 67.25 N:147 W:165 1st Qu.: 109.2 1st Qu.: 7.0
Median : 170.50 Median : 212.0 Median : 39.5
Mean : 260.24 Mean : 288.9 Mean :106.9
3rd Qu.: 339.25 3rd Qu.: 325.0 3rd Qu.:166.0
Max. :1566.00 Max. :1378.0 Max. :492.0
Errors Salary NewLeague
Min. : 0.00 Min. : 67.5 A:176
1st Qu.: 3.00 1st Qu.: 190.0 N:146
Median : 6.00 Median : 425.0
Mean : 8.04 Mean : 535.9
3rd Qu.:11.00 3rd Qu.: 750.0
Max. :32.00 Max. :2460.0
NA's :59
Thanks to our summary, we have found some missing values or NA
fields. Below we remove them easily with the na.omit()
function. Now we have a clean dataset to practice subset selection on.
Hitters = na.omit(Hitters)
# with(Hitters, sum(is.na(Salary))) #no NAs
names(Hitters)
[1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
[7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
[13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
[19] "Salary" "NewLeague"
dim(Hitters)
[1] 263 20
Now we will use the package leaps
to perform and evaluate a best subset model. When we use the function regsubsets()
we are performing the subset selection. The best model is decided using Residual Sum of Squares. Compare the syntax to the lm()
function and you will find them similar.
library(leaps)
regfit.full = regsubsets(Salary~., data = Hitters)
summary(regfit.full)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters)
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " "*" " " " " " "
4 ( 1 ) " " " " "*" "*" " " " " " "
5 ( 1 ) " " " " "*" "*" " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) " " " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
An asterisk indicates that a given variable is included in the corresponding model - so for line 2, which is the 2-variable model, it gives us several variables denotes in asterisks. So, the best two-variable model contains only Hits
and CRBI
. by default the regsubsets()
function only looks at a maximum of 8
variables. To reiterate, what this model has now told is is that the best subset of predictors for predicting salary are Hits
and CRBI
. This is wonderful! Look at the insight that we can now take and drill deeper into. (Why is is Hits and CRBI?)
In the code below we expand the Best Subset Selection model by using the nvmax =
parameter, and we set the number of variables evaluated to 19
. We also assign the summary of this 19-check model for review later.
regfit.full = regsubsets(Salary~., data = Hitters, nvmax = 19)
reg.summary = summary(regfit.full)
names(reg.summary)
[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
Below we see some results from the latest subset selection model. The R-squared
stat increases from 32%
at 1 variable to 55%
when all the variables are included. This is expected as R-squared
always increases as more variables are included.
reg.summary$rsq
[1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
[8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
[15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
Lets plot the 19 variable model from just above and see how RSS and adjusted R^2 reacts to the increasing number of variables. We can see that R^2 rises as expected and RSS falls and levels.
plot(reg.summary$rss ,xlab="Number of Variables ",ylab="RSS", type="l")
plot(reg.summary$adjr2 ,xlab="Number of Variables ", ylab="Adjusted RSq",type="l")
which.max(reg.summary$adjr2)
[1] 11
points (11,reg.summary$adjr2[11], col="red",cex=2,pch=20)
BIC is an excellent measurement statistic for finding the optimal model. Below we graph the BIC across all models and find that best is at 6
plot(reg.summary$cp ,xlab="Number of Variables ",ylab="Cp", type= 'l')
which.min(reg.summary$cp)
[1] 10
points (10,reg.summary$cp [10],col="red",cex=2,pch=20)
which.min(reg.summary$bic)
[1] 6
plot(reg.summary$bic ,xlab="Number of Variables ",ylab="BIC", type='l')
points(6,reg.summary$bic [6],col="red",cex=2,pch =20)
regsubset()
GraphsEach graph is just a visual representation of the summary output matrix we saw earlier with the asterisks. There is a black square filled in for each variable selected at the given level of that statistic. So, for each model, and the associated statistic of that model, there are selected or omitted variables, and this is just a nice way to visualize it.
plot(regfit.full ,scale="r2")
plot(regfit.full ,scale="adjr2")
plot(regfit.full ,scale="Cp")
plot(regfit.full ,scale="bic")
This is the way to output the associated coefficients with one of the best models, shown visually in the graphs previously. Particularly, we are looking at a model with the lowest BIC - which is a 6 variable model.
coef(regfit.full, 6)
(Intercept) AtBat Hits Walks CRBI DivisionW
91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
PutOuts
0.2643076