Best Subset Selection

Introduction



Hello! In this notebook we are going to cover Best Subset Selection. In Best Subset Seelction we are doing exactly what the same suggests - selecting the best subset of predictors that give our model the best outcome. This is an exhaustive process and is overwhelmingly complex when compared to forward and backward step models. Still, this is an excellent fundamental model to use for evaluating predictors and may provide insight.
In subset selection, we fit a separate least squarees regression for each possible combination of the predictors in our data. This is a 2^p complexity. We can also apply best subset selection to logistic models, and instead of measuring the best subset on the usual RSS, we look at deviance.

Data



Here we begin by using the ISLR library and Hitters dataset that shows baseball statistics. Lets take a look at the summary below.

library(ISLR)
summary(Hitters) #59 missing values
     AtBat            Hits         HmRun            Runs       
 Min.   : 16.0   Min.   :  1   Min.   : 0.00   Min.   :  0.00  
 1st Qu.:255.2   1st Qu.: 64   1st Qu.: 4.00   1st Qu.: 30.25  
 Median :379.5   Median : 96   Median : 8.00   Median : 48.00  
 Mean   :380.9   Mean   :101   Mean   :10.77   Mean   : 50.91  
 3rd Qu.:512.0   3rd Qu.:137   3rd Qu.:16.00   3rd Qu.: 69.00  
 Max.   :687.0   Max.   :238   Max.   :40.00   Max.   :130.00  
                                                               
      RBI             Walks            Years            CAtBat       
 Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
 1st Qu.: 28.00   1st Qu.: 22.00   1st Qu.: 4.000   1st Qu.:  816.8  
 Median : 44.00   Median : 35.00   Median : 6.000   Median : 1928.0  
 Mean   : 48.03   Mean   : 38.74   Mean   : 7.444   Mean   : 2648.7  
 3rd Qu.: 64.75   3rd Qu.: 53.00   3rd Qu.:11.000   3rd Qu.: 3924.2  
 Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
                                                                     
     CHits            CHmRun           CRuns             CRBI        
 Min.   :   4.0   Min.   :  0.00   Min.   :   1.0   Min.   :   0.00  
 1st Qu.: 209.0   1st Qu.: 14.00   1st Qu.: 100.2   1st Qu.:  88.75  
 Median : 508.0   Median : 37.50   Median : 247.0   Median : 220.50  
 Mean   : 717.6   Mean   : 69.49   Mean   : 358.8   Mean   : 330.12  
 3rd Qu.:1059.2   3rd Qu.: 90.00   3rd Qu.: 526.2   3rd Qu.: 426.25  
 Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.00  
                                                                     
     CWalks        League  Division    PutOuts          Assists     
 Min.   :   0.00   A:175   E:157    Min.   :   0.0   Min.   :  0.0  
 1st Qu.:  67.25   N:147   W:165    1st Qu.: 109.2   1st Qu.:  7.0  
 Median : 170.50                    Median : 212.0   Median : 39.5  
 Mean   : 260.24                    Mean   : 288.9   Mean   :106.9  
 3rd Qu.: 339.25                    3rd Qu.: 325.0   3rd Qu.:166.0  
 Max.   :1566.00                    Max.   :1378.0   Max.   :492.0  
                                                                    
     Errors          Salary       NewLeague
 Min.   : 0.00   Min.   :  67.5   A:176    
 1st Qu.: 3.00   1st Qu.: 190.0   N:146    
 Median : 6.00   Median : 425.0            
 Mean   : 8.04   Mean   : 535.9            
 3rd Qu.:11.00   3rd Qu.: 750.0            
 Max.   :32.00   Max.   :2460.0            
                 NA's   :59                



Thanks to our summary, we have found some missing values or NA fields. Below we remove them easily with the na.omit() function. Now we have a clean dataset to practice subset selection on.

Hitters = na.omit(Hitters) 
# with(Hitters, sum(is.na(Salary))) #no NAs
names(Hitters)
 [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
 [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
[13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
[19] "Salary"    "NewLeague"
dim(Hitters)
[1] 263  20



Best Subset Reg



Now we will use the package leaps to perform and evaluate a best subset model. When we use the function regsubsets() we are performing the subset selection. The best model is decided using Residual Sum of Squares. Compare the syntax to the lm() function and you will find them similar.

library(leaps)
regfit.full = regsubsets(Salary~., data = Hitters)
summary(regfit.full) 
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters)
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
         CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
5  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
6  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
8  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "       

An asterisk indicates that a given variable is included in the corresponding model - so for line 2, which is the 2-variable model, it gives us several variables denotes in asterisks. So, the best two-variable model contains only Hits and CRBI. by default the regsubsets() function only looks at a maximum of 8 variables. To reiterate, what this model has now told is is that the best subset of predictors for predicting salary are Hits and CRBI. This is wonderful! Look at the insight that we can now take and drill deeper into. (Why is is Hits and CRBI?)



In the code below we expand the Best Subset Selection model by using the nvmax = parameter, and we set the number of variables evaluated to 19. We also assign the summary of this 19-check model for review later.

regfit.full = regsubsets(Salary~., data = Hitters, nvmax = 19)
reg.summary = summary(regfit.full)
names(reg.summary)
[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"   



Below we see some results from the latest subset selection model. The R-squared stat increases from 32% at 1 variable to 55% when all the variables are included. This is expected as R-squared always increases as more variables are included.

reg.summary$rsq
 [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
 [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
[15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159

RSS and Rsq Graph



Lets plot the 19 variable model from just above and see how RSS and adjusted R^2 reacts to the increasing number of variables. We can see that R^2 rises as expected and RSS falls and levels.


plot(reg.summary$rss ,xlab="Number of Variables ",ylab="RSS", type="l") 

plot(reg.summary$adjr2 ,xlab="Number of Variables ", ylab="Adjusted RSq",type="l")

which.max(reg.summary$adjr2)
[1] 11
points (11,reg.summary$adjr2[11], col="red",cex=2,pch=20)

BIC Graph



BIC is an excellent measurement statistic for finding the optimal model. Below we graph the BIC across all models and find that best is at 6

plot(reg.summary$cp ,xlab="Number of Variables ",ylab="Cp", type= 'l')
which.min(reg.summary$cp) 
[1] 10
points (10,reg.summary$cp [10],col="red",cex=2,pch=20)

which.min(reg.summary$bic) 
[1] 6

plot(reg.summary$bic ,xlab="Number of Variables ",ylab="BIC", type='l')
points(6,reg.summary$bic [6],col="red",cex=2,pch =20)

regsubset() Graphs



Each graph is just a visual representation of the summary output matrix we saw earlier with the asterisks. There is a black square filled in for each variable selected at the given level of that statistic. So, for each model, and the associated statistic of that model, there are selected or omitted variables, and this is just a nice way to visualize it.

plot(regfit.full ,scale="r2")

plot(regfit.full ,scale="adjr2")

plot(regfit.full ,scale="Cp")

plot(regfit.full ,scale="bic")

Optimal Model



This is the way to output the associated coefficients with one of the best models, shown visually in the graphs previously. Particularly, we are looking at a model with the lowest BIC - which is a 6 variable model.


coef(regfit.full, 6)
 (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
  91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
     PutOuts 
   0.2643076