Ridge Regression and Lasso Regression

Introduction

Neither ridge regression nor the lasso will universally dominate the other. In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size. However, the number of predictors that is related to the response is never known a priorifor real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set. As with ridge regression, when the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions. Unlike ridge regression, the lasso performs variable selection, and hence results in models that are easier to interpret. There are very efficient algorithms for fitting both ridge and lasso models; in both cases the entire coefficient paths can be computed with about the same amount of work as a single least squares fit.

We use ridge regression because it often outperforms least squares as well as null models for most sets of data. This is an important consideration when determining what models should be used for a given problem, and knowing this may narrow down the search for the best combination of models.

We use Lasso usually for a similar reason as ridge, however, lasso can either outperform ridge or provide a more interpretable solution when compared to ridge regression. This is important when explaining to decision makers, or making the decision yourself, becuase it provides authority and trust for the recommendations on decisions to be made.

Page 215+ ISLR

Data



Below we clean up the data quickly and assign x and y variables for ease. The x variable contains a matrix of all the columns except Salary. The y variable contains the Salary column. We use these variables because the syntax that glmnet() takes is specific - a matrix for the x, and a vector for the y. The model.matrix() function is useful not only for creating x, but it also automatically transforms any qualitative variables into dummy variables for us. This is important because glmnet() only takes numerical, quantitative inputs.

library(glmnet)
Loading required package: Matrix
Loaded glmnet 3.0-2
library(ISLR)
Hitters = na.omit(Hitters)
x = model.matrix(Salary~., -1, data = Hitters)
head(x)
                  (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat
-Alan Ashby                 1   315   81     7   24  38    39    14   3449
-Alvin Davis                1   479  130    18   66  72    76     3   1624
-Andre Dawson               1   496  141    20   65  78    37    11   5628
-Andres Galarraga           1   321   87    10   39  42    30     2    396
-Alfredo Griffin            1   594  169     4   74  51    35    11   4408
-Al Newman                  1   185   37     1   23   8    21     2    214
                  CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts
-Alan Ashby         835     69   321  414    375       1         1     632
-Alvin Davis        457     63   224  266    263       0         1     880
-Andre Dawson      1575    225   828  838    354       1         0     200
-Andres Galarraga   101     12    48   46     33       1         0     805
-Alfredo Griffin   1133     19   501  336    194       0         1     282
-Al Newman           42      1    30    9     24       1         0      76
                  Assists Errors NewLeagueN
-Alan Ashby            43     10          1
-Alvin Davis           82     14          0
-Andre Dawson          11      3          1
-Andres Galarraga      40      4          1
-Alfredo Griffin      421     25          0
-Al Newman            127      7          0
y = Hitters$Salary
head(y)
[1] 475.0 480.0 500.0  91.5 750.0  70.0

Simple Ridge



The first task is to fit a Ridge Regression Model. In order to fit a Ridge Regresion we need to call glmnet() with the parameter alpha = 0. There is also a cv.glmnet() function that will perform the cross-validation for us.

A Ridge Regression is a good model for reducing the impact that certain variables have over the outcome. In this case, the model selects those variables that have little impact on Salary and penalizes them with a Lambda value. The larger the Lambda, the larger the penalty, and the smaller the variable coefficient becomes - thus reducing the variables that indicate non-meaningful relationships with the predictor variable. By default, glmnet() performs a redge regression for an automatically selected range of lambda values. The plot of the fit below shows how the coefficients converge to zero as the lambda size increases.

fit.ridge = glmnet(x, y, alpha = 0)
plot(fit.ridge, xvar = 'lambda') 



Now we use Cross Validation which finds a lambda range for us and uses it in the fit of the ridge regression. The plot shows the MSE as lambda increases.

cv.ridge = cv.glmnet(x,y,alpha = 0) #k-fold cross validation
plot(cv.ridge)

Finally, we predict using a specified lambda range to get a full model. Take a close look to see which variables have been reduced to close to zero.

predict(cv.ridge, s = cv.ridge$lambda.min, type = 'coefficients')
21 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept)  8.112693e+01
(Intercept)  .           
AtBat       -6.815959e-01
Hits         2.772312e+00
HmRun       -1.365680e+00
Runs         1.014826e+00
RBI          7.130225e-01
Walks        3.378558e+00
Years       -9.066800e+00
CAtBat      -1.199478e-03
CHits        1.361029e-01
CHmRun       6.979958e-01
CRuns        2.958896e-01
CRBI         2.570711e-01
CWalks      -2.789666e-01
LeagueN      5.321272e+01
DivisionW   -1.228345e+02
PutOuts      2.638876e-01
Assists      1.698796e-01
Errors      -3.685645e+00
NewLeagueN  -1.810510e+01

Simple Lasso

Lasso Model Next, now we use alpha = 1. The default in glmnet is that alpha = 1, so we do not have to specify it in the function call. We do the same with lasso below as we did previously.

fit.lasso = glmnet(x,y)
plot(fit.lasso, xvar = 'lambda')

Now, just as previously, we fit a cross-validated model for the best lambda.

cv.lasso = cv.glmnet(x,y)
plot(cv.lasso)

What if we want to use a train and validation set approach to select the lamda for the lasso method? Well the answer is below. We can do this, and it is good practice to always do this. Now we make predictions based on the best lambda from the CV, and then show the coefficients picked as a result. The dots represent the variables that our model reduced to zero.

set.seed(1)
train = sample(seq(263), 180, replace = F)
lasso.cv.tr = cv.glmnet(x[train,], y[train])
pred = predict(fit.lasso, s = lasso.cv.tr$lambda.min, newx = x[-train,])

mean((pred - y[-train])^2)
[1] 114318.9


predcoef = predict(fit.lasso, s = lasso.cv.tr$lambda.min, newx = x[-train,], type = 'coefficients')
predcoef
21 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept)   17.0661446
(Intercept)    .        
AtBat          .        
Hits           1.8794128
HmRun          .        
Runs           .        
RBI            .        
Walks          2.2261427
Years          .        
CAtBat         .        
CHits          .        
CHmRun         .        
CRuns          0.2081632
CRBI           0.4136293
CWalks         .        
LeagueN        3.3287982
DivisionW   -104.7448513
PutOuts        0.2222622
Assists        .        
Errors         .        
NewLeagueN     .