Neither ridge regression nor the lasso will universally dominate the other. In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size. However, the number of predictors that is related to the response is never known a priorifor real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set. As with ridge regression, when the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions. Unlike ridge regression, the lasso performs variable selection, and hence results in models that are easier to interpret. There are very efficient algorithms for fitting both ridge and lasso models; in both cases the entire coefficient paths can be computed with about the same amount of work as a single least squares fit.
We use ridge regression because it often outperforms least squares as well as null models for most sets of data. This is an important consideration when determining what models should be used for a given problem, and knowing this may narrow down the search for the best combination of models.
We use Lasso usually for a similar reason as ridge, however, lasso can either outperform ridge or provide a more interpretable solution when compared to ridge regression. This is important when explaining to decision makers, or making the decision yourself, becuase it provides authority and trust for the recommendations on decisions to be made.
Page 215+ ISLR
Below we clean up the data quickly and assign x and y variables for ease. The x variable contains a matrix of all the columns except Salary. The y variable contains the Salary column. We use these variables because the syntax that glmnet()
takes is specific - a matrix for the x, and a vector for the y. The model.matrix()
function is useful not only for creating x, but it also automatically transforms any qualitative variables into dummy variables for us. This is important because glmnet()
only takes numerical, quantitative inputs.
library(glmnet)
Loading required package: Matrix
Loaded glmnet 3.0-2
library(ISLR)
Hitters = na.omit(Hitters)
x = model.matrix(Salary~., -1, data = Hitters)
head(x)
(Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat
-Alan Ashby 1 315 81 7 24 38 39 14 3449
-Alvin Davis 1 479 130 18 66 72 76 3 1624
-Andre Dawson 1 496 141 20 65 78 37 11 5628
-Andres Galarraga 1 321 87 10 39 42 30 2 396
-Alfredo Griffin 1 594 169 4 74 51 35 11 4408
-Al Newman 1 185 37 1 23 8 21 2 214
CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts
-Alan Ashby 835 69 321 414 375 1 1 632
-Alvin Davis 457 63 224 266 263 0 1 880
-Andre Dawson 1575 225 828 838 354 1 0 200
-Andres Galarraga 101 12 48 46 33 1 0 805
-Alfredo Griffin 1133 19 501 336 194 0 1 282
-Al Newman 42 1 30 9 24 1 0 76
Assists Errors NewLeagueN
-Alan Ashby 43 10 1
-Alvin Davis 82 14 0
-Andre Dawson 11 3 1
-Andres Galarraga 40 4 1
-Alfredo Griffin 421 25 0
-Al Newman 127 7 0
y = Hitters$Salary
head(y)
[1] 475.0 480.0 500.0 91.5 750.0 70.0
The first task is to fit a Ridge Regression Model. In order to fit a Ridge Regresion we need to call glmnet()
with the parameter alpha = 0
. There is also a cv.glmnet()
function that will perform the cross-validation for us.
A Ridge Regression is a good model for reducing the impact that certain variables have over the outcome. In this case, the model selects those variables that have little impact on Salary and penalizes them with a Lambda value. The larger the Lambda, the larger the penalty, and the smaller the variable coefficient becomes - thus reducing the variables that indicate non-meaningful relationships with the predictor variable. By default, glmnet()
performs a redge regression for an automatically selected range of lambda values. The plot of the fit below shows how the coefficients converge to zero as the lambda size increases.
fit.ridge = glmnet(x, y, alpha = 0)
plot(fit.ridge, xvar = 'lambda')
Now we use Cross Validation which finds a lambda range for us and uses it in the fit of the ridge regression. The plot shows the MSE as lambda increases.
cv.ridge = cv.glmnet(x,y,alpha = 0) #k-fold cross validation
plot(cv.ridge)
Finally, we predict using a specified lambda range to get a full model. Take a close look to see which variables have been reduced to close to zero.
predict(cv.ridge, s = cv.ridge$lambda.min, type = 'coefficients')
21 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 8.112693e+01
(Intercept) .
AtBat -6.815959e-01
Hits 2.772312e+00
HmRun -1.365680e+00
Runs 1.014826e+00
RBI 7.130225e-01
Walks 3.378558e+00
Years -9.066800e+00
CAtBat -1.199478e-03
CHits 1.361029e-01
CHmRun 6.979958e-01
CRuns 2.958896e-01
CRBI 2.570711e-01
CWalks -2.789666e-01
LeagueN 5.321272e+01
DivisionW -1.228345e+02
PutOuts 2.638876e-01
Assists 1.698796e-01
Errors -3.685645e+00
NewLeagueN -1.810510e+01
Lasso Model Next, now we use alpha = 1
. The default in glmnet is that alpha = 1, so we do not have to specify it in the function call. We do the same with lasso below as we did previously.
fit.lasso = glmnet(x,y)
plot(fit.lasso, xvar = 'lambda')
Now, just as previously, we fit a cross-validated model for the best lambda.
cv.lasso = cv.glmnet(x,y)
plot(cv.lasso)
What if we want to use a train and validation set approach to select the lamda
for the lasso method? Well the answer is below. We can do this, and it is good practice to always do this. Now we make predictions based on the best lambda from the CV, and then show the coefficients picked as a result. The dots represent the variables that our model reduced to zero.
set.seed(1)
train = sample(seq(263), 180, replace = F)
lasso.cv.tr = cv.glmnet(x[train,], y[train])
pred = predict(fit.lasso, s = lasso.cv.tr$lambda.min, newx = x[-train,])
mean((pred - y[-train])^2)
[1] 114318.9
predcoef = predict(fit.lasso, s = lasso.cv.tr$lambda.min, newx = x[-train,], type = 'coefficients')
predcoef
21 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 17.0661446
(Intercept) .
AtBat .
Hits 1.8794128
HmRun .
Runs .
RBI .
Walks 2.2261427
Years .
CAtBat .
CHits .
CHmRun .
CRuns 0.2081632
CRBI 0.4136293
CWalks .
LeagueN 3.3287982
DivisionW -104.7448513
PutOuts 0.2222622
Assists .
Errors .
NewLeagueN .