AppCh7PolyStep.utf8

Applied Exercise - Chapter 7, Q. 6

Team 13: Jasmine Yu, Jianqi chai, David Kersey, Griffin Salyer

Return to Homepage

Exercise 6a

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial ﬁt to the data.

Set the seed and initialize our packages. We need the boot package for Cross-Validation.

set.seed(1)
library(ISLR)
library(boot)

perform a 10-fold cross validation. We perform this step in order to determine the optimal degree d for the polynomial

all.deltas = rep(NA, 10)
for (i in 1:10) {
  glm.fit = glm(wage~poly(age, i), data=Wage)
  all.deltas[i] = cv.glm(Wage, glm.fit, K=10)$delta[2]
}

Now lets look at the index where we have the minimum error. We find that the lowest error is at 9 - this is the degree at which we will be performing the best polynomial regression.

which.min(all.deltas)
[1] 9

Now lets look at a plot of the errors

plot(1:10, all.deltas, xlab="Degree", ylab="CV error", type="l", pch=20, lwd=2, ylim=c(1590, 1700))
min.point = min(all.deltas)
sd.points = sd(all.deltas)
abline(h=min.point + 0.2 * sd.points, col="red", lty="dashed")
abline(h=min.point - 0.2 * sd.points, col="red", lty="dashed")
legend("topright", "0.2-standard deviation lines", lty="dashed", col="red")

Another way to discover the best degree is to perform an anova() to test which degree d is best - the results suggest the same as the CV approach.The p-value is indicating whether some polynomial degree is better than the immediately previous polynomial degree. For example, there is a significant improvement from degree 1 to 2, however, between degree 4 and 5 there is no indicationg that there is a significant improvement in the fit. In our case there are 3 choices - the CV suggests a degree of 9, however, it is very important to note here that this is not an absolute solution. In terms of parsimony, the anova() suggests that degree 4 or 3 and degree 9 are not that different and in this case we should really consider a degree 4 or 3 polynomial regression over a degree 9. A simpler model is better in this case and gives us roughly the same results.

fit.1 = lm(wage~poly(age, 1), data=Wage)
fit.2 = lm(wage~poly(age, 2), data=Wage)
fit.3 = lm(wage~poly(age, 3), data=Wage)
fit.4 = lm(wage~poly(age, 4), data=Wage)
fit.5 = lm(wage~poly(age, 5), data=Wage)
fit.6 = lm(wage~poly(age, 6), data=Wage)
fit.7 = lm(wage~poly(age, 7), data=Wage)
fit.8 = lm(wage~poly(age, 8), data=Wage)
fit.9 = lm(wage~poly(age, 9), data=Wage)
fit.10 = lm(wage~poly(age, 10), data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6, fit.7, fit.8, fit.9, fit.10)
Analysis of Variance Table

Model  1: wage ~ poly(age, 1)
Model  2: wage ~ poly(age, 2)
Model  3: wage ~ poly(age, 3)
Model  4: wage ~ poly(age, 4)
Model  5: wage ~ poly(age, 5)
Model  6: wage ~ poly(age, 6)
Model  7: wage ~ poly(age, 7)
Model  8: wage ~ poly(age, 8)
Model  9: wage ~ poly(age, 9)
Model 10: wage ~ poly(age, 10)
   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1    2998 5022216                                    
2    2997 4793430  1    228786 143.7638 < 2.2e-16 ***
3    2996 4777674  1     15756   9.9005  0.001669 ** 
4    2995 4771604  1      6070   3.8143  0.050909 .  
5    2994 4770322  1      1283   0.8059  0.369398    
6    2993 4766389  1      3932   2.4709  0.116074    
7    2992 4763834  1      2555   1.6057  0.205199    
8    2991 4763707  1       127   0.0796  0.777865    
9    2990 4756703  1      7004   4.4014  0.035994 *  
10   2989 4756701  1         3   0.0017  0.967529    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The code below produces a graph of the data using degree 3 and degree 4 suggested by the anova(). Degree 3 is the solid blue line, and degree 4 is the solid red line. Degree 9 is the solid black line.

plot(wage~age, data=Wage, col="darkgrey")
agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])

lm.fitd3 = lm(wage~poly(age, 3), data=Wage)
lm.fitd4 = lm(wage~poly(age, 4), data=Wage)

lm.predd3 = predict(lm.fitd3, data.frame(age=age.grid))
lm.predd4 = predict(lm.fitd4, data.frame(age=age.grid))
lines(age.grid, lm.predd3, col="blue", lwd=2)
lines(age.grid, lm.predd4, col="red", lwd=2)

And now we see why we really shouldn’t use degree 9 - it does not really improve our insight much, if at all, and only complicates our model.

plot(wage~age, data=Wage, col="darkgrey")
agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])

lm.fitd9 = lm(wage~poly(age, 9), data=Wage)
lm.predd9 = predict(lm.fitd9, data.frame(age=age.grid))
lines(age.grid, lm.predd9, col="black", lwd=2)

Exercise 6b

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the ﬁt obtained.

Again, performing a 10-fold CV to choose the optimal number of cuts

all.cvs = rep(NA, 10)
for (i in 2:10) {
  Wage$age.cut = cut(Wage$age, i)
  lm.fit = glm(wage~age.cut, data=Wage)
  all.cvs[i] = cv.glm(Wage, lm.fit, K=10)$delta[2]
}

Let look at the plot and try to determine the best number of cuts.

plot(2:10, all.cvs[-1], xlab="Number of cuts", ylab="CV error", type="l", pch=20, lwd=2)

The best number of cuts is very obviously 8 - where we have the lowest CV error

Now lets perform a step function (we use glm here instead of lm - its the same functionally)

lm.fit = glm(wage~cut(age, 8), data=Wage)
summary(lm.fit)$coef

##                        Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)            76.28175   2.629812 29.006542 3.110596e-163
## cut(age, 8)(25.8,33.5] 25.83329   3.161343  8.171618  4.440913e-16
## cut(age, 8)(33.5,41.2] 40.22568   3.049065 13.192791  1.136044e-38
## cut(age, 8)(41.2,49]   43.50112   3.018341 14.412262  1.406253e-45
## cut(age, 8)(49,56.8]   40.13583   3.176792 12.634076  1.098741e-35
## cut(age, 8)(56.8,64.5] 44.10243   3.564299 12.373380  2.481643e-34
## cut(age, 8)(64.5,72.2] 28.94825   6.041576  4.791505  1.736008e-06
## cut(age, 8)(72.2,80.1] 15.22418   9.781110  1.556488  1.196978e-01

Now we just predict using our step function model, and plot.

agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])
lm.pred = predict(lm.fit, data.frame(age=age.grid))

plot(wage~age, data=Wage, col="darkgrey")
lines(age.grid, lm.pred, col="red", lwd=2)