Set the seed and initialize our packages. We need the boot package for Cross-Validation.
set.seed(1)
library(ISLR)
library(boot)perform a 10-fold cross validation. We perform this step in order to determine the optimal degree d for the polynomial
all.deltas = rep(NA, 10)
for (i in 1:10) {
  glm.fit = glm(wage~poly(age, i), data=Wage)
  all.deltas[i] = cv.glm(Wage, glm.fit, K=10)$delta[2]
} Now lets look at the index where we have the minimum error. We find that the lowest error is at 9 - this is the degree at which we will be performing the best polynomial regression.
which.min(all.deltas)
[1] 9Now lets look at a plot of the errors
plot(1:10, all.deltas, xlab="Degree", ylab="CV error", type="l", pch=20, lwd=2, ylim=c(1590, 1700))
min.point = min(all.deltas)
sd.points = sd(all.deltas)
abline(h=min.point + 0.2 * sd.points, col="red", lty="dashed")
abline(h=min.point - 0.2 * sd.points, col="red", lty="dashed")
legend("topright", "0.2-standard deviation lines", lty="dashed", col="red") 
Another way to discover the best degree is to perform an anova() to test which degree d is best - the results suggest the same as the CV approach.The p-value is indicating whether some polynomial degree is better than the immediately previous polynomial degree. For example, there is a significant improvement from degree 1 to 2, however, between degree 4 and 5 there is no indicationg that there is a significant improvement in the fit. In our case there are 3 choices - the CV suggests a degree of 9, however, it is very important to note here that this is not an absolute solution. In terms of parsimony, the anova() suggests that degree 4 or 3 and degree 9 are not that different and in this case we should really consider a degree 4 or 3 polynomial regression over a degree 9. A simpler model is better in this case and gives us roughly the same results.
fit.1 = lm(wage~poly(age, 1), data=Wage)
fit.2 = lm(wage~poly(age, 2), data=Wage)
fit.3 = lm(wage~poly(age, 3), data=Wage)
fit.4 = lm(wage~poly(age, 4), data=Wage)
fit.5 = lm(wage~poly(age, 5), data=Wage)
fit.6 = lm(wage~poly(age, 6), data=Wage)
fit.7 = lm(wage~poly(age, 7), data=Wage)
fit.8 = lm(wage~poly(age, 8), data=Wage)
fit.9 = lm(wage~poly(age, 9), data=Wage)
fit.10 = lm(wage~poly(age, 10), data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6, fit.7, fit.8, fit.9, fit.10)
Analysis of Variance Table
Model  1: wage ~ poly(age, 1)
Model  2: wage ~ poly(age, 2)
Model  3: wage ~ poly(age, 3)
Model  4: wage ~ poly(age, 4)
Model  5: wage ~ poly(age, 5)
Model  6: wage ~ poly(age, 6)
Model  7: wage ~ poly(age, 7)
Model  8: wage ~ poly(age, 8)
Model  9: wage ~ poly(age, 9)
Model 10: wage ~ poly(age, 10)
   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1    2998 5022216                                    
2    2997 4793430  1    228786 143.7638 < 2.2e-16 ***
3    2996 4777674  1     15756   9.9005  0.001669 ** 
4    2995 4771604  1      6070   3.8143  0.050909 .  
5    2994 4770322  1      1283   0.8059  0.369398    
6    2993 4766389  1      3932   2.4709  0.116074    
7    2992 4763834  1      2555   1.6057  0.205199    
8    2991 4763707  1       127   0.0796  0.777865    
9    2990 4756703  1      7004   4.4014  0.035994 *  
10   2989 4756701  1         3   0.0017  0.967529    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1The code below produces a graph of the data using degree 3 and degree 4 suggested by the anova(). Degree 3 is the solid blue line, and degree 4 is the solid red line. Degree 9 is the solid black line.
plot(wage~age, data=Wage, col="darkgrey")
agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])
lm.fitd3 = lm(wage~poly(age, 3), data=Wage)
lm.fitd4 = lm(wage~poly(age, 4), data=Wage)
lm.predd3 = predict(lm.fitd3, data.frame(age=age.grid))
lm.predd4 = predict(lm.fitd4, data.frame(age=age.grid))
lines(age.grid, lm.predd3, col="blue", lwd=2)
lines(age.grid, lm.predd4, col="red", lwd=2) 
And now we see why we really shouldn’t use degree 9 - it does not really improve our insight much, if at all, and only complicates our model.
plot(wage~age, data=Wage, col="darkgrey")
agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])
lm.fitd9 = lm(wage~poly(age, 9), data=Wage)
lm.predd9 = predict(lm.fitd9, data.frame(age=age.grid))
lines(age.grid, lm.predd9, col="black", lwd=2) 
Again, performing a 10-fold CV to choose the optimal number of cuts
all.cvs = rep(NA, 10)
for (i in 2:10) {
  Wage$age.cut = cut(Wage$age, i)
  lm.fit = glm(wage~age.cut, data=Wage)
  all.cvs[i] = cv.glm(Wage, lm.fit, K=10)$delta[2]
}Let look at the plot and try to determine the best number of cuts.
plot(2:10, all.cvs[-1], xlab="Number of cuts", ylab="CV error", type="l", pch=20, lwd=2)  The best number of cuts is very obviously 8 - where we have the lowest CV error  Now lets perform a step function (we use glm here instead of lm - its the same functionally)
lm.fit = glm(wage~cut(age, 8), data=Wage)
summary(lm.fit)$coef##                        Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)            76.28175   2.629812 29.006542 3.110596e-163
## cut(age, 8)(25.8,33.5] 25.83329   3.161343  8.171618  4.440913e-16
## cut(age, 8)(33.5,41.2] 40.22568   3.049065 13.192791  1.136044e-38
## cut(age, 8)(41.2,49]   43.50112   3.018341 14.412262  1.406253e-45
## cut(age, 8)(49,56.8]   40.13583   3.176792 12.634076  1.098741e-35
## cut(age, 8)(56.8,64.5] 44.10243   3.564299 12.373380  2.481643e-34
## cut(age, 8)(64.5,72.2] 28.94825   6.041576  4.791505  1.736008e-06
## cut(age, 8)(72.2,80.1] 15.22418   9.781110  1.556488  1.196978e-01Now we just predict using our step function model, and plot.
agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])
lm.pred = predict(lm.fit, data.frame(age=age.grid))plot(wage~age, data=Wage, col="darkgrey")
lines(age.grid, lm.pred, col="red", lwd=2)