Both Polynomial and Step functions are special cases in machine learning. We generally use them as fundamental starting points for other models and for direction when we want to explore into the roots of specific cases.
Page 267 in ISLR has a straightforward explanation of how and why we use Polynomial Regression. Additionally, this Wikipedia Article is an excellent reference for framing the idea of a polynomial regression.
Initialize our data
rm(list=ls())
library(ISLR)
attach(Wage)
head(Wage)
year age maritl race education region
231655 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic
86582 2004 24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
161300 2003 45 2. Married 1. White 3. Some College 2. Middle Atlantic
155159 2003 43 2. Married 3. Asian 4. College Grad 2. Middle Atlantic
11443 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic
376662 2008 54 2. Married 1. White 4. College Grad 2. Middle Atlantic
jobclass health health_ins logwage wage
231655 1. Industrial 1. <=Good 2. No 4.318063 75.04315
86582 2. Information 2. >=Very Good 2. No 4.255273 70.47602
161300 1. Industrial 1. <=Good 1. Yes 4.875061 130.98218
155159 2. Information 2. >=Very Good 1. Yes 5.041393 154.68529
11443 2. Information 1. <=Good 1. Yes 4.318063 75.04315
376662 2. Information 2. >=Very Good 1. Yes 4.845098 127.11574
dim(Wage)
[1] 3000 11
Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. For example, a cubic regression uses three variables, \(X\), \(X^2\), and \(X^3\), as predictors. This approach provides a simple way to fit the data non-linearly.
Below we focus on a single predictor: age
. In this example we run a simple linear regression by calling the poly()
function. The poly()
function generates a basis of orthogonal polynomials.
fit = lm(wage ~ poly(age, 4), data = Wage)
summary(fit)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
Then, we evaluate the parameter raw = T
- and find that there is little change or impact on plynomials of lower degrees. Orthogonal polynomials overcome many issues presented by multicolinearity and correlation between variables. Essentially, orthogonal polynomials help to expand or simplify a polynomial model that is very complex - i.e. when the degree is very high. Using raw = T
does not affect the model in any meaningful way - that is, theres no benefit from using non-orthogonal polynomials.
fit2 = lm(wage ~ poly(age,4, raw =T), data = Wage) # raw = T means dont use orthogonal Polynomials
summary(fit2)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
poly(age, 4, raw = T)1 2.124552e+01 5.886748e+00 3.609042 0.0003123618
poly(age, 4, raw = T)2 -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
poly(age, 4, raw = T)3 6.810688e-03 3.065931e-03 2.221409 0.0263977518
poly(age, 4, raw = T)4 -3.203830e-05 1.641359e-05 -1.951938 0.0510386498
Mathematical Basis for Othorgonal Polynomials ISLR Page 289 Covers the use of Orthogonal Polynomials in practice.
Next, we make a plot of the fitted function, along with the standard errors from that fit. The standard errors are used to give us confidence bands that we use to gauge the uncertainty of our fitted model. The dotted lines are the bands, while the solid blue line is the degree 4 fit to the Age data. Notice the points in this graph; the shape of our fit tells us the relationship that age has with Wage - and it seems to increase over time, leveling between 30 and 60, then decreasing after 60.
agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
preds = predict(fit, newdata = list(age = age.grid), se = TRUE)
se.bands = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
par(mfrow = c(1,1), mar = c(4.5 ,4.5 ,1 ,1), oma = c(0,0,4,0))
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Degree -4 Polynomial", outer = T)
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)
Wether or not an orthogonal set of basis functions is produced by the
poly()
function does not effect the model. Here we convince ourselves of this by looking at the fitted values obtainied in either case and find that they are identical. The difference is essentially equivalent to zero.
preds2 = predict (fit2 ,newdata = list(age=age.grid),se=TRUE)
max(abs(preds$fit - preds2$fit ))
[1] 7.81597e-11
Picking Polynomial Degree Here we use the anova()
function to sort which of these functions have the predictors that we actually need. We want to use an anova()
because it performs a hypothesis test of sorts that can tell us the simplest model that is sufficient to explain the relationship between Age
and Wage
. Essentially, this is an easy way to determine the degree we should be using in our model. Remember, we are comparing the p-values from model to model - with a significant p-value indicating that there is an improvement from the immediately previous model to the current degree model. For example, what we would like to see is an improvement from degree 3
to degree 4
.
fit.1 = lm(wage ~ age, data = Wage)
fit.2 = lm(wage ~ poly(age, 2),data = Wage)
fit.3 = lm(wage ~ poly(age, 3),data = Wage)
fit.4 = lm(wage ~ poly(age, 4),data = Wage)
fit.5 = lm(wage ~ poly(age, 5),data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
Analysis of Variance Table
Model 1: wage ~ age
Model 2: wage ~ poly(age, 2)
Model 3: wage ~ poly(age, 3)
Model 4: wage ~ poly(age, 4)
Model 5: wage ~ poly(age, 5)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2998 5022216
2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
3 2996 4777674 1 15756 9.8888 0.001679 **
4 2995 4771604 1 6070 3.8098 0.051046 .
5 2994 4770322 1 1283 0.8050 0.369682
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice we can recieve the same information from both the anova()
and the poly()
function by looking at the p-values - as we do in fit 5. Then, we can compare the t-stat from the poly()
test, and the F-stat from the anova()
and find that they’re the same - convincing us of poly()
.
coef(summary(fit.5))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.70361 0.7287647 153.2780243 0.000000e+00
poly(age, 5)1 447.06785 39.9160847 11.2001930 1.491111e-28
poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
poly(age, 5)3 125.52169 39.9160847 3.1446392 1.679213e-03
poly(age, 5)4 -77.91118 39.9160847 -1.9518743 5.104623e-02
poly(age, 5)5 -35.81289 39.9160847 -0.8972045 3.696820e-01
( -11.983) ^2
[1] 143.5923
The idea with the anova()
, however, is that it will work whether or not we used orthogonal polynomials - and works well when other terms are present in the model. Also, we cannot perform a poly()
on more than one term. An example of this is below:
fit.1 = lm(wage ~ education + age, data = Wage)
fit.2 = lm(wage ~ education + poly(age, 2), data = Wage)
fit.3 = lm(wage ~ education + poly(age, 3), data = Wage)
anova(fit.1, fit.2, fit.3)
Analysis of Variance Table
Model 1: wage ~ education + age
Model 2: wage ~ education + poly(age, 2)
Model 3: wage ~ education + poly(age, 3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2994 3867992
2 2993 3725395 1 142597 114.6969 <2e-16 ***
3 2992 3719809 1 5587 4.4936 0.0341 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Importantly, we can do the same procedure with Cross-Validation, which is discussed in chapter 5 of ISLR
poly()
Alternatives I()
Wrapper and cbind()
:
In R, we can do Polynomial Regression in a different way. We use the I()
wrapper function becuase this maintains the function the way we want it represented. Essentially, R reads the carrot key, ^
, as a part of a formula normally - so here we are just telling R not to do this. Using I(age^2)
protects us from R reading a formula meaning.
By using orthogonal polynomials, we can separately test for each coefficient. Now when we look at the summary we see that the linear, quadratic, and cubic terms are significant, but the quartic are not significant. This only works if there is only one term in the model. Otherwise, we will be using the more general ANOVA function.
This Graph is crudely checking the similarities in using I()
and poly()
by extracting the fitted values of x. When we see a straight line this means theyre essentially the same!
fit2a = lm(wage ~ age + I(age ^2) + I(age ^3) + I(age ^4), data = Wage)
coef(fit2a)
(Intercept) age I(age^2) I(age^3) I(age^4)
-1.841542e+02 2.124552e+01 -5.638593e-01 6.810688e-03 -3.203830e-05
plot(fitted(fit2a), fitted(fit))
The cbind()
function is just another way of doing the I()
wrapper steps, but more compactly.
fit2b = lm(wage~cbind(age, age^2, age^3, age^4), data = Wage)
summary(fit2b)$coef
Estimate Std. Error t value
(Intercept) -1.841542e+02 6.004038e+01 -3.067172
cbind(age, age^2, age^3, age^4)age 2.124552e+01 5.886748e+00 3.609042
cbind(age, age^2, age^3, age^4) -5.638593e-01 2.061083e-01 -2.735743
cbind(age, age^2, age^3, age^4) 6.810688e-03 3.065931e-03 2.221409
cbind(age, age^2, age^3, age^4) -3.203830e-05 1.641359e-05 -1.951938
Pr(>|t|)
(Intercept) 0.0021802539
cbind(age, age^2, age^3, age^4)age 0.0003123618
cbind(age, age^2, age^3, age^4) 0.0062606446
cbind(age, age^2, age^3, age^4) 0.0263977518
cbind(age, age^2, age^3, age^4) 0.0510386498
Now we fit a logistic regression model to a binary response variable constructed from Wage
. Essentially, anyone who is a ‘big earner’ or above the threshold of 250K a year is encoded as a 1, and everyone else is a 0. There will be a relatively small group of 1s that we are turning into a binary response.The variable is changed to a T/F variable with I()
then glm()
coerces it to a 0 or 1 which we can later use in plotting and for predicting. Note: Because it’s a GLM, even though the polynomial is in an orthogonal basis, it’s no longer strictly orthogonal, since GLM involves having weights for the observation.
fit = glm(I(wage >250) ~ poly(age, 4), data = Wage, family = binomial)
Calculating the confidence intervals for this is outline below. we use the default type = 'link'
that comes with the glm()
package, and it means that we get predictions for the logit. We could also specify the type = "response"
parameter and directly compute probabilities.
preds = predict(fit, newdata = list(age = age.grid), se = T)
pfit = exp(preds$fit)/(1 + exp(preds$fit))
se.bands.logit = cbind(preds$fit + 2*preds$se.fit, preds$fit -2*preds$se.fit)
se.bands = exp(se.bands.logit)/(1 + exp(se.bands.logit))
This gives us the standard error bands, but the computations are on the logit scale. To transform we have to apply the inverse logit mapping: \[p=\frac{e^\eta}{1+e^\eta}.\]
Using type = "response"
for this set of predictions will give us negative probabilities, so we dont actually use it! This is just to show.
preds = predict(fit, newdata = list(age = age.grid), type = 'response', se = T)
We have drawn the age values corresponding to the observations with wage values above 250
as gray marks on the top of the plot, and those with wage values below 250
are shown as gray marks on the bottom of the plot. We used the jitter()
function to jitter the age values a bit so that observations with the same age value do not cover each other up. This is often called a rug plot.
plot(age, I(wage >250), xlim = agelims, type = "n", ylim = c(0, .2))
points(jitter(age), I((wage >250)/5), cex = .5, pch = "|", col = "darkgrey")
lines(age.grid, pfit ,lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "red", lty = 3)
Step Functions cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piecewise constant function.
Lets look at what happens when we have a cut the size of 4, we see the number of values in each bin or cut. The cut()
function returns a categorical variable, that lm()
then creates a set of dummy variables for use in the regression.
table(cut(age, 4))
(17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1]
750 1399 779 72
now we fit the cut, and can see the regression results
fit = lm(wage ~ cut(age, 4), data = Wage)
summary(fit)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.158392 1.476069 63.789970 0.000000e+00
cut(age, 4)(33.5,49] 24.053491 1.829431 13.148074 1.982315e-38
cut(age, 4)(49,64.5] 23.664559 2.067958 11.443444 1.040750e-29
cut(age, 4)(64.5,80.1] 7.640592 4.987424 1.531972 1.256350e-01
The age<33.5
category is left out, so the intercept coefficient of $94,160
can be interpreted as the average salary for those under 33.5 years of age, and the other coefficients can be interpreted as the average additional salary for those in the other age groups
Now we do the same as above, but specify our own bin range using the breaks = c()
parameter of cut()
table(cut(age, breaks = c(18,25,35,45,55,65,85)))
(18,25] (25,35] (35,45] (45,55] (55,65] (65,85]
220 670 894 795 346 64
Regressing it in this way returns a different set of coefficients than before!
The age<18
is also left out in this step function, and the intercept coefficient of $76,871
can be interpreted as the average salary for those under 18 years of age, and the other coefficients can be interpreted as the average additional salary for those in the other age groups. This seems really high for that age group, and that comes from the nature of this function - there are very few data points for the 18 and under age group and so a regression in this area may not provide the most accurate results.
fit = lm(wage~cut(age, breaks = c(18,25,35,45,55,65,85)), data = Wage)
summary(fit)$coef
Estimate Std. Error
(Intercept) 76.87119 2.701950
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(25,35] 27.29279 3.114117
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(35,45] 42.19588 3.016138
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(45,55] 40.75438 3.053000
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(55,65] 42.14078 3.455791
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(65,85] 25.68270 5.691759
t value Pr(>|t|)
(Intercept) 28.450264 9.801999e-158
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(25,35] 8.764214 3.097201e-18
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(35,45] 13.990039 4.013382e-43
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(45,55] 13.348962 1.598943e-39
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(55,65] 12.194250 2.054603e-33
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(65,85] 4.512262 6.661936e-06
agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])
lm.pred = predict(fit, data.frame(age=age.grid))
plot(wage~age, data=Wage, col="darkgrey")
lines(age.grid, lm.pred, col="red", lwd=2)