Polynomial Regression and Step Functions


Team 13: Jasmine Yu, Jianqi chai, David Kersey, Griffin Salyer

Introduction



Both Polynomial and Step functions are special cases in machine learning. We generally use them as fundamental starting points for other models and for direction when we want to explore into the roots of specific cases.


Page 267 in ISLR has a straightforward explanation of how and why we use Polynomial Regression.
Additionally, this Wikipedia Article is an excellent reference for framing the idea of a polynomial regression.

Data



Initialize our data

rm(list=ls())
library(ISLR)
attach(Wage)
head(Wage)
       year age           maritl     race       education             region
231655 2006  18 1. Never Married 1. White    1. < HS Grad 2. Middle Atlantic
86582  2004  24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
161300 2003  45       2. Married 1. White 3. Some College 2. Middle Atlantic
155159 2003  43       2. Married 3. Asian 4. College Grad 2. Middle Atlantic
11443  2005  50      4. Divorced 1. White      2. HS Grad 2. Middle Atlantic
376662 2008  54       2. Married 1. White 4. College Grad 2. Middle Atlantic
             jobclass         health health_ins  logwage      wage
231655  1. Industrial      1. <=Good      2. No 4.318063  75.04315
86582  2. Information 2. >=Very Good      2. No 4.255273  70.47602
161300  1. Industrial      1. <=Good     1. Yes 4.875061 130.98218
155159 2. Information 2. >=Very Good     1. Yes 5.041393 154.68529
11443  2. Information      1. <=Good     1. Yes 4.318063  75.04315
376662 2. Information 2. >=Very Good     1. Yes 4.845098 127.11574
dim(Wage)
[1] 3000   11

Polynomial Regression



Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. For example, a cubic regression uses three variables, \(X\), \(X^2\), and \(X^3\), as predictors. This approach provides a simple way to fit the data non-linearly.



Below we focus on a single predictor: age. In this example we run a simple linear regression by calling the poly() function. The poly() function generates a basis of orthogonal polynomials.

fit = lm(wage ~ poly(age, 4), data = Wage)
summary(fit)$coef
                Estimate Std. Error    t value     Pr(>|t|)
(Intercept)    111.70361  0.7287409 153.283015 0.000000e+00
poly(age, 4)1  447.06785 39.9147851  11.200558 1.484604e-28
poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
poly(age, 4)3  125.52169 39.9147851   3.144742 1.678622e-03
poly(age, 4)4  -77.91118 39.9147851  -1.951938 5.103865e-02



Then, we evaluate the parameter raw = T - and find that there is little change or impact on plynomials of lower degrees. Orthogonal polynomials overcome many issues presented by multicolinearity and correlation between variables. Essentially, orthogonal polynomials help to expand or simplify a polynomial model that is very complex - i.e. when the degree is very high. Using raw = T does not affect the model in any meaningful way - that is, theres no benefit from using non-orthogonal polynomials.

fit2 = lm(wage ~ poly(age,4, raw =T), data = Wage) # raw = T means dont use orthogonal Polynomials
summary(fit2)$coef
                            Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)            -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
poly(age, 4, raw = T)1  2.124552e+01 5.886748e+00  3.609042 0.0003123618
poly(age, 4, raw = T)2 -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
poly(age, 4, raw = T)3  6.810688e-03 3.065931e-03  2.221409 0.0263977518
poly(age, 4, raw = T)4 -3.203830e-05 1.641359e-05 -1.951938 0.0510386498

Mathematical Basis for Othorgonal Polynomials
ISLR Page 289 Covers the use of Orthogonal Polynomials in practice.



Next, we make a plot of the fitted function, along with the standard errors from that fit. The standard errors are used to give us confidence bands that we use to gauge the uncertainty of our fitted model. The dotted lines are the bands, while the solid blue line is the degree 4 fit to the Age data. Notice the points in this graph; the shape of our fit tells us the relationship that age has with Wage - and it seems to increase over time, leveling between 30 and 60, then decreasing after 60.

agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
preds = predict(fit, newdata = list(age = age.grid), se = TRUE)
se.bands = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)

par(mfrow = c(1,1), mar = c(4.5 ,4.5 ,1 ,1), oma = c(0,0,4,0))
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Degree -4 Polynomial", outer = T)
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)



Wether or not an orthogonal set of basis functions is produced by thepoly() function does not effect the model. Here we convince ourselves of this by looking at the fitted values obtainied in either case and find that they are identical. The difference is essentially equivalent to zero.

preds2 = predict (fit2 ,newdata = list(age=age.grid),se=TRUE)
max(abs(preds$fit - preds2$fit ))
[1] 7.81597e-11





Picking Polynomial Degree

Here we use the anova() function to sort which of these functions have the predictors that we actually need. We want to use an anova() because it performs a hypothesis test of sorts that can tell us the simplest model that is sufficient to explain the relationship between Age and Wage. Essentially, this is an easy way to determine the degree we should be using in our model.
Remember, we are comparing the p-values from model to model - with a significant p-value indicating that there is an improvement from the immediately previous model to the current degree model. For example, what we would like to see is an improvement from degree 3 to degree 4.

fit.1 = lm(wage ~ age, data = Wage)
fit.2 = lm(wage ~ poly(age, 2),data = Wage)
fit.3 = lm(wage ~ poly(age, 3),data = Wage)
fit.4 = lm(wage ~ poly(age, 4),data = Wage)
fit.5 = lm(wage ~ poly(age, 5),data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
Analysis of Variance Table

Model 1: wage ~ age
Model 2: wage ~ poly(age, 2)
Model 3: wage ~ poly(age, 3)
Model 4: wage ~ poly(age, 4)
Model 5: wage ~ poly(age, 5)
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1   2998 5022216                                    
2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
3   2996 4777674  1     15756   9.8888  0.001679 ** 
4   2995 4771604  1      6070   3.8098  0.051046 .  
5   2994 4770322  1      1283   0.8050  0.369682    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



Notice we can recieve the same information from both the anova() and the poly() function by looking at the p-values - as we do in fit 5. Then, we can compare the t-stat from the poly() test, and the F-stat from the anova() and find that they’re the same - convincing us of poly().

coef(summary(fit.5))
                Estimate Std. Error     t value     Pr(>|t|)
(Intercept)    111.70361  0.7287647 153.2780243 0.000000e+00
poly(age, 5)1  447.06785 39.9160847  11.2001930 1.491111e-28
poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
poly(age, 5)3  125.52169 39.9160847   3.1446392 1.679213e-03
poly(age, 5)4  -77.91118 39.9160847  -1.9518743 5.104623e-02
poly(age, 5)5  -35.81289 39.9160847  -0.8972045 3.696820e-01
( -11.983) ^2
[1] 143.5923



The idea with the anova(), however, is that it will work whether or not we used orthogonal polynomials - and works well when other terms are present in the model. Also, we cannot perform a poly() on more than one term. An example of this is below:

fit.1 = lm(wage ~ education + age, data = Wage)
fit.2 = lm(wage ~ education + poly(age, 2), data = Wage)
fit.3 = lm(wage ~ education + poly(age, 3), data = Wage)
anova(fit.1, fit.2, fit.3)
Analysis of Variance Table

Model 1: wage ~ education + age
Model 2: wage ~ education + poly(age, 2)
Model 3: wage ~ education + poly(age, 3)
  Res.Df     RSS Df Sum of Sq        F Pr(>F)    
1   2994 3867992                                 
2   2993 3725395  1    142597 114.6969 <2e-16 ***
3   2992 3719809  1      5587   4.4936 0.0341 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



Importantly, we can do the same procedure with Cross-Validation, which is discussed in chapter 5 of ISLR



poly() Alternatives


I() Wrapper and cbind():


In R, we can do Polynomial Regression in a different way. We use the I() wrapper function becuase this maintains the function the way we want it represented. Essentially, R reads the carrot key, ^, as a part of a formula normally - so here we are just telling R not to do this. Using I(age^2) protects us from R reading a formula meaning.

By using orthogonal polynomials, we can separately test for each coefficient. Now when we look at the summary we see that the linear, quadratic, and cubic terms are significant, but the quartic are not significant. This only works if there is only one term in the model. Otherwise, we will be using the more general ANOVA function.


This Graph is crudely checking the similarities in using I() and poly() by extracting the fitted values of x. When we see a straight line this means theyre essentially the same!

fit2a = lm(wage ~ age + I(age ^2) + I(age ^3) + I(age ^4), data = Wage)
coef(fit2a)
  (Intercept)           age      I(age^2)      I(age^3)      I(age^4) 
-1.841542e+02  2.124552e+01 -5.638593e-01  6.810688e-03 -3.203830e-05 
plot(fitted(fit2a), fitted(fit))



The cbind() function is just another way of doing the I() wrapper steps, but more compactly.

fit2b = lm(wage~cbind(age, age^2, age^3, age^4), data = Wage)
summary(fit2b)$coef
                                        Estimate   Std. Error   t value
(Intercept)                        -1.841542e+02 6.004038e+01 -3.067172
cbind(age, age^2, age^3, age^4)age  2.124552e+01 5.886748e+00  3.609042
cbind(age, age^2, age^3, age^4)    -5.638593e-01 2.061083e-01 -2.735743
cbind(age, age^2, age^3, age^4)     6.810688e-03 3.065931e-03  2.221409
cbind(age, age^2, age^3, age^4)    -3.203830e-05 1.641359e-05 -1.951938
                                       Pr(>|t|)
(Intercept)                        0.0021802539
cbind(age, age^2, age^3, age^4)age 0.0003123618
cbind(age, age^2, age^3, age^4)    0.0062606446
cbind(age, age^2, age^3, age^4)    0.0263977518
cbind(age, age^2, age^3, age^4)    0.0510386498





Polynomial Logistic Reg



Now we fit a logistic regression model to a binary response variable constructed from Wage. Essentially, anyone who is a ‘big earner’ or above the threshold of 250K a year is encoded as a 1, and everyone else is a 0. There will be a relatively small group of 1s that we are turning into a binary response.The variable is changed to a T/F variable with I() then glm() coerces it to a 0 or 1 which we can later use in plotting and for predicting. Note: Because it’s a GLM, even though the polynomial is in an orthogonal basis, it’s no longer strictly orthogonal, since GLM involves having weights for the observation.

fit = glm(I(wage >250) ~ poly(age, 4), data = Wage, family = binomial) 


Calculating the confidence intervals for this is outline below. we use the default type = 'link' that comes with the glm() package, and it means that we get predictions for the logit. We could also specify the type = "response" parameter and directly compute probabilities.

preds = predict(fit, newdata = list(age = age.grid), se = T)
pfit = exp(preds$fit)/(1 + exp(preds$fit))
se.bands.logit = cbind(preds$fit + 2*preds$se.fit, preds$fit -2*preds$se.fit)
se.bands = exp(se.bands.logit)/(1 + exp(se.bands.logit))

This gives us the standard error bands, but the computations are on the logit scale. To transform we have to apply the inverse logit mapping:
\[p=\frac{e^\eta}{1+e^\eta}.\]


Using type = "response" for this set of predictions will give us negative probabilities, so we dont actually use it! This is just to show.

preds = predict(fit, newdata = list(age = age.grid), type = 'response', se = T) 


We have drawn the age values corresponding to the observations with wage values above 250 as gray marks on the top of the plot, and those with wage values below 250 are shown as gray marks on the bottom of the plot. We used the jitter() function to jitter the age values a bit so that observations with the same age value do not cover each other up. This is often called a rug plot.

plot(age, I(wage >250), xlim = agelims, type = "n", ylim = c(0, .2))
points(jitter(age), I((wage >250)/5), cex = .5, pch = "|", col = "darkgrey")
lines(age.grid, pfit ,lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "red", lty = 3)


Step Functions



Step Functions cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piecewise constant function.

Lets look at what happens when we have a cut the size of 4, we see the number of values in each bin or cut. The cut() function returns a categorical variable, that lm() then creates a set of dummy variables for use in the regression.

table(cut(age, 4))

(17.9,33.5]   (33.5,49]   (49,64.5] (64.5,80.1] 
        750        1399         779          72 


now we fit the cut, and can see the regression results

fit = lm(wage ~ cut(age, 4), data = Wage)
summary(fit)$coef
                        Estimate Std. Error   t value     Pr(>|t|)
(Intercept)            94.158392   1.476069 63.789970 0.000000e+00
cut(age, 4)(33.5,49]   24.053491   1.829431 13.148074 1.982315e-38
cut(age, 4)(49,64.5]   23.664559   2.067958 11.443444 1.040750e-29
cut(age, 4)(64.5,80.1]  7.640592   4.987424  1.531972 1.256350e-01

The age<33.5 category is left out, so the intercept coefficient of $94,160 can be interpreted as the average salary for those under 33.5 years of age, and the other coefficients can be interpreted as the average additional salary for those in the other age groups



Now we do the same as above, but specify our own bin range using the breaks = c() parameter of cut()

table(cut(age, breaks = c(18,25,35,45,55,65,85)))

(18,25] (25,35] (35,45] (45,55] (55,65] (65,85] 
    220     670     894     795     346      64 

Regressing it in this way returns a different set of coefficients than before!

The age<18 is also left out in this step function, and the intercept coefficient of $76,871 can be interpreted as the average salary for those under 18 years of age, and the other coefficients can be interpreted as the average additional salary for those in the other age groups. This seems really high for that age group, and that comes from the nature of this function - there are very few data points for the 18 and under age group and so a regression in this area may not provide the most accurate results.

fit = lm(wage~cut(age, breaks = c(18,25,35,45,55,65,85)), data = Wage)
summary(fit)$coef
                                                        Estimate Std. Error
(Intercept)                                             76.87119   2.701950
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(25,35] 27.29279   3.114117
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(35,45] 42.19588   3.016138
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(45,55] 40.75438   3.053000
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(55,65] 42.14078   3.455791
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(65,85] 25.68270   5.691759
                                                          t value      Pr(>|t|)
(Intercept)                                             28.450264 9.801999e-158
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(25,35]  8.764214  3.097201e-18
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(35,45] 13.990039  4.013382e-43
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(45,55] 13.348962  1.598943e-39
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(55,65] 12.194250  2.054603e-33
cut(age, breaks = c(18, 25, 35, 45, 55, 65, 85))(65,85]  4.512262  6.661936e-06

agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])
lm.pred = predict(fit, data.frame(age=age.grid))

plot(wage~age, data=Wage, col="darkgrey")
lines(age.grid, lm.pred, col="red", lwd=2)