Splines.utf8

Regression Splines, Natural Splines, Smoothing Splines, and Local Regression

The most important concept when it comes to this class of non-linear modeling functions, is that of the basis function. A basis function is a function that serves as a basis for several different model methods. The basis function is essentially one function or transformation that can be used and applied to a variable X, instead of applying to a simple linear model for the variable X. Moreover, the basis function can be thought of a set of functions within a larger function - by subdividing the larger function into several functions that form the whole, we can focus on modeling the special cases that use this function in a way that fits better than a linear model. Further, the basis function is essentially a class of linear model, and as such we can use similar metrics, coefficient estimates, etc., to provide context to our predictions.

Regression Splines are a common use of the basis function, which is the next step further than a polynomial regression. In the case of a regression spline, we can maintain a constant degree, while changing the places that are regressed. What this means is that we can maintain something like a cubic polynomial regression, while modeling several different areas to create combination of curves. That is, while polynomial regressions produce flexible fits using high degrees, regression splines produce flexibility by introducing knots and keeping the degree the same. Regression Splines provide a more stable model, as well as a generally better result over polyomial regression.

Smoothing splines are the next step up - where the focus is to create a continuous version of the curve. So, where there are nots, the model usually fits in a discontinuous manner. It fits this way because of the different regions that are being fit all at once with regard to the knots. The way we combat this is using the derivative and second derivative of the curves to smooth out the overall fit. Additionally, smoothing splines then add a penalty, lambda, to the fit so that variance is penalized across the fit. When lambda is 0 then the penalty term has no effect and the spline will jump all over the place, while as the penalty approaches infinity, that is- gets bigger, then the spline will become very smooth - essentially a line.

Local Regression is another approach using the basis function, but is similar to k-nearest neighbor methods. In local regression cases, for each datapoint observed, a fit is computed using only the nearest neighboring data points. In this regression method, we would specify the span - or how many neighbors are looked at when calculating the fit at each point. Additionally, weights are assigned to every point in the neighborhood such that the point furthest from the fitting point has a weight of zero and the closest point to the fitting point has the highest weight. Everything outside of the specified span has a weight of zero. It is important to note that local regressions suffer when more complexity is introduced, and generally are only beneficial and interpretable in dimensions under p = 3 or 4.

The data we are using comes from the wage dataset in ISLR. This is a standard dataset containing variables about what factors key into a persons age - this covers things like age, gender, education, and so on. We also load the splines package that provides functionality for making splines.

library (splines) 
library(ISLR)
attach(Wage)
library(ggplot2)
library(ggformula)
Loading required package: ggstance

Attaching package: 'ggstance'
The following objects are masked from 'package:ggplot2':

    geom_errorbarh, GeomErrorbarh

New to ggformula?  Try the tutorials: 
    learnr::run_tutorial("introduction", package = "ggformula")
    learnr::run_tutorial("refining", package = "ggformula")

First, a grid is created from the range of possible ages, and then a sequence of all the ages in between the lowest and highest age is created. This grid is used in our predicts to give R the range of possible knots.

The fit uses lm() syntax with a wrapper bs() that indicates that we want to use the basis function for our spline modeling. The bs() is the most important term here and allows us to construct the spline. Also important is the knot specification in our bs() function. This is important especially when you want to manipulate where the knots will be when modeling a certain dataset. In this case, we try the ages 25, 40, and 60, as these might be natural cuts between wage brackets. Finally we plot the spline and its error bands.

agelims = range(age)
age.grid = seq(from = agelims[1], to=agelims[2])
fit = lm(wage∼bs(age, knots = c(25, 40, 60)), data = Wage)
pred = predict(fit, newdata = list(age = age.grid), se = T)


gg = ggplot() + geom_point(data = Wage, aes(age, wage), color = 'gray')
gg = gg + geom_line(aes(age.grid,pred$fit), size = .5) + geom_line(aes(age.grid, pred$fit + 1.96*pred$se),linetype = 'dashed', color = 'blue', size = .5)
gg = gg + geom_line(aes(age.grid, pred$fit - 1.96*pred$se),linetype = 'dashed', color = 'blue', size = .5)
gg

In the graph, it is important to note that the error bands trail and become wild near the ends. This is because of the uncertainty of the data in these areas, and as such, splines generally produce some large bands on the ends. This is a function of the non-linear nature, and data used.

In this code, we are showing the similarities between the dimensions when we specify the knot locations, versus when we let R specify the knot locations. The first line indicates those manually entered knot locations, while the second line allows R to pick the best 25th %, 50th %, and 75th % location for the datapoints. Respectively we see that the dim() is the same for both 3 specified knots, and df = 6. Finally, the attributes of the percentiles are shown for the R-picked knots, which we see are ages 33.75, 42, and 51.

dim(bs(age,knots=c(25,40,60) )) 
[1] 3000    6
dim(bs(age,df=6)) 
[1] 3000    6
attr(bs(age,df=6),"knots") 
  25%   50%   75% 
33.75 42.00 51.00

This concludes regression splines, which are pretty straight forward in that we build a linear model in R with the bs() argument specifying a regression spline. Next up we have natural splines.

Natural splines are essentially regression splines but without knot specifications. Here we fit the same data, but using the ns() function to specify that we want to model a natural spline. We specify the argument df = 4 to indicate that we want 4 degrees of freedom.

fit2=lm(wage~ns(age,df=4),data=Wage)
pred2=predict(fit2,newdata=list(age=age.grid),se=T)


gg = ggplot() + geom_point(data = Wage, aes(age, wage), color = 'gray') + ggtitle('Natural Spline')
gg = gg + geom_line(aes(age.grid,pred2$fit), size = .5) + geom_line(aes(age.grid, pred2$fit + 1.96*pred2$se),linetype = 'dashed', color = 'blue', size = .5)
gg = gg + geom_line(aes(age.grid, pred2$fit - 1.96*pred2$se),linetype = 'dashed', color = 'blue', size = .5)
gg

Smoothing splines can be modeled using the smooth.spline() function of the splines package. The first fit gives a smoothing spline with 16 degrees of freedom, while the second fit finds the cross validated best number for degrees of freedom. We can view the CV df using fit2$df. You will notice that the degrees of freedom seem a little wonky, and thats because they are. The way lambda is used to penalize in smoothing splines manipulates the degrees of freedom into numbers that tend not to be whole numbers. The red line has df = 16, while the blue line has df =6.8 due to the cv approach.

# plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
# title("Smoothing Spline")
# fit=smooth.spline(age,wage,df=16)
# fit2=smooth.spline(age,wage,cv=TRUE)
# fit2$df
# lines(fit,col="red",lwd=2)
# lines(fit2,col="blue",lwd=2)
# legend ("topright",legend=c("16 DF","6.8 DF"), col=c("red","blue"),lty=1,lwd=2,cex=.8)
# 

gg = ggplot(Wage, aes(age, wage)) + geom_point(color = 'gray')
gg = gg + geom_spline(aes(age,wage),color = 'red', df = 16, all.knots = T)
gg = gg + geom_spline(aes(age,wage),color = 'blue', all.knots = T, cv = T)
gg = gg + ggtitle('Smoothing Spline')
gg

Finally, we show a way to model local regressions with the loess() function. We perform a local linear regression using the spans of .2 and .5. In other words we use an argument that searches within a neighborhood that consists of 20% or 50% of the observations around each point. The larger the span, the smoother the fit.

plot(age ,wage ,xlim=agelims ,cex=.5,col="darkgrey")
title("Local Regression") 
fit=loess(wage∼age,span=.2,data=Wage)
fit2=loess(wage∼age,span=.5,data=Wage)
lines(age.grid ,predict (fit ,data.frame(age=age.grid)), col="red",lwd=2)
lines(age.grid ,predict (fit2 ,data.frame(age=age.grid)), col="blue",lwd=2)
legend ("topright",legend=c("Span=0.2","Span=0.5"), col=c("red","blue"),lty=1,lwd=2,cex=.8)