In this fourth lesson of the Engineering Computations module on deep learning, we play with a different model-building method: polynomial regression. We already saw one case in which the observational data could not be fit by a line, and that was the classification problem that we tackled with logistic regression (Lesson 3). Now imagine that your data wiggles about on the x,y plane—where x is the independent variable or feature, and y is the dependent variable. You look at it and think: a curvilinear relationship might work. Good idea.
It may surprise you to learn that fitting a polynomial to the data is a special case of multiple linear regression. Yes: linear. When we talk about linear models, we mean linear with respect to the parameters!
Let's generate some synthetic data using a polynomial function of fourth order, y=x4+x3−4x2, with a bit of added noise.
As usuall, we start by loading some needed Python libraries and functions, including the useful tools from autograd
that you've learned about already.
from matplotlib import pyplot
from autograd import grad
from autograd import numpy
numpy.random.seed(0) # fix seed for reproducibility
x = numpy.linspace(-3, 3, 20)
y = x**4 + x**3 - 4*x**2 + 8*numpy.random.normal(size=len(x))
pyplot.scatter(x, y);
Suppose we were only given the data (not the function that generated them), and our goal is to fit a curve to these data. The nonlinear relationship between x and y suggests that a linear regression will fail. Intuitively, using polynomial functions may first come to your mind.
Let's write the model as a dth-order polynomial on x, the only feature:
ˆy=w0+w1x+w2x2+⋯+wdxd,where w denotes the weights. Keep in mind that in the model-fitting context, the objective is always to find the optimal values of these weights given x and y. When viewed from a different perspective, the model above is just a linear combination of the weights. In fact, by creating polynomial features of x, namely, letting xi=xi, the model becomes:
ˆy=w0+w1x1+w2x2+…+wdxd.As you can see, the polynomial regression model is identical to multiple linear regression, with the matrix form being also ˆy=Xw, and the only gap is forming the matrix X using the powers of x.
Suppose we want to fit our data with a 3rd degree polynomial function; let's write a function to create these polynomial features first.
degree = 3
def polynomial_features(x, degree):
""" Generate polynomial features for x."""
X = numpy.empty((len(x), degree+1))
for i in range(degree+1):
X[:,i] = x**i
return X
X = polynomial_features(x, degree)
print(X.shape)
(20, 4)
Unsurprisingly, scikit-learn offers a counterpart:
PolynomialFeatures()
.
When your original data come with multiple features (e.g., x1 and x2), the polynomial features for regression will involve interaction terms (e.g., x1x2 for 2nd-order polynomials). In that case, it is handy to use this function to generate all terms.
Recall that for multiple linear regression, we should normalize each feature to the same scale.
As in the previous lesson, let's use MinMaxScaler()
to scale all features to [0,1], except for the first column:
x0 is set to 1 for all entries, since w0 represents the intercept.
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X)
X_scaled[:,0] = 1 # the column for intercept
When scaling the matrix of polynomial features X above, we used:
X_scaled = min_max_scaler.fit_transform(X)
The function fit_transform()
is actually a combination of two steps:
fit()
: compute xmin and xmax values for each feature and save the information to the variable min_max_scaler
. You can access them via min_max_scaler.data_min_
and min_max_scaler.data_max_
.transform()
use min_max_scaler
to scale X
.This will be helpful to remember a bit later.
We can reuse the same model and loss function from Lesson 3:
def linear_regression(params, X):
'''
The linear regression model in matrix form.
Arguments:
params: 1D array of weights for the linear model
X : 2D array of input values
Returns:
1D array of predicted values
'''
return numpy.dot(X, params)
def mse_loss(params, model, X, y):
'''
The mean squared error loss function.
Arguments:
params: 1D array of weights for the linear model
model : function for the linear regression model
X : 2D array of input values
y : 1D array of predicted values
Returns:
float, mean squared error
'''
y_pred = model(params, X)
return numpy.mean( numpy.sum((y-y_pred)**2) )
gradient = grad(mse_loss)
Remember, "training" a model simply means finding the best parameters by minimizing a loss function. We'll choose both a maximum number of iterations in the optimization loop, and an exit criterion based on the norm of the gradient at the current iteration. (If this is quite small, when multiplied by the also-small learning rate, the parameters will change very little.)
max_iter = 3000
alpha = 0.01
params = numpy.zeros(X_scaled.shape[1])
descent = numpy.ones(X_scaled.shape[1])
i = 0
from sklearn.metrics import mean_absolute_error
while numpy.linalg.norm(descent) > 0.01 and i < max_iter:
descent = gradient(params, linear_regression, X_scaled, y)
params = params - descent * alpha
loss = mse_loss(params, linear_regression, X_scaled, y)
mae = mean_absolute_error(y, X_scaled@params)
if i%100 == 0:
print(f"iteration {i:4}, {loss = :.3f}, {mae = :.3f}")
i += 1
iteration 0, loss = 5434.768, mae = 11.057 iteration 100, loss = 1300.477, mae = 6.885 iteration 200, loss = 1281.309, mae = 6.864 iteration 300, loss = 1272.990, mae = 6.808 iteration 400, loss = 1267.448, mae = 6.760 iteration 500, loss = 1263.750, mae = 6.722 iteration 600, loss = 1261.282, mae = 6.690 iteration 700, loss = 1259.636, mae = 6.664 iteration 800, loss = 1258.537, mae = 6.643 iteration 900, loss = 1257.804, mae = 6.625 iteration 1000, loss = 1257.314, mae = 6.611 iteration 1100, loss = 1256.988, mae = 6.600 iteration 1200, loss = 1256.770, mae = 6.590 iteration 1300, loss = 1256.625, mae = 6.583 iteration 1400, loss = 1256.528, mae = 6.576 iteration 1500, loss = 1256.463, mae = 6.571 iteration 1600, loss = 1256.420, mae = 6.567 iteration 1700, loss = 1256.391, mae = 6.564 iteration 1800, loss = 1256.372, mae = 6.561 iteration 1900, loss = 1256.359, mae = 6.558 iteration 2000, loss = 1256.350, mae = 6.557 iteration 2100, loss = 1256.345, mae = 6.555 iteration 2200, loss = 1256.341, mae = 6.554 iteration 2300, loss = 1256.338, mae = 6.553 iteration 2400, loss = 1256.337, mae = 6.552 iteration 2500, loss = 1256.336, mae = 6.551 iteration 2600, loss = 1256.335, mae = 6.551 iteration 2700, loss = 1256.334, mae = 6.550 iteration 2800, loss = 1256.334, mae = 6.550 iteration 2900, loss = 1256.334, mae = 6.550
Let's print out the weights.
params
array([-22.51572398, 6.75930601, 41.30788709, 30.0105898 ])
The first term is the intercept, and the rest are the weights of each scaled polynomial feature.
Although our model has multiple weights, there's only one feature in our original data. Therefore, if we are given new values of x to predict y, we need to first create the polynomial features, scale them to [0,1] using the same scaling function, and then multiply by the weights. Since we have used the min-max scaling, it is important to use the same xmin and xmax of the training data to scale new data.
Recall the explanation above for the scaling step. We now only need to call min_max_scaler.transform()
to scale new data. Such design occurs very often in scikit-learn.
With that in mind, let's plot the fitted curve together with the data. We generate some new values of x, xgrid
, and predict y at these locations. Don't forget to repeat the procedure of creating polynomial features and scaling them. Ponder over this code for a moment to wrap your head around it.
xgrid = numpy.linspace(x.min(), x.max(), 30)
Xgrid_poly_feat = polynomial_features(xgrid, degree)
Xgrid_scaled = min_max_scaler.transform(Xgrid_poly_feat)
Xgrid_scaled[:,0] = 1
pyplot.scatter(x, y, c='r', label='true')
pyplot.plot(xgrid, Xgrid_scaled@params, label='predicted')
pyplot.legend();
In the model above, we just randomly picked a polynomial degree of 3 for the fitted curve. Is it good enough to model our dataset? Should we try higher-order polynomials?
We can repeat our study with different polynomial degrees varying from 1 to 15, and see what happens.
To faciliate this task, we provide you with a script to train the model and plot these fitted curves interactively using ipywidget
.
Run the cell below and drag the slider to see how the curve changes.
import sys
sys.path.append('../scripts/')
from plot_helpers import interact_polyreg
max_degree = 15
interact_polyreg(max_degree, x, y)
interactive(children=(IntSlider(value=8, description='degree', max=15, min=1), Output()), _dom_classes=('widge…
When degree
is 1, the straight line clearly fails to capture the underlying relationship between x and y. Specifically, a line is too simple to explain how far the data are spread out, namely, the variance in the data. We say that the linear model underfits the data in this case.
Underfitting happens when the model is too naive or the weights need to be trained with more iterations. It is often easy to detect, since a large training error is a good indicator.
Would having more training data help resolve underfitting?
As we increase the polynomial degree, the training error (MAE from the figure title) keeps decreasing. But does it mean that the 15th-order polynomial gives the best fit? Probably not.
Drag the slider to the right, and you will find that the curve passes exactly through many points and looks very odd. If we are given new data, this model may not predict well because it fits too closely to the old data. As real-world data tend to be noisy (due to missing and erroneous values), our synthetic data also have noise added. Models with high polynomial degrees are so flexible that they fit the noise rather than the true relationship! In this case, these models are overfitting, or have a high variance. Overfitting usually happens when the model is overly complicated. The high-order polynomials in our example have too many degrees of freedom (weights) for our data.
Would having more training data help resolve overfitting?
Compared to underfitting, overfitting is in general harder to detect because the training error tends to be small. We will discuss how to identify overfitting later in this module. Now let's focus on how to prevent overfitting using regularization.
Regularization is used to reduce or avoid overfitting. The idea is to introduce a term in the loss function that penalizes complicated models. In our polynomial regression model:
ˆy=w0+w1x+w2x2+⋯+wdxd,every polynomial term contributes to the overall complexity of the model. One idea is to add constraints to the magnitudes of these weights.
A common approach is to add a regularization term λ∑dj=1w2j to the mean-squared error loss:
L(w)=1N‖y−Xw‖2+λd∑j=1w2jThis new loss function favors smaller weights, because larger weights will increase the term on the right. As weights are smaller, the model is less likely to overfit by large amplitude higher-order polynomial terms.
Notice that we don't penalize the intercept w0 in the regularization term. This can be justified mathematically, but one way to think about it is that a constant term won't affect the overall model complexity.
Above, λ denotes the regularization parameter, or the strength of penalty. It controls the tradeoff between fitting the data well (minimizing the first term) and keeping the model simple to avoid overfitting (minimizing the second term). When λ→0, the loss function falls back to standard mean-square error loss. And when λ is large, all penalized weights will be close to 0 after training and our model would approach the constant value w0.
This regularization is also called L2-penalty, or Tikhonov regularization, while the name ridge regression is also used for this model-building method.
Let's code the regularized mean-squared error loss and set λ=1.
def regularized_loss(params, model, X, y, _lambda=1.0):
'''
The mean squared error loss function with an L2 penalty.
Arguments:
params: 1D array of weights for the linear model
model : function for the linear regression model
X : 2D array of input values
y : 1D array of predicted values
_lambda: regularization parameter, default 1.0
Returns:
float, regularized mean squared error
'''
y_pred = model(params, X)
return numpy.mean( numpy.sum((y-y_pred)**2) ) + _lambda * numpy.sum( params[1:]**2 )
gradient = grad(regularized_loss)
no_regularization_params = params.copy()
And train the 3rd-degree polynomial model using gradient descent.
max_iter = 3000
alpha = 0.01
params = numpy.zeros(X_scaled.shape[1])
descent = numpy.ones(X_scaled.shape[1])
i = 0
from sklearn.metrics import mean_absolute_error
while numpy.linalg.norm(descent) > 0.01 and i < max_iter:
descent = gradient(params, linear_regression, X_scaled, y)
params = params - descent * alpha
loss = mse_loss(params, linear_regression, X_scaled, y)
mae = mean_absolute_error(y, X_scaled@params)
if i%100 == 0:
print(f"iteration {i:4}, {loss = :.3f}, {mae = }")
i += 1
iteration 0, loss = 5434.768, mae = 11.05718775676392 iteration 100, loss = 1785.985, mae = 6.983886996350374 iteration 200, loss = 1764.260, mae = 6.966518785947395 iteration 300, loss = 1763.570, mae = 6.965728405272657
Let's compare the optimal weights before and after regularization. We can reuse the xgrid
to plot both curves.
print("weights without regularization")
print(no_regularization_params)
print("weights with regularization")
print(params)
pyplot.scatter(x, y, c='r', label='true')
pyplot.plot(xgrid, Xgrid_scaled@no_regularization_params, label='w/o regularization')
pyplot.plot(xgrid, Xgrid_scaled@params, label='with regularization')
pyplot.legend();
weights without regularization [-22.51572398 6.75930601 41.30788709 30.0105898 ] weights with regularization [-11.13750882 12.48522096 28.26626633 11.09211867]
Using our helper script again to display both models with varying polynomial degree in an ipywidget
, interact with the slider to explore the results. We set regularized=True
this time.
interact_polyreg(max_degree, x, y, regularized=True)
interactive(children=(IntSlider(value=8, description='degree', max=15, min=1), Output()), _dom_classes=('widge…
Thanks to the regularization term, you won't see the wiggling curves even with a high degree polynomial, since the magnitude of the weights (the coefficients before each polynomial term) is much smaller.
You won't be surprised by now that scikit-learn offers a method to obtain a linear model with regularization: Ridge()
.
Look and ponder at the code cell below, where we call Ridge()
and its .fit()
method to fit the ridge regression model with our scaled matrix of features (also called the design matrix) and the y data.
The default value of the penalization parameter of 1.0, so we don't specify it, but if you wanted to try a different value, you specify it as alpha=value
in the argument list. Do explore the documentation page.
from sklearn.linear_model import Ridge
model = Ridge().fit(X_scaled[:, 1:], y) # Ridge() by default fits an intercept
y_pred_sklearn = model.predict(Xgrid_scaled[:,1:])
print(model.coef_)
print(model.intercept_)
pyplot.scatter(x, y, c='r', label='true')
pyplot.plot(xgrid, y_pred_sklearn, label='sklearn ridge regression')
pyplot.legend();
[12.48291164 28.26642615 11.09583654] -11.138315887456029
The model
variable we created above using Ridge()
stores the model weights and the intercept in the .coef_
and intercept_
attributes.
Compare the values with those we obtained above using the regularization and 3rd-order polynomial. Pretty close!
Ridge()
to fit a linear model with regularization.# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
css_file = '../style/custom.css'
HTML(open(css_file, "r").read())