Linear Functions and Errors¶
You might remember a linear function of the form
from school. It has one argument, , and two constants, for the slope and for the intersection with the -axis.
This is what it looks like when we plot it:
Source
################################
# RUN, THEN COLLAPSE THIS CELL #
################################
import matplotlib.pyplot as plt
m = 2
n = 1
f = lambda x: m*x+n
plt.plot([0, 5], [f(0), f(5)])
plt.xlim(0, 5)
plt.ylim(0, f(5))
plt.xlabel("x")
plt.ylabel("y")
plt.title(f"y=mx+n (m={m}, n={n})")
plt.show()
A linear regression model for a single independent variable is actually the same function, we just write it a little differently as
As a linear regression model, this function does not stand alone, but it comes with a dataset , that it is fitted to. Fitting here means to select the constants and , such that the is as close to as possible for all .
In this setting, we call the independent variable and the dependent variable, because we assume that its values depend on the values of . We say that is an estimate of for any given value of . Note that the linear regression model does so using only linear operations on .
The following code (skip the plotting function) demonstrates how easy it is to obtain a fitted linear regression model using sklearn given some data.
You can see that the resulting model is really just the linear function you know from school time.
Feel free to temper with the values of model.coef_ and model.intercept_ after fitting, to see how the plot and the RSS change.
Source
################################
# RUN, THEN COLLAPSE THIS CELL #
################################
import numpy as np
def plot_iris_model(model, X, y, y_pred, ax):
# sort X, y and y_pred by X
sort_idx = np.argsort(X[:, 0])
X = X[sort_idx, :]
y = y[sort_idx]
y_pred = y_pred[sort_idx]
# Plot data and fitted line
ax.plot(X, y_pred, color="blue", alpha=0.7, label=f"$y={model.coef_[0]:.2f}x+{model.intercept_:.2f}$")
ax.scatter(X, y, label="Data", color="red", alpha=0.5, zorder=10)
# Plot residuals
for i in range(len(X)):
ax.vlines(X[i], y_pred[i], y[i], color="gray", alpha=0.5, zorder=-10)
ax.set_xlabel("Sepal width (cm)")
ax.set_ylabel("Petal length (cm)")
ax.set_title("Linear Fit: Petal Length vs Sepal Width")
ax.legend()from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression
def RSS(y, y_pred):
return np.sum(np.square(y-y_pred))
np.random.seed(0)
# Load iris data and select values for species Iris-Versicolour and Iris-Virginica
iris = load_iris()
X = iris.data[50:, 1].reshape(-1, 1) # sepal width
X += np.random.random(X.shape)*0.1 # add some noise for better plotting
y = iris.data[50:, 2] # petal length
# Fit linear model
model = LinearRegression()
model.fit(X, y)
# Try some other values here and see how that changes the model and the RSS below.
# model.coef_ = np.array([2.])
# model.intercept_ = 1.
y_pred = model.predict(X)
print(f"RSS = {RSS(y, y_pred):.2f}")
fig, ax = plt.subplots(1, 1)
plot_iris_model(model, X, y, y_pred, ax)
plt.show()RSS = 49.76

Your answer
We can preprocess the dependent variables in any way we want before passing them to the linear regression model. This includes applying a mathematical function, which can make the predictions seem nonlinear. They are still linear w.r.t. the model input (since we still fit a linear model), but the model input might not be linear w.r.t. the originally observed variable anymore. Note that the nonlinearity is fixed though and cannot be learned.
For example, we could try to fit a linear model on different powers of .
fig, axs = plt.subplots(1, 4, figsize=(20, 4))
for power in range(1, 5):
power_model = LinearRegression()
power_model.fit(X**power, y)
power_y_pred = power_model.predict(X**power)
print(f"power = {power}, RSS = {RSS(y, power_y_pred):.2f}")
plot_iris_model(power_model, X, y, power_y_pred, axs[power-1])
plt.show()To gain some more flexibility w.r.t. the nonlinearity, we can simply add the result of applying many different nonlinear functions on our original . This is called a basis expansion and it allows our linear model to select from different nonlinearly transformed versions of .
If we apply this idea to our power example from before, we allow the model to create its predictions from a linear combination of different powers. Thereby we obtain a linear model that effectively learns a polynomial function of fixed degree.
See for yourself:
Source
################################
# RUN, THEN COLLAPSE THIS CELL #
################################
def plot_polynomial_iris_model(model, X, y, y_pred, ax):
# sort X, y and y_pred by X
sort_idx = np.argsort(X[:, 0])
X = X[sort_idx, :]
y = y[sort_idx]
y_pred = y_pred[sort_idx]
X_samples = np.linspace(X.min(), X.max(), 1000).reshape(-1, 1)
X_samples_powers = np.hstack([X_samples**power for power in range(1, polynomial_degree+1)])
y_samples = model.predict(X_samples_powers)
# Plot data and fitted line
ax.plot(X_samples, y_samples, color="blue", alpha=0.7, label="Fitted model")
ax.scatter(X, y, label="Data", color="red", alpha=0.5, zorder=10)
# Plot residuals
for i in range(len(X)):
ax.vlines(X[i], y_pred[i], y[i], color="gray", alpha=0.5, zorder=-10)
ax.set_ylim(y.min()*0.9, y.max()*1.1)
ax.set_xlabel("Sepal width (cm)")
ax.set_ylabel("Petal length (cm)")
ax.set_title("Linear Fit: Petal Length vs Sepal Width")
ax.legend()polynomial_degree = 10
X_powers = np.hstack([X**power for power in range(1, polynomial_degree+1)])
polynomial_model = LinearRegression()
polynomial_model.fit(X_powers, y)
print(f"intercept = {polynomial_model.intercept_}, coefficients = {polynomial_model.coef_}")
polynomial_y_pred = polynomial_model.predict(X_powers)
print(f"polynomial degree = {polynomial_degree}, RSS = {RSS(y, polynomial_y_pred):.2f}")
fig, ax = plt.subplots(1, 1)
plot_polynomial_iris_model(polynomial_model, X, y, polynomial_y_pred, ax)You might have noticed, that the fit of this polynomial model is not as good as one might expect, if you played around with the degree of the polynomial in the last code cell. This is because the powers in the expanded features increase linearly, which makes features based on high powers very similar to each other. They get so similar, that they are almost not linearly dependent anymore. This creates numerical problems when computing the optimal coefficients, limiting them to small values.
This problem can be avoided, by choosing functions that result in less similar variables, such as radial basis functions.
A very simple choice of other functions is the set of indicator functions for a set of distinct intervals (also called bins, which you might know from histograms). An indicator function returns 1 if the input lies in a predefined set (the interval in this case), and 0 otherwise. So for a given input, the linear model just knows in which interval lies. For any single interval, the linear model therefore simply learns to predict the mean of of all training points in that interval.
This is what such a model would look like on our Iris data:
Source
################################
# RUN, THEN COLLAPSE THIS CELL #
################################
def one_hot_binning(x: np.ndarray, b: int) -> np.ndarray:
x = x.ravel()
bins = np.linspace(x.min(), x.max(), b + 1)
indices = np.digitize(x, bins, right=False) - 1
indices = np.clip(indices, 0, b - 1)
one_hot = np.eye(b)[indices]
return one_hot
def plot_binned_iris_model(model, X, y, y_pred, bins, ax):
# sort X, y and y_pred by X
sort_idx = np.argsort(X[:, 0])
X = X[sort_idx, :]
y = y[sort_idx]
y_pred = y_pred[sort_idx]
X_samples = np.linspace(X.min(), X.max(), 1000).reshape(-1, 1)
X_samples_binned = one_hot_binning(X_samples, bins)
y_samples = model.predict(X_samples_binned)
# Plot data and fitted line
ax.plot(X_samples, y_samples, color="blue", alpha=0.7, label="Fitted model")
ax.scatter(X, y, label="Data", color="red", alpha=0.5, zorder=10)
# Plot residuals
for i in range(len(X)):
ax.vlines(X[i], y_pred[i], y[i], color="gray", alpha=0.5, zorder=-10)
ax.set_ylim(y.min()*0.9, y.max()*1.1)
ax.set_xlabel("Sepal width (cm)")
ax.set_ylabel("Petal length (cm)")
ax.set_title("Linear Fit: Petal Length vs Sepal Width")
ax.legend()bins = 10
X_binned = one_hot_binning(X, bins)
bins_model = LinearRegression()
bins_model.fit(X_binned, y)
bins_y_pred = bins_model.predict(X_binned)
print(f"#bins = {bins}, RSS = {np.linalg.norm(y-bins_y_pred)**2:.2f}")
fig, ax = plt.subplots(1, 1)
plot_binned_iris_model(bins_model, X, y, bins_y_pred, bins, ax)How to Find ¶
We can write the RSS in matrix notation as
The minimum of this function is the , that results in the smallest overall distance of the model predictions to the actual target values.
The gradient of this function is
The RSS is optimal where its gradient is zero (Hastie et al., 2009):
With a quick rearrangement for , we find the ordinary least squares (OLS) solution to the optimization problem of minimizing :
Your answer
Projections in ¶
Let’s put the prediction target aside and just think about the underlying geometry of the OLS solution to for a moment. (In the prediction setting, this only corresponds to the training/fitting process.)
The lecture has presented a special orthogonal projection of a vector :
We will now dissect this projection to reveal the intuition behind it.
A Single Dimension¶
The full projection can be very hard to comprehend since there are so many numbers involved, which are arranged into variably many columns and rows of different matrices.
To reduce complexity, let’s first have a look at the projection when . This allows us to simplify the projection to get:
Your answer
Seems kind of familiar, right? You have implemented the part of this formula in the PCA exercise as the orthogonal projection of a point onto a (potential) principal component! (There we have avoided the division by the length of the projection vector by restricting the length of that vector to 1. The final multiplication by the projection vector was hidden in the supplied plotting code.)
Let’s visualize this projection for some example values.
Source
################################
# RUN, THEN COLLAPSE THIS CELL #
################################
import numpy as np
import matplotlib.pyplot as plt
def orthogonal_projection(x_1, y):
return x_1 * (np.dot(x_1, y) / (np.linalg.norm(x_1)**2))
def plot_vector(v, name, ax, shift=[0, 0], color="black"):
perpendicular = np.array([-v[1], v[0]]) / np.linalg.norm(v)
ax.annotate("", xytext=(shift[0], shift[1]), xy=(shift[0]+v[0], shift[1]+v[1]), arrowprops=dict(arrowstyle="->", color=color), zorder=30)
ax.text(shift[0]+(v[0]/2) + perpendicular[0]*0.02, shift[1]+(v[1]/2) + perpendicular[1]*0.02, name, color=color,
horizontalalignment="center", verticalalignment="center")
def plot_projection_2d(x_1, y, ax):
# get projection
y_proj = orthogonal_projection(x_1, y)
# show gridlines
ax.grid(zorder=0)
# show vectors
plot_vector(y, r"$\mathbf{y}$", ax, color="red")
plot_vector(y_proj, r"$\hat{\mathbf{y}}$", ax, shift=[0., 0.02], color="red")
plot_vector(y-y_proj, r"$\mathbf{e}$", ax, shift=y_proj, color="gray")
plot_vector(x_1, r"$\mathbf{x}_1$", ax, color="blue")
# set axis limits
all_content = np.zeros((2*x_1.shape[0] + 1, 2))
all_content[:x_1.shape[0]] = x_1
all_content[x_1.shape[0]:2*x_1.shape[0]] = y_proj
all_content[-1] = y
xlim = [all_content[:, 0].min(), all_content[:, 0].max()]
ylim = [all_content[:, 1].min(), all_content[:, 1].max()]
padding = min([0.5, 0.2*(xlim[1]-xlim[0]), 0.2*(ylim[1]-ylim[0])])
xlim[0] -= padding
xlim[1] += padding
ylim[0] -= padding
ylim[1] += padding
xlim[0] = min(xlim[0], ylim[0], 0)
xlim[1] = max(xlim[1], ylim[1])
ylim[0] = min(xlim[0], ylim[0], 0)
ylim[1] = max(xlim[1], ylim[1])
ax.set_xlim(xlim)
ax.set_ylim(ylim)
# define annotations
ax.set_title(r"Projection of $\mathbf{y}$ onto $\mathbf{x}_1$")
ax.set_xlabel("Observation 1")
ax.set_ylabel("Observation 2")x_1 = np.array([0.3, 0.8])
y = np.array([0.2, 0.2])
fig, ax = plt.subplots(1, 1, figsize=(4, 4))
plot_projection_2d(x_1, y, ax)
plt.show()Reintroducing Complexity¶
Let’s have a look at how this setup changes when we add or remove an observation. The next plot shows how the observations space looks like with a third observation.
Your answer
Source
import numpy as np
import matplotlib.pyplot as plt
from io import BytesIO
import ipywidgets as widgets
from IPython.display import display
def orthogonal_projection(X: np.ndarray, y: np.ndarray):
try:
return X @ np.linalg.inv(X.T @ X) @ X.T @ y
except np.linalg.LinAlgError as e:
print(repr(e))
return None
def interactive_vector_rotation(X, y, n_frames=90, elev=20, start_angle=0):
X = np.array(X)
y = np.array(y)
angles = np.linspace(0, 360, n_frames)
frames = []
y_proj = orthogonal_projection(X, y)
if y_proj is None:
return
if X.shape[1] == 2:
# plane grid in parametric form: p(s,t) = s*x1 + t*x2
span = 1
s = np.linspace(-span, span, 10)
t = np.linspace(-span, span, 10)
S, T = np.meshgrid(s, t)
P = np.outer(S, X[:, 0]) + np.outer(T, X[:, 1])
X_surf, Y_surf, Z_surf = P[:, 0].reshape(S.shape), P[:, 1].reshape(S.shape), P[:, 2].reshape(S.shape)
# Pre-render frames into memory
for angle in angles:
fig = plt.figure(figsize=(4, 4))
ax = fig.add_subplot(111, projection="3d")
if X.shape[1] == 2:
ax.plot_surface(X_surf, Y_surf, Z_surf, alpha=0.25, color="cyan", edgecolor="none", zorder=0)
lim = 1.2 * np.max(np.abs(np.vstack([*X.T, y])))
ax.set_xlim(-lim, lim)
ax.set_ylim(-lim, lim)
ax.set_zlim(-lim, lim)
for i in range(X.shape[1]):
ax.quiver(0, 0, 0, *X[:, i], color="b", arrow_length_ratio=0.15, linewidth=1)
ax.quiver(0, 0, 0, *y, color="r", arrow_length_ratio=0.15, linewidth=1)
ax.quiver(0, 0, 0, *y_proj, color="r", arrow_length_ratio=0.15, linewidth=1, linestyle="dashed")
ax.quiver(*y_proj, *(y-y_proj), color="gray", arrow_length_ratio=0.15, linewidth=1)
ax.view_init(elev=elev, azim=angle)
ax.set_box_aspect([1, 1, 1])
ax.set_axis_off()
# custom coordinate lines
for axis in np.eye(3):
ax.plot([0, axis[0]], [0, axis[1]], [0, axis[2]], color="black", lw=1)
ax.text(1.1, 0, 0, "o1")
ax.text(0, 1.1, 0, "o2")
ax.text(0, 0, 1.1, "o3")
plt.tight_layout()
buf = BytesIO()
plt.savefig(buf, format="png")
buf.seek(0)
frames.append(buf.read())
plt.close(fig)
# Interactive display
img_widget = widgets.Image(value=frames[(start_angle*n_frames)//360], format="png", width=600, height=600)
slider = widgets.IntSlider(min=0, max=360 - 1, step=1, value=start_angle, description="View angle")
def update(change):
img_widget.value = frames[(change["new"]*n_frames)//360]
slider.observe(update, names="value")
display(widgets.VBox([slider, img_widget]))interactive_vector_rotation(np.array([[0.3, 0.8, 0.5]]).T, [0.2, 0.2, 0.8], start_angle=280)We add a second vector of observations for a new independent variable. It is linearly independent from .
Your answer
interactive_vector_rotation(np.array([[0.8, 0.3, 0.5], [0.3, 0.8, 0.5]]).T, [0.2, 0.2, 0.8], start_angle=280)The next code cell tries to create the same plot for two different vectors and , but fails.
Your answer
interactive_vector_rotation(np.array([[0.5, 0.5, 0.5], [1, 1, 1]]).T, [0.2, 0.2, 0.8], start_angle=280)We add a third vector of observations for a new independent variable. It is linearly independent from and .
Your answer
interactive_vector_rotation(np.array([[0.8, 0.3, 0.5], [0.3, 0.8, 0.5], [-0.3, -0.7, 0.5]]).T, [0.2, 0.2, 0.8], start_angle=280)- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. In Springer Series in Statistics. Springer New York. 10.1007/978-0-387-84858-7