06: Regularization and Linear Classification

Regularization¶

In last week’s exercise, we have seen, that the closed-form solution to $\hat{\beta}$ fails, when variables are perfectly correlated, because the inverse we are trying to compute does not exist. Let’s look at what happens, when we have two variables, that are not linearly dependent, but very well correlated.

We first generate some data like in the lecture. This data has two independent variables $x_1, x_2$ and one dependent variable $y$ . The independent variables are sampled on a straight line $x_1 = x_2$ plus a small amount of noise. The dependent variable is computed for these sampled points from a linear model plus a small amount of noise.

# This cell contains all imports for the entire notebook.
# Take this as an overview of the libraries we use today.

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression
from sklearn.metrics import zero_one_loss
from statsmodels.regression.linear_model import OLS

import numpy as np
np.random.seed(0)

import matplotlib
import matplotlib.pyplot as plt

from IPython.display import display, HTML
from ipywidgets import interactive

def linear_model(X, beta):
    return np.stack((np.ones(X.shape[0]), *(X.T)), axis=-1) @ beta


def generate_data(orig_beta, num_points):
    x_noise_std = 1e-3
    y_noise_std = 1e-2
    
    xs = np.stack((np.linspace(-10, 10, num_points), np.linspace(-10, 10, num_points)), axis=-1)
    xs += np.stack((np.random.normal(0, x_noise_std, num_points), np.random.normal(0, x_noise_std, num_points)), axis=-1)
    
    ys = linear_model(xs, model1_beta)
    ys += np.random.normal(0, y_noise_std, (num_points,1))

    scaler = StandardScaler()
    scaler.fit(xs)
    xs = scaler.transform(xs)
    
    return xs, ys

def plot_data_2d(xs, ys):
    fig, axs = plt.subplots(1, 2, figsize=(10, 4))
    
    scatter = axs[0].scatter(xs[:, 0], xs[:, 1], c=ys)
    plt.colorbar(scatter, ax=axs[0], label="$y$" )
    axs[0].set_xlabel("$x_1$")
    axs[0].set_ylabel("$x_2$")
    axs[0].set_title("$y$ vs. ($x_1$, $x_2$)")
    
    axs[1].scatter(xs[:, 1], ys)
    axs[1].set_xlabel("$x_1$")
    axs[1].set_ylabel("$y$")
    axs[1].set_title("$y$ vs. $x_1$")
    
    plt.tight_layout()
    plt.show()

# generate data
model1_beta = np.array([[5],  # beta0
                        [1],  # beta1
                        [1]]) # beta2
num_points = 101

xs, ys = generate_data(model1_beta, num_points)
plot_data_2d(xs, ys)

Your answer

Do you have a solution to the task? Then let’s have a look at what linear regression has to say about it.

def plot_predictions_grid(model, title, fig, ax):
    grid_min = -10
    grid_max = 10
    grid_size = 20
    grid_vals = np.linspace(grid_min, grid_max, grid_size)
    xx, yy = np.meshgrid(grid_vals, grid_vals)
    grid_points = np.column_stack([xx.ravel(), yy.ravel()])
    
    preds = model.predict(grid_points)
    preds = np.reshape(preds, (grid_size, grid_size))
    
    im = ax.imshow(preds, extent=(grid_min, grid_max, grid_max, grid_min))
    ax.set_xlim((grid_min, grid_max))
    ax.set_ylim((grid_min, grid_max))
    ax.set_xlabel("$x_1$")
    ax.set_ylabel("$x_2$")
    ax.set_title(title)


def draw_colorbar(cmap, fig, ax):
    # dummy scalar mappable for the colorbar
    norm = matplotlib.colors.Normalize(vmin=0, vmax=1)
    sm = matplotlib.cm.ScalarMappable(cmap=cmap, norm=norm)
    sm.set_array([])
    
    # draw colorbar on the given axis
    cbar = fig.colorbar(sm, cax=ax)
    cbar.set_ticks([0, 1])
    cbar.set_ticklabels(["low", "high"])
    cbar.set_label("predictions")

fig = plt.figure(figsize=(10, 5))

for i in range(10):
    # resample data
    xs, ys = generate_data(model1_beta, num_points)
    
    # fit model
    ls_fit = LinearRegression().fit(xs, ys)
    
    # plot model predictions
    ax = fig.add_subplot(3, 5, i+1)
    plot_predictions_grid(ls_fit, f"run {i+1}", fig, ax)

# plot colorbar
ax_cbar = fig.add_subplot(3, 5, 11)
ax_cbar.set_title("colorbar")
draw_colorbar("viridis", fig, ax_cbar)

# finalize plot
plt.tight_layout()
plt.show()

As you can see, OLS regression is not very certain about that question either. The learned models appear to vary randomly by a lot.

Your answer

In the lecture, you have looked at some statistical tools for linear regression diagnostics. Your task will now be to use a diagnostic model summary to find the correlated variables.

To prepare the challenge, the next code cell adds a couple of completely oncorrelated variables with no connection to $y$ . Don’t read that cell, because it is too easy to see the solution from the code.

Tasks

Based on the model summary, which $x$ variables do you suspect to be correlated and why?
Given the model summary and a threshold $\alpha = 0.05$ , would you reject the null hypothesis $H_0: \hat\beta_0 = \hat\beta_1 = \dots = \hat\beta_{10} = 0$ ?
Given the model summary and a threshold $\alpha = 0.05$ , would you reject the null hypothesis $H_0: \hat\beta_0 = 0$ ?
Given the model summary and a threshold $\alpha = 0.05$ , would you reject the null hypothesis $H_0: \hat\beta_1 = 0$ ?
Bonus: If you test all the parameter values for their significance, then the event of a p-value being below or above your threshold $\alpha$ changes meaning. Name a method that you could apply in order to regain control over the meaning of that event!
Now uncomment the first line in the code cell that gave you the model summary and replace the underscode with the number of one of the suspicious variables. This replaces a variable with the same noise as there is in all the meaningless variables.
How do you know whether you have hit the right variable?

Your answer

# generate new, uncorrelated x data with 10 variables
num_variables = 10
xs_large = np.random.multivariate_normal(np.zeros(num_variables+1), np.eye(num_variables+1), num_points)
xs_large[:, 0] = np.ones((num_points,))  # add a constant to allow statsmodels to fit an intercept

# insert correlated variables into the uncorrelated data
xs_large[:, 3] = xs[:, 0]
xs_large[:, 6] = xs[:, 1]

# xs_large[:, _] = np.random.normal(0, 1, num_points)

# fit a model to the new data and show diagnostics
sm_model = OLS(ys, xs_large)
sm_ls_fit = sm_model.fit()
display(HTML(sm_ls_fit.summary(alpha=0.05).as_html()))

Let’s go back to our two-dimensional data from the beginning and have a look at what the loss looks like when continually varying $\hat\beta$ .

Run the following code cells up to the next tasks box. They will give you visualizations of the least squares loss on our 2d data as well as the Euclidean and Manhattan distance metrics. The last one will print out the least squares loss at some specific points in $(\hat\beta_1, \hat\beta_2)$ space.

def plot_loss_surface(X, y, loss_func, loss_func_name, title, ax):
    beta0 = 5.0
    
    # Parameter grids
    xymin = -20
    xymax = 20
    beta1_vals = np.linspace(xymin, xymax, 200)
    beta2_vals = np.linspace(xymin, xymax, 200)
    
    # Allocate loss matrix
    loss_surface = np.zeros((len(beta2_vals), len(beta1_vals)))
    
    # Compute MSE over the grid
    for i, b1 in enumerate(beta1_vals):
        for j, b2 in enumerate(beta2_vals):
            beta_vec = np.array([[beta0], [b1], [b2]])
            y_pred = linear_model(X, beta_vec)
            loss_surface[j, i] = loss_func(y, y_pred, beta_vec)
    
    # Plot
    im = ax.imshow(loss_surface, origin='lower',
               extent=[beta1_vals[0], beta1_vals[-1],
                       beta2_vals[0], beta2_vals[-1]],
               aspect='equal')
    plt.colorbar(im, ax=ax, label=loss_func_name)
    ax.contour(beta1_vals, beta2_vals, loss_surface, levels=np.linspace(loss_surface.min()+1, loss_surface.max(), 10), colors="white")
    ax.axhline(0, beta1_vals[0], beta1_vals[-1], color="red")
    ax.axvline(0, beta2_vals[0], beta2_vals[-1], color="red")
    
    ax.set_xlabel(r'$\hat\beta_1$')
    ax.set_ylabel(r'$\hat\beta_2$')
    ax.set_title(title)


def least_squares_loss(y, y_pred, beta=None):
    # return 0
    return np.sum((ys - y_pred)**2)


def ridge_term(y, y_pred, beta, alpha):
    return alpha*np.sum(np.square(beta))


def ridge_loss(y, y_pred, beta, alpha):
    return least_squares_loss(y, y_pred) + ridge_term(y, y_pred, beta, alpha)


def lasso_term(y, y_pred, beta, alpha):
    return alpha*np.sum(np.abs(beta))


def lasso_loss(y, y_pred, beta, alpha):
    return least_squares_loss(y, y_pred) + lasso_term(y, y_pred, beta, alpha)

fig = plt.figure(figsize=(15, 4))
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)

plot_loss_surface(xs, ys, least_squares_loss, "Least Squares Loss", "Least Squares Loss Surface", ax1)
plot_loss_surface(xs, ys, lambda y, y_pred, beta: ridge_term(y, y_pred, beta, 1), "Distance", "Euclidean Distance to $(0,0)$", ax2)
plot_loss_surface(xs, ys, lambda y, y_pred, beta: lasso_term(y, y_pred, beta, 1), "Distance", "Manhattan Distance to $(0,0)$", ax3)

plt.tight_layout()
plt.show()

beta_vec1 = np.array([[5.0], [12.0], [0.0]])
y_pred1 = linear_model(xs, beta_vec1)
print("Least Squares loss at (12, 0):", least_squares_loss(ys, y_pred1, beta_vec1))

beta_vec2 = np.array([[5.0], [0.0], [12.0]])
y_pred2 = linear_model(xs, beta_vec2)
print("Least Squares loss at (0, 12):", least_squares_loss(ys, y_pred2, beta_vec2))

Make sure to have a close look at the point losses printed by the last code cell for answering tasks 1-3!

Tasks

Visually identify the optimal $\hat\beta_1$ and $\hat\beta_2$ based on the least squares loss.
Visually identify the optimal $\hat\beta_1$ and $\hat\beta_2$ based on the least squares loss, while also minimizing the Euclidean distance to the coordinate center $(0, 0)$ with equal weight.
Visually identify the optimal $\hat\beta_1$ and $\hat\beta_2$ based on the least squares loss, while also minimizing the Manhattan distance to the coordinate center $(0, 0)$ with equal weight.
Which of the following optimization targets do tasks 1-3 correspond?
$\arg \min_{\beta} \sum_{i=1}^{p} (y_i - x_i^T\beta)^2 + \sum_{i=1}^{p} |\beta_i|$
(1)

$\arg \min_{\beta} \sum_{i=1}^{p} (y_i - x_i^T\beta)^2 + \sum_{i=1}^{p} \beta_i^2$
(2)

$\arg \min_{\beta} \sum_{i=1}^{p} (y_i - x_i^T\beta)^2$
(3)

$\arg \min_{\beta} \sum_{i=1}^{p} (y_i - x_i^T\beta)^2 + \alpha \sum_{i=1}^{p} \beta_i^2 + (1-\alpha) \sum_{i=1}^{p} |\beta_i|$
(4)
Which regularization methods do each of the optimization targets of task 4 correspond to?

Your answer

The following code cell gives you the computed solutions to tasks 1-3, when you know how to match the tasks to regularization methods.

The code cell after that shows the predictions of the Ridge and Lasso models in $(x_1, x_2)$ space. Note that these are much less random compared to the OLS models you have seen at the beginning.

# fit models
ls_fit = LinearRegression().fit(xs, ys)
ls_beta = np.array([ls_fit.intercept_[0], *ls_fit.coef_[0]])
ridge_fit = Ridge(alpha=1).fit(xs, ys)
ridge_beta = np.array([ridge_fit.intercept_[0], *ridge_fit.coef_])
lasso_fit = Lasso(alpha=1).fit(xs, ys)
lasso_beta = np.array([lasso_fit.intercept_[0], *lasso_fit.coef_])

print("Least Squares:", ls_beta)
print("Ridge:", ridge_beta)
print("Lasso:", lasso_beta)

# plot model predictions
fig = plt.figure(figsize=(10, 2.5))

ax2 = fig.add_subplot(131)
plot_predictions_grid(ridge_fit, "Ridge", fig, ax2)
ax3 = fig.add_subplot(132)
plot_predictions_grid(lasso_fit, "Lasso", fig, ax3)

# plot colorbar
ax_cbar = fig.add_subplot(133)
ax_cbar.set_title("colorbar")
draw_colorbar("viridis", fig, ax_cbar)

plt.tight_layout()
plt.show()

The next two code cells give you an interactive view of two $x$ variables, the loss and the parameter path for different regularizations. Feel free to play around with the sliders and see how everything reacts!

For example, you can use the x1_x2_angle slider to control the angle between the variables in observation space. This controls correlation.

Your answer

def orthogonal_projection(X, y):
    try:
        return X @ np.linalg.inv(X.T @ X) @ X.T @ y
    except np.linalg.LinAlgError as e:
        print(repr(e))
        return None


def fig_to_img(fig):
    plt.tight_layout()

    buf = BytesIO()
    plt.savefig(buf, format="png")
    buf.seek(0)
    frame = buf.read()
    plt.close(fig)
    return frame


# only the 3d vector projection plot without interactivity
def render3d(X, y, y_proj, X_surf, Y_surf, Z_surf, elev, angle, ax):
    if X.shape[1] == 2:
        ax.plot_surface(X_surf, Y_surf, Z_surf, alpha=0.25, color="cyan", edgecolor="none", zorder=0)

    lim = 1.2 * np.max(np.abs(np.vstack([*X.T, y])))
    ax.set_xlim(-lim, lim)
    ax.set_ylim(-lim, lim)
    ax.set_zlim(-lim, lim)

    for i in range(X.shape[1]):
        ax.quiver(0, 0, 0, *X[:, i], color="b", arrow_length_ratio=0.15, linewidth=1)
    ax.quiver(0, 0, 0, *y, color="r", arrow_length_ratio=0.15, linewidth=1)
    ax.quiver(0, 0, 0, *y_proj, color="r", arrow_length_ratio=0.15, linewidth=1, linestyle="dashed")
    ax.quiver(*y_proj, *(y-y_proj), color="gray", arrow_length_ratio=0.15, linewidth=1)
    ax.view_init(elev=elev, azim=angle)
    ax.set_box_aspect([1, 1, 1])

    ax.text(X[0, 0], X[1, 0], X[2, 0], "x1")
    if X.shape[1] > 1:
        ax.text(X[0, 1], X[1, 1], X[2, 1], "x2")

    ax.set_axis_off()
    # custom coordinate lines
    for axis in np.eye(3):
        ax.plot([0, axis[0]], [0, axis[1]], [0, axis[2]], color="black", lw=1)
    ax.text(1.1, 0, 0, "o1")
    ax.text(0, 1.1, 0, "o2")
    ax.text(0, 0, 1.1, "o3")
    
    ax.set_title("Variables in Observation Space")


# only the 2d regression line plot without interactivity
def plot_colspace_preds_1d(X, y, ax):
    X = np.array(X)
    y = np.array(y)

    # set axes limits
    xlim = (-1, 1)
    ylim = (-1, 1)
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    ax.set_aspect("equal")

    # show gridlines
    ax.grid()

    # add lines for axes
    ax.axvline(0, xlim[0], xlim[1], color="black", zorder=0)
    ax.axhline(0, ylim[0], ylim[1], color="black", zorder=0)

    ax.scatter(X[:, 0], y)
    
    beta_intermediate = X.T @ y
    beta = np.linalg.inv(X.T @ X) @ X.T @ y

    print(f"X^Ty = {beta_intermediate}")
    print(f"(X^TX)^-1X^Ty = {beta}")

    xs_samples = np.zeros_like(X[:2])
    xs_samples[0, 0] = -1
    xs_samples[1, 0] = 1
    ys_samples1 = colspace_to_rowspace(xs_samples, beta_intermediate)
    ax.plot(xs_samples[:, 0], ys_samples1, label=r"$\hat{\beta}=\mathbf{X}^T\mathbf{y}$")

    ys_samples2 = colspace_to_rowspace(xs_samples, beta)
    ax.plot(xs_samples[:, 0], ys_samples2, label=r"$\hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$")

    ax.legend()

    # define annotations
    ax.set_title("Predictions")
    ax.set_xlabel(r"$\mathbf{x}_1$")
    ax.set_ylabel(r"$\hat{\mathbf{y}}$")


def get_regularized_model(fit_intercept, regularization_lambda, regularization_alpha, regularization):
    model = None
    
    if regularization == "Lasso":
        model = Lasso(fit_intercept=fit_intercept, alpha=regularization_lambda)
    elif regularization == "Ridge":
        model = Ridge(fit_intercept=fit_intercept, alpha=regularization_lambda)
    elif regularization == "ElasticNet":
        model = ElasticNet(fit_intercept=fit_intercept, alpha=regularization_lambda, l1_ratio=1-regularization_alpha)

    return model


def lasso_terms(n_samples, regularization_lambda, betas):
    return 2 * n_samples * regularization_lambda * np.sum(np.abs(betas), axis=0)
    

def ridge_terms(regularization_lambda, betas):
    return regularization_lambda * np.sum(np.square(betas), axis=0)


def elasticnet_terms(n_samples, regularization_lambda, regularization_alpha, betas):
    return regularization_lambda * (n_samples * regularization_alpha * ridge_terms(1, betas) + (1 - regularization_alpha) * lasso_terms(n_samples, 1, betas))


# TODO: clean up loss functions to match sklearn (move n_samples factor for Lasso and ElasticNet to match sklearn losses)
# currently they have the same minimum but different scaling than what sklearn uses
def get_regularization_terms(n_samples, regularization_lambda, regularization_alpha, betas, regularization):
    terms = None

    if regularization == "Lasso":
        terms = lasso_terms(n_samples, regularization_lambda, betas)
    elif regularization == "Ridge":
        terms = ridge_terms(regularization_lambda, betas)
    elif regularization == "ElasticNet":
        terms = elasticnet_terms(n_samples, regularization_lambda, regularization_alpha, betas)

    return terms


def plot_loss_2d(X, y, fit_intercept, regularization_lambda, regularization_alpha, regularization, resolution, print_values, fig, ax):
    X = np.array(X)
    y = np.array(y)

    # set axes limits
    axlim = (-2, 2)
    ax.set_xlim(axlim)
    ax.set_ylim(axlim)
    ax.set_aspect("equal")
    
    # add lines for axes
    ax.axvline(0, axlim[0], axlim[1], color="black", zorder=0)
    ax.axhline(0, axlim[0], axlim[1], color="black", zorder=0)

    coords = np.linspace(axlim[0], axlim[1], resolution)
    pts_x, pts_y = np.meshgrid(coords, coords)
    betas = np.array([np.ravel(pts_x), np.ravel(pts_y)])

    losses = np.sum((np.repeat(np.array([y]).T, resolution**2, axis=1) - (X @ betas))**2, axis=0)
    losses = np.reshape(losses, (resolution, resolution))
    im = ax.imshow(losses, extent=(axlim[0], axlim[1], axlim[1], axlim[0]))
    fig.colorbar(im, ax=ax, label="RSS")
    ax.contour(pts_x, pts_y, losses, levels=np.linspace(losses.min(), losses.max(), 10), colors="white")
    
    ax.axhline(0, axlim[0], axlim[1], color="white")
    ax.axvline(0, axlim[0], axlim[1], color="white")

    # opt_idx = np.unravel_index(np.argmin(losses), losses.shape)
    # opt = (pts_x[opt_idx], pts_y[opt_idx])
    model = LinearRegression(fit_intercept=fit_intercept)
    model = model.fit(X, y)
    opt = model.coef_
    if print_values:
        print(f"OLS optimum = [{opt[0]:.2f}, {opt[1]:.2f}]")
    ax.scatter(opt[0], opt[1], color="red", marker="*", label="OLS optimum", zorder=10)

    regularized_opts = np.zeros((100, 2))
    max_lambda = 1 if regularization == "Lasso" else 10
    for i, lambda_i in enumerate(np.linspace(0, max_lambda, 100)):
        if i == 0:
            regularized_opts[i] = model.coef_
            continue
        model_i = get_regularized_model(fit_intercept, lambda_i, regularization_alpha, regularization)
        model_i = model_i.fit(X, y)
        regularized_opts[i] = model_i.coef_
    ax.plot(regularized_opts[:, 0], regularized_opts[:, 1])

    model = get_regularized_model(fit_intercept, regularization_lambda, regularization_alpha, regularization)
    model = model.fit(X, y)
    regularized_opt = model.coef_
    if print_values:
        print(f"regularized optimum = [{regularized_opt[0]:.2f}, {regularized_opt[1]:.2f}]")
    ax.scatter(regularized_opt[0], regularized_opt[1], color="red", label=f"{regularization} optimum", zorder=10)

    regularization_terms_func = lambda x: get_regularization_terms(len(X), regularization_lambda, regularization_alpha, x, regularization)
    regularization_terms = regularization_terms_func(betas)
    regularization_terms = np.reshape(regularization_terms, (resolution, resolution))
    regularization_losses = losses + regularization_terms
    
    ax.contour(pts_x, pts_y, losses, levels=[np.sum((y - (X @ regularized_opt))**2)], colors="red")
    ax.contour(pts_x, pts_y, regularization_terms, levels=[regularization_terms_func(regularized_opt)], colors="red")

    ax.legend()
    ax.set_title("Loss landscape")
    ax.set_xlabel(r"$\beta_1$")
    ax.set_ylabel(r"$\beta_2$")


def plot_loss_on_optima_path(X, y, fit_intercept, regularization_lambda, regularization_alpha, regularization, ax):
    lambdas_resolution = 100
    losses = np.zeros((lambdas_resolution,))
    max_lambda = 1 if regularization == "Lasso" else 10
    lambdas = np.linspace(0, max_lambda, lambdas_resolution+1)[1:]

    for i, lambda_i in enumerate(lambdas):
        model_i = get_regularized_model(fit_intercept, lambda_i, regularization_alpha, regularization)
        model_i = model_i.fit(X, y)

        # betas = [model_i.intercept_, *model_i.coef_]
        betas = model_i.coef_
        losses[i] = np.sum(np.square(y - model_i.predict(X))) + get_regularization_terms(len(X), regularization_lambda, regularization_alpha, betas, regularization)

    model_conf = get_regularized_model(fit_intercept, regularization_lambda, regularization_alpha, regularization)
    model_conf = model_conf.fit(X, y)
    
    # betas = [model_conf.intercept_, *model_conf.coef_]
    betas = model_conf.coef_
    loss_conf = np.sum(np.square(y - model_conf.predict(X))) + get_regularization_terms(len(X), regularization_lambda, regularization_alpha, betas, regularization)
    
    ax.plot(lambdas, losses)
    
    ax.scatter([0], [losses[0]], marker="*", color="red", label="OLS optimum", zorder=10)
    ax.scatter([regularization_lambda], [loss_conf], marker="o", color="red", label=f"{regularization} optimum", zorder=10)

    ax.legend()
    ax.set_ylim((0, np.max(losses)*1.2))
    ax.set_ylabel("Regularized loss")
    ax.set_xlabel(r"$\lambda$")
    ax.set_title("Regularized loss along regularization optima path")


def plot_betas_per_lambda(X, y, fit_intercept, regularization_lambda, regularization_alpha, regularization, ax):
    lambdas_resolution = 100
    betas = np.zeros((lambdas_resolution, 3))
    max_lambda = 1 if regularization == "Lasso" else 10
    lambdas = np.linspace(0, max_lambda, lambdas_resolution+1)[1:]

    for i, lambda_i in enumerate(lambdas):
        model_i = get_regularized_model(fit_intercept, lambda_i, regularization_alpha, regularization)
        model_i = model_i.fit(X, y)

        betas[i] = [model_i.intercept_, *model_i.coef_]
        # betas[i] = model_i.coef_

    for i in range(betas.shape[1]):
        ax.plot(lambdas, betas[:, i], label=r"$\beta_"+str(i)+r"$")

    ax.axvline(regularization_lambda, np.min(betas), np.max(betas), linestyle="dashed", color="red", zorder=-10)

    ax.legend()
    ax.set_ylabel("Learned value")
    ax.set_xlabel(r"$\lambda$")
    ax.set_title("Regularization influence on model parameters")


def _normalize(v):
    n = np.linalg.norm(v)
    if n == 0:
        raise ValueError("Zero vector cannot be normalized.")
    return v / n

def _axis_angle_rotate(v, axis, angle):
    axis = _normalize(axis)
    v_par = np.dot(v, axis) * axis
    v_perp = v - v_par
    w = np.cross(axis, v_perp)
    return v_par + v_perp * np.cos(angle) + w * np.sin(angle)

def generate_vectors(y, theta_hy, phi_h, s1, phi_out, theta_out, s2):
    """
    y: base vector
    theta_hy: angle between h and y
    phi_h: rotation of h around y
    s1: scale of h relative to ||y||
    phi_out: rotation of output pair around h
    theta_out: angle between each output vector and h
    s2: relative scale factor of the two outputs
    """
    y = np.asarray(y, dtype=float)
    yn = _normalize(y)

    # Step 1: construct h
    # pick any vector not parallel to y
    tmp = np.array([1.0, 0.0, 0.0])
    if np.allclose(np.cross(yn, tmp), 0):
        tmp = np.array([0.0, 1.0, 0.0])

    # perpendicular direction
    b = _normalize(np.cross(yn, tmp))

    # initial h direction with angle theta_hy
    h_dir = np.cos(theta_hy) * yn + np.sin(theta_hy) * b

    # spin around y by phi_h
    h_dir = _axis_angle_rotate(h_dir, yn, phi_h)
    h = h_dir * (s1 * np.linalg.norm(y))

    # Step 2: construct basis around h
    hn = _normalize(h)
    # pick perpendicular vector
    tmp2 = np.array([1.0, 0.0, 0.0])
    if np.allclose(np.cross(hn, tmp2), 0):
        tmp2 = np.array([0.0, 1.0, 0.0])
    p = _normalize(np.cross(hn, tmp2))   # perpendicular to h
    q = np.cross(hn, p)                  # completes right-handed basis

    # Step 3: construct raw u and v directions at angle theta_out from h
    # before rotating around h
    u_dir = np.cos(theta_out) * hn + np.sin(theta_out) * p
    v_dir = np.cos(theta_out) * hn - np.sin(theta_out) * p  # opposite side

    # Step 4: spin both around h by phi_out
    u_dir = _axis_angle_rotate(u_dir, hn, phi_out)
    v_dir = _axis_angle_rotate(v_dir, hn, phi_out)

    # Step 5: apply length scalings
    h_norm = np.linalg.norm(h)
    u = u_dir * ((s2 if s2 <= 1.0 else (1.0 / (2-s2))) * h_norm)
    v = v_dir * (((1.0/s2) if s2 <= 1.0 else (2-s2)) * h_norm)

    return h, u, v


def rotate_vectors_about_plane_normal(v1, v2, angle):
    # Normal of the plane spanned by v1 and v2
    n = np.cross(v1, v2)
    n_norm = np.linalg.norm(n)
    if n_norm == 0:
        raise ValueError("Input vectors are collinear; plane normal undefined.")
    k = n / n_norm  # unit normal

    # Rodrigues' rotation matrix
    K = np.array([[0, -k[2], k[1]],
                  [k[2], 0, -k[0]],
                  [-k[1], k[0], 0]])
    I = np.eye(3)
    R = I + np.sin(angle) * K + (1 - np.cos(angle)) * (K @ K)

    return R @ v1, R @ v2


# interactive 3d vector projection, loss surface and parameter trajectories
def interactive_vector_rotation(y, x_labels=None, n_frames=90, elev=20, proj_func=orthogonal_projection):
    if x_labels is None:
        x_labels = [""]*2
    
    def plot(viewangle=315., x1_x2_yangle=180., x1_x2_angle=45., x1_x2_relative_scale=1., centering=False, unit_variance=False, fit_intercept=False, lambda_regularization=0.5, alpha_regularization=0.5, regularization="Lasso", print_values=False):
        theta_hy = np.deg2rad(30.)
        phi_h = np.deg2rad(135.)
        s1=1
        phi_out = np.deg2rad(165.)
        x1_x2_angle = np.deg2rad(x1_x2_angle-180.)
        h, u, v = generate_vectors(y, theta_hy, phi_h, s1, phi_out, x1_x2_angle, x1_x2_relative_scale)
        ur, vr = rotate_vectors_about_plane_normal(u, v, np.deg2rad(x1_x2_yangle))
        X = np.stack([ur, vr], axis=1)
        resolution = 100
        
        if regularization != "Lasso":
            lambda_regularization *= 10

        if print_values:
            print(f"X = \n{X}")
            print(f"y = {y}")

        scaler = StandardScaler(with_mean=centering, with_std=unit_variance)
        scaler.fit(X)
        X_scaled = scaler.transform(X)
        X = X_scaled

        if print_values:
            print(f"Preprocessed X = \n{X}")
            print(f"beta = {np.linalg.inv(X.T @ X) @ X.T @ y}")

        ####################################
        ############## PLANE ###############
        X_surf = None
        Y_surf = None
        Z_surf = None
        if X.shape[1] == 2:
            # plane grid in parametric form: p(s,t) = s*x1 + t*x2
            span = 1
            s = np.linspace(-span, span, 10)
            t = np.linspace(-span, span, 10)
            S, T = np.meshgrid(s, t)
            P = np.outer(S, X[:, 0]) + np.outer(T, X[:, 1])
            X_surf, Y_surf, Z_surf = P[:, 0].reshape(S.shape), P[:, 1].reshape(S.shape), P[:, 2].reshape(S.shape)
        ####################################
        
        y_proj = proj_func(X, y)
        if print_values:
            print(f"y_proj = {y_proj}")

        fig1 = plt.figure(figsize=(25, 6))
        ax1 = fig1.add_subplot(141, projection="3d")
        render3d(X, y, y_proj, X_surf, Y_surf, Z_surf, elev, viewangle, ax1)

        ax2 = fig1.add_subplot(142)
        # plot_colspace_preds_1d(X, y, ax2)
        plot_loss_2d(X, y, fit_intercept, lambda_regularization, alpha_regularization, regularization, resolution, print_values, fig1, ax2)

        ax3 = fig1.add_subplot(143)
        plot_loss_on_optima_path(X, y, fit_intercept, lambda_regularization, alpha_regularization, regularization, ax3)
        
        ax4 = fig1.add_subplot(144)
        plot_betas_per_lambda(X, y, fit_intercept, lambda_regularization, alpha_regularization, regularization, ax4)
        
        plt.show()
    
    interactive_plot = interactive(plot,
                                   viewangle=(0., 359., 5.),
                                   x1_x2_yangle=(0., 365., 1.),
                                   x1_x2_angle=(1., 89., 1.),
                                   x1_x2_relative_scale=(0.01, 1.99, 0.01),
                                   centering=False,
                                   unit_variance=False,
                                   fit_intercept=False,
                                   lambda_regularization=(0.01, 1., 0.01),
                                   alpha_regularization=(0.01, 0.99, 0.01),
                                   regularization=["Lasso", "Ridge", "ElasticNet"],
                                   print_values=False)
    output = interactive_plot.children[-1]
    output.layout.height = '600px'
    return interactive_plot

ys_interactive = np.array([0.4, 0.4, 1.6])
interactive_vector_rotation(ys_interactive)

Linear Classification¶

Skip the following two code cells. They just prepare the plots for later on and generate data.

def plot_ground_truth(X, Y):
    fig = plt.figure(figsize=(12, 6))

    ax2 = fig.add_subplot(1, 2, 1)
    ax2.scatter(X[Y == 1][:, 0], X[Y == 1][:, 1], s=10, c="Red")
    ax2.scatter(X[Y == 0][:, 0], X[Y == 0][:, 1], s=10, c="Blue")
    # plt.plot([-1, 2], [-1, 2], c="Gray")
    ax2.set_xlabel("x1")
    ax2.set_ylabel("x2")
    ax2.set_xlim(-0.25, 1.25)
    ax2.set_ylim(-0.25, 1.25)
    ax2.set_aspect("equal")
    ax2.set_title("Noisy Points with Classes - 2D View")
    ax2.legend(["Noisy Points of Class 1", "Noisy Points of Class 0"])

    ax1 = fig.add_subplot(1, 2, 2, projection="3d", computed_zorder=False)
    ax1.scatter(X[Y == 1][:, 0], X[Y == 1][:, 1], 1, s=10, c="Red", zorder=1)
    ax1.scatter(X[Y == 0][:, 0], X[Y == 0][:, 1], 0, s=10, c="Blue", zorder=-1)
    ax1.set_xlabel("x1")
    ax1.set_ylabel("x2")
    ax1.set_zlabel("y")
    ax1.set_xlim(-0.25, 1.25)
    ax1.set_ylim(-0.25, 1.25)
    ax1.set_aspect("equal")
    ax1.set_title("Noisy Points with Classes - 3D View")
    ax1.legend(["Noisy Points of Class 1", "Noisy Points of Class 0"])
    ax1.view_init(elev=20, azim=-115)

    plt.show()


def plot_linear_model_preds(X, Y_pred_in, xlim_left=-0.25, xlim_right=1.25, ylim_bottom=-0.25, ylim_top=1.25):
    plane_x1 = [xlim_left, xlim_right]
    plane_x2 = [ylim_bottom, ylim_top]
    plane_x1, plane_x2 = np.meshgrid(plane_x1, plane_x2)
    plane_y = 0.5

    fig = plt.figure(figsize=(18, 6))

    ax2 = fig.add_subplot(1, 3, 1)
    ax2.scatter(X[:, 0], X[:, 1], s=10, c=Y_pred_in)
    ax2.set_xlabel("x1")
    ax2.set_ylabel("x2")
    ax2.set_xlim(xlim_left, xlim_right)
    ax2.set_ylim(ylim_bottom, ylim_top)
    ax2.set_aspect("equal")
    ax2.set_title("Points with Regression Prediction - 2D View")

    ax1 = fig.add_subplot(1, 3, 2, projection="3d", computed_zorder=False)
    ax1.scatter(X[Y_pred_in > plane_y][:, 0], X[Y_pred_in > plane_y][:, 1], Y_pred_in[Y_pred_in > plane_y], s=10, c=Y_pred_in[Y_pred_in > plane_y], zorder=1, vmin=np.min(Y_pred_in))
    ax1.scatter(X[Y_pred_in <= plane_y][:, 0], X[Y_pred_in <= plane_y][:, 1], Y_pred_in[Y_pred_in <= plane_y], s=10, c=Y_pred_in[Y_pred_in <= plane_y], zorder=-1, vmax=np.max(Y_pred_in))
    ax1.plot_surface(plane_x1, plane_x2, np.zeros_like(plane_x1) + plane_y, color=[0.5, 0.5, 0.5], alpha=0.75, zorder=0)
    ax1.set_xlabel("x1")
    ax1.set_ylabel("x2")
    ax1.set_zlabel("y-hat")
    ax1.set_xlim(xlim_left, xlim_right)
    ax1.set_ylim(ylim_bottom, ylim_top)
    ax1.set_aspect("equal")
    ax1.set_title("Points with Regression Prediction - 3D View")
    ax1.view_init(elev=20, azim=-115)

    ax = fig.add_subplot(1, 3, 3)
    ax.scatter(X[Y_pred_in >= plane_y][:, 0], X[Y_pred_in >= plane_y][:, 1], s=10, c="Red")
    ax.scatter(X[Y_pred_in < plane_y][:, 0], X[Y_pred_in < plane_y][:, 1], s=10, c="Blue")
    ax.set_xlabel("x1")
    ax.set_ylabel("x2")
    ax.set_xlim(xlim_left, xlim_right)
    ax.set_ylim(ylim_bottom, ylim_top)
    ax.set_aspect("equal")
    ax.set_title("Points with Classification Prediction - 2D View")
    ax.legend(["Points Predicted in Class 1", "Points Predicted in Class 0"])

    plt.show()


def plot_logistic_model_preds(X, Y_pred_in, xlim_left=-0.25, xlim_right=1.25, ylim_bottom=-0.25, ylim_top=1.25):
    plane_x1 = [xlim_left, xlim_right]
    plane_x2 = [ylim_bottom, ylim_top]
    plane_x1, plane_x2 = np.meshgrid(plane_x1, plane_x2)
    plane_y = 0.5

    fig = plt.figure(figsize=(18, 6))

    ax2 = fig.add_subplot(1, 3, 1)
    ax2.scatter(X[:, 0], X[:, 1], s=10, c=Y_pred_in)
    ax2.set_xlabel("x1")
    ax2.set_ylabel("x2")
    ax2.set_xlim(xlim_left, xlim_right)
    ax2.set_ylim(ylim_bottom, ylim_top)
    ax2.set_aspect("equal")
    ax2.set_title("Points with Regression Prediction - 2D View")

    ax1 = fig.add_subplot(1, 3, 2, projection="3d", computed_zorder=False)
    ax1.scatter(X[Y_pred_in > plane_y][:, 0], X[Y_pred_in > plane_y][:, 1], Y_pred_in[Y_pred_in > plane_y], s=10, c=Y_pred_in[Y_pred_in > plane_y], zorder=1, vmin=np.min(Y_pred_in))
    ax1.scatter(X[Y_pred_in <= plane_y][:, 0], X[Y_pred_in <= plane_y][:, 1], Y_pred_in[Y_pred_in <= plane_y], s=10, c=Y_pred_in[Y_pred_in <= plane_y], zorder=-1, vmax=np.max(Y_pred_in))
    ax1.plot_surface(plane_x1, plane_x2, np.zeros_like(plane_x1) + plane_y, color=[0.5, 0.5, 0.5], alpha=0.75, zorder=0)
    ax1.set_xlabel("x1")
    ax1.set_ylabel("x2")
    ax1.set_zlabel("y-hat")
    ax1.set_xlim(xlim_left, xlim_right)
    ax1.set_ylim(ylim_bottom, ylim_top)
    ax1.set_aspect("equal")
    ax1.set_title("Points with Regression Prediction - 3D View")
    ax1.view_init(elev=20, azim=-115)

    ax = fig.add_subplot(1, 3, 3)
    ax.scatter(X[Y_pred_in >= plane_y][:, 0], X[Y_pred_in >= plane_y][:, 1], s=10, c="Red")
    ax.scatter(X[Y_pred_in < plane_y][:, 0], X[Y_pred_in < plane_y][:, 1], s=10, c="Blue")
    ax.set_xlabel("x1")
    ax.set_ylabel("x2")
    ax.set_xlim(xlim_left, xlim_right)
    ax.set_ylim(ylim_bottom, ylim_top)
    ax.set_aspect("equal")
    ax.set_title("Points with Regression Prediction - 2D View")
    ax.legend(["Points Predicted in Class 1", "Points Predicted in Class 0"])

    plt.show()


def are_linear_predictions_equal(X, model, custom_func):
    model_Y_pred = model.predict(X)
    custom_Y_pred = np.apply_along_axis(custom_func, 1, X)

    epsilon = 1e-6
    if np.all(np.abs(custom_Y_pred - model_Y_pred) < epsilon):
        print("You got it right!")
    else:
        print("Not quite there yet. Try again!")


def are_logistic_predictions_equal(X, model, custom_func):
    model_Y_pred = model.predict_proba(X)[:, 1]
    custom_Y_pred = np.apply_along_axis(custom_func, 1, X)

    epsilon = 1e-6
    if np.all(np.abs(custom_Y_pred - model_Y_pred) < epsilon):
        print("You got it right!")
    else:
        print("Not quite there yet. Try again!")


def model_to_str(m):
    return f"y_pred = {m.intercept_:.4f}" + "".join([f" + {coef:.4f}*x_{i}" for i, coef in enumerate(m.coef_)])

beta = [0, -1, 1]
X_raw = np.random.uniform(0, 1, (1000, 2))
X = X_raw + np.random.normal(0, 0.1, (1000, 2))

Y_soft = np.array([np.dot(beta, [1, x1, x2]) for x1, x2 in X_raw])
Y = np.array([0 if y_soft < 0 else 1 for y_soft in Y_soft])

Let’s look at a dataset with two independent variables $x_{i1}$ and $x_{i2}$ . The dependent variable $y_i$ takes on either the value 0 or 1 (i.e. $y_i \in \{0,1\}$ ).

The plot on the left side highlights the categorical nature of this data, ready to be classified. The 3D plot shows that the points can also be represented three-dimensionally with the values of $y_i$ as the third coordinate.

Your answer

plot_ground_truth(X, Y)

Let’s try this idea using sklearn.linear_model.LinearRegression. This model is fitted according to the ordinary least squares solution of the linear regression model on the values of $x_{i1}$ and $x_{i2}$ .

You can see the prediction results of the resulting model on the training data in the first plot of the code cell after the next. The middle plot shows the points in 3D space where the height of the points is taken from the regression prediction $\hat{y}$ . You should be able to see that these points lie on the regression plane that you have described in the task above.

Side note:

This is similar to the actual implementation of sklearn.linear_model.RidgeClassifier, about which its documentation says:

This classifier first converts the target values into {-1, 1} and then treats the problem as a regression task (multi-output regression in the multiclass case).

Your answer

model = LinearRegression()
model.fit(X, Y)

Y_pred = model.predict(X)

zero_one_loss(Y, Y_pred > 0.5)

plot_linear_model_preds(X, Y_pred)

def linear_regressor(x):
    beta0 = model.intercept_
    beta1_n = model.coef_
    beta_vec = np.array([beta0, *beta1_n])

    # TODO: implement the linear regression model function for the parameters beta
    return None


# checking and plotting the solution
custom_Y_pred = np.apply_along_axis(linear_regressor, 1, X)

if not all(custom_Y_pred == None):
    plot_linear_model_preds(X, custom_Y_pred)
    are_linear_predictions_equal(X, model, linear_regressor)

... and that is how you can use a linear regression model for classification!

In the lecture you have heard about a problem with this approach when there is an outlier to the training data. Have a look at that case in the next three code blocks.

This is what happens here on a high level:

Adding a single outlier point to the data.
Fitting a new linear regression model to that data.
Plotting the prediction results of that new model.

X_outlier = np.concat([X, [[10, 0]]])
Y_outlier = np.concat([Y, [1]])

model = LinearRegression()
model.fit(X_outlier, Y_outlier)

Y_pred = model.predict(X_outlier)

zero_one_loss(Y_outlier, Y_pred > 0.5)

plot_linear_model_preds(X_outlier, Y_pred, xlim_left=-1, xlim_right=11, ylim_top=2, ylim_bottom=-1)

We can clearly see that the decision boundary is tilted towards the right now in the plot on the right and that the 0-1 loss is larger now than before.

In the lecture it was claimed that an outlier does not have such a large effect on a logistic regression model. See for yourself in the following two code blocks, where the model is trained and the results are plotted again.

model = LogisticRegression(penalty=None)
model.fit(X_outlier, Y_outlier)

Y_pred = model.predict_proba(X_outlier)[:, 1]

zero_one_loss(Y_outlier, Y_pred > 0.5)

plot_logistic_model_preds(X_outlier, Y_pred, xlim_left=-1, xlim_right=11, ylim_top=2, ylim_bottom=-1)

This is caused by the different loss that is used for the optimization in logistic regression. You are encouraged to think about how the different loss has a different influence on the fitting process. The basic idea is that here, wrongly predicted outliers are not penalized as much as with the sum of squared errors, which is used in the usual linear regression setting. (Take it as a bonus task to compare the influence of an outlier on the logistic regression loss vs. the linear regression loss.)

Let’s get an intuitive understanding of how this works by looking at the plots above.

Your answer

Enough with the comparison between linear and logistic regression for now. Let’s have a look at just the logistic regression model!

For that, a logistic regression model is trained and visualized on the data without the outlier in the following two code cells. (Just like the linear regression model in the beginning.)

model = LogisticRegression(penalty=None)
model.fit(X, Y)

Y_pred = model.predict_proba(X)[:, 1]

zero_one_loss(Y, Y_pred > 0.5)

plot_logistic_model_preds(X, Y_pred)

def logistic_regressor(x):
    beta0 = model.intercept_
    beta1_n = model.coef_
    beta_vec = np.array([beta0[0], *beta1_n[0]])

    # TODO: implement the logistic regression model function for the parameters beta
    return None


# checking and plotting the solution
custom_Y_pred = np.apply_along_axis(logistic_regressor, 1, X)

if not all(custom_Y_pred == None):
    plot_logistic_model_preds(X, custom_Y_pred)
    are_logistic_predictions_equal(X, model, logistic_regressor)

In the lecture it was also mentioned that logistic regression models can struggle with data that is actually linearly separable.

Side note:

The logistic regression optimizer of sklearn really had to be forced into showing this behaviour, because they implemented it pretty robustly against this case. Check out the parameters to LogisticRegression() below to see what had to be done in order to disable sklearn’s safeguards.

X_sep = []
Y_sep = []

if not (len(X_sep) == 0 and len(Y_sep) == 0):
    model = LogisticRegression(penalty=None, max_iter=1000, tol=0)
    model.fit(X_sep, Y_sep)

    print(model.coef_)
    print(model.intercept_)