04: Model Evaluation

Regression vs. Classification¶

Two very simple datasets are plotted below. Let’s say we have to predict y from x in each of them.

One of the datasets defines a classification problem and the other one defines a regression problem. Which is which?
Explain your answer!

import numpy as np
import matplotlib.pyplot as plt

dataset_a = np.random.normal(0.0, 1.0, 100)

plt.scatter(dataset_a, dataset_a + np.random.uniform(-0.3, 0.3, 100))
plt.xlabel("x")
plt.ylabel("y")
plt.title("Dataset A")
plt.show()

dataset_b = np.concat([np.random.normal(i, 0.1, 33) for i in range(3)])

plt.scatter(dataset_b, np.concat([[value]*33 for value in ["A", "B", "C"]]))
plt.title("Dataset B")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

Regression¶

MAE vs MSE¶

Let $y_i$ be the actual value and $\hat{y}_i$ be the predicted value, then:

MAE = \frac{1}{n} \sum_i |y_i - \hat{y}_i|

(1)

and

MSE = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2.

(2)

The MAE and MSE are applied on the example data below to evaluate a mean predictor model.

What pros/cons do you see in this example for the MAE and the MSE respectively?

from sklearn.metrics import mean_absolute_error, mean_squared_error

mae_vs_mse_X = np.random.uniform(0, 10, 10)
mae_vs_mse_dataset = np.array([[x, 1] for x in mae_vs_mse_X])
mae_vs_mse_dataset[-1, 1] = 9

mean_predictor = np.mean(mae_vs_mse_dataset[:, 1])

print("mean =", mean_predictor)

mae = mean_absolute_error(mae_vs_mse_dataset[:, 1], [mean_predictor]*10)
mse = mean_squared_error(mae_vs_mse_dataset[:, 1], [mean_predictor]*10)

print(f"Mean Predictor MAE = {mae:.4f}")
print(f"Mean Predictor MSE = {mse:.4f}")

plt.scatter(mae_vs_mse_dataset[:, 0], mae_vs_mse_dataset[:, 1], s=10)
plt.axhline(mean_predictor)

plt.xlim([0, 10])
plt.ylim([0, 10])

plt.xlabel("x")
plt.ylabel("y")
plt.legend(["data", "mean predictor"])

plt.show()

mean = 1.8
Mean Predictor MAE = 1.4400
Mean Predictor MSE = 5.7600

$R^2$ ¶

Let $y_i$ be the actual value, $\hat{y}_i$ the predicted value and $\bar{y} = \frac{1}{n} \sum_i y_i$ the mean of all actual values. Then,

R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_j (y_j - \bar{y})^2}.

(3)

What is the $R^2$ of the mean predictor?
What is the best possible $R^2$ a model could get?
What is the worst possible $R^2$ a model could get?
Bonus: How is $R^2$ related to the variance in the data? (Hint: Have a close look at the denominator (bottom of the fraction).)

Classification¶

Confusion Matrix¶

Fill in the cell names in the cells marked with an x in the following confusion matrix of a binary classification problem.
Briefly state what each of the four non-bold cells represent.
Turn the following multi-class confusion matrix into a binary classification confusion matrix for C2 as the positive class.
C1 C2 C3 C4
C1 1 2 3 4
C2 5 6 7 8
C3 9 10 11 12
C4 13 14 15 16

	C1	C2	C3	C4
C1	1	2	3	4
C2	5	6	7	8
C3	9	10	11	12
C4	13	14	15	16

	x	x
x	x	x
x	x	x

	x	x
x	x	x
x	x	x

Properties of precision and recall¶

What unwanted (trivial) strategies could a classifier predicting the class 0 or 1 employ to...

always get a precision of 1?
always get a recall of 1?

Choosing between precision and recall¶

In what settings is precision more important than recall and in which applications is recall more important than precision? Hint: As an example, think of a spam filter which has to differentiate between spam and important mails from your boss.

Combine precision and recall¶

What is a measure that combines precision and recall?
Define it.
Why do we use the harmonic rather than the arithmetic mean?

ROC AUC¶

ROC AUC is for binary classifiers that predict class scores instead of class membership.

How can class scores be converted to predictions of class membership?
Assume you have an evaluation set of $n$ data points and a unique class score prediction for each of them from your model. How many unique sets of class membership predictions can be defined using the method from your answer to the previous question?
Define the axes of ROC space based on the values in a confusion matrix.

x =

(4)

y =

(5)

Reference values¶

Consider a label set y with roughly the same amount of cases (1) and controls (0)

For a classifier that returns random labels. What do you observe?
For a classifier that always returns 1. What do you observe?
Check your answers by running the code cells below to calculate the accuracy, precision, recall, and ROC AUC.

np.random.seed(0)
y = np.random.choice([0,1], p=(0.5, 0.5), size=1000)

from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

# random classifier evaluation
y_pred_random = np.random.choice([0,1], p=(0.5, 0.5), size=1000)
print("Accuracy: ", accuracy_score (y, y_pred_random))
print("Precision:", precision_score(y, y_pred_random))
print("Recall:   ", recall_score   (y, y_pred_random))
print("ROC AUC:  ", roc_auc_score  (y, y_pred_random))

# majority classifier evaluation
y_pred_1 = np.repeat(1, y.size)
print("Accuracy: ", accuracy_score (y, y_pred_1))
print("Precision:", precision_score(y, y_pred_1))
print("Recall:   ", recall_score   (y, y_pred_1))
print("ROC AUC:  ", roc_auc_score  (y, y_pred_1))

Imbalanced data¶

Consider a label set y_imbalanced with only a small set of cases (1) compared to controls (0)

For a classifier that returns random labels. What do you observe?
For a classifier that always predicts the majority class (0). What do you observe and why could this be an issue in practice (particularly if we only consider accuracy)?
Check your answers by running the code cells below to calculate the accuracy, precision, recall, and ROC AUC.

np.random.seed(0)
y_imbalance = np.random.choice([0,1], p=(0.9, 0.1), size=1000)  # imbalanced labels, i.e., small sets of `cases` (label=1)

# random classifier evaluation
y_pred_random = np.random.choice([0,1], p=(0.5, 0.5), size=1000)
print("Accuracy: ", accuracy_score (y_imbalance, y_pred_random))
print("Precision:", precision_score(y_imbalance, y_pred_random))
print("Recall:   ", recall_score   (y_imbalance, y_pred_random))
print("ROC AUC:  ", roc_auc_score  (y_imbalance, y_pred_random))

# majority classifier evaluation
y_pred_majority = np.repeat(0, y.size)
print("Accuracy: ", accuracy_score (y_imbalance, y_pred_majority))
print("Precision:", precision_score(y_imbalance, y_pred_majority))
print("Recall:   ", recall_score   (y_imbalance, y_pred_majority))
print("ROC AUC:  ", roc_auc_score  (y_imbalance, y_pred_majority))

Holdout Data Splits¶

Train / Test¶

The following code uses train_test_split to shuffle and split the given data (X, y) into a train and a test dataset with a 80:20 ratio. It then trains the DecisionTreeClassifier on the training dataset and evaluates it using the ROC AUC score for the training and the test dataset.

What do you observe?
What could be an issue with publications about a new model where no test set is used?
What could go wrong if we keep on improving our model until it performs well on our fixed test set?

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

np.random.seed(42)

n_samples = 1000
n_features = 100
X = np.random.random((n_samples, n_features))
y = np.random.choice([0, 1], p=(0.5, 0.5), size=n_samples)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_pred = model.predict_proba(X_train)
print(f"Train ROC AUC = {roc_auc_score(y_train, y_pred[:,1])}")

y_pred = model.predict_proba(X_test)
print(f"Test ROC AUC = {roc_auc_score(y_test, y_pred[:,1])}")

Cross Validation¶

Consider the following dataset (X, y). We already split it into a train (X_train, y_train), (X_val, y_val), and a test set (X_test, y_test).

Why would we split the dataset into 3 parts like here, instead of just train-test?
The ROC AUC score is computed on the train, validation, and test dataset. What do you observe?
How does the application of cross validation improve the evaluation compared to just evaluating the model on a fixed evaluation set?
Cross validation is computed here and the mean and standard deviation of the ROC AUC scores are reported. What does this tell you?
In addition to the ROC AUC score on the test set, why would you also always report the mean and standard deviation of the cross validations scores?
Why would you be a bit suspicious of the current ROC AUC score on the test set?
BONUS: Why are we observing these results based on the data we use?
Suppose the dataset contains medical data of patients and the same patient was examined multiple times, leading to multiple data instances. Why would the current implementation of cross validation be considered cheating and how should it be changed in order to avoid cheating?

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

np.random.seed(0)

n_samples = 200 * 6
X, y = make_classification(n_samples=n_samples, flip_y=0.01, n_redundant=2, n_informative=2)
X[-2*int(n_samples / 6):-int(n_samples / 6),:] = np.random.random((int(n_samples / 6), X.shape[1]))

# define the test set
X_intermediate, X_test, y_intermediate, y_test = train_test_split(X, y, test_size=1/6, shuffle=False)

# use the remaining data to define the train and validation set
X_train, X_val, y_train, y_val = train_test_split(X_intermediate, y_intermediate, test_size=0.2, shuffle=False)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_pred = model.predict_proba(X_train)
print(f"Train ROC AUC = {roc_auc_score(y_train, y_pred[:,1])}")

y_pred = model.predict_proba(X_val)
print(f"Validation ROC AUC = {roc_auc_score(y_val, y_pred[:,1])}")

y_pred = model.predict_proba(X_test)
print(f"Test ROC AUC = {roc_auc_score(y_test, y_pred[:,1])}")

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_intermediate, y_intermediate, scoring="roc_auc")
print("CV ROC AUC mean:", np.mean(scores))
print("CV ROC AUC std:", np.std(scores))