Scaling, Model Introspection and Reproducibility

Scaling, Model Introspection and Reproducibility#

Scaling#

Run the following cell.

%pip install pysubgroup==0.7.6

What is scaling?
Why is scaling important?
What are two common examples of scaling? What do they do?
These two scaling methods are applied to the following data and the results are plotted. What do you observe (look at the axes)?
Afterwards an outlier is added to the data and the scaling results are plotted again. What do you observe and why is this a problem?
Also have a look around the different scaling procedures provided by scikit-learn. Which scalers would help alleviate the outlier issue?

# generate the data
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
X = np.random.multivariate_normal((0,0), [[1,0],[0,1]], 1000,)
X[:,0] = X[:,0] - 3
X[:,1] = X[:,1] / 10

# plot the data
plt.scatter(X[:,0], X[:,1], s=1)

<matplotlib.collections.PathCollection at 0x7ff080d9ef90>

../_images/1782bb3902bff685f03ec1a0931829837ae28a65c109f6aa5e5ad77b49c0d92c.png

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

X_scaled = StandardScaler().fit_transform(X)
print(f"Mean: {X_scaled.mean():.3f}, Std: {X_scaled.std():.3f}")
print(f"Min: {X_scaled.min():.3f}, Max: {X_scaled.max():.3f}")
plt.scatter(X_scaled[:,0], X_scaled[:,1], s=1)
plt.show()

Mean: -0.000, Std: 1.000
Min: -3.407, Max: 3.742

../_images/0d6ef01f15d7df343c276a5fd5af90f9bd0823bab32cd98b4703d05ac3dba869.png

X_scaled = MinMaxScaler().fit_transform(X)
print(f"Mean: {X_scaled.mean():.3f}, Std: {X_scaled.std():.3f}")
print(f"Min: {X_scaled.min():.3f}, Max: {X_scaled.max():.3f}")
plt.scatter(X_scaled[:,0], X_scaled[:,1], s=1)
plt.show()

Mean: 0.480, Std: 0.156
Min: 0.000, Max: 1.000

../_images/137b98aead2bf5af1d0453bc8abe945606d040fef5ff3847470e33663a6f1aa2.png

X = np.concatenate([X, [[10, 10]]]) # add outlier

X_scaled = StandardScaler().fit_transform(X)
print(f"Mean: {X_scaled.mean():.3f}, Std: {X_scaled.std():.3f}")
print(f"Min: {X_scaled.min():.3f}, Max: {X_scaled.max():.3f}")
plt.scatter(X_scaled[:,0], X_scaled[:,1], s=1)
plt.show()

Mean: -0.000, Std: 1.000
Min: -3.148, Max: 30.108

../_images/462ea3f4db01d996871777b44ccdd8c8bb0ecde30049c708e19f127ca674b3ba.png

X_scaled = MinMaxScaler().fit_transform(X)
print(f"Mean: {X_scaled.mean():.3f}, Std: {X_scaled.std():.3f}")
print(f"Min: {X_scaled.min():.3f}, Max: {X_scaled.max():.3f}")
plt.scatter(X_scaled[:,0], X_scaled[:,1], s=1)
plt.show()

Mean: 0.116, Std: 0.100
Min: 0.000, Max: 1.000

../_images/267a44610b21bbf2f201dc4aae8eadf9ff901c38450dad9bba3466481b38aa1b.png

Model Introspection#

The following code plots the coefficients of a linear model. Try to interpret the coefficients as feature importances in the following tasks.

Try different data splits. What do you observe?
Try adding and removing individual features. What do you observe?
Try scaling individual features. What do you observe?
Are the features that are considered important according to a model the only possibly important features to make predictions?

# imports
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.pipeline import make_pipeline

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt
import seaborn as sns

# load data
data_titanic = pd.read_csv("exercise_01_titanic.csv", index_col="PassengerId")

def extract_features(data):
    """Extract features from existing variables"""
    
    data_extract = data.copy()
    
    # name
    name_only = data_extract["Name"].str.replace(r"\(.*\)", "", regex=True)
    first_name = name_only.str.split(", ", expand=True).iloc[:,1]
    title = first_name.str.split(".", expand=True).iloc[:,0]
    data_extract["Title"] = title
    
    # extract others (optional)
    
    return data_extract

data_extract = extract_features(data_titanic)

def preprocess(data):
    """Convert features into numeric variables readable by our models.""" 
    
    data_preprocessed = data.copy()
    
    # Sex
    
    data_preprocessed = pd.get_dummies(data_preprocessed, columns=["Sex"], drop_first=True)
    
    # Embarked
    
    data_preprocessed = pd.get_dummies(data_preprocessed, columns=["Embarked"], dummy_na=True)
    
    # Title
    title = data_preprocessed["Title"]
    title_counts = title.value_counts()
    higher_titles = title_counts[title_counts < 50]
    title_groups = ["higher" if t in higher_titles else t for t in title]
    data_preprocessed["Title"] = title_groups
    data_preprocessed = pd.get_dummies(data_preprocessed, columns=["Title"])
    
    # drop the rest
    data_preprocessed.drop(columns=["Name", "Cabin", "Ticket"], inplace=True)
    
    return data_preprocessed

data_preprocessed = preprocess(data_extract)

# before inspecting the data, selecting and building models, etc. 
# FIRST split data into train and test data (we set the test data size to 30%)

X = data_preprocessed.drop(columns="Survived")
y = data_preprocessed["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

imputer = SimpleImputer(strategy="mean")
scaler = StandardScaler()
preprocessing_pipeline = make_pipeline(scaler, imputer)


X_train_example = X_train.copy()

# TODO: drop features here
# ...

X_train_processed = preprocessing_pipeline.fit_transform(X_train_example)

# TODO: change `X_train_processed` for example here
# ...

# set model
model = LogisticRegression()

# fit for coefficient extraction
model.fit(X_train_processed, y_train)
feature_weights = model.coef_.flatten()

# plot feature importances
fig, ax = plt.subplots(1, 1, figsize=(6 * 2, 6), sharey=True)
sns.barplot(x=X_train_example.columns, y=feature_weights, ax=ax, hue=np.linspace(0, 1, len(X_train_example.columns)), palette="hsv")
ax.set_xticklabels(ax.get_xticklabels(), rotation=270)
plt.show()

/tmp/ipykernel_517654/3137654952.py:26: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(ax.get_xticklabels(), rotation=270)

../_images/1b0c45a90758d18cdb37608594da697a65f015b8329f6d7f17d0b555531d6de9.png

Reproducibility#

Reproduce an experiment with these steps:
1. Run the following code, it gives you a number.
2. Now save the my_data DataFrame to a file (it is your very own dataset) and send it to your neighbour.
3. On your neighbour’s computer (with their consent), create a new notebook to read the data in a DataFrame your_data and run the Python line print(int(version("pysubgroup").split(".")[2]) + np.mean(your_data["numbers"]) + np.random.random(1)[0]).
4. Get it to give you the same number as on your computer. (You may have to change your code and redo the previous steps to make that possible.)
How could a version control system have helped you here?
How could an environment have helped you here? (e.g. using Conda)
BONUS: How could Docker have helped you here?

from importlib.metadata import version
import pysubgroup

rng1 = np.random.default_rng()
rng2 = np.random.default_rng()
my_data = pd.DataFrame({"numbers": rng1.random(100)})
print(int(version("pysubgroup").split(".")[2]) + np.mean(my_data["numbers"]) + rng2.random(1)[0])

7.112655896366465

Scaling, Model Introspection and Reproducibility

Contents

Scaling, Model Introspection and Reproducibility#

Scaling#

Model Introspection#

Reproducibility#