# General Terms and First Regression Example

## General Terms

a. Define Knowledge Discovery in Databases!

b. What is the difference between data mining and knowledge discovery (based on the definitions used in the lecture slides)?

c. List four typical tasks in the context of the Modeling phase in CRISP-DM and briefly describe them based on a quick web search!

d. List and shortly explain the three properties of data with which [De Mauro et al. 2016] defines Big Data!

### CRISP-DM

a. List the six phases of the CRISP-DM model and describe them in bullet points!

b. How is the Data Preparation phase linked to the other phases conceptually?

c. What happens in the Evaluation phase?

## Regression: House Prices

In this set of exercises, we will work on a task that aims to predict housing prices from variables describing homes.
The task is based on a [Kaggle Competition](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques). 
Here is a more detailed description of the task:

> Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
>
> With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

A [detailed description of the data](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data) can be found on [Kaggle](https://www.kaggle.com/). Additional information is included in the file `exercise_02_data_description.txt`, which you can find in the StudIP course material together with the dataset in `exercise_02_train.csv`.

a. Take a look at `exercise_02_data_description.txt`! Which data property of Big Data does this indicate?

b. Now also skim the dataset in `exercise_02_train.csv`! Do you conclude that the task of predicting housing prices can be considered Big Data?

### A minimal Data Science workflow

Suppose you have tried to solve the above problem of predicting housing prices from features of houses, but you didn't finish.
The project was on hold for a couple of weeks, but now you are trying to reapproach it.
The following code is what you have so far.

Let's first try to understand the code in its current state.
Afterwards we can figure out how to continue.

1. Read through the code in the next cell and briefly answer the following questions to make sure you understand what is going on!
   1. What are `X` and `y`?
   2. What is stored in `numeric_variables`?
   3. What is the shape of a `pandas.DataFrame`? (e.g. line `19`)
   4. What are "NAs"? (Hint: You may look up the documentation of `pandas.DataFrame.isna`.)
   5. What is stored in `n_na`? (Hint: This one may be more tricky than it looks at first glance. Have a look at line `23`!)
   6. What do the Python expressions do that are written in curly braces in some strings? (Hint: There is an `f` in front of those strings. Weird `f-strings`...)
   7. What does the "list comprehension" in line `36` do?

2. You know what the code does on a step-by-step level now. Let's "zoom out" to get a bigger picture. Answer the following questions!
   1. Map the steps of the CRISP-DM onto the code below! Give the start and end lines of each step!
   2. Some steps are missing. Is there maybe a step that we have already done, that is not included in the code? (Hint: We do not worry about Deployment in this exercise.)

3. By understanding what you left behind when you last approached this problem, we have taken on the mind of a Data Scientist. Now let's do some Data Science!
   1. Run the code below!
   2. Inspect the output and ask yourself the following questions! (You may ignore the error message for now.)
      1. How many variables (in percent) do we loose if we only use numeric variables?
      2. How many variables (in percent) do we loose if we drop all variables that have NAs?
      3. How many samples do we loose (in percent) if we drop all samples that contain NAs?
      4. How many variables do we have left if we only use numeric variables and drop all variables with NAs?
   3. Inspect the error message and answer the following questions! (You are encouraged to use an LLM for this step.)
      1. Is the error related to any other output?
      2. Check out the documentation of the used model [here](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)! Do you see a reason why the data might not be suitable for this model?
      3. What needs to be done conceptually in order to solve the error? (Hint: Consider the model type fixed, so it's not the problem. Think about how to adapt the data in order to fit the model.)
      4. Implement that fix and run the code again!

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

data_houses = pd.read_csv(
    "exercise_02_train.csv", 
    index_col="Id")

X = data_houses.drop(columns="SalePrice")
y = data_houses["SalePrice"]

numeric_variables = data_houses.select_dtypes(np.number).columns

print(f"The following variables are numeric:")
for i, c in enumerate(numeric_variables):
    print(f"  {i:02d}.", c)

percent_numeric = numeric_variables.size / X.shape[1] * 100
print(f"{percent_numeric:.02f}% of all variables are numeric.") 
print(f"This means, we loose {100 - percent_numeric:.02f} if we only keep numeric variables.")

n_na = X.isna().sum()
mask_na = n_na > 0
print(n_na[mask_na])

percent_na = mask_na.sum() / X.shape[1] * 100
print(f"{percent_na:.02f}% of all variables contain NAs.")
print(f"This means, we loose {100 - percent_na:.02f}% of all variables if we only keep variables without NAs.")

n_na_samples = X.shape[0] - X.dropna().shape[0]
percent_na_samples = n_na_samples / X.shape[0] * 100
print(f"{percent_na_samples:.02f}% of all samples contain NAs.")
print(f"This means, we loose {percent_na_samples:.02f}% if we drop all samples with NAs.")

mask_number = [c in numeric_variables for c in X.columns]
mask_selected = mask_number & ~mask_na
print(X.loc[:,mask_selected].shape[1])

model = LinearRegression()
model.fit(X, y)

y_pred = model.predict(X)
mean_absolute_error(y, y_pred)

The following variables are numeric:
  00. MSSubClass
  01. LotFrontage
  02. LotArea
  03. OverallQual
  04. OverallCond
  05. YearBuilt
  06. YearRemodAdd
  07. MasVnrArea
  08. BsmtFinSF1
  09. BsmtFinSF2
  10. BsmtUnfSF
  11. TotalBsmtSF
  12. 1stFlrSF
  13. 2ndFlrSF
  14. LowQualFinSF
  15. GrLivArea
  16. BsmtFullBath
  17. BsmtHalfBath
  18. FullBath
  19. HalfBath
  20. BedroomAbvGr
  21. KitchenAbvGr
  22. TotRmsAbvGrd
  23. Fireplaces
  24. GarageYrBlt
  25. GarageCars
  26. GarageArea
  27. WoodDeckSF
  28. OpenPorchSF
  29. EnclosedPorch
  30. 3SsnPorch
  31. ScreenPorch
  32. PoolArea
  33. MiscVal
  34. MoSold
  35. YrSold
  36. SalePrice
46.84% of all variables are numeric.
This means, we loose 53.16 if we only keep numeric variables.
LotFrontage      259
Alley           1369
MasVnrType       872
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageTyp

  mask_selected = mask_number & ~mask_na


ValueError: could not convert string to float: 'RL'