General Terms and First Regression Example#
General Terms#
a. Define Knowledge Discovery in Databases!
b. What is the difference between data mining and knowledge discovery (based on the definitions used in the lecture slides)?
c. List four typical tasks in the context of the Modeling phase in CRISP-DM and briefly describe them based on a quick web search!
d. List and shortly explain the three properties of data with which [De Mauro et al. 2016] defines Big Data!
CRISP-DM#
a. List the six phases of the CRISP-DM model and describe them in bullet points!
b. How is the Data Preparation phase linked to the other phases conceptually?
c. What happens in the Evaluation phase?
Regression: House Prices#
In this set of exercises, we will work on a task that aims to predict housing prices from variables describing homes. The task is based on a Kaggle Competition. Here is a more detailed description of the task:
Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
A detailed description of the data can be found on Kaggle. Additional information is included in the file exercise_02_data_description.txt
, which you can find in the StudIP course material together with the dataset in exercise_02_train.csv
.
a. Take a look at exercise_02_data_description.txt
! Which data property of Big Data does this indicate?
b. Now also skim the dataset in exercise_02_train.csv
! Do you conclude that the task of predicting housing prices can be considered Big Data?
A minimal Data Science workflow#
Suppose you have tried to solve the above problem of predicting housing prices from features of houses, but you didn’t finish. The project was on hold for a couple of weeks, but now you are trying to reapproach it. The following code is what you have so far.
Let’s first try to understand the code in its current state. Afterwards we can figure out how to continue.
Read through the code in the next cell and briefly answer the following questions to make sure you understand what is going on!
What are
X
andy
?What is stored in
numeric_variables
?What is the shape of a
pandas.DataFrame
? (e.g. line19
)What are “NAs”? (Hint: You may look up the documentation of
pandas.DataFrame.isna
.)What is stored in
n_na
? (Hint: This one may be more tricky than it looks at first glance. Have a look at line23
!)What do the Python expressions do that are written in curly braces in some strings? (Hint: There is an
f
in front of those strings. Weirdf-strings
…)What does the “list comprehension” in line
36
do?
You know what the code does on a step-by-step level now. Let’s “zoom out” to get a bigger picture. Answer the following questions!
Map the steps of the CRISP-DM onto the code below! Give the start and end lines of each step!
Some steps are missing. Is there maybe a step that we have already done, that is not included in the code? (Hint: We do not worry about Deployment in this exercise.)
By understanding what you left behind when you last approached this problem, we have taken on the mind of a Data Scientist. Now let’s do some Data Science!
Run the code below!
Inspect the output and ask yourself the following questions! (You may ignore the error message for now.)
How many variables (in percent) do we loose if we only use numeric variables?
How many variables (in percent) do we loose if we drop all variables that have NAs?
How many samples do we loose (in percent) if we drop all samples that contain NAs?
How many variables do we have left if we only use numeric variables and drop all variables with NAs?
Inspect the error message and answer the following questions! (You are encouraged to use an LLM for this step.)
Is the error related to any other output?
Check out the documentation of the used model here! Do you see a reason why the data might not be suitable for this model?
What needs to be done conceptually in order to solve the error? (Hint: Consider the model type fixed, so it’s not the problem. Think about how to adapt the data in order to fit the model.)
Implement that fix and run the code again!
1import pandas as pd
2import numpy as np
3from sklearn.linear_model import LinearRegression
4from sklearn.metrics import mean_absolute_error
5
6data_houses = pd.read_csv(
7 "exercise_02_train.csv",
8 index_col="Id")
9
10X = data_houses.drop(columns="SalePrice")
11y = data_houses["SalePrice"]
12
13numeric_variables = data_houses.select_dtypes(np.number).columns
14
15print(f"The following variables are numeric:")
16for i, c in enumerate(numeric_variables):
17 print(f" {i:02d}.", c)
18
19percent_numeric = numeric_variables.size / X.shape[1] * 100
20print(f"{percent_numeric:.02f}% of all variables are numeric.")
21print(f"This means, we loose {100 - percent_numeric:.02f} if we only keep numeric variables.")
22
23n_na = X.isna().sum()
24mask_na = n_na > 0
25print(n_na[mask_na])
26
27percent_na = mask_na.sum() / X.shape[1] * 100
28print(f"{percent_na:.02f}% of all variables contain NAs.")
29print(f"This means, we loose {100 - percent_na:.02f}% of all variables if we only keep variables without NAs.")
30
31n_na_samples = X.shape[0] - X.dropna().shape[0]
32percent_na_samples = n_na_samples / X.shape[0] * 100
33print(f"{percent_na_samples:.02f}% of all samples contain NAs.")
34print(f"This means, we loose {percent_na_samples:.02f}% if we drop all samples with NAs.")
35
36mask_number = [c in numeric_variables for c in X.columns]
37mask_selected = mask_number & ~mask_na
38print(X.loc[:,mask_selected].shape[1])
39
40model = LinearRegression()
41model.fit(X, y)
42
43y_pred = model.predict(X)
44mean_absolute_error(y, y_pred)
The following variables are numeric:
00. MSSubClass
01. LotFrontage
02. LotArea
03. OverallQual
04. OverallCond
05. YearBuilt
06. YearRemodAdd
07. MasVnrArea
08. BsmtFinSF1
09. BsmtFinSF2
10. BsmtUnfSF
11. TotalBsmtSF
12. 1stFlrSF
13. 2ndFlrSF
14. LowQualFinSF
15. GrLivArea
16. BsmtFullBath
17. BsmtHalfBath
18. FullBath
19. HalfBath
20. BedroomAbvGr
21. KitchenAbvGr
22. TotRmsAbvGrd
23. Fireplaces
24. GarageYrBlt
25. GarageCars
26. GarageArea
27. WoodDeckSF
28. OpenPorchSF
29. EnclosedPorch
30. 3SsnPorch
31. ScreenPorch
32. PoolArea
33. MiscVal
34. MoSold
35. YrSold
36. SalePrice
46.84% of all variables are numeric.
This means, we loose 53.16 if we only keep numeric variables.
LotFrontage 259
Alley 1369
MasVnrType 872
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406
dtype: int64
24.05% of all variables contain NAs.
This means, we loose 75.95% of all variables if we only keep variables without NAs.
100.00% of all samples contain NAs.
This means, we loose 100.00% if we drop all samples with NAs.
33
/tmp/ipykernel_150157/1370243486.py:37: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
mask_selected = mask_number & ~mask_na
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_150157/1370243486.py in ?()
37 mask_selected = mask_number & ~mask_na
38 print(X.loc[:,mask_selected].shape[1])
39
40 model = LinearRegression()
---> 41 model.fit(X, y)
42
43 y_pred = model.predict(X)
44 mean_absolute_error(y, y_pred)
~/miniconda3/envs/data-science-ws-24-25/lib/python3.13/site-packages/sklearn/base.py in ?(estimator, *args, **kwargs)
1469 skip_parameter_validation=(
1470 prefer_skip_nested_validation or global_skip_validation
1471 )
1472 ):
-> 1473 return fit_method(estimator, *args, **kwargs)
~/miniconda3/envs/data-science-ws-24-25/lib/python3.13/site-packages/sklearn/linear_model/_base.py in ?(self, X, y, sample_weight)
605 n_jobs_ = self.n_jobs
606
607 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
608
--> 609 X, y = self._validate_data(
610 X,
611 y,
612 accept_sparse=accept_sparse,
~/miniconda3/envs/data-science-ws-24-25/lib/python3.13/site-packages/sklearn/base.py in ?(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
646 if "estimator" not in check_y_params:
647 check_y_params = {**default_check_params, **check_y_params}
648 y = check_array(y, input_name="y", **check_y_params)
649 else:
--> 650 X, y = check_X_y(X, y, **check_params)
651 out = X, y
652
653 if not no_val_X and check_params.get("ensure_2d", True):
~/miniconda3/envs/data-science-ws-24-25/lib/python3.13/site-packages/sklearn/utils/validation.py in ?(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1297 raise ValueError(
1298 f"{estimator_name} requires y to be passed, but the target y is None"
1299 )
1300
-> 1301 X = check_array(
1302 X,
1303 accept_sparse=accept_sparse,
1304 accept_large_sparse=accept_large_sparse,
~/miniconda3/envs/data-science-ws-24-25/lib/python3.13/site-packages/sklearn/utils/validation.py in ?(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
1009 )
1010 array = xp.astype(array, dtype, copy=False)
1011 else:
1012 array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
-> 1013 except ComplexWarning as complex_warning:
1014 raise ValueError(
1015 "Complex data not supported\n{}\n".format(array)
1016 ) from complex_warning
~/miniconda3/envs/data-science-ws-24-25/lib/python3.13/site-packages/sklearn/utils/_array_api.py in ?(array, dtype, order, copy, xp, device)
741 # Use NumPy API to support order
742 if copy is True:
743 array = numpy.array(array, order=order, dtype=dtype)
744 else:
--> 745 array = numpy.asarray(array, order=order, dtype=dtype)
746
747 # At this point array is a NumPy ndarray. We convert it to an array
748 # container that is consistent with the input's namespace.
~/miniconda3/envs/data-science-ws-24-25/lib/python3.13/site-packages/pandas/core/generic.py in ?(self, dtype, copy)
2149 def __array__(
2150 self, dtype: npt.DTypeLike | None = None, copy: bool_t | None = None
2151 ) -> np.ndarray:
2152 values = self._values
-> 2153 arr = np.asarray(values, dtype=dtype)
2154 if (
2155 astype_is_view(values.dtype, arr.dtype)
2156 and using_copy_on_write()
ValueError: could not convert string to float: 'RL'