import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 6)
In this exercise, we will try different regression models on points drawn from three different analytical functions. We will see that the overall quality of the models heavily depends on the shape of the functions.
That said, let's define our three analytical functions.
def f1(x):
return x * np.sin(x) + 2*x
def f2(x):
return 10 * np.sin(x) + x**2
def f3(x):
return np.sign(x) * (300 + x**2) + 20 * np.sin(x)
As usual, we can define convenient functions to generate and plot our data.
def generate_X_y(f):
tr = 20
n_samples = 100
X = np.linspace(-tr, tr, n_samples)
y = f(X)
return X, y
def plot_f(X, y, title):
LW = 4
fig, ax = plt.subplots()
ax.plot(X, y, color='cornflowerblue', linewidth=.5*LW, label="ground truth")
fig.suptitle(title)
#ax.scatter(X_train, y_train, color='navy', s=30, marker='o', label="training points")
for f in [f1, f2, f3]:
X, y = generate_X_y(f)
plot_f(X, y, f)
f1 has two main components: a sine wave with an increasing magnitude (due to the factor x), plus an additive factor that gives a non zero slope to the curve.
f2 has a sine wave with fixed amplitude, modulated by a parabolic function.
f3 presents a sine wave modulated by a parabolic function. However, it is different from f2. The quadratic expression changes concativity (i.e. there is an inflection point) at zero, due to the sign component. Also, the factor sign(x) 300 produces a discontinuity point of type one.
The shape of the functions tells us that f2 and f3 have an infinite asymptotic value, for both positive and negative values. Hence, they can be approximated with polynomial regressors. However, f3 has a discontinuity point at 0, which can harden the approximation for classifiers. Even if the ordinary least squares regressor would fit the linear trend present in f1, we can see that all the three functions cannot be approximated with a linear regressor, with sufficient results, for large values of x.
Back to programming, you can notice how the functions behave as simple objects in Python. It is not unsual to assign them to variables or, like in this case, create an array of them. Here, the f variable points, at each loop cycle, to a different function in memory, and gets invoked as a callable object. Additionally, you can note the printed version of f contains the name of the assigned function and the memory address where its code is stored.
Let's now fit our regression models. To do so, we define a function to create the training and test points given a function and a scikit-learn Pipeline to apply to them. For the seek of readability, we inspect one analytic function at a time.
from sklearn.model_selection import train_test_split
def generate_train_test(f, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=30,
random_state=42,
shuffle=True)
y_test = y_test[X_test.argsort()]
X_test.sort()
return X_train, X_test, y_train, y_test
Note that here the cardinality of training and test sets are reversed with respect to common cases. Typically, the 70% of the dataset is kept as training set and the remaining 30% is used to test the model.
In the following cells, we test different regression algorithm and evaluate the regression error through two metrics: the MSE and the R2 score. Thus, we use a convenient Python function that, given an analytical function, a dataset, and a model, produces the value of the two metrics. To better understand the quality of each regressor, the function provides also a graphical representation of the predicted values against the real values drawn from the original curve.
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
def evaluate_model(f, X, y, model, model_name):
X_train, X_test, y_train, y_test = generate_train_test(f, X, y)
#Â plot the real function and the training points
LW = 2
fig, ax = plt.subplots()
ax.plot(X, y, color='cornflowerblue', linewidth=.5*LW, label="ground truth")
ax.scatter(X_train, y_train, color='navy', s=30, marker='o', label="training points")
# predict the test points and plot them onto the chart
model.fit(X_train.reshape(-1, 1), y_train)
y_hat = model.predict(X_test.reshape(-1, 1))
ax.plot(X_test, y_hat, linewidth=LW, label=name, color='r')
fig.suptitle(f"{f} approximated by {model_name}")
fig.legend()
return mean_squared_error(y_test, y_hat), r2_score(y_test, y_hat)
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.preprocessing import FunctionTransformer, PolynomialFeatures
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
The performance of each regressor are collected and displayed with the library PrettyTable.
from prettytable import PrettyTable
degree = 5
models = [
LinearRegression(),
Ridge(random_state=42),
MLPRegressor(hidden_layer_sizes=(10,), random_state=42, max_iter=10000),
MLPRegressor(hidden_layer_sizes=(10,10), activation='tanh', solver='lbfgs',
alpha=0.000, batch_size='auto', learning_rate='constant',
learning_rate_init=0.01, power_t=0.5, max_iter=10000, shuffle=True,
random_state=42, tol=0.0001, verbose=True, warm_start=False,
momentum=0.0, nesterovs_momentum=False, early_stopping=False,
validation_fraction=0.0, beta_1=0.9, beta_2=0.999, epsilon=1e-08),
SVR(gamma='scale'),
RandomForestRegressor(n_estimators=300),
make_pipeline(
make_column_transformer(
(FunctionTransformer(np.sin), [0]),
(PolynomialFeatures(degree), [0])
),
LinearRegression()
),
make_pipeline(
make_column_transformer(
(FunctionTransformer(np.sin), [0]),
(PolynomialFeatures(degree), [0])
),
Ridge(alpha=1)
)
]
names = [
'linreg',
'ridge',
'mlp_standard',
'mlp_tuned',
'svr',
'rf',
f'sin+poly{degree}+linreg',
f'sin+poly{degree}+ridge'
]
Before the actual simulation, let's spend a few comments on the code above. We generated a list with a few models to be tested along with a respective textual representation that is used in the title of the each figure. The single-stage models are:
We also adopted two composite pipelines. Each of them has two steps:
t = PrettyTable()
t.field_names = ['model', 'MSE', 'R2']
X, y = generate_X_y(f1)
for model, name in zip(models, names):
mse, r2 = evaluate_model(f1, X, y, model, name)
t.add_row([name, mse, r2])
print(t)
We see that LinearRegression and Ridge have comparable results. SVR fails at modeling the correct shape of the curve, the standard MLP converged to an approximation of the linear behavior, while the tuned MLP shows a more complex pattern, although an abnormal spike near zero worsen probably affects negatively the performance. The Random Forest regressor achieved the best performance in both the MSE and R2 scores. Finally, it is worth noting that the addition of sin(x) and the polynomial features up to the fifth degree slightly worsened the performance of LinearRegression and Ridge. Having the real function shape, we can graphically find a motivation for that. For both the regressors, the predictions (the red curve) are able follow a sinusoidal pattern (which we might have expected by the introduction of sin(x)). Nonetheless, for negative values of x, this curve is in couterphase with the real one, increasing the overall error.
In many real tasks, we typically suppose that the measurements of the predictive variables carry some sort of noise. To reflect this aspect to our synthetic data, we can inject it manually.
def inject_noise(x):
"""Add a random noise drawn from a normal distribution."""
return x + np.random.normal(0, 50, size=x.size)
X, y = generate_X_y(f1)
y = inject_noise(y)
t = PrettyTable()
t.field_names = ['model', 'MSE', 'R2']
for model, name in zip(models, names):
mse, r2 = evaluate_model(f1, X, y, model, name)
t.add_row([name, mse, r2])
print(t)
This kind of noise (normal, standard deviation=50) makes the problem extremely more complex. The initial shape of the function is lost and the performace are worse for any classifier.
Note: the reported warning concerns MLP. Even with 10000 iterations is is not able to converge to stable values for the weights in the network.
t = PrettyTable()
t.field_names = ['model', 'MSE', 'R2']
X, y = generate_X_y(f2)
for model, name in zip(models, names):
mse, r2 = evaluate_model(f2, X, y, model, name)
t.add_row([name, mse, r2])
print(t)
The crucial result here concerns the LinearRegression and Ridge models. Given the highly non-linear behavior, without a proper preprocessing, they performed the worst. However, the higher capability achieved including the sinusoidal component, along with polynomial features, drastically improved the performance. The best performing model here is the Ridge regularitazion in sin+poly5+ridge which copied almost identically the true values. This is a clear example of how an accurate preprocessing becomes crucial, especially for linear models.
As a further exercise, we suggest to test the performance of the other regressors with the same preprocessing applied to improve the linear models. Let's test now what happens adding the Gaussian noise.
X, y = generate_X_y(f2)
y = inject_noise(y)
t = PrettyTable()
t.field_names = ['model', 'MSE', 'R2']
for model, name in zip(models, names):
mse, r2 = evaluate_model(f2, X, y, model, name)
t.add_row([name, mse, r2])
print(t)
The overall shape of the function is retained by the standard MLP, and sin+poly5+[linreg|ridge]. This lead to the lowest overall error.
t = PrettyTable()
t.field_names = ['model', 'MSE', 'R2']
X, y = generate_X_y(f3)
for model, name in zip(models, names):
mse, r2 = evaluate_model(f3, X, y, model, name)
t.add_row([name, mse, r2])
print(t)
The top performing algorithm is Random Forest again. We can see that it is the only model able to handle the discontinuity at 0: values approaching 0 from the left (small negative values) are similar to the real values of the left-most part of the function, the same applies for small positive values.
X, y = generate_X_y(f3)
y = inject_noise(y)
t = PrettyTable()
t.field_names = ['model', 'MSE', 'R2']
for model, name in zip(models, names):
mse, r2 = evaluate_model(f3, X, y, model, name)
t.add_row([name, mse, r2])
print(t)
The most robust to noise is Random Forest in this case.
In this exercise, you will carry out a multivariate regression analysis. Technically speaking, the preprocessing step added in the pipeline in Exercise 1 also lead to a multivariate analyis, considering also the newly generated features. Now, we generate a synthetic, multi-dimensional dataset using scikit-learn. The nature and the importance of each of the features can be fine-tuned using the make_regression function.
from sklearn.datasets import make_regression
We can use the make_regression
function to generate a synthetic dataset with 2000 points. You should spend enough time inspecting the function parameters. For now, we recall that, by default:
noise
X, y = make_regression(n_samples=2000, random_state=42)
X.shape, y.shape
We can now run again the regression simulation developed in Exercise 1.
Note: here we get back to the normal conditions where we adopt 70% of the dataset as training set and the remaining 30% as test set. This can be achieved by simply changing the value to the train_size
parameter.
t = PrettyTable()
t.field_names = ['model', 'MSE', 'R2']
for model, name in zip(models, names):
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=.7,
random_state=42,
shuffle=True)
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
mse = mean_squared_error(y_test, y_hat)
r2 = r2_score(y_test, y_hat)
t.add_row([name, mse, r2])
print(t)
The tested model behave differently. Since the target label is generated by make_regression
from a random linear model, we find that linear models and the model which can closely approximate a linear model (e.g. the standard MLP with one hidden layer) perform better. The R2 for the unregularized linear model is 1, meaning that it is able to capture all the information in the 10 informative features (the MSE is approximately 0).
Even though these results suggest that the so-defined problem is simple, we can notice that more complex models fail at grasping this simplicity, leading to high errors. Over-parametrized models strongly suffer the redundancy of information carried by the non-informative 90 features.
Let's now inspect how the performance of our models change when:
X, y = make_regression(n_samples=2000, random_state=42, noise=10)
t = PrettyTable()
t.field_names = ['model', 'MSE', 'R2']
for model, name in zip(models, names):
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=.7,
random_state=42,
shuffle=True)
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
mse = mean_squared_error(y_test, y_hat)
r2 = r2_score(y_test, y_hat)
t.add_row([name, mse, r2])
print(t)
As we might have expected, the introduction of a random gaussian noise with standard deviation of 10 decreases the performance of every model.
Let's introduce more informative features. However, keep in mind that those will be used to generate a target value with a linear combination: we can expect that models close to linear will be again the top performer.
X, y = make_regression(n_samples=2000, random_state=42, noise=10,
n_informative=70)
t = PrettyTable()
t.field_names = ['model', 'MSE', 'R2']
for model, name in zip(models, names):
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=.7,
random_state=42,
shuffle=True)
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
mse = mean_squared_error(y_test, y_hat)
r2 = r2_score(y_test, y_hat)
t.add_row([name, mse, r2])
print(t)
Comparing the latter tables with the previous one (with 10 informative features) we can notice two different changes. While Linear Regression, Ridge and the standard MLP have improved their performance (in terms of both MSE and R2), the other, more complex classifiers performed significanlty worse. We can conclude that increasing the number of informative features not only brings more information (and lowers the redundancy), but also makes the problem harder, and the latter factor has the strongest impact on the tested models.
In the second part of the laboratory, we will cover the topic of time series forecasting, working on the dataset of temperatures collected during the Second World War. Specifically, we will focus on the forecasting of the temperature value one - or more - day ahead of the current time, based on the history of available temperatures.
To quickly load and inspect the dataset, we can use pandas.
import pandas as pd
df = pd.read_csv('weatherww2/SummaryofWeather.csv')
We can inspect the content of the dataset with the method info
.
df.info()
Please refer to the laboratory text to discover more on the attributes.
The dataset is composed of 119040 rows. From the output of info
, it is clear that most of the 31 attributes have missing values. 9 of them are completely null (e.g. FT, FB, etc.). This is commonplace in real-life tasks, where different sampling procedures are used during the time, the measurements can fail, etc.
The information relative to each sensor are reported in a second file. Let's read it.
sensors = pd.read_csv("weatherww2/WeatherStationLocations.csv")
sensors.info()
The latter file contains useful information to characterize our sensors. Specifically, the Latitude and Longitude attribute can be used to group sensors near on the map. As we can see, many attributes in both the files are numerical, floating point. The normalization of these attributes is a matter specific to the considered task. Let's skip it for now, we will discuss about it later.
To discover the sensors with most of the temperature readings, we can use pandas. Note: this is not mandatory, every step in the following cells can be implemented with plain Python.
df.groupby("STA").size().sort_values(0, ascending=False).head(10)
Grouping the rows by sensor id (STA), counting the size of each group, and, finally, sorting them gives us the list of STAs that show up the most.
Transorming features with dates into Datetime objects is usually a good idea. Meanwhile we can set a new index for our dataset.
df["Date"] = pd.to_datetime(df["Date"])
df = df.set_index("Date")
We can now keep only the readings of Sensor 22508 and plot the series of MeanTemps.
mtemps = df[df["STA"] == 22508]["MeanTemp"]
mtemps.head(5)
fig, ax = plt.subplots()
_ = ax.plot(mtemps.index.values, mtemps.values)
# or equivalently
# mtemps.plot(ax=ax)
This plot tells us that:
There are many strategies to address the forecasting task of a time series. The transformation of the series itself into a structured representation (i.e. a set of records sharing some predictive features) enables the use of machine learning models. We also know these models are trained to learn the function mapping predicting features to a desired value. Thus, to complete the design of a forecasting framework, we need to carefully select a target variable.
ML practitioners often use the following strategy:
W = 3
X = list()
y = list()
for i in range(mtemps.size - W): # range: [0, mtemps.size - W - 1]
X.append(mtemps.iloc[i:i + W].values.T) #Â transpose to create a row array
y.append(mtemps.iloc[i + W])
#Â transform the structured representation into numpy arrays
X = np.array(X)
y = np.array(y)
X.shape, y.shape
We can check whether the rolling window has worked properly inspecting a few samples.
mtemps[:5]
X[:3,:]
y[:3]
Given a record associated with the day t, y[t]
is the value taken by the series at the day t+W+1 (i.e. mtemps[t+W]
), which seems correct.
Let's use now the values from 1940 to 1944 as training data and test the model on the remaining year. We can leverage pandas to filter data based on time strings. To do so, we can convert our arrays into pandas DataFrame and Series using the DatetimeIndex from mtemps
.
X_df = pd.DataFrame(X, index=mtemps.index[:mtemps.size - W],
columns=["t0", "t1", "t2"])
X_df.head()
y_s = pd.Series(y, index=mtemps.index[:mtemps.size - W])
y_s.head()
X_train, y_train = X_df.loc["1940":"1944"], y_s.loc["1940":"1944"]
X_train.index
Keep in mind that we used a forward-looking notation for our window, i.e. for the record t we have the values of the series between t and t+W. Under this conditions, at the end of the year, we are going to have W data points that include information from the next year. This is not a problem up if we slightly shift the beginning of our test set. Specifically, we might want to start our test set for the Wth day or the year.
Including some information from what it is consider the test set into the training procedure is a wrong habit known as data leakage. You should always stop and take a moment to consider if any data leakage is happening within your pipeline.
from datetime import date
initial_day = date(1944, 12, 31) + pd.Timedelta(f"{W} days")
initial_day
X_test, y_test = X_df.loc[initial_day:], y_s.loc[initial_day:]
X_test.shape
Note how neat is the indexing on pandas object with Datetime indices. You are allowed to specify a numpy-like interval with dates as datetime.date
or strings. Of course, more complex and useful indexings are possibile, but they go beyond the scope of the laboratory.
Since we still have in memory the definitions of models
and names
(this is one of the beauties and curses of Jupyter notebooks) we can recycle and them for our current task.
t = PrettyTable()
t.field_names = ['model', 'MSE', 'R2']
for model, name in zip(models, names):
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
mse = mean_squared_error(y_test, y_hat)
r2 = r2_score(y_test, y_hat)
t.add_row([name, mse, r2])
print(t)
Even in this case, the models with behavior more close to linear perform the better. This could imply that some sort of linear relationship exists between the past W
values of the series and the target.
Additional comments can be made:
Thanks to matplotlib, we can plot our predictions onto the real series for the test year. For simplicity, we can inspect the best performing model, the Ridge regularizer.
model = Ridge(random_state=42)
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
y_hat = pd.Series(y_hat, index=y_test.index)
error = y_test - y_hat
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True, gridspec_kw={'height_ratios': [3, 1]})
ax[0].plot(y_test, label="Ground truth")
ax[0].plot(y_hat, label="Predicted")
ax[0].set_ylabel("Temperature (C)")
ax[0].legend()
ax[0].grid()
ax[1].plot(error.abs())
ax[1].set_ylabel("|error|")
ax[1].axhline(error.abs().mean(), color="purple", label = "Average |error|")
ax[1].legend()
ax[1].grid()
f"The average |error| is: {error.abs().mean():.2f} +- {error.abs().std():.2f} degrees Celsius"
The figure shows that Ridge provides a good approximation but its predictions are somewhat lagged. The model takes at least one day to react to the current trend and so induces a non-zero error. On the bottom-end of the figure, the plot of residuals shows that the larger is the change in temperature w.r.t. the previous day (i.e. a high volatility among consecutive days is present), the larger is the error due to the lagged limitations of the estimator.
Even though these results suggest that the only values of the series are not sufficient to build a robust forecasting model (e.g. we may think that other, more expressive features are needed), the average absolute error 0.65 and the standard deviation is 0.59, which may be acceptable values for our domain after all.
Predicting a value further than one day ahead is not possibile at the moment, but many strategies exist to expand the forecast horizon. For example, one can train different models on different target variables: one encoding the value one day ahead (i.e. what we have done here), one with the value two days ahead as target, and so on. In this experimental setting, you would end up having one model per horizon step, which could add some computational cost.
As a further exercise, we leave to you the tweaking of this pipeline to test and visually assess the performance of the remaining regressors.