Semester Project - Machine Learning

Project made by Nicolas Gregori SUPSI - 2021

Dataset description

The current dataset contains parameters about weather in Szeged, a city located in Hungary, between 2006 and 2016. It contains the following columns:

  • Formatted Date
  • Summary (weather forecast by hour)
  • Precip Type
  • Temperature (in Celsius Degrees)
  • Apparent Temperature (in Celsius Degrees) - It is the temperature perceived by humans
  • Humidity
  • Wind Speed (in km/h)
  • Wind Bearing (degrees)
  • Visibility (km)
  • Loud Cover
  • Pressure (in millibars)
  • Daily Summer (weather forecast by day)

All the data contained into the dataset were taken every hour. Also, there are some columns which summarize how was the weather at a precise hour.

Available at: Weather in Szeged.

Preliminar operations

Making sure not losing the job done

In [1]:
%autosave 25
Autosaving every 25 seconds

Load the cell below to import all dependencies needed to run the notebook

In [80]:
#Various imports
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [81]:
#Defining portion for test pahse
testPortion = 0.4

First step: we have to import the dataset into the notebook and save it into a Dataframe. Then, we show the first rows just to get a first view of it.

In [82]:
#Load the dataset from memory 
dfPath = "./resources/weatherHistory.csv"
resPath = "./results/"
df = pd.read_csv(dfPath)
In [83]:
#Print dataset's head      
df.head()
Out[83]:
Formatted Date Summary Precip Type Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Visibility (km) Loud Cover Pressure (millibars) Daily Summary
0 2006-04-01 00:00:00.000 +0200 Partly Cloudy rain 9.472222 7.388889 0.89 14.1197 251.0 15.8263 0.0 1015.13 Partly cloudy throughout the day.
1 2006-04-01 01:00:00.000 +0200 Partly Cloudy rain 9.355556 7.227778 0.86 14.2646 259.0 15.8263 0.0 1015.63 Partly cloudy throughout the day.
2 2006-04-01 02:00:00.000 +0200 Mostly Cloudy rain 9.377778 9.377778 0.89 3.9284 204.0 14.9569 0.0 1015.94 Partly cloudy throughout the day.
3 2006-04-01 03:00:00.000 +0200 Partly Cloudy rain 8.288889 5.944444 0.83 14.1036 269.0 15.8263 0.0 1016.41 Partly cloudy throughout the day.
4 2006-04-01 04:00:00.000 +0200 Mostly Cloudy rain 8.755556 6.977778 0.83 11.0446 259.0 15.8263 0.0 1016.51 Partly cloudy throughout the day.
In [84]:
#Printing features' information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Formatted Date            96453 non-null  object 
 1   Summary                   96453 non-null  object 
 2   Precip Type               95936 non-null  object 
 3   Temperature (C)           96453 non-null  float64
 4   Apparent Temperature (C)  96453 non-null  float64
 5   Humidity                  96453 non-null  float64
 6   Wind Speed (km/h)         96453 non-null  float64
 7   Wind Bearing (degrees)    96453 non-null  float64
 8   Visibility (km)           96453 non-null  float64
 9   Loud Cover                96453 non-null  float64
 10  Pressure (millibars)      96453 non-null  float64
 11  Daily Summary             96453 non-null  object 
dtypes: float64(8), object(4)
memory usage: 8.8+ MB
In [85]:
#Getting rows and columns' number
df.shape
Out[85]:
(96453, 12)
In [86]:
#Printing some feature's information
df.describe()
Out[86]:
Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Visibility (km) Loud Cover Pressure (millibars)
count 96453.000000 96453.000000 96453.000000 96453.000000 96453.000000 96453.000000 96453.0 96453.000000
mean 11.932678 10.855029 0.734899 10.810640 187.509232 10.347325 0.0 1003.235956
std 9.551546 10.696847 0.195473 6.913571 107.383428 4.192123 0.0 116.969906
min -21.822222 -27.716667 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
25% 4.688889 2.311111 0.600000 5.828200 116.000000 8.339800 0.0 1011.900000
50% 12.000000 12.000000 0.780000 9.965900 180.000000 10.046400 0.0 1016.450000
75% 18.838889 18.838889 0.890000 14.135800 290.000000 14.812000 0.0 1021.090000
max 39.905556 39.344444 1.000000 63.852600 359.000000 16.100000 0.0 1046.380000

Prepocessing phase

Before starting doing some analyses, it is important to prepare the dataset during the preprocessing phase. For example, we can delete irrlevant columns (more details below) and transform discrete features into a more adaptable format:

There are some columns which are particulary useless. For instance, Loud Cover data are full of 0's and that makes this column so irrelevant. The same principle is valid for the string(Daily Summary and Summary) that tells how it's the weather every hour. \

In [87]:
df = df.drop(['Daily Summary', 'Summary', 'Loud Cover'], axis = 1)
df = df.rename(columns = {"Temperature (C)":"Temperature",
                          "Wind Speed (km/h)":"Wind Speed",
                           "Apparent Temperature (C)":"Apparent Temperature",
                            "Visibility (km)":"Visibility",
                            "Wind Bearing (degrees)":"Wind Bearing",
                            "Pressure (millibars)":"Pressure"})
In [88]:
df.columns
Out[88]:
Index(['Formatted Date', 'Precip Type', 'Temperature', 'Apparent Temperature',
       'Humidity', 'Wind Speed', 'Wind Bearing', 'Visibility', 'Pressure'],
      dtype='object')

Now, we are going to take a quick look about data distribution and how features are correlated with each other. Seeing the table below we can start thinking what kinds of analyses could be interesting to make.

In [89]:
#Print data distribution
df.hist(bins = 50, figsize = (10,25))
plt.show()
In [90]:
df.corr()
Out[90]:
Temperature Apparent Temperature Humidity Wind Speed Wind Bearing Visibility Pressure
Temperature 1.000000 0.992629 -0.632255 0.008957 0.029988 0.392847 -0.005447
Apparent Temperature 0.992629 1.000000 -0.602571 -0.056650 0.029031 0.381718 -0.000219
Humidity -0.632255 -0.602571 1.000000 -0.224951 0.000735 -0.369173 0.005454
Wind Speed 0.008957 -0.056650 -0.224951 1.000000 0.103822 0.100749 -0.049263
Wind Bearing 0.029988 0.029031 0.000735 0.103822 1.000000 0.047594 -0.011651
Visibility 0.392847 0.381718 -0.369173 0.100749 0.047594 1.000000 0.059818
Pressure -0.005447 -0.000219 0.005454 -0.049263 -0.011651 0.059818 1.000000

In order to avoid to have nan, we convert this values in column Precip Type into other

In [91]:
df = df.replace(np.nan, 'other', regex=True)
set(df["Precip Type"].values)
Out[91]:
{'other', 'rain', 'snow'}

For further utility we print the box plot of the precitapion types taken hourly

In [92]:
sns.boxplot(x =  df["Humidity"],y = df["Precip Type"])
plt.title("PrecipatioN Type Box Plot")
plt.show()
In [93]:
#Optional: parse the date into a proper format and extract the respective Year,Month,Day and HouR
df['Formatted Date'] = pd.to_datetime(df['Formatted Date'], format='%Y-%m-%d %H:%M:%S.%f %z') 
df['Year'] = df['Formatted Date'].apply(lambda x: x.year)
df['Month'] = df['Formatted Date'].apply(lambda x: x.month)
df['Day'] = df['Formatted Date'].apply(lambda x: x.day)
df['Hour'] = df['Formatted Date'].apply(lambda x: x.hour)
In [94]:
df
Out[94]:
Formatted Date Precip Type Temperature Apparent Temperature Humidity Wind Speed Wind Bearing Visibility Pressure Year Month Day Hour
0 2006-04-01 00:00:00+02:00 rain 9.472222 7.388889 0.89 14.1197 251.0 15.8263 1015.13 2006 4 1 0
1 2006-04-01 01:00:00+02:00 rain 9.355556 7.227778 0.86 14.2646 259.0 15.8263 1015.63 2006 4 1 1
2 2006-04-01 02:00:00+02:00 rain 9.377778 9.377778 0.89 3.9284 204.0 14.9569 1015.94 2006 4 1 2
3 2006-04-01 03:00:00+02:00 rain 8.288889 5.944444 0.83 14.1036 269.0 15.8263 1016.41 2006 4 1 3
4 2006-04-01 04:00:00+02:00 rain 8.755556 6.977778 0.83 11.0446 259.0 15.8263 1016.51 2006 4 1 4
... ... ... ... ... ... ... ... ... ... ... ... ... ...
96448 2016-09-09 19:00:00+02:00 rain 26.016667 26.016667 0.43 10.9963 31.0 16.1000 1014.36 2016 9 9 19
96449 2016-09-09 20:00:00+02:00 rain 24.583333 24.583333 0.48 10.0947 20.0 15.5526 1015.16 2016 9 9 20
96450 2016-09-09 21:00:00+02:00 rain 22.038889 22.038889 0.56 8.9838 30.0 16.1000 1015.66 2016 9 9 21
96451 2016-09-09 22:00:00+02:00 rain 21.522222 21.522222 0.60 10.5294 20.0 16.1000 1015.95 2016 9 9 22
96452 2016-09-09 23:00:00+02:00 rain 20.438889 20.438889 0.61 5.8765 39.0 15.5204 1016.16 2016 9 9 23

96453 rows × 13 columns

The regressions and classificatons done below were done using scikit-learn, a powerful library in Python used for Machine Learnin and Statistics. Before see the results that I have obtained, it's necessary to split the dataset into two differents:

  • Train set: used to train all classifiers and regressors;
  • Test set: used to tests all classifers and regressors prepared;
In [95]:
#Splitting the dataset into train, and test dataset
[dfTrain,dfTest] = train_test_split(df.drop(['Formatted Date','Year','Month','Day'],axis=1),random_state=1234,test_size=testPortion) 

Predicting Temperature basing on Humidity

First of all let's start doing a simple Linear Regression. Our aim is to minimize as soon as possible the RMSE (Residual Minimum Squared Error) computed as the mean of the differences between real and predicted values. Let's find out if Temperature and Humidity are correllated each other.

In [96]:
xTrain = dfTrain["Humidity"].values
xTest  = dfTest["Humidity"].values
yTrain = dfTrain["Temperature"].values
yTest  = dfTest["Temperature"].values

xTrain = np.reshape(xTrain,(-1,1))
xTest = np.reshape(xTest,(-1,1))
In [97]:
linReg = LinearRegression()
linReg.fit(xTrain,yTrain)
Out[97]:
LinearRegression()
In [98]:
#Print reg params
print(f"Intercepts:  {linReg.intercept_}")
print(f"Coefficients:  {linReg.coef_}")
Intercepts:  34.671419460194286
Coefficients:  [-30.96347094]
In [99]:
yTestPredicted = linReg.predict(xTest)
plt.scatter(xTrain,yTrain)
plt.xlabel("Humidity")
plt.ylabel("Temperature")
plt.title("Correlation between Temperature And Humidity")
plt.plot(xTest,yTestPredicted,color="orange")
plt.show()
In [100]:
print("RMSE test set: ",np.sqrt(mean_squared_error(yTest,yTestPredicted)))
print("R^2 score: ",linReg.score(xTest,yTest))
RMSE test set:  7.393057970127397
R^2 score:  0.3990707303279296

The $R^2$, is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is a statist that tell us the quality of a regressor

The chart above shows the best "line" that approxes all the points. Unfortunatley we have a score of 40%, so it is not a good regression. It seems that Humidity pass perfectly trough the points, but this is an illusion because the chart is fitted with a lot of points.

Predicting Apparent Temperature every hour

The apparent temperature is the temperature perceived by humans, caused by the combined effect of air temperature, relative temperature and wind speed. Mostly, it is applied on outdoor perceived temperature. Is it that these features have a great correlation? \ Let's find out.

In [101]:
#Take data needed
xTrain = dfTrain[["Temperature","Humidity","Wind Speed"]].values
yTrain = dfTrain["Apparent Temperature"].values

xTest = dfTest[["Temperature","Humidity","Wind Speed"]].values
yTest = dfTest["Apparent Temperature"].values
In [102]:
#Prepare Linear Regressior - Printig the best values found
linReg = LinearRegression()
linReg.fit(xTrain,yTrain)
print(f"Intercepts:  {linReg.intercept_}")
print(f"Coefficients:  {linReg.coef_}")
Intercepts:  -2.2940583182423957
Coefficients:  [ 1.12583139  1.01884332 -0.09549998]
In [103]:
#Checking how regressior behaviours with dfTest
yTrainPredicted = linReg.predict(xTrain)
RMSETrain = np.sqrt(mean_squared_error(yTrain,yTrainPredicted))
print(f"RMSE train set: {RMSETrain}")
RMSE train set: 1.0793621432981388
In [104]:
yTestPredicted = linReg.predict(xTest)
In [105]:
#Show errors distribution
errors = np.abs(yTestPredicted - yTest)
plt.figure()
plt.title("Error distribution - Apparent Temperature")
plt.hist(x = errors, bins = 50)
plt.show()
In [106]:
#Show trend between real and predicted (based on a sample of 1500 units)
plt.figure(figsize=(14, 4))
plt.title("Comparison between real and estimate data - Apparent Temperature")
plt.plot(yTest[0:1500], label='Real')
plt.plot(yTestPredicted[0:1500], label='Prediction')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0.)
plt.show()
In [107]:
#Try to predict "real" data
RMSETest = np.sqrt(mean_squared_error(yTest,yTestPredicted))
R2Test = linReg.score(xTest,yTest)
print(f"RMSE score test: {RMSETest}")
print(f"R2 score test: {R2Test}")
RMSE score test: 1.080129952197552
R2 score test: 0.989762960917948

In conclusion, it's a evidence that they have a great correlation. The regression is abled to explain the 98% of all instances. So, there is no reason to try improving its quality through other algorithms. We can also save result in a txt file executing the cell below.

In [108]:
#Save prediction's data
np.savetxt(resPath + "Apparent Temperature - Predictions",yTestPredicted)

Predicting Atmospheric Pressure - Ridge - Lasso - Elastic Net Regression

Let's try predicting Pressure using all relevant columns. This predictions will be done using Ridge and Lasso regression.

In [109]:
targetColumn = ["Pressure"]
predictors = list(set(list(df.drop(['Formatted Date','Month','Day','Year','Hour','Precip Type'],axis = 1))) 
                  - set(targetColumn))
alpha = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10]
params = {'alpha': alpha}

Xs = dfTrain[predictors].values
Ys = dfTrain[targetColumn].values

xTest = dfTest[predictors].values
yTest = dfTest["Pressure"].values

We define the number of folds used for GridSearchCV . In thsi case we define cv as 5.

In [110]:
cv = 5

Ridge Regression

In [111]:
ridge = Ridge()
ridgeReg = GridSearchCV(ridge,params,scoring='neg_mean_squared_error', cv = cv, verbose=1)
ridgeReg.fit(Xs,Ys)
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:    0.3s finished
Out[111]:
GridSearchCV(cv=5, estimator=Ridge(),
             param_grid={'alpha': [1e-15, 1e-10, 1e-08, 0.0001, 0.001, 0.01, 1,
                                   5, 10]},
             scoring='neg_mean_squared_error', verbose=1)
In [112]:
ridgeReg.best_params_
Out[112]:
{'alpha': 10}
In [113]:
ridgeReg.best_score_
Out[113]:
-13568.295301085522
In [114]:
yTestRidgePredicted = ridgeReg.predict(xTest)
In [115]:
ridgeError = np.sqrt(mean_squared_error(yTest,yTestRidgePredicted))
print(f"RMSE score: {ridgeError}")
RMSE score: 116.52281301076783

Lasso Regression

In [116]:
lasso = Lasso()
lassoReg = GridSearchCV(lasso,params,scoring='neg_mean_squared_error', cv= cv, verbose=1)
lassoReg.fit(Xs,Ys)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Fitting 5 folds for each of 9 candidates, totalling 45 fits
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 313693457.65529686, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305260137.2216757, tolerance: 61499.71079985987
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305807027.0577796, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322570573.0321096, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322542149.860265, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 313692937.5171009, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305259378.6380357, tolerance: 61499.71079985987
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305806541.47857183, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322570020.2077561, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322541640.02402234, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 313641449.7231718, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305184292.09993696, tolerance: 61499.71079985987
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305758474.3800298, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322515297.0827149, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322491171.71375155, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 93754922.98975086, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 60685320.313750684, tolerance: 61499.71079985987
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 95012053.87698549, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 93537005.57787573, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 100613177.09292424, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 3625481.6959115267, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1686625.6875605583, tolerance: 61499.71079985987
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 3817420.8826272488, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 3513698.7626856565, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 4058286.691865444, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:    5.3s finished
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 392482046.6146549, tolerance: 79094.23538738962
  model = cd_fast.enet_coordinate_descent(
Out[116]:
GridSearchCV(cv=5, estimator=Lasso(),
             param_grid={'alpha': [1e-15, 1e-10, 1e-08, 0.0001, 0.001, 0.01, 1,
                                   5, 10]},
             scoring='neg_mean_squared_error', verbose=1)
In [117]:
lassoReg.best_params_
Out[117]:
{'alpha': 1e-15}
In [118]:
lassoReg.best_score_
Out[118]:
-13568.297158463954
In [119]:
yTestLassoPredicted = lassoReg.predict(xTest)
In [120]:
lassoError = np.sqrt(mean_squared_error(yTest,yTestLassoPredicted))
print(f"RMSE score: {lassoError}")
RMSE score: 116.52288548521467

We can observe that Ridge and Lasso via GridSearchCV obtain a sligthly different RMSE score (Lasso regression has an alpha less than Ridge Regression)

Elastic Net Regression

The Elastic Net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods

In [121]:
elasticNet = ElasticNet()
elasticReg = GridSearchCV(elasticNet,params,scoring="neg_mean_squared_error",cv = cv, verbose = 1)
elasticReg.fit(Xs,Ys)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Fitting 5 folds for each of 9 candidates, totalling 45 fits
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 313693457.65789765, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305260137.2254687, tolerance: 61499.71079985987
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305807027.0602075, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322570573.0348738, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322542149.8628141, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 313693197.58870393, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305259757.933345, tolerance: 61499.71079985987
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305806784.2705388, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322570296.6225694, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322541894.9446226, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 313667452.07725215, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305222211.1293457, tolerance: 61499.71079985987
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 305782749.2796595, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322542933.28453505, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 322516659.2824716, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 156609997.02134788, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 115961821.24015948, tolerance: 61499.71079985987
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 156535685.60830584, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 157916152.493072, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 165528876.38652667, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 11511458.52150929, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 5676645.8526751995, tolerance: 61499.71079985987
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 12021615.496261, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 11224531.624630332, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 12774393.30235976, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 125946.169013381, tolerance: 63213.63056903937
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 132452.03838336468, tolerance: 61608.30682170687
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 121895.25952863693, tolerance: 65023.634564212574
  model = cd_fast.enet_coordinate_descent(
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 141215.33273756504, tolerance: 65029.05605171954
  model = cd_fast.enet_coordinate_descent(
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:    4.9s finished
C:\Users\Notebook\anaconda3\envs\Data Science\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 13136456.808402538, tolerance: 79094.23538738962
  model = cd_fast.enet_coordinate_descent(
Out[121]:
GridSearchCV(cv=5, estimator=ElasticNet(),
             param_grid={'alpha': [1e-15, 1e-10, 1e-08, 0.0001, 0.001, 0.01, 1,
                                   5, 10]},
             scoring='neg_mean_squared_error', verbose=1)
In [122]:
elasticReg.best_params_
Out[122]:
{'alpha': 0.001}
In [123]:
elasticReg.best_score_
Out[123]:
-13568.293241403746
In [124]:
yTestElasticNetPredicted = elasticReg.predict(xTest)
In [125]:
#Print ElasticNet error
elasticError = np.sqrt(mean_squared_error(yTest,yTestElasticNetPredicted))
print(f"RMSE score: {elasticError}")
RMSE score: 116.52265910153984

Via ElasticNet regolarization we obtain a small alpha than Lasso, but bigger than Ridge

Predicting Precip type basing on Humidity - K NeighborsClassifier

We start with classification analysis. Classification is a supervised learning concept which basically categorizes a set of data into classes.

In [126]:
#Drop Precip Type columns with other value cause we have few samples tho predict
df = df.drop(df[df["Precip Type"] == "other"].index)
set(df["Precip Type"])
Out[126]:
{'rain', 'snow'}
In [127]:
#Divide again the dataset int train and test
dfTrain,dfTest = train_test_split(df,random_state=1234,test_size= testPortion)
In [128]:
xTrain = dfTrain["Humidity"].values
yTrain = dfTrain["Precip Type"].values
xTest = dfTest["Humidity"].values
yTest = dfTest["Precip Type"].values
In [129]:
#Normalize Humidity values
scaler = StandardScaler()

xTrain = np.reshape(xTrain,(-1,1))
xTest = np.reshape(xTest,(-1,1))

scaler.fit(xTrain)
xTrainScaled = scaler.transform(xTrain)
xTestScaled = scaler.transform(xTest)
In [130]:
weatherClassifier =  KNeighborsClassifier(n_neighbors = 3)
weatherClassifier.fit(xTrainScaled,yTrain)
Out[130]:
KNeighborsClassifier(n_neighbors=3)
In [131]:
yWeatherPredicted = weatherClassifier.predict(xTest)

Looking the confusion matrix we can see that we have an high prediction about rain category. Instead snow category is not predicted so good cause we haven't got enough samples.

Now, we want to see another statistic: the ROC curve. An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

True Positive Rate
False Positive Rate

But before, via LabelEncoder we have to trasform the categoric variable(Precip Type) into discrete values.

In [132]:
#Printing some statistic like Confusion Matrix and a Classification Report
cf = confusion_matrix(yTest,yWeatherPredicted)
print("Confusion matrix: ")
print(cf)
print(classification_report(yTest,yWeatherPredicted))
print(f"Accuracy: {weatherClassifier.score(xTestScaled,yTest)}")
Confusion matrix: 
[[25616  8435]
 [ 3358   966]]
              precision    recall  f1-score   support

        rain       0.88      0.75      0.81     34051
        snow       0.10      0.22      0.14      4324

    accuracy                           0.69     38375
   macro avg       0.49      0.49      0.48     38375
weighted avg       0.80      0.69      0.74     38375

Accuracy: 0.8252508143322476
In [133]:
sns.heatmap(cf, annot = True)
Out[133]:
<AxesSubplot:>
In [134]:
le = LabelEncoder()
yTestDiscreteValues = le.fit_transform(yTest)
yTestDiscreteValues
Out[134]:
array([0, 0, 0, ..., 1, 0, 0])

After the conversion done above, the variabile becomes binary with the following meanings:

  • 0 stands for rain
  • 1 stands for snow

Referring also to ROC curve it is alsto interesting look the AUC. Instead, AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

In [135]:
#Compute ROC and AUC
probs =  weatherClassifier.predict_proba(xTest)
preeds = probs[:,1]
fpr,tpr,threshold = roc_curve(yTestDiscreteValues,preeds)

rocAuc = auc(fpr,tpr)
In [136]:
#Plot ROC curve
plt.title('Receiver Operating Characteristic with K =3')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % rocAuc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

The chart shows that the relaiabilty of the classifier is around 50%.\ Now we try to predict the best case, unless if there would be another one. We start computing the classifier error for each K until 34

In [137]:
#CELL A BIT SLOW TO RUN
errorsClassifiers = list()
for i in range (1,35):
    weatherClassifier =  KNeighborsClassifier(n_neighbors=i)
    weatherClassifier.fit(xTrainScaled,yTrain)
    yPredict = weatherClassifier.predict(xTest)
    errorsClassifiers.append(np.mean(yTest != yPredict))
In [138]:
#Plot error classifier  for each K
plt.figure(figsize =(14,6))
plt.plot(range(1,35), errorsClassifiers, color = "green", linestyle="dashed", marker="o", 
         markerfacecolor='blue', markersize=9)
plt.title("Error Rate K Value")
plt.xlabel("K value")
plt.ylabel("K mean error")
Out[138]:
Text(0, 0.5, 'K mean error')

Observing the chart we can figure out that the best K is 12, but anyway it is not a good classifier. So, now we want trying improve our classificator using SVM.

Predicting Precip type basing on Humidity - SVM

At last, we want to improve the analyse done before using SVM (support vector machine). They are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis.

Before, we define a functions that compute the bet iperparamters using GridSearchCV. To make a comparison we want to use two different kernels:

  • Linear
  • RBF

Instead, an other function plot a confusion matrix for each SVM that we use.

.Execute the cell to load the two functions

In [139]:
def compute_SVM(classfier,params,n_folds,Xs,Ys):
    gscv = GridSearchCV(classfier,params,cv = n_folds)
    gscv.fit(Xs,Ys)
    print("Params combination: \n",gscv.cv_results_['params'])
    print('Avg accuracy per combination:\n', gscv.cv_results_['mean_test_score'])
    print('Best combination:\n', gscv.best_params_)
    print('Avg accuracy of best combination: {}'.format(gscv.best_score_))
    return gscv
In [140]:
def print_confusion_matrix(classifier,yTest,yTestPredicted):
    cf = confusion_matrix(yTest,yTestPredicted)
    print(f"Confusion matrix:")
    print(cf)
    sns.heatmap(cf,annot = True)
In [141]:
#Prepare SVC's 
n_folds = 3
clsLinear = SVC()
clsRBF = SVC()

paramGridLinearSVC = [{'kernel': ['linear'], 'C': [1, 5, 10]}]
paramGridRBFSVC =  [{'kernel': ['rbf'], 'C': [1, 5, 10], 'gamma': [0.1, 0.01]}]

Linear Clustering

In [142]:
#CELL A BIT SLOW TO RUN
gscvLinear = compute_SVM(clsLinear,paramGridLinearSVC,n_folds,xTrain,yTrain)
Params combination: 
 [{'C': 1, 'kernel': 'linear'}, {'C': 5, 'kernel': 'linear'}, {'C': 10, 'kernel': 'linear'}]
Avg accuracy per combination:
 [0.88902208 0.88902208 0.88902208]
Best combination:
 {'C': 1, 'kernel': 'linear'}
Avg accuracy of best combination: 0.889022080922847
In [143]:
yTestDiscreteValues = np.reshape(yTestDiscreteValues,(-1,1))
yLinearPredicted = gscvLinear.predict(yTestDiscreteValues)
In [144]:
print_confusion_matrix(gscvLinear,yTest,yLinearPredicted)
Confusion matrix:
[[34051     0]
 [ 4324     0]]

RBF Clustering

In [145]:
#CELL A BIT SLOW TO RUN 
gscvRBF = compute_SVM(clsRBF,paramGridRBFSVC,n_folds,xTrain,yTrain)
Params combination: 
 [{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}, {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}, {'C': 5, 'gamma': 0.1, 'kernel': 'rbf'}, {'C': 5, 'gamma': 0.01, 'kernel': 'rbf'}, {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}, {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}]
Avg accuracy per combination:
 [0.88902208 0.88902208 0.88902208 0.88902208 0.88902208 0.88902208]
Best combination:
 {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
Avg accuracy of best combination: 0.889022080922847
In [146]:
yRBFPredicted = gscvRBF.predict(xTest)
In [147]:
print_confusion_matrix(gscvLinear,yTest,yRBFPredicted)
Confusion matrix:
[[34051     0]
 [ 4324     0]]

SVM is ony able to predict rain Precip Type (not snow).

In conclusion, we can say that using SVM we are able to increase the accuracy of our regressor of 89%, instead 82% (value obtained with KNeighbors), but obvsiously accuracy is not a reliable statistic in the best case because it's much better to consider how many values of a determined class will be predicted correctly.