Project made by Nicolas Gregori SUPSI - 2021
The current dataset contains parameters about weather in Szeged, a city located in Hungary, between 2006 and 2016. It contains the following columns:
All the data contained into the dataset were taken every hour. Also, there are some columns which summarize how was the weather at a precise hour.
Available at: Weather in Szeged.
Making sure not losing the job done
%autosave 25
Load the cell below to import all dependencies needed to run the notebook
#Various imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Defining portion for test pahse
testPortion = 0.4
First step: we have to import the dataset into the notebook and save it into a Dataframe. Then, we show the first rows just to get a first view of it.
#Load the dataset from memory
dfPath = "./resources/weatherHistory.csv"
resPath = "./results/"
df = pd.read_csv(dfPath)
#Print dataset's head
df.head()
#Printing features' information
df.info()
#Getting rows and columns' number
df.shape
#Printing some feature's information
df.describe()
Before starting doing some analyses, it is important to prepare the dataset during the preprocessing phase. For example, we can delete irrlevant columns (more details below) and transform discrete features into a more adaptable format:
There are some columns which are particulary useless. For instance, Loud Cover data are full of 0's and that makes this column so irrelevant. The same principle is valid for the string(Daily Summary and Summary) that tells how it's the weather every hour. \
df = df.drop(['Daily Summary', 'Summary', 'Loud Cover'], axis = 1)
df = df.rename(columns = {"Temperature (C)":"Temperature",
"Wind Speed (km/h)":"Wind Speed",
"Apparent Temperature (C)":"Apparent Temperature",
"Visibility (km)":"Visibility",
"Wind Bearing (degrees)":"Wind Bearing",
"Pressure (millibars)":"Pressure"})
df.columns
Now, we are going to take a quick look about data distribution and how features are correlated with each other. Seeing the table below we can start thinking what kinds of analyses could be interesting to make.
#Print data distribution
df.hist(bins = 50, figsize = (10,25))
plt.show()
df.corr()
In order to avoid to have nan, we convert this values in column Precip Type into other
df = df.replace(np.nan, 'other', regex=True)
set(df["Precip Type"].values)
For further utility we print the box plot of the precitapion types taken hourly
sns.boxplot(x = df["Humidity"],y = df["Precip Type"])
plt.title("PrecipatioN Type Box Plot")
plt.show()
#Optional: parse the date into a proper format and extract the respective Year,Month,Day and HouR
df['Formatted Date'] = pd.to_datetime(df['Formatted Date'], format='%Y-%m-%d %H:%M:%S.%f %z')
df['Year'] = df['Formatted Date'].apply(lambda x: x.year)
df['Month'] = df['Formatted Date'].apply(lambda x: x.month)
df['Day'] = df['Formatted Date'].apply(lambda x: x.day)
df['Hour'] = df['Formatted Date'].apply(lambda x: x.hour)
df
The regressions and classificatons done below were done using scikit-learn, a powerful library in Python used for Machine Learnin and Statistics. Before see the results that I have obtained, it's necessary to split the dataset into two differents:
#Splitting the dataset into train, and test dataset
[dfTrain,dfTest] = train_test_split(df.drop(['Formatted Date','Year','Month','Day'],axis=1),random_state=1234,test_size=testPortion)
First of all let's start doing a simple Linear Regression. Our aim is to minimize as soon as possible the RMSE (Residual Minimum Squared Error) computed as the mean of the differences between real and predicted values. Let's find out if Temperature and Humidity are correllated each other.
xTrain = dfTrain["Humidity"].values
xTest = dfTest["Humidity"].values
yTrain = dfTrain["Temperature"].values
yTest = dfTest["Temperature"].values
xTrain = np.reshape(xTrain,(-1,1))
xTest = np.reshape(xTest,(-1,1))
linReg = LinearRegression()
linReg.fit(xTrain,yTrain)
#Print reg params
print(f"Intercepts: {linReg.intercept_}")
print(f"Coefficients: {linReg.coef_}")
yTestPredicted = linReg.predict(xTest)
plt.scatter(xTrain,yTrain)
plt.xlabel("Humidity")
plt.ylabel("Temperature")
plt.title("Correlation between Temperature And Humidity")
plt.plot(xTest,yTestPredicted,color="orange")
plt.show()
print("RMSE test set: ",np.sqrt(mean_squared_error(yTest,yTestPredicted)))
print("R^2 score: ",linReg.score(xTest,yTest))
The $R^2$, is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is a statist that tell us the quality of a regressor
The chart above shows the best "line" that approxes all the points. Unfortunatley we have a score of 40%, so it is not a good regression. It seems that Humidity pass perfectly trough the points, but this is an illusion because the chart is fitted with a lot of points.
The apparent temperature is the temperature perceived by humans, caused by the combined effect of air temperature, relative temperature and wind speed. Mostly, it is applied on outdoor perceived temperature. Is it that these features have a great correlation? \ Let's find out.
#Take data needed
xTrain = dfTrain[["Temperature","Humidity","Wind Speed"]].values
yTrain = dfTrain["Apparent Temperature"].values
xTest = dfTest[["Temperature","Humidity","Wind Speed"]].values
yTest = dfTest["Apparent Temperature"].values
#Prepare Linear Regressior - Printig the best values found
linReg = LinearRegression()
linReg.fit(xTrain,yTrain)
print(f"Intercepts: {linReg.intercept_}")
print(f"Coefficients: {linReg.coef_}")
#Checking how regressior behaviours with dfTest
yTrainPredicted = linReg.predict(xTrain)
RMSETrain = np.sqrt(mean_squared_error(yTrain,yTrainPredicted))
print(f"RMSE train set: {RMSETrain}")
yTestPredicted = linReg.predict(xTest)
#Show errors distribution
errors = np.abs(yTestPredicted - yTest)
plt.figure()
plt.title("Error distribution - Apparent Temperature")
plt.hist(x = errors, bins = 50)
plt.show()
#Show trend between real and predicted (based on a sample of 1500 units)
plt.figure(figsize=(14, 4))
plt.title("Comparison between real and estimate data - Apparent Temperature")
plt.plot(yTest[0:1500], label='Real')
plt.plot(yTestPredicted[0:1500], label='Prediction')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0.)
plt.show()
#Try to predict "real" data
RMSETest = np.sqrt(mean_squared_error(yTest,yTestPredicted))
R2Test = linReg.score(xTest,yTest)
print(f"RMSE score test: {RMSETest}")
print(f"R2 score test: {R2Test}")
In conclusion, it's a evidence that they have a great correlation. The regression is abled to explain the 98% of all instances. So, there is no reason to try improving its quality through other algorithms. We can also save result in a txt file executing the cell below.
#Save prediction's data
np.savetxt(resPath + "Apparent Temperature - Predictions",yTestPredicted)
Let's try predicting Pressure using all relevant columns. This predictions will be done using Ridge and Lasso regression.
targetColumn = ["Pressure"]
predictors = list(set(list(df.drop(['Formatted Date','Month','Day','Year','Hour','Precip Type'],axis = 1)))
- set(targetColumn))
alpha = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10]
params = {'alpha': alpha}
Xs = dfTrain[predictors].values
Ys = dfTrain[targetColumn].values
xTest = dfTest[predictors].values
yTest = dfTest["Pressure"].values
We define the number of folds used for GridSearchCV . In thsi case we define cv as 5.
cv = 5
ridge = Ridge()
ridgeReg = GridSearchCV(ridge,params,scoring='neg_mean_squared_error', cv = cv, verbose=1)
ridgeReg.fit(Xs,Ys)
ridgeReg.best_params_
ridgeReg.best_score_
yTestRidgePredicted = ridgeReg.predict(xTest)
ridgeError = np.sqrt(mean_squared_error(yTest,yTestRidgePredicted))
print(f"RMSE score: {ridgeError}")
lasso = Lasso()
lassoReg = GridSearchCV(lasso,params,scoring='neg_mean_squared_error', cv= cv, verbose=1)
lassoReg.fit(Xs,Ys)
lassoReg.best_params_
lassoReg.best_score_
yTestLassoPredicted = lassoReg.predict(xTest)
lassoError = np.sqrt(mean_squared_error(yTest,yTestLassoPredicted))
print(f"RMSE score: {lassoError}")
We can observe that Ridge and Lasso via GridSearchCV obtain a sligthly different RMSE score (Lasso regression has an alpha less than Ridge Regression)
The Elastic Net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods
elasticNet = ElasticNet()
elasticReg = GridSearchCV(elasticNet,params,scoring="neg_mean_squared_error",cv = cv, verbose = 1)
elasticReg.fit(Xs,Ys)
elasticReg.best_params_
elasticReg.best_score_
yTestElasticNetPredicted = elasticReg.predict(xTest)
#Print ElasticNet error
elasticError = np.sqrt(mean_squared_error(yTest,yTestElasticNetPredicted))
print(f"RMSE score: {elasticError}")
Via ElasticNet regolarization we obtain a small alpha than Lasso, but bigger than Ridge
We start with classification analysis. Classification is a supervised learning concept which basically categorizes a set of data into classes.
#Drop Precip Type columns with other value cause we have few samples tho predict
df = df.drop(df[df["Precip Type"] == "other"].index)
set(df["Precip Type"])
#Divide again the dataset int train and test
dfTrain,dfTest = train_test_split(df,random_state=1234,test_size= testPortion)
xTrain = dfTrain["Humidity"].values
yTrain = dfTrain["Precip Type"].values
xTest = dfTest["Humidity"].values
yTest = dfTest["Precip Type"].values
#Normalize Humidity values
scaler = StandardScaler()
xTrain = np.reshape(xTrain,(-1,1))
xTest = np.reshape(xTest,(-1,1))
scaler.fit(xTrain)
xTrainScaled = scaler.transform(xTrain)
xTestScaled = scaler.transform(xTest)
weatherClassifier = KNeighborsClassifier(n_neighbors = 3)
weatherClassifier.fit(xTrainScaled,yTrain)
yWeatherPredicted = weatherClassifier.predict(xTest)
Looking the confusion matrix we can see that we have an high prediction about rain category. Instead snow category is not predicted so good cause we haven't got enough samples.
Now, we want to see another statistic: the ROC curve. An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate
False Positive Rate
But before, via LabelEncoder we have to trasform the categoric variable(Precip Type) into discrete values.
#Printing some statistic like Confusion Matrix and a Classification Report
cf = confusion_matrix(yTest,yWeatherPredicted)
print("Confusion matrix: ")
print(cf)
print(classification_report(yTest,yWeatherPredicted))
print(f"Accuracy: {weatherClassifier.score(xTestScaled,yTest)}")
sns.heatmap(cf, annot = True)
le = LabelEncoder()
yTestDiscreteValues = le.fit_transform(yTest)
yTestDiscreteValues
After the conversion done above, the variabile becomes binary with the following meanings:
Referring also to ROC curve it is alsto interesting look the AUC. Instead, AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).
#Compute ROC and AUC
probs = weatherClassifier.predict_proba(xTest)
preeds = probs[:,1]
fpr,tpr,threshold = roc_curve(yTestDiscreteValues,preeds)
rocAuc = auc(fpr,tpr)
#Plot ROC curve
plt.title('Receiver Operating Characteristic with K =3')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % rocAuc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
The chart shows that the relaiabilty of the classifier is around 50%.\ Now we try to predict the best case, unless if there would be another one. We start computing the classifier error for each K until 34
#CELL A BIT SLOW TO RUN
errorsClassifiers = list()
for i in range (1,35):
weatherClassifier = KNeighborsClassifier(n_neighbors=i)
weatherClassifier.fit(xTrainScaled,yTrain)
yPredict = weatherClassifier.predict(xTest)
errorsClassifiers.append(np.mean(yTest != yPredict))
#Plot error classifier for each K
plt.figure(figsize =(14,6))
plt.plot(range(1,35), errorsClassifiers, color = "green", linestyle="dashed", marker="o",
markerfacecolor='blue', markersize=9)
plt.title("Error Rate K Value")
plt.xlabel("K value")
plt.ylabel("K mean error")
Observing the chart we can figure out that the best K is 12, but anyway it is not a good classifier. So, now we want trying improve our classificator using SVM.
At last, we want to improve the analyse done before using SVM (support vector machine). They are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis.
Before, we define a functions that compute the bet iperparamters using GridSearchCV. To make a comparison we want to use two different kernels:
Instead, an other function plot a confusion matrix for each SVM that we use.
.Execute the cell to load the two functions
def compute_SVM(classfier,params,n_folds,Xs,Ys):
gscv = GridSearchCV(classfier,params,cv = n_folds)
gscv.fit(Xs,Ys)
print("Params combination: \n",gscv.cv_results_['params'])
print('Avg accuracy per combination:\n', gscv.cv_results_['mean_test_score'])
print('Best combination:\n', gscv.best_params_)
print('Avg accuracy of best combination: {}'.format(gscv.best_score_))
return gscv
def print_confusion_matrix(classifier,yTest,yTestPredicted):
cf = confusion_matrix(yTest,yTestPredicted)
print(f"Confusion matrix:")
print(cf)
sns.heatmap(cf,annot = True)
#Prepare SVC's
n_folds = 3
clsLinear = SVC()
clsRBF = SVC()
paramGridLinearSVC = [{'kernel': ['linear'], 'C': [1, 5, 10]}]
paramGridRBFSVC = [{'kernel': ['rbf'], 'C': [1, 5, 10], 'gamma': [0.1, 0.01]}]
#CELL A BIT SLOW TO RUN
gscvLinear = compute_SVM(clsLinear,paramGridLinearSVC,n_folds,xTrain,yTrain)
yTestDiscreteValues = np.reshape(yTestDiscreteValues,(-1,1))
yLinearPredicted = gscvLinear.predict(yTestDiscreteValues)
print_confusion_matrix(gscvLinear,yTest,yLinearPredicted)
#CELL A BIT SLOW TO RUN
gscvRBF = compute_SVM(clsRBF,paramGridRBFSVC,n_folds,xTrain,yTrain)
yRBFPredicted = gscvRBF.predict(xTest)
print_confusion_matrix(gscvLinear,yTest,yRBFPredicted)
SVM is ony able to predict rain Precip Type (not snow).
In conclusion, we can say that using SVM we are able to increase the accuracy of our regressor of 89%, instead 82% (value obtained with KNeighbors), but obvsiously accuracy is not a reliable statistic in the best case because it's much better to consider how many values of a determined class will be predicted correctly.