Predicting Next Day Rain In Australia

(1) What Is The Business Use Case ?

Predicting next day rain using a dataset containing 10 years of daily weather observations from different location across Australia.

(2) Python Implementation.

(1) Importing Required Library

import matplotlib.pyplot as plt 
import seaborn as sns
import datetime
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from keras.layers import Dense, BatchNormalization, Dropout, LSTM
from keras.model import Sequential
from keras.utils import to_categorical
from keras.optimizer import Adam
from tensorflow.keras import regularizers
from keras import callbacks
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
np.random.seed(0)

(2) Data Loading

data = pd.read_csv('weatherAUS.csv')
data.head()

The dataset contains about 10 years of daily weather observations from different locations across Australia.
Observations were drawn from numerous weather stations.
In this project, I will use this data to predict whether or not it will rain the next day.
There are 23 attributes including the target variable “RainTomorrow”, indicating whether or not it will rain the next day or not.

(3) Data Information

data.info()

There are missing values in the dataset
Dataset includes numeric and categorical values.

(4) Data Visualization & Cleaning

Steps involved in this section.
Count plot of target column
Correlation amongst numeric attributes
Parse Dates into datetime
Encoding days and months as continuous cyclic features.

First of all let us evaluate the target and find out if our data is imbalanced or not?

cols = ['blue','red']
sns.countplot(x=data['RainTomorrow'], palette=cols)

Correlation amongst numeric attributes

corrmat = data.corr()
cmap = sns.diverging_palette(260,-10,s=50, l=75, n=6, as_cmap=True)
plt.subplots(figsize=(18,18))
sns.heatmap(corrmat,cmap=cmap, annot=True, square=True)

Parsing Dates Into DateTime Object.

My goal is to build an artificial neural network(ANN). I will encode dates appropriately, i.e. I prefer the months and days in a cyclic continuous feature.
A date and time are inherently cyclical.
To let the ANN model know that a feature is cyclical I split it into periodic subsections.
Namely, years, months and days. Now for each subsection, I create two new features, deriving a sine and cosine transformations of the subsection feature.

Exploring the length of date objects

lengths = data['Date'].str.len()
lengths.value_counts()

There doesn’t seem to be any missing values in dates so parsing values into datetime

Converting The Date Object To The DateTime Format:

data['Date'] = pd.to_datetime(date['Date'])

Creating A New Column For Year:

data['year'] = data.Date.dt.year

Function to encode datetime into cyclic parameters.

def encode(data, col, max_val):
    data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
    data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
    return data

data['month'] = data.Date.dt.month
data = encode(data, 'month', 12)

data['day'] = data.Date.dt.day
data = encode(data, 'day', 31)

data.head()

Plotting Distribution Of Days Over Years:

section = data[:360] 
tm = section["day"].plot(color="#C2C4E2")
tm.set_title("Distribution Of Days Over Year")
tm.set_ylabel("Days In month")
tm.set_xlabel("Days In Year")

As expected, the “year” attribute of data repeats.
However, in this, the true cyclic nature is not presented in a continuous manner.
Splitting months and days into Sine and cosine combinations provides the cyclical continuous feature.
This can be used as input features to ANN.

cyclic_month = sns.scatterplot(x='month_sin', y='month_cos', data=data, color="#C2C4E2")
cyclic_month.set_title("Cyclic Encoding of Month")
cyclic_month.set_ylabel("Cosine Encoded Months")
cyclic_month.set_xlabel("Sine Encoded Months")

To implement the true nature of the date column we need to split the date column into sine and cosine terms.
We need to combine both the sine and cosine terms to implement the cyclic nature of date and month column.

cyclic_day = sns.scatterplot(x='day_sin',y='day_cos',data=data, color="#C2C4E2")
cyclic_day.set_title("Cyclic Encoding of Day")
cyclic_day.set_ylabel("Cosine Encoded Day")
cyclic_day.set_xlabel("Sine Encoded Day")

Next, I will deal with missing values in categorical and numeric attributes separately

Categorical Variable:

Filling missing value with the mode of the column.
Get the list of categorical variables.

s = (data.dtypes == 'object')
object_cols = list(s[s].index)
print('Categorical Variables:')
print(object_cols)

Missing Values In Categorical Columns:

for i in object_cols:
    print(i, data[i].isnull().sum())

Filling missing values with mode of the column in value:

for i in object_cols:
    data[i].fillna(data[i].mode()[0], inplace=True)

Numerical Variables:

Filling missing value with the median of the column.
Get the list of numeric variables.

t = (data.dtypes == 'float64')
num_cols = list(t[t].index)

print('Numeric Variables:')
print(num_cols)

Missing Values In Numeric Columns:

for i in num_cols:
    print(i, data[i].isnull().sum)

Filling missing values with median of the column in value:

for i in object_cols:
    data[i].fillna(data[i].median()[0], inplace=True)

data.info()

(5) Data Preprocessing

Label encoding columns with categorical data.
Perform the scaling of the features.
Detecting outliers.
Dropping the outliers based on data analysis.

Label Encoding The Categorical Variable:

label_encoder = LabelEncoder()

for i in object_cols:
    data[i] = label_encoder.fit_transform(data[i])
    
data[object_cols]

Standardizing The Numerical Variables:

# Dropping Target & Extra Columns
features = data.drop(['RainTomorrow', 'Date', 'day', 'month'], axis=1)
# Creating The Target Column
target = data['RainTomorrow']

col_names = list(features.columns)
s_scalar = preprocessing.StandardScalar()
features = s_scalar.fit_transform(features)
features = pd.DataFrame(features, columns=col_names)
features.head()

Outlier Detection:

colours = ["#D0DBEE", "#C2C4E2", "#EED4E5", "#D1E6DC", "#BDE2E2"]
plt.figure(figsize=(20,10))

sns.boxenplot(data = features, palette = colours)
plt.xticks(rotations=90)
plt.show()

# Adding The Target Variable
features["RainTomorrow"] = target

# Dropping The Outlier values
features = features[(features["MinTemp"]<2.3)&(features["MinTemp"]>-2.3)]
features = features[(features["MaxTemp"]<2.3)&(features["MaxTemp"]>-2)]
features = features[(features["Rainfall"]<4.5)]
features = features[(features["Evaporation"]<2.8)]
features = features[(features["Sunshine"]<2.1)]
features = features[(features["WindGustSpeed"]<4)&(features["WindGustSpeed"]>-4)]
features = features[(features["WindSpeed9am"]<4)]
features = features[(features["WindSpeed3pm"]<2.5)]
features = features[(features["Humidity9am"]>-3)]
features = features[(features["Humidity3pm"]>-2.2)]
features = features[(features["Pressure9am"]< 2)&(features["Pressure9am"]>-2.7)]
features = features[(features["Pressure3pm"]< 2)&(features["Pressure3pm"]>-2.7)]
features = features[(features["Cloud9am"]<1.8)]
features = features[(features["Cloud3pm"]<2)]
features = features[(features["Temp9am"]<2.3)&(features["Temp9am"]>-2)]
features = features[(features["Temp3pm"]<2.3)&(features["Temp3pm"]>-2)]

features.shape

Looking at the scaled features without outliers

plt.figure(figsize=(20,10))
sns.boxenplot(data=features, palette=colours)
plt.xticks(rotation=90)
plt.show()

Looks Good.
Up next is building an Artificial Neural Network.

(6) Model Building

Assigning ‘X’ and ‘y’ the status of attributes and tags
Splitting test and training sets
Initialising the neural network
Defining by adding layers
Compiling the neural network
Train the neural network

Assigning X and y the status of attributes and tags:

X = features.drop(['RainTomorrow'], axis=1)
y = features['RainTomorrow']

Splitting Test And Training Sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)

print('X Train:', X_train.shape)
print('X Test:',  X_test.shape)
print('y Train:', y_train.shape)
print('X Test:', y_test.shape)

Defining Early Stopping Method:

early_stopping = callbacks.EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

Initializing the Neural Networks:

model = Sequential()

Defining The Neural Network Layers:

model.add(Dense(units = 32, kernel_initializer = 'uniform', activation = 'relu', input_dim = 26))

model.add(Dense(units = 32, kernel_initializer = 'uniform', activation = 'relu'))

model.add(Dense(units=16, kernel_initializer = 'uniform', activation = 'relu'))

model.add(Dropout(0.25))

model.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu'))

model.add(Dropout(0.5))

model.add(Dense(units=1, kernel_initializer = 'uniform', activation = 'relu'))

Compiling The ANN:

opt = Adam(learning_rate = 0.00009)
model.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])

Training ANN Model:

history = model.fit(X_train, y_train, batch_size = 32, epochs = 150, callbacks = [early_stopping], validation_split = 0.2)

Plotting Training & Validation Loss Over Epochs:

history_df = pd.DataFrame(history.history)

plt.plot(history_df.loc[:, ['loss']], "#BDE2E2", label='Training loss')
plt.plot(history_df.loc[:, ['val_loss']],"#C2C4E2", label='Validation loss')
plt.title('Training and Validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend(loc="best")

plt.show()

Plotting Training And Validation Accuracy Over Epochs:

history_df = pd.DataFrame(history.history)

plt.plot(history_df.loc[:, ['accuracy']], "#BDE2E2", label='Training accuracy')
plt.plot(history_df.loc[:, ['val_accuracy']], "#C2C4E2", label='Validation accuracy')

plt.title('Training and Validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

(7) Conclusion:

Testing on the test set.
Evaluating the confusion matrix.
Evaluating the classification report.

Predicting The Test Set Result:

y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5)

Predicting Confusion Matrix:

cmap1 = sns.diverging_palette(260,-10,s=50, l=75, n=5, as_cmap=True)
plt.subplots(figsize=(12,8))
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix/np.sum(cf_matrix), cmap = cmap1, annot = True, annot_kws = {'size':15})

Predicting Classification Report

print(classification_report(y_test, y_pred))

Australian Rain Prediction.

Predicting Next Day Rain In Australia

Table Of Contents:

(1) What Is The Business Use Case ?

(2) Python Implementation.

(1) Importing Required Library

(2) Data Loading

(3) Data Information

(4) Data Visualization & Cleaning

First of all let us evaluate the target and find out if our data is imbalanced or not?

Correlation amongst numeric attributes

Parsing Dates Into DateTime Object.

Exploring the length of date objects

Converting The Date Object To The DateTime Format:

Creating A New Column For Year:

Function to encode datetime into cyclic parameters.

Plotting Distribution Of Days Over Years:

Next, I will deal with missing values in categorical and numeric attributes separately

Categorical Variable:

Missing Values In Categorical Columns:

Filling missing values with mode of the column in value:

Numerical Variables:

Missing Values In Numeric Columns:

Filling missing values with median of the column in value:

(5) Data Preprocessing

Label Encoding The Categorical Variable:

Standardizing The Numerical Variables:

Outlier Detection:

Looking at the scaled features without outliers

(6) Model Building

Assigning X and y the status of attributes and tags:

Splitting Test And Training Sets:

Defining Early Stopping Method:

Initializing the Neural Networks:

Defining The Neural Network Layers:

Compiling The ANN:

Training ANN Model:

Plotting Training & Validation Loss Over Epochs:

Plotting Training And Validation Accuracy Over Epochs:

(7) Conclusion:

Predicting The Test Set Result:

Predicting Confusion Matrix:

Predicting Classification Report

Leave a Reply Cancel reply