Exploratory Data Analysis

EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. Here we will see how to check the data imbalance and skewness of the data.

Python3




plt.pie(df['rainfall'].value_counts().values,
        labels = df['rainfall'].value_counts().index,
        autopct='%1.1f%%')
plt.show()


Output:

Pie chart for the number of data for each target

Python3




df.groupby('rainfall').mean()


Output:

Here we can clearly draw some observations:

  • maxtemp is relatively lower on days of rainfall.
  • dewpoint value is higher on days of rainfall.
  • humidity is high on the days when rainfall is expected.
  • Obviously, clouds must be there for rainfall.
  • sunshine is also less on days of rainfall.
  • windspeed is higher on days of rainfall.

The observations we have drawn from the above dataset are very much similar to what is observed in real life as well.

Python3




features = list(df.select_dtypes(include = np.number).columns)
features.remove('day')
print(features)


Output:

['pressure', 'maxtemp', 'temperature', 'mintemp', 'dewpoint', 'humidity', 'cloud', 'sunshine', 'winddirection', 'windspeed']

Let’s check the distribution of the continuous features given in the dataset.

Python3




plt.subplots(figsize=(15,8))
 
for i, col in enumerate(features):
  plt.subplot(3,4, i + 1)
  sb.distplot(df[col])
plt.tight_layout()
plt.show()


Output:

Distribution plot for the columns with continuous data

Let’s draw boxplots for the continuous variable to detect the outliers present in the data.

Python3




plt.subplots(figsize=(15,8))
 
for i, col in enumerate(features):
  plt.subplot(3,4, i + 1)
  sb.boxplot(df[col])
plt.tight_layout()
plt.show()


Output:

Box plots for the columns with continuous data 

There are outliers in the data but sadly we do not have much data so, we cannot remove this.

Python3




df.replace({'yes':1, 'no':0}, inplace=True)


Sometimes there are highly correlated features that just increase the dimensionality of the feature space and do not good for the model’s performance. So we must check whether there are highly correlated features in this dataset or not.

Python3




plt.figure(figsize=(10,10))
sb.heatmap(df.corr() > 0.8,
           annot=True,
           cbar=False)
plt.show()


Output:

Heat map to detect highly correlated features

Now we will remove the highly correlated features ‘maxtemp’ and ‘mintemp’. But why not temp or dewpoint? This is because temp and dewpoint provide distinct information regarding the weather and atmospheric conditions.

Python3




df.drop(['maxtemp', 'mintemp'], axis=1, inplace=True)


Rainfall Prediction using Machine Learning – Python

Today there are no certain methods by using which we can predict whether there will be rainfall today or not. Even the meteorological department’s prediction fails sometimes. In this article, we will learn how to build a machine-learning model which can predict whether there will be rainfall today or not based on some atmospheric factors. This problem is related to Rainfall Prediction using Machine Learning because machine learning models tend to perform better on the previously known task which needed highly skilled individuals to do so. 

Similar Reads

Importing Libraries and Dataset

Python libraries make it easy for us to handle the data and perform typical and complex tasks with a single line of code....

Data Cleaning

...

Exploratory Data Analysis

...

Model Training

...

Model Evaluation

...