Handling Outliers
An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal)objects. They can be caused by measurement or execution errors. The analysis for outlier detection is referred to as outlier mining. There are many ways to detect outliers, and the removal process of these outliers from the dataframe is the same as removing a data item from the panda’s dataframe.
To handle outliers effectively, we need to identify them in key numerical variables that could significantly impact our analysis. For this dataset, we’ll focus on ‘Salary’ and ‘Bonus %’ as these are critical financial metrics.
We’ll use the Interquartile Range (IQR) method to identify outliers in these variables. The IQR method is robust as it defines outliers based on the statistical spread of the data.
import seaborn as sns
import matplotlib.pyplot as plt
# Calculate IQR for Salary and Bonus %
Q1_salary = df['Salary'].quantile(0.25)
Q3_salary = df['Salary'].quantile(0.75)
IQR_salary = Q3_salary - Q1_salary
Q1_bonus = df['Bonus %'].quantile(0.25)
Q3_bonus = df['Bonus %'].quantile(0.75)
IQR_bonus = Q3_bonus - Q1_bonus
# Define outliers
outliers_salary = df[(df['Salary'] < (Q1_salary - 1.5 * IQR_salary)) |
(df['Salary'] > (Q3_salary + 1.5 * IQR_salary))]
outliers_bonus = df[(df['Bonus %'] < (Q1_bonus - 1.5 * IQR_bonus)) |
(df['Bonus %'] > (Q3_bonus + 1.5 * IQR_bonus))]
# Plotting boxplots
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
sns.boxplot(x=df['Salary'], ax=axes[0])
axes[0].set_title('Boxplot of Salary')
sns.boxplot(x=df['Bonus %'], ax=axes[1])
axes[1].set_title('Boxplot of Bonus %')
# Show the plots
plt.show()
# Display the number of outliers detected
outliers_salary.shape[0], outliers_bonus.shape[0]
Output:
For removing the outlier, one must follow the same process of removing an entry from the dataset using its exact position in the dataset because in all the above methods of detecting the outliers end result is the list of all those data items that satisfy the outlier definition according to the method used.
For more information, refer Detect and Remove the Outliers using Python
Steps for Mastering Exploratory Data Analysis | EDA Steps
Mastering exploratory data analysis (EDA) is crucial for understanding your data, identifying patterns, and generating insights that can inform further analysis or decision-making. Data is the lifeblood of cutting-edge groups, and the capability to extract insights from records has become a crucial talent in today’s statistics-pushed world. Exploratory Data Analysis (EDA) is a powerful method that allows analysts, scientists, and researchers to gain complete knowledge of their data earlier than projecting formal modeling or speculation testing.
It is an iterative procedure that entails summarizing, visualizing, and exploring information to find patterns, anomalies, and relationships that might not be apparent at once. In this complete article, we will understand and implement critical steps for performing Exploratory Data Analysis. Here are steps to help you master EDA:
Steps for Mastering Exploratory Data Analysis
- Step 1: Understand the Problem and the Data
- Step 2: Import and Inspect the Data
- Step 3: Handling Missing Values
- Step 4: Explore Data Characteristics
- Step 5: Perform Data Transformation
- Step 6: Visualize Data Relationships
- Step 7: Handling Outliers
- Step 8: Communicate Findings and Insights