How To Calculate Summary Statistics In Pandas
Pandas, a powerful data manipulation library for Python, provides various functionalities to compute summary statistics on datasets. Summary statistics offer a quick and insightful overview of the main characteristics of a dataset. In this article, we will explore five different methods to calculate summary statistics using Pandas, accompanied by correct and error-free code examples.
Calculate Summary Statistics In Pandas
Below, are the example of Calculate Summary Statistics In Pandas.
- Using Descriptive Statistics using describe()
- Mean, Median, and Mode with mean(), median(), and mode()
- Correlation with corr() Method
Creating a Sample DataFrame
Let’s create the sample Pandas Dataframe.
Python3
import pandas as pd # Creating a sample DataFrame data = { 'A' : [ 1 , 2 , 3 , 4 , 5 ], 'B' : [ 10 , 20 , 15 , 25 , 30 ]} df = pd.DataFrame(data) print (df) |
Output :
A B
0 1 10
1 2 20
2 3 15
3 4 25
4 5 30
Calculate Summary Statistics Using Descriptive Statistics with describe()
The describe() method is a powerful tool to generate descriptive statistics of a DataFrame. It provides a comprehensive summary, including count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum
Python3
# Using describe() to calculate summary statistics summary_stats = df.describe() print (summary_stats) |
Output :
Summary Statistics:
A B
count 5.000000 5.000000
mean 3.000000 20.000000
std 1.581139 7.905694
min 1.000000 10.000000
25% 2.000000 15.000000
50% 3.000000 20.000000
75% 4.000000 25.000000
max 5.000000 30.000000
Mean, Median, and Mode with mean(), median(), and mode()
Pandas provides specific functions to calculate the mean, median, and mode of each column in a DataFrame.
Python3
# Calculating mean, median, and mode mean_values = df.mean() median_values = df.median() mode_values = df.mode().iloc[ 0 ] # mode() returns a DataFrame print ( "Mean values:\n" , mean_values) print ( "\nMedian values:\n" , median_values) print ( "\nMode values:\n" , mode_values) |
Output :
Mean values:
A 3.0
B 20.0
dtype: float64
Median values:
A 3.0
B 20.0
dtype: float64
Mode values:
A 1
B 10
Name: 0, dtype: int64
Calculate Summary Statistics Using Correlation with corr() Method
Correlation measures the strength and direction of a linear relationship between two variables. The corr() method in Pandas computes the pairwise correlation of columns, and it is particularly useful when dealing with large dataset
Python3
# Calculating correlation between columns correlation_matrix = df.corr() print ( "\nCorrelation Matrix:\n" , correlation_matrix) |
Output:
Correlation Matrix:
A B
A 1.0 0.9
B 0.9 1.0
Conclusion
In conclusion, mastering the art of calculating summary statistics in Pandas is essential for efficient data analysis. By harnessing the power of functions like describe()
and exploring methods such as mean, median, and standard deviation, users can gain valuable insights into their datasets. Pandas’ flexibility allows for easy customization and adaptation to diverse data scenarios. With these fundamental skills, analysts can streamline their workflow, uncover patterns, and make informed decisions based on a thorough understanding of their data.