Data Structures in Pandas

Pandas provides two main data structures: Series and DataFrame.

  • Series: A one-dimensional labelled array capable of holding any data type.
  • DataFrame: A two-dimensional tabular data structure with labelled axes (rows and columns).

Command

 Execution

Import pandas as pd
 Load the Pandas library as custom defined name pd
pd.__version__
Check the Pandas version

Pandas Read and Write to CSV

Command

Execution Tasks

pd.read_csv('xyz.csv')
Read the .csv file
df.to_csv('xyz.csv')
Save the Pandas data frame as “xyz.csv” form in the current folder
pd.ExcelFile('xyz.xls' )
pd.read_excel(file, 'Sheet1')
Read the Sheet1 of the Excel file ‘xyz.xls’
df.to_excel('xyz.xlsx', sheet_name='Sheet1')
Save the dataset to xyz.xlsx as Sheet1
pd.read_json('xyz.json')
Read the xyz.json file
pd.read_sql('xyz.sql')
Read the xyz.sql file
pd.read_html('xyz.html')
Read the xyz.html file

Create Pandas Series and Dataframe

Command

Execution Tasks

pd.Series(data=Data)

Create a Pandas Series with 

Data like  {10: ‘DSA’, 20: ‘ML’, 30: ‘DS’}

pd.Series(data = ['Geeks','for','geeks'],
index = ['A','B','C'])
Create a Pandas Series and add custom defined index
pd.DataFrame(data)

Create Pandas Data frame with 

Data like  

{‘Fruits’: [‘Mango’, ‘Apple’, ‘Banana’, ‘Orange’],       

 ‘Quantity’: [40, 20, 25, 10],       

 ‘Price’: [80, 100, 50, 70] }

df.dtypes
Give Data types
df.shape
Give shape of the data
df['Column_Name'].astype('int32')
Change the data type to integer 32 bit
df['Column_Name'].astype('str')
Change the data type to string
df['Column_Name'].astype('float')
Change the data type to float
df.info()
Check the data information
df.values
Give the data into the NumPy array

Pandas Dataframe

 

Fruits

Quantity

Price

0

Mango

40

80

1

Apple

20

100

2

Banana

25

50

3

Orange

10

70

Pandas Sorting, Reindexing, Renaming, Reshaping, Dropping

Sorting by values

df.sort_values('Price', ascending=True)
Sort the values of ‘Price’ of data frame df in Ascending order
df.sort_values('Price', ascending=False)
Sort the values of ‘Price’ of data frame df in Descending order

Sorting by Index

df.sort_index(ascending=False)
Sort the index of data frame df in Descending order

Reindexing 

df.reset_index(drop=True, inplace=True)

Reset the indexes to default

  • inplace = True # make changes to the original data frame
  • drop = True # Drop the initial indexes, if False then the previous index is assigned in a column.

Renaming

df.rename(columns={'Fruits': 'FRUITS',
'Quantity': 'QUANTITY',
'Price': 'PRICE'},
inplace=True)

Rename the column name with its respective values:

In the given code ‘Fruits’ will be replaced by ‘FRUITS’, ‘Quantity’ will be replaced ‘QUANTITY’ and ‘Price’ will be replaced by ‘PRICE’

Reshaping

pd.melt(df)
Gather columns into rows
pivot = df.pivot(columns='FRUITS',
values=['PRICE', 'QUANTITY'])

Create a Pivot Table

 Dropping

df1 = df.drop(columns=['QUANTITY'], axis=1)

Drop Column

  • Drop the ‘QUANTITY’ from the data frame df, Here axis = 1 is for the column.
df2 = df.drop([1, 3], axis=0)

Drop Rows

  • Drop 2nd and 4th rows of data frame df, Here axis = 0 is for row

Dataframe Retrieving Series/DataFrame Information and Slicing

Observation

df.head()
Print the first 5 rows
df.tail()
Print the last 5 rows
df.sample(n)
Select randomly n rows from the data frame df and print it.
df.nlargest(2, 'QUANTITY')
Select the largest top 2 rows of the numerical column name ‘QUANTITY’ by its values.
df.nsmallest(2, 'QUANTITY')
Select the smallest 2 rows of the numerical column name ‘QUANTITY’ by its values.
df[df.PRICE > 50]
Select the rows having ‘PRICE’ values > 50

Selection Column data

df['FRUITS']
Select a single column value with the name of the column I.E ‘FRUITS’
df[['FRUITS', 'PRICE']]
Select more than one column with its name.
df.filter(regex='F|Q')

Select the column whose names match the patterns of the respective regular expression 

I.E ‘FRUITS’ & ‘QUANTITY’

Getting Subsets of rows or columns

df.loc[:, 'FRUITS':'PRICE']
Select all the columns between Fruits and Price
df.loc[df['PRICE'] < 70, ['FRUITS', 'PRICE']]
Select FRUITS name having PRICE <70
df.iloc[2:5]
Select 2 to 5 rows
df.iloc[:, [0, 2]]
Select the columns having 0th & 2nd positions
df.at[1, 'PRICE']
Select Single PRICE value at 2nd row of the ‘PRICE’ column
df.iat[1, 2]
Select the single values by their position i.e at the 2nd row and 3rd column.

Filter

df.filter(items=['FRUITS', 'PRICE'])

Filter by column name

  •  Select the ‘FRUITS’ and ‘PRICE’ column of the data frame
df.filter(items=[3], axis=0)

Filter by row index

  • Select the 3rd row of the data frame
  • Here axis = 0 is for row
 df['PRICE'].where(df['PRICE'] > 50)

 Returns a new Series object with the same length as the original ‘PRICE’ column. 

But where() function will replace values where the condition is False with NaN (missing value) or another specified value.

 df.query('PRICE>70')

 Filter a DataFrame based on a specified condition

  • Return the rows having PRICE > 70

Combine Two data sets:

Merge two data frame

pd.merge(df1, df2, how='left', on='Fruits')

Left Join

  • Merge the two data frames df1 and df2 based on the ‘Fruits’ column of the left data frame i.e df1
pd.merge(df1, df2, how='right', on='Fruits')

Right Join

  • Merge the two data frames df1 and df2 based on the ‘Fruits’ column of the right data frame i.e df2
pd.merge(df1, df2, how='inner', on='Fruits')

Inner Join

  • Merge the two data frames df1 and df2 based on the common ‘Fruits’ name of both data frame 
pd.merge(df1, df2, how='outer', on='Fruits')

Outer Join

  • Merge the two data frames df1 and df2 based on the common ‘Fruits’ name
  • In this case ‘Fruits’ of both data frames will be arranged accordingly

Concatenation

concat_df = pd.concat([df, df1], axis=0, ignore_index=True)

Row-Wise Concatenation

  • axis = 0 : denotes that the data frame df and df1 will join vertically 
  • Ignore_index = True : ensures that the resulting DataFrame has a new index, starting from zero and incrementing sequentially
  • concat_df has the rows of df followed by df1 
concat_df = pd.concat([df, df2], axis=1)

Row-Wise Concatenation

  • axis = 1 : denotes that the data frame df and df1 will join horizontally 
  • concat_df has the column of df followed by df2, 
  • If the lengths of the DataFrames don’t match, NaN values will be assigned to the missing elements.

Data Analysis:

Describe dataset

df.describe()

Descriptive statistics of a data frame
Return

  • count: Number of rows for each numerical column
  • mean: Average values of each numerical column
  • std: Standard deviation of each numerical column
  • min: Minimum value of each numerical column
  • 25%, 50%, 75%: 25, 50 & 75 percentile of each numerical column
  •  max: Maximum values of each numerical column
df.describe(include=['O'])

Descriptive statistics of Object data types of the data frame

  • include =[‘O’] : Signifies the Object data types column
  • count: Number of rows for each object datatypes column
  • unique: Count of unique values for each object datatypes column
  • top: Top row value each object datatypes column
  • freq: Frequecy of the unique value
df.FRUITS.unique()
  • Check the unique values of ‘FRUITS’ column in the dataset
df.FRUITS.value_counts()
Frequency the unique values in ‘FRUITS’ column
df['PRICE'].sum()
Return the sum of ‘PRICE’
df['PRICE'].cumsum()
Return the cumulative sum of ‘PRICE’ values
df['PRICE'].min()
Return the minimum value of ‘PRICE’ column
df['PRICE'].max()
Return the maximum value of ‘PRICE’ column
df['PRICE'].mean()
Return the mean value of ‘PRICE’ column
df['PRICE'].median()
Return the median value of ‘PRICE’ column
df['PRICE'].var()
Return the variance value of ‘PRICE’ column
df['PRICE'].std()
Return the standard deviation value of ‘PRICE’ column
df['PRICE'].quantile([0.25, 0.75])
Return the 25 and 75  percentile value of ‘PRICE’ column
df.apply(summation)

Apply any custom function with pandas

def summation(col):
if col.dtypes != 'int64':
return col.count()
else:
return col.sum()
df.cov(numeric_only=True)
Compute the Covariance for numerical columns
df.corr(numeric_only=True)
Compute the Correlation for numerical columns

Missing Values

df.isnull()

Check for null values

  • Return True or False, Having True means data is missing
df.isnull().sum()
Return the row-wise count of null values
df['DISCOUNT'] = df['DISCOUNT'].fillna(value=VALUE)
Fill the null values with the specified values ‘VALUE’. The value can be Mean, median, mode or any specified value.
df1 = df.dropna()
Drop the null values

Add a new  column to the Data frame

df['COL_NAME'] = COL_DATA

Add a column to the Existing dataset

Note: The length of COL_DATA should be equal to the number of rows of existing dataset

df = df.assign(Paid_Price=lambda df:
(df.QUANTITY * df.PRICE))
Add a column using the existing columns values

Group By

grouped = df.groupby(by='COL_NAME')
Group the dataframe w.r.t unique values of the specified column Name i.e ‘COL_NAME’
grouped.agg(['count','sum', 'mean'])
Return the count, sum and mean value as per grouped of column i.e ‘COL_NAME’

Graph with Pandas

grouped = df.groupby(['Origin'])
grouped.sum().plot.pie(y='Paid_Price', subplots=True)

Pie Chart

  • Plot the Pie Chart showing group by sum of values in ‘Paid_Price’ as per group of  ‘Origin’
df.plot.scatter(x='PRICE', y='DISCOUNT')

Scatter Plot

  • Scatter Plot between ‘PRICE’ and ‘DISCOUNT’
df.plot.bar(x='FRUITS', y=['QUANTITY', 'PRICE', 'DISCOUNT'])

Bar Chart

  • Bar chart having horizontal axis with Fruit Names and the respective ‘QUANTITY,’PRICE’ and ‘DISCOUNT’ values.
df['QUANTITY'].plot.hist(bins=3)

Histogram Plot

  • Histogram plot of  ‘QUANTITY’ column with specified bins value i.e 3 here.
df.boxplot(column='PRICE', grid=False)

Box Plot

  • Box plot of ‘PRICE’ column 
  • It is used for outlier detection

Pandas Cheat Sheet for Data Science in Python

Pandas is a powerful and versatile library that allows you to work with data in Python. It offers a range of features and functions that make data analysis fast, easy, and efficient. Whether you are a data scientist, analyst, or engineer, Pandas can help you handle large datasets, perform complex operations, and visualize your results.

This Pandas Cheat Sheet is designed to help you master the basics of Pandas and boost your data skills. It covers the most common and useful commands and methods that you need to know when working with data in Python. You will learn how to create, manipulate, and explore data frames, how to apply various functions and calculations, how to deal with missing values and duplicates, how to merge and reshape data, and much more.

If you are new to Data Science using Python and Pandas, or if you want to refresh your memory, this cheat sheet is a handy reference that you can use anytime. It will save you time and effort by providing you with clear and concise examples of how to use Pandas effectively.

Similar Reads

Pandas Cheat Sheet

This Pandas Cheat Sheet will help you enhance your understanding of the Pandas library and gain proficiency in working with DataFrames, importing/exporting data, performing functions and operations, and utilizing visualization methods to explore DataFrame information effectively....

What is Pandas?

Python’s Pandas open-source package is a tool for data analysis and management. It was developed by Wes McKinney and is used in various fields, including data science, finance, and social sciences. Pandas’ key features encompass the use of DataFrame and Series objects, efficient indexing capabilities, data alignment, and swift handling of missing data....

Installing Pandas

If you have Python installed, you can use the following command to install Pandas:...

Importing Pandas

Once Pandas is installed, you can import it into your Python script or Jupyter Notebook using the following import statement:...

Data Structures in Pandas

Pandas provides two main data structures: Series and DataFrame....

Hands-on Practice on Pandas

Load the pandas libraries...

I/O Pandas Series and Dataframe

Creating Pandas Series....

Sorting, Reindexing, Renaming, Reshaping, Dropping

Sorting by values...

Dataframe Slicing and Observation

A. Observation...

Combine Two data sets

Create 1st dataframe...

Descriptive Analysis Pandas

Describe dataset...

Conclusion

In conclusion, the Pandas Cheat Sheet serves as an invaluable resource for data scientists and Python users. Its concise format and practical examples provide quick access to essential Pandas functions and methods. By leveraging this pandas cheat sheet, users can streamline their data manipulation tasks, gain insights from complex datasets, and make informed decisions. Overall, the Pandas Cheat Sheet is a must-have tool for enhancing productivity and efficiency in data science projects....

Pandas Cheat Sheet – FAQs

1. What is a Pandas cheat sheet?...