Data Structures in Pandas
Pandas provides two main data structures: Series and DataFrame.
- Series: A one-dimensional labelled array capable of holding any data type.
- DataFrame: A two-dimensional tabular data structure with labelled axes (rows and columns).
Command | Execution |
---|---|
Import pandas as pd | Load the Pandas library as custom defined name pd |
pd.__version__ | Check the Pandas version |
Pandas Read and Write to CSV
Command | Execution Tasks |
---|---|
pd.read_csv('xyz.csv') | Read the .csv file |
df.to_csv('xyz.csv') | Save the Pandas data frame as “xyz.csv” form in the current folder |
pd.ExcelFile('xyz.xls' ) | Read the Sheet1 of the Excel file ‘xyz.xls’ |
df.to_excel('xyz.xlsx', sheet_name='Sheet1') | Save the dataset to xyz.xlsx as Sheet1 |
pd.read_json('xyz.json') | Read the xyz.json file |
pd.read_sql('xyz.sql') | Read the xyz.sql file |
pd.read_html('xyz.html') | Read the xyz.html file |
Create Pandas Series and Dataframe
Command | Execution Tasks |
---|---|
pd.Series(data=Data) | Create a Pandas Series with Data like {10: ‘DSA’, 20: ‘ML’, 30: ‘DS’} |
pd.Series(data = ['Geeks','for','geeks'], | Create a Pandas Series and add custom defined index |
pd.DataFrame(data) | Create Pandas Data frame with Data like {‘Fruits’: [‘Mango’, ‘Apple’, ‘Banana’, ‘Orange’], ‘Quantity’: [40, 20, 25, 10], ‘Price’: [80, 100, 50, 70] } |
df.dtypes | Give Data types |
df.shape | Give shape of the data |
df['Column_Name'].astype('int32') | Change the data type to integer 32 bit |
df['Column_Name'].astype('str') | Change the data type to string |
df['Column_Name'].astype('float') | Change the data type to float |
df.info() | Check the data information |
df.values | Give the data into the NumPy array |
Pandas Dataframe
| Fruits | Quantity | Price |
---|---|---|---|
0 | Mango | 40 | 80 |
1 | Apple | 20 | 100 |
2 | Banana | 25 | 50 |
3 | Orange | 10 | 70 |
Pandas Sorting, Reindexing, Renaming, Reshaping, Dropping
Sorting by values | |
df.sort_values('Price', ascending=True) | Sort the values of ‘Price’ of data frame df in Ascending order |
df.sort_values('Price', ascending=False) | Sort the values of ‘Price’ of data frame df in Descending order |
Sorting by Index | |
df.sort_index(ascending=False) | Sort the index of data frame df in Descending order |
Reindexing | |
df.reset_index(drop=True, inplace=True) | Reset the indexes to default
|
Renaming | |
df.rename(columns={'Fruits': 'FRUITS', | Rename the column name with its respective values: In the given code ‘Fruits’ will be replaced by ‘FRUITS’, ‘Quantity’ will be replaced ‘QUANTITY’ and ‘Price’ will be replaced by ‘PRICE’ |
Reshaping | |
pd.melt(df) | Gather columns into rows |
pivot = df.pivot(columns='FRUITS', | Create a Pivot Table |
Dropping | |
df1 = df.drop(columns=['QUANTITY'], axis=1) | Drop Column
|
df2 = df.drop([1, 3], axis=0) | Drop Rows
|
Dataframe Retrieving Series/DataFrame Information and Slicing
Observation | |
---|---|
df.head() | Print the first 5 rows |
df.tail() | Print the last 5 rows |
df.sample(n) | Select randomly n rows from the data frame df and print it. |
df.nlargest(2, 'QUANTITY') | Select the largest top 2 rows of the numerical column name ‘QUANTITY’ by its values. |
df.nsmallest(2, 'QUANTITY') | Select the smallest 2 rows of the numerical column name ‘QUANTITY’ by its values. |
df[df.PRICE > 50] | Select the rows having ‘PRICE’ values > 50 |
Selection Column data | |
df['FRUITS'] | Select a single column value with the name of the column I.E ‘FRUITS’ |
df[['FRUITS', 'PRICE']] | Select more than one column with its name. |
df.filter(regex='F|Q') | Select the column whose names match the patterns of the respective regular expression I.E ‘FRUITS’ & ‘QUANTITY’ |
Getting Subsets of rows or columns | |
df.loc[:, 'FRUITS':'PRICE'] | Select all the columns between Fruits and Price |
df.loc[df['PRICE'] < 70, ['FRUITS', 'PRICE']] | Select FRUITS name having PRICE <70 |
df.iloc[2:5] | Select 2 to 5 rows |
df.iloc[:, [0, 2]] | Select the columns having 0th & 2nd positions |
df.at[1, 'PRICE'] | Select Single PRICE value at 2nd row of the ‘PRICE’ column |
df.iat[1, 2] | Select the single values by their position i.e at the 2nd row and 3rd column. |
Filter | |
df.filter(items=['FRUITS', 'PRICE']) | Filter by column name
|
df.filter(items=[3], axis=0) | Filter by row index
|
df['PRICE'].where(df['PRICE'] > 50) | Returns a new Series object with the same length as the original ‘PRICE’ column. But where() function will replace values where the condition is False with NaN (missing value) or another specified value. |
df.query('PRICE>70') | Filter a DataFrame based on a specified condition
|
Combine Two data sets:
Merge two data frame | |
pd.merge(df1, df2, how='left', on='Fruits') | Left Join
|
pd.merge(df1, df2, how='right', on='Fruits') | Right Join
|
pd.merge(df1, df2, how='inner', on='Fruits') | Inner Join
|
pd.merge(df1, df2, how='outer', on='Fruits') | Outer Join
|
Concatenation | |
concat_df = pd.concat([df, df1], axis=0, ignore_index=True) | Row-Wise Concatenation
|
concat_df = pd.concat([df, df2], axis=1) | Row-Wise Concatenation
|
Data Analysis:
Describe dataset | |
df.describe() | Descriptive statistics of a data frame
|
df.describe(include=['O']) | Descriptive statistics of Object data types of the data frame
|
df.FRUITS.unique() |
|
df.FRUITS.value_counts() | Frequency the unique values in ‘FRUITS’ column |
df['PRICE'].sum() | Return the sum of ‘PRICE’ |
df['PRICE'].cumsum() | Return the cumulative sum of ‘PRICE’ values |
df['PRICE'].min() | Return the minimum value of ‘PRICE’ column |
df['PRICE'].max() | Return the maximum value of ‘PRICE’ column |
df['PRICE'].mean() | Return the mean value of ‘PRICE’ column |
df['PRICE'].median() | Return the median value of ‘PRICE’ column |
df['PRICE'].var() | Return the variance value of ‘PRICE’ column |
df['PRICE'].std() | Return the standard deviation value of ‘PRICE’ column |
df['PRICE'].quantile([0.25, 0.75]) | Return the 25 and 75 percentile value of ‘PRICE’ column |
df.apply(summation) | Apply any custom function with pandas def summation(col): |
df.cov(numeric_only=True) | Compute the Covariance for numerical columns |
df.corr(numeric_only=True) | Compute the Correlation for numerical columns |
Missing Values | |
df.isnull() | Check for null values
|
df.isnull().sum() | Return the row-wise count of null values |
df['DISCOUNT'] = df['DISCOUNT'].fillna(value=VALUE) | Fill the null values with the specified values ‘VALUE’. The value can be Mean, median, mode or any specified value. |
df1 = df.dropna() | Drop the null values |
Add a new column to the Data frame | |
df['COL_NAME'] = COL_DATA | Add a column to the Existing dataset Note: The length of COL_DATA should be equal to the number of rows of existing dataset |
df = df.assign(Paid_Price=lambda df: | Add a column using the existing columns values |
Group By | |
grouped = df.groupby(by='COL_NAME') | Group the dataframe w.r.t unique values of the specified column Name i.e ‘COL_NAME’ |
grouped.agg(['count','sum', 'mean']) | Return the count, sum and mean value as per grouped of column i.e ‘COL_NAME’ |
Graph with Pandas | |
grouped = df.groupby(['Origin']) | Pie Chart
|
df.plot.scatter(x='PRICE', y='DISCOUNT') | Scatter Plot
|
df.plot.bar(x='FRUITS', y=['QUANTITY', 'PRICE', 'DISCOUNT']) | Bar Chart
|
df['QUANTITY'].plot.hist(bins=3) | Histogram Plot
|
df.boxplot(column='PRICE', grid=False) | Box Plot
|
Pandas Cheat Sheet for Data Science in Python
Pandas is a powerful and versatile library that allows you to work with data in Python. It offers a range of features and functions that make data analysis fast, easy, and efficient. Whether you are a data scientist, analyst, or engineer, Pandas can help you handle large datasets, perform complex operations, and visualize your results.
This Pandas Cheat Sheet is designed to help you master the basics of Pandas and boost your data skills. It covers the most common and useful commands and methods that you need to know when working with data in Python. You will learn how to create, manipulate, and explore data frames, how to apply various functions and calculations, how to deal with missing values and duplicates, how to merge and reshape data, and much more.
If you are new to Data Science using Python and Pandas, or if you want to refresh your memory, this cheat sheet is a handy reference that you can use anytime. It will save you time and effort by providing you with clear and concise examples of how to use Pandas effectively.