Find All Duplicate Rows in a Pandas Dataframe
Below are the examples by which we can select duplicate rows in a DataFrame:
- Select Duplicate Rows Based on All Columns
- Get List of Duplicate Last Rows Based on All Columns
- Select List Of Duplicate Rows Using Single Columns
- Select List Of Duplicate Rows Using Multiple Columns
- Select Duplicate Rows Using Sort Values
Select Duplicate Rows Based on All Columns
Here, We do not pass any argument, therefore, it takes default values for both the arguments i.e. subset = None and keep = ‘first’.
Python3
# Import pandas library import pandas as pd # List of Tuples employees = [( 'Stuti' , 28 , 'Varanasi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Aaditya' , 25 , 'Mumbai' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Mumbai' ), ( 'Aaditya' , 40 , 'Dehradun' ), ( 'Seema' , 32 , 'Delhi' ) ] # Creating a DataFrame object df = pd.DataFrame(employees, columns = [ 'Name' , 'Age' , 'City' ]) # Selecting duplicate rows except first # occurrence based on all columns duplicate = df[df.duplicated()] print ( "Duplicate Rows :" ) # Print the resultant Dataframe duplicate |
Output
Duplicate Rows :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
Get List of Duplicate Last Rows Based on All Columns
If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.
Python3
# Import pandas library import pandas as pd # List of Tuples employees = [( 'Stuti' , 28 , 'Varanasi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Aaditya' , 25 , 'Mumbai' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Mumbai' ), ( 'Aaditya' , 40 , 'Dehradun' ), ( 'Seema' , 32 , 'Delhi' ) ] # Creating a DataFrame object df = pd.DataFrame(employees, columns = [ 'Name' , 'Age' , 'City' ]) # Selecting duplicate rows except last # occurrence based on all columns. duplicate = df[df.duplicated(keep = 'last' )] print ( "Duplicate Rows :" ) # Print the resultant Dataframe duplicate |
Output
Duplicate Rows :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
Select List Of Duplicate Rows Using Single Columns
If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.
Python3
# import pandas library import pandas as pd # List of Tuples employees = [( 'Stuti' , 28 , 'Varanasi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Aaditya' , 25 , 'Mumbai' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Mumbai' ), ( 'Aaditya' , 40 , 'Dehradun' ), ( 'Seema' , 32 , 'Delhi' ) ] # Creating a DataFrame object df = pd.DataFrame(employees, columns = [ 'Name' , 'Age' , 'City' ]) # Selecting duplicate rows based # on 'City' column duplicate = df[df.duplicated( 'City' )] print ( "Duplicate Rows based on City :" ) # Print the resultant Dataframe duplicate |
Output
Duplicate Rows based on City :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
7 Saumya 32 Delhi
Select List Of Duplicate Rows Using Multiple Columns
In this example, a pandas DataFrame is created from a list of employee tuples with columns ‘Name,’ ‘Age,’ and ‘City.’ The code identifies and displays duplicate rows based on the ‘Name’ and ‘Age’ columns, highlighting instances where individuals share the same name and age.
Python3
# import pandas library import pandas as pd # List of Tuples employees = [( 'Stuti' , 28 , 'Varanasi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Aaditya' , 25 , 'Mumbai' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Mumbai' ), ( 'Aaditya' , 40 , 'Dehradun' ), ( 'Seema' , 32 , 'Delhi' ) ] # Creating a DataFrame object df = pd.DataFrame(employees, columns = [ 'Name' , 'Age' , 'City' ]) # Selecting duplicate rows based # on list of column names duplicate = df[df.duplicated([ 'Name' , 'Age' ])] print ( "Duplicate Rows based on Name and Age :" ) # Print the resultant Dataframe duplicate |
Output
Duplicate Rows based on City :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
Select Duplicate Rows Using Sort Values
In this example, a pandas DataFrame is created from a list of employee tuples, and duplicate rows based on the ‘Name’ and ‘Age’ columns are identified and displayed, with the resulting DataFrame sorted by the ‘Age’ column. The code showcases how to find and organize duplicate entries in a tabular data structure
Python3
import pandas as pd # List of Tuples employees = [( 'Stuti' , 28 , 'Varanasi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Aaditya' , 25 , 'Mumbai' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Mumbai' ), ( 'Aaditya' , 40 , 'Dehradun' ), ( 'Seema' , 32 , 'Delhi' ) ] # Creating a DataFrame object df = pd.DataFrame(employees, columns = [ 'Name' , 'Age' , 'City' ]) # Finding and sorting duplicate rows based on 'Name' and 'Age' duplicate_sorted = df[df.duplicated([ 'Name' , 'Age' ], keep = False )].sort_values( 'Age' ) print ( "Duplicate Rows based on Name and Age (sorted):" ) # Print the resultant DataFrame print (duplicate_sorted) |
Output
Duplicate Rows based on Name and Age (sorted):
Name Age City
1 Saumya 32 Delhi
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
Find duplicate rows in a Dataframe based on all or selected columns
Duplicating rows in a DataFrame involves creating identical copies of existing rows within a tabular data structure, such as a pandas DataFrame, based on specified conditions or across all columns. This process allows for the replication of data to meet specific analytical or processing requirements. In this article, we will be discussing how to find duplicate rows in a Dataframe based on all or a list of columns. For this, we will use Dataframe.duplicated() method of Pandas.
Creating a Sample Pandas DataFrame
Let’s create a simple Dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’, and ‘City’.
Python3
# Import pandas library import pandas as pd # List of Tuples employees = [( 'Stuti' , 28 , 'Varanasi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Aaditya' , 25 , 'Mumbai' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Delhi' ), ( 'Saumya' , 32 , 'Mumbai' ), ( 'Aaditya' , 40 , 'Dehradun' ), ( 'Seema' , 32 , 'Delhi' ) ] # Creating a DataFrame object df = pd.DataFrame(employees, columns = [ 'Name' , 'Age' , 'City' ]) # Print the Dataframe df |
Output
Name Age City
0 Stuti 28 Varanasi
1 Saumya 32 Delhi
2 Aaditya 25 Mumbai
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
6 Aaditya 40 Dehradun
7 Seema 32 Delhi