Drop duplicate rows in PySpark DataFrame
In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python.
Let’s create a sample Dataframe
Python3
# importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving # an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "1" , "sravan" , "company 1" ], [ "4" , "sridevi" , "company 1" ]] # specify column names columns = [ 'Employee ID' , 'Employee NAME' , 'Company' ] # creating a dataframe from the # lists of data dataframe = spark.createDataFrame(data, columns) print ( 'Actual data in dataframe' ) dataframe.show() |
Output:
Method 1: Distinct
Distinct data means unique data. It will remove the duplicate rows in the dataframe
Syntax: dataframe.distinct()
where, dataframe is the dataframe name created from the nested lists using pyspark
Python3
print ( 'distinct data after dropping duplicate rows' ) # display distinct data dataframe.distinct().show() |
Output:
We can use the select() function along with distinct function to get distinct values from particular columns
Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()
Python3
# display distinct data in Employee # ID and Employee NAME dataframe.select([ 'Employee ID' , 'Employee NAME' ]).distinct().show() |
Output:
Method 2: dropDuplicate
Syntax: dataframe.dropDuplicates()
where, dataframe is the dataframe name created from the nested lists using pyspark
Python3
# remove duplicate data using # dropDuplicates()function dataframe.dropDuplicates().show() |
Output:
Python program to remove duplicate values in specific columns
Python3
# remove duplicate data using # dropDuplicates() function in # two columns dataframe.select([ 'Employee ID' , 'Employee NAME' ] ).dropDuplicates().show() |
Output: