How to use Window Function In Python

The window function is used for partitioning the columns in the dataframe

Syntax: Window.partitionBy(‘column_name_group’)

where, column_name_group is the column that contains multiple values for partition

We can partition the data column that contains group values and then use the aggregate functions like min(), max, etc to get the data. In this way, we are going to filter the data from the PySpark DataFrame with where clause.

Syntax: dataframe.withColumn(‘new column’, functions.max(‘column_name’).over(Window.partitionBy(‘column_name_group’))).where(functions.col(‘column_name’) == functions.col(‘new_column_name’))

where,

dataframe is the input dataframe

column_name_group is the column to be partitioned

column_name is to get the values with grouped column

new_column_name is the new filtered column

Example: PySpark program to filter only maximum rows from the dataframe from all departments

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
#import functions 
from pyspark.sql import functions as f 
  
# import window module 
from pyspark.sql import Window 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of student  data 
data = [["1", "sravan", "IT", 45000], 
        ["2", "ojaswi", "CS", 85000], 
        ["3", "rohith", "CS", 41000], 
        ["4", "sridevi", "IT", 56000], 
        ["5", "bobby", "ECE", 45000], 
        ["6", "gayatri", "ECE", 49000], 
        ["7", "gnanesh", "CS", 45000], 
        ["8", "bhanu", "Mech", 21000] 
        ] 
  
# specify column names 
columns = ['ID', 'NAME', 'DEPT', 'FEE'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
  
# display 
dataframe.withColumn('FEE max', f.max('FEE').over( 
    Window.partitionBy('DEPT'))).where( 
  f.col('FEE') == f.col('FEE max')).show() 

Output:

GroupBy and filter data in PySpark

In this article, we will Group and filter the data in PySpark using Python.

Let’s create the dataframe for demonstration:

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of student  data 
data = [["1", "sravan", "IT", 45000], 
        ["2", "ojaswi", "CS", 85000], 
        ["3", "rohith", "CS", 41000], 
        ["4", "sridevi", "IT", 56000], 
        ["5", "bobby", "ECE", 45000], 
        ["6", "gayatri", "ECE", 49000], 
        ["7", "gnanesh", "CS", 45000], 
        ["8", "bhanu", "Mech", 21000] 
        ] 
  
# specify column names 
columns = ['ID', 'NAME', 'DEPT', 'FEE'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# display 
dataframe.show() 

Output:

In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data. We have to use any one of the functions with groupby while using the method