Steps to Call another Custom Python Function from a PySpark UDF
Let us see a step-by-step process to call another custom Python function from a Pyspark UDF.
Step 1: Import the necessary modules
First, import the ‘udf’ from the ‘pyspark.sql.functions’ module, which offers tools for dealing with Spark DataFrames.
from pyspark.sql.functions import udf
Step 2: Start Spark Session
Next, create a spark session by importing the necessary spark modules.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Step 3: Create a Dataframe
The next step is to create a dataframe that will be used to perform the operations on in spark.
data = [("Marry", 25), ("Sunny", 30), ("Ram", 35)]
df = spark.createDataFrame(data, ["name", "age"])
Step 4: Define the custom Python function
Then define a custom Python function that we wish to invoke from the PySpark UDF. We can use any logic or calculations we need in this function. For example, a function to convert a string to an upper case string.
def to_uppercase(string):
return string.upper()
Step 5: Create a PySpark UDF
Use the UDF function from the ‘pyspark.sql.functions’ module to construct a PySpark UDF after creating the custom Python function. The ‘udf()’ function should receive the custom Python function as an argument. The custom function is registered as a UDF, so that it may be applied to DataFrame columns.
to_uppercase_udf = udf(to_uppercase)
Step 6: Apply the UDF to a DataFrame
After creating the PySpark UDF, use the ‘withColumn()’ function to apply it to a DataFrame column. In the DataFrame, this method adds a new column or deletes an existing column. Each row of the DataFrame will call the UDF once, applying the custom Python function to the designated column and producing the desired result.
df = df.withColumn("name_uppercase", to_uppercase_udf(df["name"]))
Step 7: Display the DataFrame
At last, we will use the ‘show()’ function to display the dataframe to see the changes made to it.
df.show()
By following these instructions, we can execute customized calculations and transformations on the PySpark DataFrames by calling another custom Python function from a PySpark UDF.
Calling another custom Python function from Pyspark UDF
PySpark, often known as Python API for Apache Spark, was created for distributed data processing. It gives users the ability to efficiently and scalable do complex computations and transformations on large datasets. User-Defined Functions (UDFs), which let users create their unique functions and apply them to Spark DataFrames or RDDs, which is one of the main features of PySpark. Using UDFs, PySpark’s capabilities may be expanded and customized to meet certain needs. In this article, we will learn how to call another custom Python function from Pyspark UDF.