Create new column with function in Spark Dataframe
In this article, we are going to learn how to create a new column with a function in the PySpark data frame in Python.
PySpark is a popular Python library for distributed data processing that provides high-level APIs for working with big data. The data frame class is a key component of PySpark, as it allows you to manipulate tabular data with distributed computing. In this article, we will need to have PySpark installed and be familiar with basic data frame operations.
Define a function:
In this, we have defined a function that we can use to create a new column. This function takes an age as input and returns a string indicating whether the person is a ” child ” or an ” adult “.
Python3
# Defining a function def age_group(age): if age < 18 : return "child" else : return "adult" # Storing return output of age_group function # in age_group_udf age_group_udf = udf(age_group) |
Create a new column with a function using the withColumn() method in PySpark
In this column, we are going to add a new column to a data frame by defining a custom function and applying it to the data frame using a UDF. The UDF takes a column of the data frame as input, applies the custom function to it, and returns the result as a new column.
Here are the steps for using the withColumn() method to create a new column called “age_group” in our data frame:
Python3
from pyspark.sql import SparkSession from pyspark.sql.functions import udf # create a SparkSession spark = SparkSession.builder.getOrCreate() # create a list of tuples data = [( "Alice" , 1 ), ( "Bob" , 2 ), ( "Charlie" , 3 )] # create a DataFrame from the list df = spark.createDataFrame(data, [ "name" , "age" ]) print ( "Dataframe Before adding new col : " ) # view the DataFrame df.show() def age_group(age): if age < 18 : return "child" else : return "adult" age_group_udf = udf(age_group) # create a new column called "age_group" df = df.withColumn( "age_group" , age_group_udf(df.age)) print ( "Dataframe After adding new col : " ) # view the DataFrame with the new column df.show() |
Output :
Create a new column with a function using the PySpark UDFs method
In this approach, we are going to add a new column to a data frame by defining a custom function and registering it as a UDF using the spark.udf.register() method. Then using selectExpr() method of the data frame to select the columns of the data frame and the new column which is created by applying the UDF on a column of DataFrame. Here is a step-by-step explanation of the code:
Python3
# Import required modules from pyspark.sql import SparkSession # Creating the SparkSession spark = SparkSession.builder.appName( "Creating new column using UDF" ).getOrCreate() # Creating the DataFrame data = [( "John" , 25 ), ( "Mike" , 30 ), ( "Emily" , 35 )] columns = [ "name" , "age" ] df = spark.createDataFrame(data, columns) print ( "Dataframe Before adding new col : " ) df.show() # Define the UDF def my_udf(age): if age < 30 : return "Young" else : return "Old" # Register the UDF as a function udf_age_group = spark.udf.register( "age_group" , my_udf) # Use the UDF in a select statement df = df.selectExpr( "name" , "age" , "age_group(age) as age_group" ) print ( "Dataframe After adding new col : " ) # Show the DataFrame df.show() |
Output :