What is Adding StructType columns to PySpark DataFrames?

In this article, we will learn Adding StructType columns to PySpark DataFrames,This free Python tutorial for complete beginners will help you learn Python from scratch.

Adding StructType columns to PySpark DataFrames️‍🔥

In this article, we are going to learn about adding StructType columns to Pyspark data frames in Python.

The interface which allows you to write Spark applications using Python APIs is known as Pyspark. While creating the data frame in Pyspark, the user can not only create simple data frames but can also create data frames with StructType columns. This can be easily done by defining the StructType columns in the schema for the data frame and then creating the data frame later on.

What is StructType?

The class that is used to programmatically specify the schema to the data frame and create complex columns like nested struct, array, and map columns is known as StructType. The StructType can be imported through the following command in Python:

from pyspark.sql.types import StructType

The StructType contains a class that is used to define the columns which include column name, column type, nullable column, and metadata is known as StructField.

Stepwise Implementation to add StructType columns to PySpark DataFrames:

Step 1: First of all, we need to import the required libraries, i.e., libraries SparkSession, StructType, StructField, StringType, and IntegerType. The SparkSession library is used to create the session while StructType defines the structure of the data frame and StructField defines the columns of the data frame. The StringType and IntegerType are used to represent String and Integer values for the data frame respectively.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate() function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, define the data set in the list.

data_set = [((nested_values_1), column_value_1),
            ((nested_values_2), column_value_2),
            ((nested_values_3), column_value_3)]

Step 4: Moreover, define the structure using StructType and StructField functions respectively.

schema = StructType([StructField('column_1', StructType(
                    [StructField('nested_column_1', column_type(), True),
                    StructField('nested_column_2', column_type(), True),
                    StructField('nested_column_3', column_type(), True) ])),
                    StructField('column_2', column_type(), True)])

Step 5: Further, add StructType columns to the Pyspark data frame, i.e., create a Pyspark data frame using the specified structure and data set.

df = spark_session.createDataFrame(data = data_set, schema = schema)

Step 6: Finally, display the data frame.

df.show()

Example 1:

In this example, we have defined the data structure with StructType which has four StructFields ‘Full_Name‘, ‘Date_Of_Birth‘, ‘Gender‘, and ‘Fees‘. The StructType ‘Full_Name‘ is also further nested and contains three StructFields ‘First_Name‘, ‘Middle_Name‘, and ‘Last_Name‘. We have also defined the data set and then created the Pyspark data frame according to the data structure.

Python3

# Python program to add StructType columns to PySpark DataFrames 
  
# Import the libraries SparkSession, StructType, StructField, StringType, IntegerType 
from pyspark.sql import SparkSession 
from pyspark.sql.types import StructType, StructField, StringType, IntegerType 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Define the data set 
data_set = [(('Vinayak','','Rai'), 
             '2000-21-02','Male',13000), 
            (('Ria','Singh','Rajput'), 
             '2004-01-06','Female',10000)] 
  
# Define the structure for the data frame by adding StructType columns 
schema = StructType([ 
        StructField('Full_Name', StructType([ 
        StructField('First_Name', StringType(), True), 
        StructField('Middle_Name', StringType(), True), 
        StructField('Last_Name', StringType(), True) 
             ])), 
        StructField('Date_Of_Birth', StringType(), True), 
        StructField('Gender', StringType(), True), 
        StructField('Fees', IntegerType(), True) 
         ]) 
  
# Create the Pyspark data frame using createDataFrame function 
df = spark_session.createDataFrame(data = data_set, 
                                   schema = schema) 
  
# Display the data frame having StructType columns 
df.show()

Output:

Example 2:

In this example, we have defined the data structure with StructType which has two StructFields ‘Date_Of_Birth‘ and ‘Age‘. The StructType ‘Date_Of_Birth‘ is also further nested and contains three StructFields ‘Year‘, ‘Month‘, and ‘Date‘. We have also defined the data set and then created the Pyspark data frame according to the data structure.

Python3

# Python program to add StructType columns to PySpark DataFrames 
  
# Import the libraries SparkSession, StructType, 
# StructField, StringType, IntegerType 
from pyspark.sql.types import StructType, StructField, StringType, IntegerType 
from pyspark.sql import SparkSession 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Define the data set 
data_set = [((2000,2,21),18), 
            ((1998,4,16),24), 
            ((1998,1,11),18), 
            ((2006,3,30),16)] 
  
# Define the structure for the data frame by adding StructType columns 
schema = StructType([ 
        StructField('Date_Of_Birth', StructType([ 
        StructField('Year', IntegerType(), True), 
        StructField('Month', IntegerType(), True), 
        StructField('Date', IntegerType(), True) 
             ])), 
        StructField('Age', IntegerType(), True) 
         ]) 
  
# Create the Pyspark data frame using createDataFrame function 
df = spark_session.createDataFrame(data = data_set, 
                                   schema = schema) 
  
# Display the data frame having StructType columns 
df.show()

Output: