Methods to apply custom schema to a Pyspark DataFrame
- Applying custom schema by changing the name.
- Applying custom schema by changing the type.
- Applying custom schema by changing the metadata.
Method 1: Applying custom schema by changing the name
As we know, whenever we create the data frame or upload the CSV file, it has some predefined schema, but if we don’t want it and want to change it according to our needs, then it is known as applying a custom schema. The custom schema has two fields ‘column_name‘ and ‘column_type‘. In this way, we will see how we can apply the customized schema to the data frame by changing the names in the schema.
Syntax: StructType(StructField(‘column_name_1’, column_type(), Boolean_indication))
Parameters:
- column_name_1, column_name_2: These are the column names given to the data frame while applying custom schema.
- column_type: These are the types to be given to columns while applying custom schema.
- Boolean_indication: It takes the input as ‘True’ or ‘False’ that defines whether the column contains null value or not.
Example:
In this example, we have defined the customized schema with columns ‘Student_Name’ of StringType, ‘Student_Age’ of IntegerType, ‘Student_Subject’ of StringType, ‘Student_Class’ of IntegerType, ‘Student_Fees’ of IntegerType. Then, we loaded the CSV file (link) whose schema is as follows:
Finally, we applied the customized schema to that CSV file by changing the names and displaying the updated schema of the data frame.
Python3
# PySpark - Apply custom schema to a DataFrame by changing names # Import the libraries SparkSession, StructType, # StructField, StringType, IntegerType from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Define the structure for the data frame schema = StructType([ StructField( 'Student_Name' , StringType(), True ), StructField( 'Student_Age' , IntegerType(), True ), StructField( 'Student_Subject' , StringType(), True ), StructField( 'Student_Class' , IntegerType(), True ), StructField( 'Student_Fees' , IntegerType(), True ) ]) # Applying custom schema to data frame df = spark_session.read. format ( "csv" ).schema(schema).option( "header" , True ).load( "/content/student_data.csv" ) # Display the updated schema df.printSchema() |
Output:
root |-- Student_Name: string (nullable = true) |-- Student_Age: integer (nullable = true) |-- Student_Subject: string (nullable = true) |-- Student_Class: integer (nullable = true) |-- Student_Fees: integer (nullable = true)
Method 2: Applying custom schema by changing the type
As you know, the custom schema has two fields ‘column_name‘ and ‘column_type‘. In a previous way, we saw how we can change the name in the schema of the data frame, now in this way, we will see how we can apply the customized schema to the data frame by changing the types in the schema.
Example:
In this example, we have read the CSV file (link), i.e., basically a dataset of 5*5, whose schema is as follows:
Then, we applied a custom schema by changing the type of column ‘fees‘ from Integer to Float using the cast function and printed the updated schema of the data frame.
Python3
# Apply custom schema to a DataFrame by changing column type # Import the libraries SparkSession from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file data_frame = csv_file = spark_session.read.csv( '/content/student_data.csv' , sep = ',' , inferSchema = True , header = True ) # Applying custom schema to data frame by changing column type data_frame1 = data_frame.withColumn( 'fees' , data_frame[ 'fees' ].cast( 'float' )) # Display the updated schema data_frame1.printSchema() |
Output:
root |-- name: string (nullable = true) |-- age: integer (nullable = true) |-- subject: string (nullable = true) |-- class: integer (nullable = true) |-- fees: float (nullable = true)
Method 3: Applying custom schema by changing the metadata
The custom schema usually has two fields ‘column_name‘ and ‘column_type‘ but we can also define one other field, i.e., ‘metadata‘. The metadata is basically a small description of the column. In this way, we will see how we can apply the customized schema using metadata to the data frame.
Example:
In this example, we have defined the customized schema with columns ‘Student_Name’ of StringType with metadata ‘Name of the student’, ‘Student_Age’ of IntegerType with metadata ‘Age of the student’, ‘Student_Subject’ of StringType with metadata ‘Subject of the student’, ‘Student_Class’ of IntegerType with metadata ‘Class of the student’, ‘Student_Fees’ of IntegerType with metadata ‘Fees of the student’. Then, we loaded the CSV file (link) whose schema is as follows:
Finally, we applied the customized schema to that CSV file and displayed the schema of the data frame along with the metadata.
Python3
#PySpark - Apply custom schema to a DataFrame by changing metadata # Import the libraries SparkSession, StructType, # StructField, StringType, IntegerType from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Define the schema for the data frame schema = StructType([ StructField( 'Student_Name' , StringType(), True , metadata = { "desc" : "Name of the student" }), StructField( 'Student_Age' , IntegerType(), True , metadata = { "desc" : "Age of the student" }), StructField( 'Student_Subject' , StringType(), True , metadata = { "desc" : "Subject of the student" }), StructField( 'Student_Class' , IntegerType(), True , metadata = { "desc" : "Class of the student" }), StructField( 'Student_Fees' , IntegerType(), True , metadata = { "desc" : "Fees of the student" }) ]) # Applying custom schema to data frame df = spark_session.read. format ( "csv" ).schema( schema).option( "header" , True ).load( "/content/student_data.csv" ) # Display the updated schema of the data frame df.printSchema() # Run a loop to display metadata for each column for i in range ( len (df.columns)): a = df.schema.fields[i].metadata[ "desc" ] print ( 'Column ' ,i + 1 , ': ' ,a) |
Output:
root |-- Student_Name: string (nullable = true) |-- Student_Age: integer (nullable = true) |-- Student_Subject: string (nullable = true) |-- Student_Class: integer (nullable = true) |-- Student_Fees: integer (nullable = true) Column 1 : Name of the student Column 2 : Age of the student Column 3 : Subject of the student Column 4 : Class of the student Column 5 : Fees of the student
PySpark – Apply custom schema to a DataFrame
In this article, we are going to apply custom schema to a data frame using Pyspark in Python.
A distributed collection of rows under named columns is known as a Pyspark data frame. Usually, the schema of the Pyspark data frame is inferred from the data frame itself, but Pyspark also gives the feature to customize the schema according to the needs. This can be done easily by defining the new schema and by loading it into the respective data frame. Read the article further to know about it in detail.
What is Schema?
The structure of the data frame which we can get by calling the printSchema() method on the data frame object is known as the Schema in Pyspark. Basically, schema defines the structure of the data frame such as data type of a column and boolean value indication (If column’s value can be null or not). The schema can be defined by using the StructType class which is a collection of StructField that defines the column name, column type, nullable column, and metadata.