The type of data, field names, and field types in a table are defined by a schema, which is a structured definition of a dataset. In Spark, a row’s structure in a data frame is defined by its schema. To carry out numerous tasks including data filtering, joining, and querying a schema is necessary.
Concepts related to the topic
- StructType: StructType is a class that specifies a DataFrame’s schema. Each StructField in the list corresponds to a field in the DataFrame.
- StructField: The name, data type, and nullable flag of a field in a DataFrame are all specified by the class known as StructField.
- DataFrame: A distributed collection of data with named columns is referred to as a data frame. It can be modified using different SQL operations and is similar to a table in a relational database.
Step 1: Load the necessary libraries and functions and Create a SparkSession object
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import Row
spark = SparkSession.builder.appName( "Schema" ).getOrCreate()
spark
|
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.3.1
Master
local[*]
AppName
Schema
Step 2: Define the schema
Python3
schema = StructType([
StructField( "id" , IntegerType(), True ),
StructField( "name" , StringType(), True ),
StructField( "age" , IntegerType(), True )
])
|
Step 3: List of employee data with 5-row values
Python3
data = [[ 101 , "Sravan" , 23 ],
[ 102 , "Akshat" , 25 ],
[ 103 , "Pawan" , 25 ],
[ 104 , "Gunjan" , 24 ],
[ 105 , "Ritesh" , 26 ]]
|
Step 4: Create a data frame from the data and the schema, and print the data frame
Python3
df = spark.createDataFrame(data, schema = schema)
df.show()
|
+---+------+---+
| id| name|age|
+---+------+---+
|101|Sravan| 23|
|102|Akshat| 25|
|103| Pawan| 25|
|104|Gunjan| 24|
|105|Ritesh| 26|
+---+------+---+
Step 5: Print the schema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
Step 6: Stop the SparkSession
Steps needed
- Create a StructType object defining the schema of the DataFrame.
- Create a list of StructField objects representing each column in the DataFrame.
- Create a Row object by passing the values of the columns in the same order as the schema.
- Create a DataFrame from the Row object and the schema using the createDataFrame() function.
Creating a data frame with multiple columns of different types using schema.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import Row
spark = SparkSession.builder.appName( "example" ).getOrCreate()
schema = StructType([
StructField( "id" , IntegerType(), True ),
StructField( "name" , StringType(), True ),
StructField( "age" , IntegerType(), True )
])
row = Row( id = 100 , name = "Akshat" , age = 19 )
df = spark.createDataFrame([row], schema = schema)
df.show()
df.printSchema()
spark.stop()
|
+---+------+---+
| id| name|age|
+---+------+---+
|100|Akshat| 19|
+---+------+---+
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)