How to introduce the schema in a Row in Spark?

How to split rows of a Spark RDD by Deliminator

The type of data, field names, and field types in a table are defined by a schema, which is a structured definition of a dataset. In Spark, a row’s structure in a data frame is defined by its schema. To carry out numerous tasks including data filtering, joining, and querying a schema is necessary.

Concepts related to the topic

StructType: StructType is a class that specifies a DataFrame’s schema. Each StructField in the list corresponds to a field in the DataFrame.
StructField: The name, data type, and nullable flag of a field in a DataFrame are all specified by the class known as StructField.
DataFrame: A distributed collection of data with named columns is referred to as a data frame. It can be modified using different SQL operations and is similar to a table in a relational database.

Examples 1:

Step 1: Load the necessary libraries and functions and Create a SparkSession object

Python3

from pyspark.sql import SparkSession 
from pyspark.sql.types import StructType, StructField, IntegerType, StringType 
from pyspark.sql import Row 
  
# Create a SparkSession object 
spark = SparkSession.builder.appName("Schema").getOrCreate() 
spark

Output:

SparkSession - in-memory
SparkContext

Spark UI
Version
v3.3.1
Master
local[*]
AppName
Schema

Step 2: Define the schema

Python3

# Define the schema 
schema = StructType([ 
    StructField("id", IntegerType(), True), 
    StructField("name", StringType(), True), 
    StructField("age", IntegerType(), True) 
])

Step 3: List of employee data with 5-row values

Python3

# list of employee data with 5 row values 
data = [[101, "Sravan", 23], 
        [102, "Akshat", 25], 
        [103, "Pawan",  25], 
        [104, "Gunjan", 24], 
        [105, "Ritesh", 26]]

Step 4: Create a data frame from the data and the schema, and print the data frame

Python3

# Create a DataFrame from the Row object and the schema 
df = spark.createDataFrame(data, schema=schema) 
# Show the DataFrame 
df.show()

Output:

+---+------+---+
| id|  name|age|
+---+------+---+
|101|Sravan| 23|
|102|Akshat| 25|
|103| Pawan| 25|
|104|Gunjan| 24|
|105|Ritesh| 26|
+---+------+---+

Step 5: Print the schema

Python3

# print the schema 
df.printSchema()

Output:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Step 6: Stop the SparkSession

Python3

# Stop the SparkSession 
spark.stop()

Example 2:

Steps needed

Create a StructType object defining the schema of the DataFrame.
Create a list of StructField objects representing each column in the DataFrame.
Create a Row object by passing the values of the columns in the same order as the schema.
Create a DataFrame from the Row object and the schema using the createDataFrame() function.

Creating a data frame with multiple columns of different types using schema.

Python3

from pyspark.sql import SparkSession 
from pyspark.sql.types import StructType, StructField, IntegerType, StringType 
from pyspark.sql import Row 
  
# Create a SparkSession object 
spark = SparkSession.builder.appName("example").getOrCreate() 
  
# Define the schema 
schema = StructType([ 
    StructField("id", IntegerType(), True), 
    StructField("name", StringType(), True), 
    StructField("age", IntegerType(), True) 
]) 
  
# Create a Row object 
row = Row(id=100, name="Akshat", age=19) 
  
# Create a DataFrame from the Row object and the schema 
df = spark.createDataFrame([row], schema=schema) 
  
# Show the DataFrame 
df.show() 
  
# print the schema 
df.printSchema() 
  
# Stop the SparkSession 
spark.stop()

Output

+---+------+---+
| id|  name|age|
+---+------+---+
|100|Akshat| 19|
+---+------+---+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Tags:

#python #Python-Pyspark #Data Science #python

How to split rows of a Spark RDD by Deliminator