Rename nested field in spark DataFrame

Rename Field in spark Dataframe

If we have nested columns then we have to redefine the structure of the DataFrame. First, we will define the schema then we will apply the schema using the following code structure:

df.select(col("address").cast(struct_schema)).printSchema()

Create the DataFrame.

Python3

from pyspark.sql.types import StructType, StructField, StringType, IntegerType 
  
# Define the schema for the DataFrame 
schema = StructType([ 
    StructField("name", StringType()), 
    StructField("age", IntegerType()), 
    StructField("address", StructType([ 
        StructField("street", StringType()), 
        StructField("city", StringType()), 
        StructField("zip", IntegerType()) 
    ])) 
]) 
  
# Create the DataFrame 
data = [("Alice", 25, {"street": "Main St", "city": "Anytown", "zip": 12345}),   
        ("Bob", 30, {"street": "Park Ave", "city": "New York", "zip": 56789})] 
df = spark.createDataFrame(data, schema) 
  
# Show the DataFrame 
df.show() 
#print the Schema 
df.printSchema() 

Output:

+-----+---+---------------------------+
|name |age|address                    |
+-----+---+---------------------------+
|Alice|25 |{Main St, Anytown, 12345}  |
|Bob  |30 |{Park Ave, New York, 56789}|
+-----+---+---------------------------+

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- address: struct (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- zip: integer (nullable = true)

To rename the filed name we have to redefine the structure of the DataFrame while defining the schema we have to pass the newfieldname and its datatype.

Python3

#import the libraries 
from pyspark.sql.types import  LongType, StringType, StructField, StructType 
from pyspark.sql.functions import col 
  
#define the schema 
struct_schema = StructType([ 
    StructField("Street_name", StringType()), 
    StructField("city_name", StringType()), 
    StructField("Zip_code", IntegerType()) 
]) 
#apply the schema 
df.select(col("address").cast(struct_schema)).printSchema() 

Output:

 root
 |-- address: struct (nullable = true)
 |    |-- Street_name: string (nullable = true)
 |    |-- city_name: string (nullable = true)
 |    |-- Zip_code: integer (nullable = true)

Rename Nested Field in Spark Dataframe in Python

In this article, we will discuss different methods to rename the columns in the DataFrame like withColumnRenamed or select. In Apache Spark, you can rename a nested field (or column) in a DataFrame using the withColumnRenamed method. This method allows you to specify the new name of a column and returns a new DataFrame with the renamed column.

Required Package

PySpark is the Python library for Spark programming. It allows developers to interact with the Spark cluster using the Python programming language. PySpark is a powerful tool for large-scale data processing and analysis, as it allows you to perform distributed computations on large datasets using the power of the Spark engine. you can install Pyspark using the following command:

!pip install pyspark

Tags:

#Python-Pyspark #Python #python

Rename Field in spark Dataframe

Rename nested field in spark DataFrame

Python3

Python3

Rename Nested Field in Spark Dataframe in Python

Similar Reads