How to use DataFrame.select() In Python

Here we will use select() function, this function is used to select the columns from the dataframe

Syntax: dataframe.select(columns)

Where dataframe is the input dataframe and columns are the input columns

Example 1: Change a single column.

Let us convert the `course_df3` from the above schema structure, back to the original schema.

Python

from pyspark.sql.types import StringType, BooleanType, IntegerType 
  
course_df4 = course_df3.select( 
    course_df3.Name, 
    course_df3.Course_Name, 
    course_df3.Duration_Months, 
    (course_df3.Course_Fees.cast(IntegerType())) 
    .alias('Course_Fees'), 
    (course_df3.Start_Date.cast(StringType())) 
    .alias('Start_Date'), 
    (course_df3.Payment_Done.cast(BooleanType())) 
    .alias('Payment_Done'), 
) 
  
course_df4.printSchema() 

Output:

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: long (nullable = true)
 |-- Course_Fees: integer (nullable = true)
 |-- Start_Date: string (nullable = true)
 |-- Payment_Done: boolean (nullable = true)

Example 2: Changing multiple columns to the same datatype.

Python

# Changing datatype of all the columns 
# to string type 
from pyspark.sql.types import StringType 
  
course_df5 = course_df.select( 
  [course_df.cast(StringType()) 
   .alias(c) for c in course_df.columns] 
) 
course_df5.printSchema()

Output:

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: string (nullable = true)
 |-- Course_Fees: string (nullable = true)
 |-- Start_Date: string (nullable = true)
 |-- Payment_Done: string (nullable = true)

Example 3: Changing multiple columns to the different datatypes.

Let us use the `course_df5` which has all the column type as `string`. We will change the column types to a respective format.

Python

from pyspark.sql.types import ( 
    StringType, BooleanType, IntegerType, FloatType, DateType 
) 
  
coltype_map = { 
    "Name": StringType(), 
    "Course_Name": StringType(), 
    "Duration_Months": IntegerType(), 
    "Course_Fees": FloatType(), 
    "Start_Date": DateType(), 
    "Payment_Done": BooleanType(), 
} 
  
# course_df6 has all the column 
# types as string 
course_df6 = course_df5.select( 
    [course_df5.cast(coltype_map) 
     .alias(c) for c in course_df5.columns] 
) 
course_df6.printSchema() 

Output:

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: integer (nullable = true)
 |-- Course_Fees: float (nullable = true)
 |-- Start_Date: date (nullable = true)
 |-- Payment_Done: boolean (nullable = true)

How to Change Column Type in PySpark Dataframe ?

In this article, we are going to see how to change the column type of pyspark dataframe.

Creating dataframe for demonstration:

Python

# Create a spark session 
from pyspark.sql import SparkSession 
spark = SparkSession.builder.appName('SparkExamples').getOrCreate() 
  
# Create a spark dataframe 
columns = ["Name", "Course_Name", 
           "Duration_Months", 
           "Course_Fees", "Start_Date", 
           "Payment_Done"] 
data = [ 
    ("Amit Pathak", "Python", 3, 
     10000, "02-07-2021", True), 
    ("Shikhar Mishra", "Soft skills", 
     2, 8000, "07-10-2021", False), 
    ("Shivani Suvarna", "Accounting", 
     6, 15000, "20-08-2021", True), 
    ("Pooja Jain", "Data Science", 12, 
     60000, "02-12-2021", False), 
] 
course_df = spark.createDataFrame(data).toDF(*columns) 
  
# View the dataframe 
course_df.show() 

Output:

Let’s see the schema of dataframe:

Python

# View the column datatypes 
course_df.printSchema()

Output:

Tags:

#Python-Pyspark #Python #python

Method 1: Using DataFrame.withColumn()

Method 3: Using spark.sql()

How to use DataFrame.select() In Python

Python

Python

Python

How to Change Column Type in PySpark Dataframe ?

Python

Python

Similar Reads