How to use dtypes() In Python

Here we are using dtypes followed by startswith() method to get the columns of a particular type.

Syntax: dataframe[[item[0] for item in dataframe.dtypes if item[1].startswith(‘datatype’)]]

where,

dataframe is the input dataframe

datatype refers the keyword types

item defines the values in the column

And finally, we are using collect() method to display column data

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# import data field types 
from pyspark.sql.types import (StringType, 
DoubleType, IntegerType, StructType, 
StructField, FloatType) 
  
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of student  data 
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi", 
                                      9.2, 6789.00),  
        (3, "bobby", 8.9, 988.000)] 
  
# specify column names with data types 
columns = StructType([ 
    StructField("ID", IntegerType(), True), 
    StructField("NAME", StringType(), True), 
    StructField("GPA", FloatType(), True), 
    StructField("FEE", DoubleType(), True), 
  
]) 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# select columns that are integer type 
print(dataframe[[item[0] 
                 for item in dataframe.dtypes if item[ 
                   1].startswith('int')]].collect()) 
  
# select columns that are string type 
print(dataframe[[item[0] 
                 for item in dataframe.dtypes if item[ 
                   1].startswith('string')]].collect()) 
  
# select columns that are float type 
print(dataframe[[item[0] 
                 for item in dataframe.dtypes if item[ 
                   1].startswith('float')]].collect()) 
  
# select columns that are double type 
print(dataframe[[item[0] 
                 for item in dataframe.dtypes if item[ 
                   1].startswith('double')]].collect()) 

Output:

[Row(ID=1), Row(ID=2), Row(ID=3)]

[Row(NAME=’sravan’), Row(NAME=’ojsawi’), Row(NAME=’bobby’)]

[Row(GPA=9.800000190734863), Row(GPA=9.199999809265137), Row(GPA=8.899999618530273)]

[Row(FEE=4500.0), Row(FEE=6789.0), Row(FEE=988.0)]

PySpark – Select columns by type

In this article, we will discuss how to select columns by type in PySpark using Python.

Let’s create a dataframe for demonstration

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# import data field types 
from pyspark.sql.types import StringType, DoubleType,  
IntegerType, StructType, StructField, FloatType 
  
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of student  data 
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi", 
                                      9.2, 6789.00), 
        (3, "bobby", 8.9, 988.000)] 
  
# specify column names with data types 
columns = StructType([ 
    StructField("ID", IntegerType(), True), 
    StructField("NAME", StringType(), True), 
    StructField("GPA", FloatType(), True), 
    StructField("FEE", DoubleType(), True), 
  
]) 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
  
# display 
dataframe.show()