How to use dtypes() In Python
Here we are using dtypes followed by startswith() method to get the columns of a particular type.
Syntax: dataframe[[item[0] for item in dataframe.dtypes if item[1].startswith(‘datatype’)]]
where,
- dataframe is the input dataframe
- datatype refers the keyword types
- item defines the values in the column
And finally, we are using collect() method to display column data
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # import data field types from pyspark.sql.types import (StringType, DoubleType, IntegerType, StructType, StructField, FloatType) # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of student data data = [( 1 , "sravan" , 9.8 , 4500.00 ), ( 2 , "ojsawi" , 9.2 , 6789.00 ), ( 3 , "bobby" , 8.9 , 988.000 )] # specify column names with data types columns = StructType([ StructField( "ID" , IntegerType(), True ), StructField( "NAME" , StringType(), True ), StructField( "GPA" , FloatType(), True ), StructField( "FEE" , DoubleType(), True ), ]) # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # select columns that are integer type print (dataframe[[item[ 0 ] for item in dataframe.dtypes if item[ 1 ].startswith( 'int' )]].collect()) # select columns that are string type print (dataframe[[item[ 0 ] for item in dataframe.dtypes if item[ 1 ].startswith( 'string' )]].collect()) # select columns that are float type print (dataframe[[item[ 0 ] for item in dataframe.dtypes if item[ 1 ].startswith( 'float' )]].collect()) # select columns that are double type print (dataframe[[item[ 0 ] for item in dataframe.dtypes if item[ 1 ].startswith( 'double' )]].collect()) |
Output:
[Row(ID=1), Row(ID=2), Row(ID=3)]
[Row(NAME=’sravan’), Row(NAME=’ojsawi’), Row(NAME=’bobby’)]
[Row(GPA=9.800000190734863), Row(GPA=9.199999809265137), Row(GPA=8.899999618530273)]
[Row(FEE=4500.0), Row(FEE=6789.0), Row(FEE=988.0)]
PySpark – Select columns by type
In this article, we will discuss how to select columns by type in PySpark using Python.
Let’s create a dataframe for demonstration
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # import data field types from pyspark.sql.types import StringType, DoubleType, IntegerType, StructType, StructField, FloatType # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of student data data = [( 1 , "sravan" , 9.8 , 4500.00 ), ( 2 , "ojsawi" , 9.2 , 6789.00 ), ( 3 , "bobby" , 8.9 , 988.000 )] # specify column names with data types columns = StructType([ StructField( "ID" , IntegerType(), True ), StructField( "NAME" , StringType(), True ), StructField( "GPA" , FloatType(), True ), StructField( "FEE" , DoubleType(), True ), ]) # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe.show() |
Output:
We can select the column by name using the following keywords:
- Integer: int
- String : string
- Float: float
- Double: double