Create MapType in Spark DataFrame
Let us first create PySpark MapType to create map objects using the MapType() function. Then create the schema using the StructType() and StructField() functions. After that create a DataFrame using the spark.createDataDrame() method, which takes the data as one of its parameters. In this example, we are taking a list of tuples as the dataset. the printSchema() and show() methods are used to display the schema and the dataframe as the output.
Python3
#import needed modules from pyspark.sql.types import StringType, MapType mapCol = MapType(StringType(),StringType(), False ) #Convert the map of StructType to an array of StructType for further implementation. from pyspark.sql.types import StructField, StructType, StringType, MapType schema = StructType([ StructField( 'identification' , StringType(), True ), StructField( 'features' , MapType(StringType(),StringType()), True ) ]) # Dynamically create MapType on Spark DataFrame and assign to assign displayed attribute. from pyspark.sql import SparkSession spark = SparkSession.builder.appName( 'w3wiki' ).getOrCreate() dataDictionary = [ ( 'Shivaz' ,{ 'pupil' : 'black' , 'nails' : 'white' }), ( 'Shivi' ,{ 'pupil' : 'brown' , 'nails' : 'yellow' }), ( 'Shiv' ,{ 'pupil' : 'green' , 'nails' : 'white' }), ( 'Shaz' ,{ 'pupil' : 'grey' , 'nails' : 'yellow' }), ( 'Shiva' ,{ 'pupil' : 'blue' , 'nails' : 'white' }) ] #print the final created schema. df = spark.createDataFrame(data = dataDictionary, schema = schema) df.printSchema() df.show(truncate = False ) |
Output:
How to get keys and values from Map Type column in Spark SQL DataFrame
In Python, the MapType function is preferably used to define an array of elements or a dictionary which is used to represent key-value pairs as a map function. The Maptype interface is just like HashMap in Java and the dictionary in Python. It takes a collection and a function as input and returns a new collection as a result.
The formation of a map column is possible by using the createMapType() function on the DataTypes class such as StringType, IntegerType, ArrayType, and many more. This formation mainly takes two arguments, one is keyType and another is valueType which should extend the DataTypes class. valueContainsNull is the third param which is an optional boolean type, used to signify the value of the second param which accepts Null/None values. To get the key-value pair map type function applies a given operation to each element of a collection such as either list or an array.
Features and functionalities of MapType function:
- We use maptype function for data transformation due to its flexibility.
- It applies various transformations on output such as addition, multiplication, string concatenation, or other, which is defined for the collection of data type.
- MapType functions are collimated, which signifies that they can be executed on multiple threads to enhance the performance of map functions to handle massive collections.
- Output is computed only when they are needed, which overall saves memory as well as run time.