Union() function in pyspark

Method 2: UnionByName() function in pyspark

The PySpark union() function is used to combine two or more data frames having the same structure or schema. This function returns an error if the schema of data frames differs from each other.

Syntax: data_frame1.union(data_frame2)

Where,

data_frame1 and data_frame2 are the dataframes

Example 1:

Python3

# Python program to illustrate the 
# working of union() function 
import pyspark 
from pyspark.sql import SparkSession 
  
spark = SparkSession.builder.appName('w3wiki.com').getOrCreate() 
  
# Creating a dataframe 
data_frame1 = spark.createDataFrame( 
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)], 
    ["Student Name", "Overall Percentage"] 
) 
  
# Creating another dataframe 
data_frame2 = spark.createDataFrame( 
    [("Naveen", 91.123), ("Piyush", 90.51)], 
    ["Student Name", "Overall Percentage"] 
) 
  
# union() 
answer = data_frame1.union(data_frame2) 
  
# Print the result of the union() 
answer.show() 

Output:

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      Naveen|            91.123|
|      Piyush|             90.51|
+------------+------------------+

Example 2:

In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is different. Hence, the output is not the desired one as union() can be applied on datasets having the same structure.

Python3

# Python program to illustrate the working 
# of union() function 
  
import pyspark 
from pyspark.sql import SparkSession 
  
spark = SparkSession.builder.appName('w3wiki.com').getOrCreate() 
  
# Creating a data frame 
data_frame1 = spark.createDataFrame( 
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)], 
    ["Student Name", "Overall Percentage"] 
) 
  
# Creating another data frame 
data_frame2 = spark.createDataFrame( 
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")], 
    ["Overall Percentage", "Student Name"] 
) 
  
# Union both the dataframes using uninonByName() method 
answer = data_frame1.union(data_frame2) 
  
# Print the combination of both the dataframes 
answer.show() 

Output:

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      91.123|            Naveen|
|       90.51|            Piyush|
|       87.67|            Hitesh|
+------------+------------------+

How to union multiple dataframe in PySpark?

In this article, we will discuss how to union multiple data frames in PySpark.

Tags:

#Python-Pyspark #Python #python

Method 2: UnionByName() function in pyspark

Union() function in pyspark

Python3

Python3

How to union multiple dataframe in PySpark?

Similar Reads