Use createDataFrame() method and use toPandas() method
Here is the syntax of the createDataFrame() method :
Syntax : current_session.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
Parameters :
- data : a resilient distributed dataset or data in form of MySQL/SQL datatypes
- schema : string or list of columns names for the DataFrame.
- samplingRatio -> float: a sample ratio of the rows
- verifySchema -> bool: check if the datatypes of the rows is as specified in the schema
Returns : PySpark DataFrame object.
Example:
In this example, we will pass the Row list as data and create a PySpark DataFrame. We will then use the toPandas() method to get a Pandas DataFrame.
Python
# Importing PySpark and importantly # Row from pyspark.sql import pyspark from pyspark.sql import SparkSession from pyspark.sql import Row # PySpark Session row_pandas_session = SparkSession.builder.appName( 'row_pandas_session' ).getOrCreate() # List of Sample Row objects row_object_list = [Row(Topic = 'Dynamic Programming' , Difficulty = 10 ), Row(Topic = 'Arrays' , Difficulty = 5 ), Row(Topic = 'Sorting' , Difficulty = 6 ), Row(Topic = 'Binary Search' , Difficulty = 7 )] # creating PySpark DataFrame using createDataFrame() df = row_pandas_session.createDataFrame(row_object_list) # Printing the Spark DataFrame df.show() # Conversion to Pandas DataFrame pandas_df = df.toPandas() # Final Result print (pandas_df) |
Output :
Convert PySpark Row List to Pandas DataFrame
In this article, we will convert a PySpark Row List to Pandas Data Frame. A Row object is defined as a single Row in a PySpark DataFrame. Thus, a Data Frame can be easily represented as a Python List of Row objects.