PySpark Row using on DataFrame and RDD

You can access the rows in the data frame like this: Attribute, dictionary value. Row allows you to create row objects using named arguments. A named argument cannot be omitted to indicate that the value is “none” or does not exist. In this case, you should explicitly set this to None.

Subsequent changes in version 3.0.0: Rows created from named arguments are now sorted by the position entered instead of alphabetically by field name. A row in PySpark is an immutable, dynamically typed object containing a set of key-value pairs, where the keys correspond to the names of the columns in the DataFrame.

Rows can be created in a number of ways, including directly instantiating a Row object with a range of values, or converting an RDD of tuples to a DataFrame. In pyspark, DataFrames are based on RDDs but provide a more structured and streamlined way to manipulate data using SQL-like queries and transformations. In this context, a Row object represents a record in a DataFrame or an element in an RDD of tuples. 

1. Creating a Row object in PySpark

Approach:

  • Import Row from pyspark.sql
  • Create a row using Row()
  • Access the columns in data using Attribute value.

Python3

from pyspark.sql import Row

# Create a Row object with three columns: name, age, and city
row = Row(name='w3wiki', age=25, city='India')

# Access the values of the row using dot notation
print(row.name)
print(row.age)
print(row.city)