How to Convert RDD to Dataframe in Spark Scala?

How to get name of dataframe column in PySpark ?

Scala AnyRef type

This article focuses on discussing ways to convert rdd to dataframe in Spark Scala.

Table of Content

RDD and DataFrame in Spark
Convert Using createDataFrame Method
Conversion Using toDF() Implicit Method
Conclusion
FAQs

RDD and DataFrame in Spark

RDD and DataFrame are Spark’s two primary methods for handling data.

RDD is like the basic building block for processing data, while DataFrame is more like using SQL.
Sometimes in projects, there is a need to switch between RDDs and DataFrames.

Below is the Scala program to setup a spark session and create a dataset:

Scala

import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types._
val spark = SparkSession.builder().master("local").appName("RDDExample").getOrCreate()
val sc = spark.sparkContext
val rdd = sc.parallelize(Seq(
  ("Alice", "HR Manager", 40),
  ("Bob", "Software Developer", 35),
  ("Charlie", "Data Scientist", 28)
))

Output:

Convert Using createDataFrame Method

To make things simpler, you can use the createDataFrame method in Spark to turn your data into a DataFrame. You do not need to worry about specifying a schema (which describes the structure of your data) right away. Instead, you can just provide your existing data in the form of an RDD (Resilient Distributed Dataset), and Spark will figure out the structure for you.

This way, you can easily work with your data in a DataFrame format without much hassle.

Syntax:

Scala

import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types._

object RDDToDataFrame {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("RDD to DataFrame")
      .master("local[*]")
      .getOrCreate()

    val data = Seq(
      ("John", 30),
      ("Alice", 25),
      ("Bob", 35)
    )

    val rdd = spark.sparkContext.parallelize(data)
    val schema = StructType(
      Seq(
        StructField("Name", StringType, nullable = true),
        StructField("Age", IntegerType, nullable = true)
      )
    )
    val df = spark.createDataFrame(rdd.map(row => Row.fromSeq(row)), schema)

    df.show()
    spark.stop()
  }
}

Let’s now examine the DataFrame is schema that we recently created:

Scala

dfWitDefaultSchema.printSchema()

When we work with data in tables (like in a spreadsheet), the names of the columns usually follow a specific order based on a template. Sometimes, the computer tries to guess the type of data in each column, but it might not always get it right.
To make sure our data is organized correctly and to have more control over its structure, it is better to set up a clear plan for how the table should look beforehand. In programming terms, we create something called a schema that defines the layout of our table.
Now, when we are using a tool like Apache Spark, we need to follow certain rules. In the past, we could create a table directly from our data, but now we need to convert our data into a special format called Row before we can use it to create a table. This change helps ensure that our data is handled safely and accurately.

Below is the Code provided:

Scala

val rowRDD:RDD[Row] = rdd.map(t => Row(t._1, t._2, t._3))

Next, let’s create the schema object that we need:

Below is the Code provided:

Scala

val schema = new StructType()
  .add(StructField("EmployeeName", StringType, false))
  .add(StructField("Department", StringType, true))
  .add(StructField("Salary", DoubleType, true))

Output:

Let’s invoke the method once more, this time passing in an extra schema parameter:

Scala

val dfWithSchema:DataFrame = spark.createDataFrame(rowRDD, schema)

We will print the schema information once again:

Output:

It is evident that the data types are defined correctly and that the columns have appropriate names.

Conversion Using toDF() Implicit Method

Another common way to turn RDDs into DataFrames is by using the .toDF() method. But before we start, we need to make sure we import the necessary things from the SparkSession.

This helps us work with DataFrames smoothly:

Below is the Code provided:

Scala

import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types._

object RDDToDataFrame {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("RDD to DataFrame")
      .master("local[*]")
      .getOrCreate()

    val data = Seq(
      ("John", 30),
      ("Alice", 25),
      ("Bob", 35)
    )

    import spark.implicits._
    val df = data.toDF("Name", "Age")

    df.show()
    spark.stop()
  }
}

With that in place, we are all set to convert our RDD. However, it is important to note that this method is designed to handle specific types of RDDs, including Int, Long, String, or any subclasses of scala.Product. Suppose we have an RDD constructed using a sequence of Tuples. In that case, we can utilize our imported implicit method as follows:

Below is the Code provided:

Scala

val dfUsingToDFMethod = rdd.toDF("EmployeeName", "Department", "Salary")

Now, let’s take a peek at the schema of our freshly minted DataFrame:

Below is the Code provided:

Scala

dfUsingToDFMethod.printSchema()

Upon execution, this will display:

Output:

This showcases the schema structure of our DataFrame, including the names and data types of its columns.

Conclusion

In this guide, we have looked at different ways to turn an RDD into a DataFrame. We have gone through each method in detail, learning about what they need to work. If your RDD contains Rows, you can use the createDataFrame method. But if it is something else, the toDF() method can be really helpful.

FAQs related to How to Convert RDD to Dataframe in Spark Scala?

What exactly is an RDD in Spark Scala?

Ans: An RDD, which stands for Resilient Distributed Dataset, serves as a foundational data structure within Spark Scala. It essentially represents distributed collections of objects.

What is the reason behind converting an RDD to a DataFrame?

Ans: Converting RDDs to DataFrames offers a more organized and efficient approach to handling data in Spark. DataFrames come with a range of operations and optimizations that are not available with RDDs.

How can I change an RDD into a DataFrame in Spark Scala?

Ans: To switch an RDD to a DataFrame in Spark Scala, you have a couple of options. You can employ the createDataFrame method if your RDD contains Rows. Alternatively, you can opt for the toDF() implicit method, which offers a simpler conversion process.

Does converting RDDs to DataFrames have any impact on performance?

Ans: Yes, there are performance implications when converting RDDs to DataFrames. DataFrames are engineered for efficiency, featuring better memory management and execution plans compared to RDDs. This often translates to improved processing speed.

Is it possible to convert any type of RDD to a DataFrame?

Ans: While RDDs of type Row can be directly converted to DataFrames using createDataFrame, converting RDDs of other types might necessitate additional transformations or mapping operations to align with the DataFrame structure.

Tags:

#Scala

How to get name of dataframe column in PySpark ?

Scala AnyRef type