Understanding Spark

The official definition of Spark on its website is “Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters”.

Let us dive deeper and better understand what Spark means.

  • The multi-language paradigm means that Spark can be written in many languages. Currently, it supports Python, Scala, R, Java, and SQL.
  • The property of spark that makes it so popular and useful in different data applications is its distributed nature.
  • Spark processes the data by dividing it into smaller chunks and then processing each chunk on a separate node.
  • There are two types of nodes – driver and worker. Each spark program only has one driver node. When a program is run the driver node manages the worker nodes, it takes care of segregating the data and sending the operations to be performed on the data to the worker node.
  • In case, one worker node requires the data from another worker node the driver takes care of the communication between them. The driver node manages all the tasks and returns the final result to the user.
  • Compared to its predecessor, Hadoop, spark runs much faster. The reason is that Spark runs on memory while Hadoop uses a storage disk to do data processing.
  • Along with it, spark works on advanced optimizations and uses DAGs, lazy evaluation, etc. to better optimize the given task.
  • DAG (Directed Acyclic Graph) is an important component of Spark. Every task is broken down into subtasks and arranged in a sequence called DAGs. The MapReduce used by Hadoop is also a type of DAG.
  • However, Spark generalizes the concept and forms DAGs according to the task at hand.

How to create Spark session in Scala?

Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. – values like 1,2 can invoke functions like toString(). Scala is a statically typed language although unlike other statically typed languages like C, C++, or Java, it doesn’t require type information while writing the code. The type verification is done at the compile time. Static typing allows to building of safe systems by default. Smart built-in checks and actionable error messages, combined with thread-safe data structures and collections, prevent many tricky bugs before the program first runs.

Similar Reads

Understanding Spark

The official definition of Spark on its website is “Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters”....

Understanding SparkSession

The SparkSession class is the entry point into all functionality in Spark. It was introduced in Spark 2.0. It serves as a bridge to access all of Spark’s core features, encompassing RDDs, DataFrames, and Datasets, offering a cohesive interface for handling structured data processing. When developing a Spark SQL application, it is typically one of the initial objects you instantiate....

Creating SparkSession

Method 1 – Using builder API...