Understanding RDD and Spark

Building Sample RDD

Before building an RDD, let’s take a brief introduction about it. An RDD is the base object of Spark Language. Spark is used to develop distributed products i.e. a code that can be run on many machines at the same time. The main purpose of such products is to process large data for business analysis. The RDD is a collection of partitioned elements that can be operated in parallel. RDD stands for Resilient Distributed Dataset. Resilient means that the data structure will persist even after any failure that could result in data loss like a power outage. Distributed means that the processing of large datasets will be broken into smaller chunks to process. The RDD has now become an old API of the Spark Language, as its successors like DataFrame and DataSet have come up which are more optimized and provide type-safety to build better code.

How to Print RDD in scala?

Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. – values like 1,2 can invoke functions like toString(). Scala is a statically typed language although unlike other statically typed languages like C, C++, or Java, it doesn’t require type information while writing the code. The type verification is done at the compile time. Static typing allows to building of safe systems by default. Smart built-in checks and actionable error messages, combined with thread-safe data structures and collections, prevent many tricky bugs before the program first runs.

Table of Content