Card image cap

Apache Spark is an open-source cluster computing framework accepted by a wide range of industries to process the vast volumes of data and to get real-time insights out of it. It is one of the most successful products of Apache Spark. Before jumping into the tutorial, let's consider some important things about data and the evolution of spark.

The concept of data is not a new one; it has been in the market for decades. The growth of data was very low in the past, but it has grown tremendously since the invention of the internet. Tons of data have been getting generated around the clock, and in the same manner, the technologies which process the data are also evolving along with it.

Apache Spark Tutorial - Table of Contents

What is Apache Spark?

Apache Spark is a cluster computing open-source framework designed to fulfill the need for real-time analytics. It was developed based on the Hadoop MapReduce model, and it extends the capabilities of MapReduce for additional computations such as stream processing, and interactive queries. Spark has gained more popularity due to its in-memory computation that boosts the speed of computing.

Spark is capable of supporting different workloads such as interactive algorithms, streaming, batch application, and interactive queries. In addition to all these tasks, it reduces the burden of management for maintaining other extra tools.

Data Processing

It was not possible to process extensive amounts of data with traditional data processing systems. Companies all around the world were in desperate need of data processing mechanisms. Finally, Hadoop came up with its powerful batch processing capability to solve this so-called problem of data processing. Hadoop stores the data in a distributed manner and processes the data in a parallel way. It has three main components which are, the reliable data storage layer HDFS, the resource management layer YARN, and the batch processing engine MapReduce.

You may get a doubt when we have such a robust data processing technology why do we need other technologies like Spark? Let's get into the Spark tutorial, to find answers to your questions.

To gain clear knowledge about Spark we must have prior knowledge of MapReduce let's look into MapReduce first.

What is MapReduce?

MapReduce is an innovative programming model suitable for processing a vast amount of data. The MapReduce algorithm is divided into two important tasks which are Map and Reduce. The Map takes a set of data and transforms that data into another set of data, in which individual elements are divided into small tuples (key/value pairs). The next process starts with the reduce function which receives the output of the Map as an input and combines the small tuples into small sets of tuples. The main advantage of MapReduce is that it allows easy data scaling processes over multiple nodes.

History of Spark

Before the evolution of Spark MapReduce was used as a processing framework. Initially, the spark was started as a research project in 2009, and it was turned into open-sourced in 2010. The reason behind developing the spark is to create a cluster management framework that could support cluster-based computing for various systems. In 2013 Spark moved to the Apache foundation and was adopted by many industries across the world.

Apache Spark Streaming

Apache Spark streaming is a highly scalable fault-tolerant and streaming process system that seamlessly supports streaming and batch workloads. Spark streaming is an extension to Spark API which allows data scientists and data engineers to process real-time data received from various sources which include Flume, Kafka, and Amazon Kinesis. Whatever the data that has been processed is pushed to the databases, storage, and onto the live dashboards. The main abstracter here is a DStream which represents a stream of data segregated into tiny batches. The DStream is built on RDDS, which is considered as a Core data abstract of Spark. The DStream allows Spark Streaming to easily connect with the components like Spark SQL and MLib.

Spark is quite different from other streaming systems because other systems have either designed only for streaming or contain streaming and batch processing APIs, but the compilation process goes to different engines. Spark has gained popularity and has become a powerful streaming system because of its single execution engine and programming model.

[ Check out Caching Play in Spark Streaming ]

Apache Spark Features

There is a vast amount of data that is needed to process immediately to act on the insights residing in it. Organizations across the world are looking for real-time analytics which is not possible by Hadoop MapReduce. Spark is developed to analyze the data on demand and provides instant analytics.

Below mentioned are the capabilities and features of apache spark that makes it special:

  • Speed: Spark works at a lightning speed. It runs an application on the Hadoop cluster 100 times faster and on disk ten times faster. Spark reduces the read/write operations on the disk and stores the intermediate data on the memory. It uses the partition method to decrease the traffic while processing data, hence the ultimate result is faster processing.

  • Supports multiple Languages: Spark is highly flexible in nature. It supports a wide range of programming languages such as Java, Scala, or Python. Therefore you can write the applications in your preferred language. Spark is equipped with 80 different operators which help in interactive querying.

  • Supports multiple formats: The best part of the spark is that it supports all types of data formats. It supports multiple data sources such as Cassandra, Hive, JSON, and Parquet.

  • Dynamic in Nature: Apache Spark allows you to develop parallel applications because it contains 80 high-level operators.

  • Reusability: Spark allows you to reuse the code for joining the team against historical data or for batch processing. Using spark code you can also run ad-hoc queries on stream mode.

  • Fault tolerance: Spark is highly fault-tolerant. You can do it with Spark’s core abstraction-RDD. The RDDs are designed to manage the sudden failovers of any worker node in the cluster. It ensures that there is no loss of data in case of any node failover.

  • Real-time Stream Processing: This is the best feature of Spark. Spark allows real-time data processing whereas Hadoop does not allow this. Hadoop only processes the data which is already present and the invention of spark has eliminated this issue.

  • Supports for Greater Analysis: To provide the users with deep analysis Apache Spark comes with dedicated tools which include machine learning which add-on to map and reduce, for streaming data declarative/ interactive queries.

  • Cost-effective: The Big data Hadoop requires a huge storage capacity and data center for the replication process. Spark needs low storage and cost-effectiveness compared to Hadoop.

  • Hadoop Integration: Spark is highly flexible. You can run Spark on the Hadoop Yarn cluster manager or it can be run independently. It is built with the ability to read existing Hadoop data.

Apache Spark Components

Spark components are those which make Apache spark fast and reliable. All these components were developed to eliminate the issues that have been raised while using MapReduce.

The below-mentioned are the components that make Spark stronger:

  1. Spark Core
  2. Spark Streaming/li>
  3. Spark SQL/li>
  4. GraphX/li>
  5. MLlib ( Machine Learning)/li>

Component #1. Spark Core

Spark core performs essential functions such as memory management, task scheduling, interacting with storage systems, and so on. It also acts as a home for the central point for the API that defines the resilient distributed datasets (RDDs). The RDDs is a group of nodes that are spread across many nodes which can be manipulated in a parallel way.

Component #2. Spark Streaming

This component is developed to process the data of live streams. These live streams of data include a quest of messages posted by users on servers and the log files generated by production web servers. The DStream in Spark is actually a series of RDD that helps in processing real-time data. Spark streaming helps in making the system fault-tolerant and highly scalable.

Component #3. Spark SQL

Spark SQL comes with the package of Spark, and it helps work with structured data. It allows querying the data in two ways which are SQL and HQL ( hive query language). Spark SQL supports data from a wide variety of sources such as Parquet, JSON, and hive tables. In addition to providing an SQL interface, Spark SQL also developers integrate the SQL with programming languages such as Python, Scala, and Java, all in one application. This highly scalable integration makes the spark stand out from other open-source data warehouse tools.

Component #4. GraphX

Spark is equipped with a library called GraphX which possesses the ability to manipulate graphs and perform computations. This works similarly to spark streaming, Spark SQL to extend the RDD which creates a direct graph.

To support graph computation, GraphX exposes optimized variants of pregel API and a set of fundamental operators. To make it more powerful and easy for computing GraphX is equipped with a growing collection of builders and graph algorithms to simplify Graph analytics tasks.

Component #5. MLlib ( Machine Learning Library)

Apache Spark has an inbuilt library called MLlib which has a wide range of machine learning algorithms, clustering, classification, collaboration filters, etc. It also supports model evaluation and data support. In addition to all, it also provides some lower-level functionalities such as descent optimization algorithms.

Apache Spark’s Run-Time Architecture Components

We have mainly three runtime components that are:

  • Apache Spark Driver
  • Apache Spark Cluster Manager
  • Apache Spark Executors
  • Apache Spark Driver.

 

The driver is the main run time architecture of spark. The driver processes the user code that creates RDDs, creates Spark context, and also performs action and transformation. Whenever Spark shell is launched, that indicates a driver program has been created. When you terminate the driver, the application will be completed.

The driver program divides the Spark application into tasks and schedules those small tasks to run on the executor. The task scheduler is built inside the driver and distributes the tasks with the workers. A driver is responsible for two main functionalities:

  • Modifies a user program into a task
  • Schedules tasks on the executor
  • Apache Spark Cluster Manager.

 

Cluster managers play an important role in launching the executors or sometimes even launching drivers. The Jobs and actions within the Spark application are scheduled by Spark Scheduler on the cluster manager in a First In First Out (FIFO) manner. Sometimes the scheduling is also done in a Round Robin manner.

Spark comes with an option called dynamic adjustment which allows you to adjust the resources based on workload. It makes the application free from heavy workloads or unused resources. The resources are called only when there is a demand. This option is applicable to all cross-grained cluster managers, i.e. YARN mode, standalone mode, and Mesos coarse-grained mode.

[ Related Article: What are Apache Spark Challenges? ]

Apache Spark Executors

The individual tasks in the Spark job are run in the Spark executors. The executors are launched at the start of the Spark application and they keep running for the entire lifetime of an application. Executors mainly perform two important roles:

  • Executes the tasks that make up the application and sends the result to the driver.
  • Enables in-memory storing facilities for RDDs that are cached by the user.

 

1. Apache Spark Resilient Distributed Datasets

The core Data structure of Spark is RDD. RDD is an acronym for Resilient Distributed Datasets (RDD) and it is an immutable set of objects. The datasets on RDD are arranged in a logical manner and can be computed on various nodes of the cluster. RDD can accept various objects which include Java, Python, or Scala, and user-defined classes.

You can create RDDs in two different ways one is parallelizing an existing collection in a drive program and the other is referencing a dataset in an external storage system which may include a shared file system, HBase, HDFS, or the data source which offers Hadoop Input Format. Spark uses RDD as a means to accomplish faster and more efficient MapReduce Operations. Without RDDs MapReduce operations require a lot of time to process.

2. Data Sharing using Spark RDD

It takes a lot of time to process the data in MapReduce due to serialization, replication, and disk IO. Hadoop applications spend almost 90% of their time Doing Haddop read-write operations.

To eliminate this problem, researchers have developed a solution called Resilient Distributed Database (RDD). RDD is capable of supporting in-memory processing computation. It stores the state of memory as an object in all jobs and the object is shared among all other jobs. In-memory data sharing is 10 to 100 times faster compared to network and Disk.

3. Ways to Create RDD

To create RDDs we have three ways.

  1. Parallelized Collections: Invoking the parallelize method in the driver program would allow you to create parallelized collections.

  2. External datasets: By calling the Textfile method one can create Spark RDDs. This method considers the URL of the file and reads it a group of lines.

  3. Existing RDDs: This is another method to create RDDs where you can apply a transformation to existing RDDs.

4. Spark RDD Operations

There are two kinds of operations that RDD support:

  • Transformation Operations: In this operation, a new Spark RDD is created from the existing one.

  • Action Operations: In Apache Spark, Action gives the end result to write to the external data store or to the driver program.

5. Features of RDD

There are multiple benefits of using RDDs. Let's discuss some of them:

  • In-memory computation: Whenever you store data in RDD, it also gets stored in memory for as long as you wish to store it. It allows the users to quickly detect patterns and analyze the huge data sets on the fly.

  • Fault tolerance: The lineage of operations would allow us to re-compute the lost partitions of RDD from the original node. This feature allows for high fault tolerance and helps in recovering the data with ease.

  • Immutability: This feature of RDD never lets anyone modify it. You can only create a new RDD by transforming the existing one but you can not alter the existing one. Immutability is one of the considerable factors to achieve consistency.

  • Persistency: Using in-memory we can store the frequently used data in it. This means you need not to go to the disk every time to retrieve the data that you need often. It improves the execution speed.

  • No limitation: There are no limitations to creating and using RDDs. You can use any number of RDDs you want to be based on the disk and memory size.

Apache Spark Limitations

There are some limitations associated with Apache Spark. Let's discuss them one by one.

1. No File Management System:

Spark has no file management system, so it has to be integrated with other platforms. To manage the Spark files it has to rely on Hadoop or any other cloud platform. This is a considerable limitation in Apache Spark.

2. No full Real-Time Data Processing:

Apache Spark is not capable enough to support the real data streaming process. In the live data streaming process it divides the data into batches, called Spark RDDS (Resilient Distributed Database) Then it undergoes the operations such as Join, map or reduces, etc to process the RDDs. After getting the outcome that is again made into batches. This complex process has made Spark micro-batch processing. So it has failed to support full real-time data processing.

3. High Cost:

It is not that easier to keep data in memory for processing. For the processing in memory, spark consumes huge memory. Spark requires huge RAM capabilities to process in memory which makes it highly complex. Sometimes you are required to buy additional memory to run the Spark which requires more money.

4. Small File Issue:

Basically, RDD contains a large number of small files that are partitioned. If we want to have efficient processing then we have to repartition the RDD files into a manageable format. To make them into a manageable format requires huge shuffling over the network.

5. Very Few Algorithms:

Spark MLib is the library that contains all machine learning algorithms. The availability of algorithms is very less in Spark MLib. This has contributed to the limitations of Apache Spark.

6. Latency:

Apache Spark comes with a high latency which turns the throughput lower. The Apache SLink comes with low latency and a high throughput rate which makes it far better than Spark.

7. Manual Optimization:

In spark, users require optimizing the jobs and datasets. To divide the partitions you need to specify the required number of partitions. To pass the parallelize method the number of partitions is to be fixed first. In making efficient partitions and cache, the entire partitions task should be controlled manually.

8. Window Criteria:

Apache spark is designed to support only time-based window criteria but not record-based window criteria.

Apache Spark Use cases

World big companies such as Netflix and some other online advertisers are leveraging the potentiality of spark technology, let’s consider a real-time example here.

Uber: This transportation company collects terabytes of data regularly for millions of mobile devices. By using advanced spark components, it builds a continuous ETL pipeline. With this ETL pipeline, Uber converts raw or unstructured data into a structured one and conducts analysis on this to find insights into it.

Pinterest: This website also uses the ETL pipeline to find how users are engaging with pins all over the world in real-time. Which results in helping the customers with the stuff they are interested in.

Industries That Use Apache Spark

Apache spark fulfills the developed needs of real-time data analytics. With each passing day, the need for data processing is increasing. The majority of the data lies in an unstructured format, and it demands a high computation capability like a spark. Below are the industries that use Spark for real-time analytics.

  • Banks: The banking industry has started adopting Spark extensively to make more informed decisions. They are using Spark to analyze data such as social profiles, call recordings, emails, complaint logs, and forum discussions. Real-time data analytics helps the banking sector in decisions of aspects of banking such as credit assessment, targeted advertising customer segmentation, etc. Fraud detection becomes a simple spark for real-time analytics.

  • Healthcare: Healthcare is adopting Spark technology to analyze the health condition of critical patients. In some situations, hospitals need to be in touch with each other in times of emergencies such as blood and organ transplantations. Providing medical assistance at the right time is very important because it is a matter of life or death.

  • Telecommunications: Companies that are into services such as video chats, streaming, and calls use Spark for real-time analytics to improve customer engagement and satisfaction. Through real-time spark analysis, they can gather issues or things that cause dissatisfaction for a customer and can be eliminated.

  • Government: Government agencies are using spark for gaining real-time analytics in the departments which look after the security of nations. Especially it uses Spark technology in areas like military and police departments for identifying threats and to take measures appropriately.

  • Stock Market: It is advantageous for the stockbrokers to predict the movement of their portfolios. Companies across the world are using these real-time analytics and estimate the demand for their brand.

Spark Built on Hadoop

There are three ways we can build spark on top of the Hadoop:

  • Standalone
  • Hadoop Yarn
  • Spark in MapReduce. (SIMR)

1. Standalone

It means spark is placed on top of the Hadoop distributed file system (HDFS) and memory is allocated specially for HDFS. On this standalone mode, MapReduce and Spark are run side by side to finish the tasks of the spark.

2. Hadoop Yarn

In this mode, the Spark simply works on Yarn without any pre-installation process. This would help in the integration of spark into the Hadoop ecosystem. It creates an interface where all other spark components can efficiently run on Hadoop without any complexities.

3. Spark in MapReduce (SIMR)

Earlier It was a very challenging task to run spark on MapReduce clusters with no installed YARN on its clusters. Often users required permission to install Spark on various subsets of machines which is a time taking process. To eliminate this obstacle the SIMR is launched.

SIMR allows a user to run Spark on top of Hadoop MapReduce without having installed spark or scala installed on any of the nodes. To run the process, all you need is access to HDFS and MapReduce.

Conclusion

Earlier computation was a big issue, but later Hadoop solved this in the same fashion real-time analytics was not possible, but later Hadoop came up with the solution in the form of Spark. We have come across the concept of Spark and how it works in real-time with its features, applications, use cases, etc. I hope you will find some useful information about this Apache Spark tutorial. Happy learning!

Related Articles

About Author

L

Liam Plunkett

Solution Architect

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.