Spark Tutorial – Learn Spark From Experts

spark-tutorial

Are you in search of Best Spark Tutorial then you are in the right place? This Apache Spark tutorial for beginners is designed to provide you with a complete overview of all concepts associated with Spark. Let’s get into the Spark tutorial part.

Apache Spark is an open-source cluster computing framework accepted by a wide range of industries to process the vast volumes of data and to get real-time insights out of it. It is one of the most successful products of Apache Spark. Before jumping into the tutorial, let’s consider some important things about data and evolution of spark.

The concept of data is not a new one; it has been in the market for decades. Growth of data was very low in the past, but it has grown tremendously since the invention of the internet. Tons of data has been getting generated round the clock, and in the same manner, the technologies which process the data are also evolving along with it.

Data processing:

It was not possible to process extensive amounts of data with traditional data processing systems. Companies all around the world were in desperate need for data processing mechanisms. Finally, Hadoop came up with its powerful batch processing capability to solve this so-called problem of data processing. Hadoop stores the data in a distributed manner and process the data in a parallel way. It has three main components which are, reliable data storage layer-HDFS, resource management layer-YARN, and batch processing engine MapReduce.

You may get a doubt when we have such a robust data processing technology why do we need other technologies like Spark? Let’s get into the Spark tutorial, to find answers to your questions.

To gain clear knowledge about Spark we must have prior knowledge of MapReduce lets look into the MapReduce first.

What is MapReduce?

MapReduce is an innovative programming model suitable for processing a vast amount of data. The MapReduce algorithm is divided into two important tasks which are Map and Reduce. The Map takes a set of data and transforms that data into another set of data, in which individual elements are divided into small tuples (key/value pairs). The next process starts with the reduce function which receives the output of the Map as an input and combines the small tuples into small sets of tuples. The main advantage of MapReduce is that it allows easy data scaling processes over multiple nodes.

Evolution of Spark

Before the evolution of Spark MapReduce was used as a processing framework. Initially, the spark was started as a research project in 2009, and it was turned into open-sourced in 2010. The reason behind developing the spark is to create cluster management framework that could support cluster-based computing for various systems. In 2013 Spark moved to Apache foundation and adopted by many industries across the world.

What is Apache Spark?

Apache Spark is a cluster computing open-source framework designed to fulfil the need for real-time analytics. It was developed based on the Hadoop MapReduce model, and it extends the capabilities of MapReduce for additional computations such as stream processing, and interactive queries. Spark has gained more popularity due to its in-memory computation that boosts the speed of computing.

Spark is capable of supporting different workloads such as interactive algorithms, streaming, batch application, and interactive queries. In addition to all these tasks, it reduces the burden of management for maintaining other extra tools.

Apache Spark Tutorial – Spark Streaming

Apache Spark streaming is a highly scalable fault-tolerant and streaming process system that seamlessly supports streaming and batch workloads. The Spark streaming is an extension to Spark API which allows data scientists and data engineers to process real-time data received from various sources which include Flume, Kafka, and Amazon Kinesis. Whatever the data that has been processed is pushed to the databases, storages and on to the live dashboards. The main abstracter here is a DStream which represents a stream of data segregated into tiny batches. The DStream is built on RDDS, which is considered as a Core data abstract of Spark. The DStream allows the Spark Streaming to easily connect with the components like Spark SQL and MLib.

Spark is quite different from other streaming systems because other systems have either designed only for streaming or contain streaming and batch processing APIs, but the compilation process goes to different engines. Spark has gained popularity and has become a powerful streaming system because of its single execution engine and programming model.

Apache Spark’s run-time Architecture components:

We have mainly three runtime components which are:

    • Apache Spark Driver
    • Apache Spark Cluster Manager
    • Apache Spark Executors
    • Apache Spark Driver

The driver is the main run time architecture of spark. The driver processes the user code that creates RDDs, creates Spark context and also performs action and transformation. Whenever Spark shell is launched, that indicates a driver program has been created. When you terminate the driver, the application will be completed.

The driver program divides Spark application into tasks and schedules those small tasks to run on the executor. The task scheduler is built inside the driver and distributes the tasks with the workers. A driver is responsible for two main functionalities:

      • Modifies a user program into a task
      • Schedules tasks on the executor
      • Apache Spark Cluster Manager

Cluster managers play an important role in launching the executors or sometimes even to launch drivers. The Jobs and actions within the Spark application are scheduled by Spark Scheduler on the cluster manager in a First In First Out (FIFO) manner. Sometimes the scheduling is also done in a Round Robin manner.

Spark comes with an option called dynamic adjustment which allows you to adjust the resources based on workload. It makes the application free from the heavy workloads or unused resources. The resources are called only when there is a demand. This option is applicable to all cross-grained cluster managers, i.e. YARN mode, standalone mode, and Mesos coarse-grained mode.

Apache Spark Executors:

The individual tasks in the Spark job are run in the Spark executors. The executors are launched at the start of the Spark application and they keep running for the entire lifetime of an application. Executors mainly perform two important roles:

      • Executes the tasks that make up the application and sends the result to the driver.
      • Enables in-memory storing facilities for RDDs that are cached by the user.

Apache Spark Resilient Distributed Datasets:

The core Data structure of Spark is RDD. RDD is an acronym for Resilient Distributed Datasets (RDD) and it is an immutable set of objects. The datasets on RDD are arranged in a logical manner and can be computed on various nodes of the cluster. RDD can accept various objects which include Java, Python, or Scala and user-defined classes.

You can create RDDs in two different ways one is parallelizing an existing collection in a drive program and the other is referencing a dataset in an external storage system which may include shared file system, HBase, HDFS, or the data source which offers Hadoop Input Format. Spark uses RDD as a means to accomplish faster and efficient MapReduce Operations. Without RDDs Mapreduce operations require a lot of time to process.

Data Sharing using Spark RDD

It takes a lot of time to process the data in MapReduce due to serialization, replication, and disk IO. Hadoop applications spend almost 90% of their time Doing Haddop read-write operations.

To eliminate this problem, researchers have developed a solution called Resilient Distributed Database (RDD). RDD is capable of supporting in-memory processing computation. It stores the state of memory as an object in all jobs and the object is shared among all other jobs. In-memory data sharing is 10 to 100 times faster compared to network and Disk.

Ways to create RDD

To create RDDs we have three ways.

1. Parallelized Collections:
Invoking the parallelize method in the driver program would allow you to create parallelized collections.

2. External datasets
By calling the Textfile method one can create Spark RDDs. This method considers the URL of the file and reads it a group of lines.

3. Existing RDDs
This is another method to create RDDs where you can apply a transformation to existing RDDs.

Spark RDD operations:

There are two kinds of operations which RDD support:

>1. Transformation Operations:

In this operation, a new Spark RDD is created from the existing one.

2. Action Operations:

In Apache Spark, Action gives the end result to write to the external data store or to the driver program.

Features of RDD:

There are multiple benefits of using RDDs. Let’s discuss some of them.

a) In-memory computation:

whenever you store data in RDD, it also gets stored in memory for as long as you wish to store. It allows the users in quick detection of patterns and analyzes the huge data sets on the fly.

b) Fault tolerance

The lineage of operations would allow us to re-compute the lost partitions of RDD from the original node. This feature allows for high fault tolerance and helps in recovering the data with ease.

c) Immutability:

This feature of RDD never lets anyone modify it. You can only create new RDD by transforming the existing one but you can not alter the existing one. Immutability is one of the considerable factors to achieve consistency.

d) Persistency:

Using in-memory we can store the frequently used data in it. This means you need not to go to the disk every time to retrieve the data that you need often. It improves the execution speed.

e) No limitation:

There are no limitations to create and use RDDs. You can use any number of RDDs you want based on the disk and memory size.

Spark Tutorial – Industries that use Spark.

Apache spark fulfils the developed needs of real-time data analytics. With each passing day, the need for data processing is increasing. Majority of the data lies in an unstructured format, and it demands a high computation capability like a spark. Below are the industries that use Spark for real-time analytics.

Banks:

The banking industry has started adopting Spark extensively to make more informed decisions. They are using Spark to analyse data such as social profiles, call recordings, emails, complaint logs, and forum discussions. The real-time data analytics helps the banking sector in decisions of aspects of banking such as credit assessment, targeted advertising customer segmentation, etc. Fraud detection becomes a simple spark for real-time analytics.

Healthcare:

Health care is adopting Spark technology to analyse the health condition of critical patients. In some situations, hospitals need to be in touch with each other in times of emergencies such as blood and organ transplantations. Providing medical assistance at the right time is very important because it is a matter of life or death.

Telecommunications :

Companies which are into services such as video chats, streaming, call use Spark for real-time analytics to improve customer engagement and satisfaction. Through real-time spark analysis, they can gather issues or things that cause dissatisfaction for a customer can be eliminated.

Government:

Government agencies are using spark for gaining real-time analytics in the departments which look after the security of nations. Especially it uses Spark technology in areas like military and police departments for identifying threats and to take measures appropriately.

Stock Market:

It is advantageous for the stockbrokers to predict the movement of the portfolios. Companies across the world are using these real-time analytics and estimate the demand for their brand.

Spark Built on Hadoop:

There are three ways we can build spark on top of the Hadoop

      • Standalone
      • Hadoop Yarn
      • Spark in MapReduce. (SIMR)

Standalone:

It means spark is placed on top of the Hadoop distributed file system (HDFS) and memory allocated specially for HDFS. On this standalone mode, MapReduce and Spark are run side by side to finish the tasks of the spark.

Hadoop Yarn:

In this mode, the Spark simply works on Yarn without any pre-installation process. This would help in the integration of spark into the Hadoop ecosystem. It creates an interface where all other spark components can efficiently run on Hadoop without any complexities.

Spark in MapReduce (SIMR)

Earlier It was a very challenging task to run spark on MapReduce clusters with no installed YARN on its clusters. Often users required permission to install Spark on various subsets of machines which is a time taking process. To eliminate this obstacle the SIMR is launched.

SIMR allows a user to run Spark on top of Hadoop MapReduce without having installed spark or scala installed on any of the nodes. To run the process, all you need is access to HDFS and MapReduce.

Apache Spark Tutorial – Spark Components:

Spark components are those which make Apache spark fast and reliable. All these components were developed to eliminate the issues that have been raised while using MapReduce. The below mentioned are the components that make Spark stronger.

      • Spark Core
      • Spark Streaming/li>
      • Spark SQL/li>
      • GraphX/li>
      • MLlib ( Machine Learning)/li>

Spark Core

Spark core performs essential functions such as memory management, task scheduling, interacting with storage systems, and so on. It also acts as a home for the central point for the API that defines the resilient distributed datasets (RDDs). The RDDs is a group of nodes that are spread across many nodes which can be manipulated in a parallel way.

Spark Streaming

This component is developed to process the data of live streams. These live streams of data include a quest of messages posted by users on servers and the log files generated by production web servers. The DStream in Spark which is actually a series of RDD that helps in processing the real-time data. Spark streaming helps in making the system fault-tolerant and highly scalable.

Spark SQL

Spark SQL comes with the package of the Spark, and it helps work with structured data. It allows querying the data in two ways which are SQL and HQL ( hive query language). Spark SQL supports data from a wide variety of sources such as Parquet, JSON, and hive tables. In addition to providing SQL interface, Spark SQL also developers integrate the SQL with programming languages such as Python, Scala, Java, all in one application. This highly scalable integration makes the spark to stand out from other open-source data warehouse tools.

GraphX

Spark is equipped with a library called GraphX which possesses the ability to manipulate the graphs and performs computations. This works similar to spark streaming, Spark SQL to extend the RDD which creates a direct graph.

To support graph computation, GraphX exposes optimized variants of pregel API and a set of fundamental operators. To make it more powerful and easy for computing GraphX is equipped with a growing collection of builders and graph algorithms to simplify Graph analytics tasks.

MLlib ( Machine Learning Library)

Apache Spark has an inbuilt library called MLlib which has a wide range of machine learning algorithms, clustering, classification, collaboration filters etc. It also supports model evaluation and data support. In addition to all, it also provides some lower-level functionalities such as descent optimization algorithms.

Spark Tutorial – Features of Apache Spark:

There is a vast amount of data that needed to process immediately to act on the insights resided in it. Organizations across the world are looking for real-time analytics which is not possible by Hadoop MapReduce. Spark is developed to analyse the data on demand and provides instant analytics.

Below mentioned are the capabilities of the spark that makes it special:

Speed:

Spark works at a lightning speed. It runs an application on Hadoop cluster 100 times faster and on disk ten times faster. Spark reduces the read/write operations on disk and stores the intermediate data on the memory. It uses the partition method to decrease the traffic while processing data, hence the ultimate result is faster processing.

Supports multiple Languages.

Spark is highly flexible in nature. It supports a wide range of programming languages such as Java, Scala, or Python. Therefore you can write the applications in your preferred language. Spark is equipped with 80 different operators which help in interactive querying.

Supports multiple formats

The best part of the spark is that it supports all types of data formats. It supports multiple data sources such as Cassandra, Hive, JSON, and Parquet.

Dynamic in Nature

Apache Spark allows you to develop parallel applications because it contains 80 high-level operators.

Reusability

Spark allows you to reuse the code for joining the team against historical data or for batch-processing. Using spark code you can also run ad-hoc queries on stream mode.

Fault tolerance:

Spark is highly fault-tolerant. You can do it with the Spark’s core abstraction-RDD. The RDDs are designed to manage the sudden failovers of any worker node in the cluster. It ensures that there is no loss of data in case of any node failover.

Real-time Stream Processing:

This is the best feature of Spark. Spark allows real-time data processing whereas Hadoop does not allow this. Hadoop only processes the data which is already present and the invention of spark has eliminated this issue.

Supports for grater analysis:

To provide the users with deep analysis Apache Spark comes with the dedicated tools which include machine learning which add-on to map and reduce, for streaming data declarative/ interactive queries.

Cost-effective:

The Big data Hadoop requires a huge storage capacity and data centre to the replication process. Spark needs low storage and cost-effective compared to Hadoop.

Hadoop Integration:

Spark is highly flexible. You can run Spark on Hadoop Yarn cluster manager or it can be run independently. It is built with an ability to read existing Hadoop data.

Spark Tutorial – Limitations:

There are some limitations associated with Apache Spark. Let’s discuss them one by one.

1.No File Management System:

Spark has no file management system, so it has to be integrated with other platforms. To manage the Spark files it has to rely on the Hadoop or any other cloud platforms. This is a considerable limitation in Apache Spark.

2. No full Real-Time Data Processing:

Apache Spark is not capable enough to support the real data streaming process. In the live data streaming process it divides the data into batches, called Spark RDDS (Resilient Distributed Database) Then it undergoes the operations such as Join, map or reduces etc, to process the RDDs. After getting the outcome that is again made into batches. This complex process has made Spark micro-batch processing. So it has failed to support full real-time data processing.

3. High Cost:

It is not that easier to keep data in memory for processing. For the processing in memory, spark consumes huge memory. Spark requires huge RAM capabilities to process in memory which makes it highly complex. Sometimes you are required to buy additional memory to run the Spark which requires more money.

4. Small File Issue:

Basically, RDD contains a large number of small files that are partitioned. If we want to have efficient processing then we have to repartition the RDD files into a manageable format. To make them into a manageable format it requires huge shuffling over the network.

5. Very Few Algorithms:

Spark MLib is the library which contains all machine learning algorithms. The availability of algorithms is very less in Spark MLib. This has contributed to the limitations of Apache Spark.

6. Latency:

Apache Spark comes with a high latency which turns the throughput lower. The Apache SLink comes with low latency and high throughput rate which makes it far better than Spark.

7. Manual optimization:

In spark, users require optimizing the jobs and datasets. To divide the partitions you need to specify the required number of partitions. To pass the parallelize method the number of partitions to be fixed first. In making efficient partitions and cache, the entire partitions task should be controlled manually.

8. Window criteria:

Apache spark is designed to support only time-based window criteria but not record based window criteria.

Apache Spark Tutorial – Use case:

World big companies such as Netflix and some other online advertisers are leveraging the potentiality of spark technology, let’s consider a real-time example here.

Uber:

This transportation company collects terabytes of data regularly for millions of mobile devices. By using advanced spark components, it builds a continuous ETL pipeline. With this ETL pipeline, Uber converts raw or unstructured data into a structured one and conducts analysis on this to find insights out of it.

Pinterest:

This website also uses the ETL pipeline to find how users are engaging with pins all over the world in real-time. Which results in helping the customers with the stuff they are interested in.

Conclusion;

Earlier computation was a big issue, but later Hadoop solved this in the same fashion real-time analytics was not possible, but later Hadoop came with the solution in the form of Spark. We have come across the concept of Spark and how it works in real-time its features, applications, use cases etc. I hope you will find some useful information about this Spark tutorial. Happy learning!