Apache Spark – Distributed Systems

by Tuvia Elgazar

Part of my course requirements.

Joined Mar 2017
Published Books 1

Apache Spark™

fast and general engine for large-scale data processing

Apache Spark is an open-source distributed general-purpose cluster computing framework with (mostly) in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.

Figure 1. The Spark Platform

You could also describe Spark as a distributed, data processing engine for batch and streaming modesfeaturing SQL queries, graph processing, and machine learning.

In contrast to Hadoop’s two-stage disk-based MapReduce computation engine, Spark’s multi-stage (mostly) in-memory computing engine allows for running most computations in memory, and hence most of the time provides better performance for certain applications, e.g. iterative algorithms or interactive data mining (read Spark officially sets a new record in large-scale sorting).

Spark aims at speed, ease of use, extensibiity and interactive analytics.

Spark is often called cluster computing engine or simply execution engine.

Spark is a distributed platform for executing complex multi-stage applications, like machine learning algorithms, and interactive ad hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Dataset.

Using Spark Application Frameworks, Spark simplifies access to machine learning and predictive analytics at scale.

Spark is mainly written in Scala, but provides developer API for languages like Java, Python, and R.

Note	Microsoft’s Mobius project provides C# API for Spark “enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.”

If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is a viable alternative.

Access any data type across any data source.
Huge demand for storage and data processing.

The Apache Spark project is an umbrella for SQL (with Datasets), streaming, machine learning (pipelines) and graph processing engines built atop Spark Core. You can run them all in a single application using a consistent API.

Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).

Spark can access data from many data sources.

Apache Spark’s Streaming and SQL programming models with MLlib and GraphX make it easier for developers and data scientists to build applications that exploit machine learning and graph analytics.

At a high level, any Spark application creates RDDs out of some input, run (lazy) transformations of these RDDs to some other form (shape), and finally perform actions to collect or store data. Not much, huh?

You can look at Spark from programmer’s, data engineer’s and administrator’s point of view. And to be honest, all three types of people will spend quite a lot of their time with Spark to finally reach the point where they exploit all the available features. Programmers use language-specific APIs (and work at the level of RDDs using transformations and actions), data engineers use higher-level abstractions like DataFrames or Pipelines APIs or external tools (that connect to Spark), and finally it all can only be possible to run because administrators set up Spark clusters to deploy Spark applications to.

It is Spark’s goal to be a general-purpose computing platform with various specialized applications frameworks on top of a single unified engine.

Why Spark

Let’s list a few of the many reasons for Spark. We are doing it first, and then comes the overview that lends a more technical helping hand.

Easy to Get Started

Spark offers spark-shell that makes for a very easy head start to writing and running Spark applications on the command line on your laptop.

You could then use Spark Standalone built-in cluster manager to deploy your Spark applications to a production-grade cluster to run on a full dataset.

This free e-book was created with

Create your own amazing e-book!
It's simple and free.

Start now