Steve's Site

Nov. 2, 2023, 4:36 p.m.

In this post I will be covering the differences and similarities between two important, and related, technologies in the field of Big Data: Apache Hadoop and Apache Spark. I will also touch on Hive as it relates to Hadoop.

Apache Hadoop

To start with, let us examine what exactly is Apache Hadoop? Hadoop is a software library which allows data processing to be distributed across clusters of computers. This allows us to process data at a far greater scale. When using Hadoop we are not reliant on an individual piece of hardware; we rely on the application to distribute the load across a cluster of machines and detect and handle failures of individual computers.

Hadoop solves the problem of dealing with massive amounts of data. It breaks jobs down into smaller workloads which can be handled in parallel by the aforementioned cluster of computers.

There are four modules that make up the Hadoop framework:

1. Hadoop Distributed File System (HDFS)

As the name implies this is a system of files distributed across nodes in a cluster. Nodes (computers) operate on the data that resides in their local storage. This creates a system with very high throughput.

2. Yet Another Resource Negotiator (YARN)

YARN is the way that Hadoop manages its compute resources and their clusters. It also handles the scheduling of jobs and resource allocation across the Hadoop system.

3. MapReduce

This is a system, based on YARN, which organizes the parallel processing of large data sets. In the MapReduce way of doing things data is broken down into smaller subsets which are then distributed to nodes along with instructions for processing.

4. Hadoop Common

The common utilities that support the other modules.

Hive

Hive is a data warehousing system built on top of Hadoop. Hive provides the user with a SQL-like interface. Without Hive you would need to build a complex Hadoop job to return query results. Hive abstracts away that complexity.

Hadoop has several benefits: it's distributed model scales easily to handle large data loads, it is relatively low cost, because Hadoop does not require pre processing of data before storage. There is a great deal of flexibility in what can be stored, and finally because Hadoop does not rely on any one piece of hardware it is a highly resilient system.

Apache Spark

It follows that we should now examine Apache Spark. Apache Spark is a more modern technology than Hadoop, but is in the same family. Apache Spark is an engine, available in multiple programming languages, that facilitates data engineering, data science, and machine learning.

Some key features of spark are: the ability to process data in batches or by streaming; fast execution of sql queries for dash-boarding or ad hoc analytics; and the ability to perform exploratory data analytics on huge data sets. You can train machine learning algorithms on your local machine and then scale it to thousands of nodes on fault tolerant clusters.

Key components of Apache Spark:

1. Spark Core

This coordinates the basic functions of Apache Spark: memory management, data storage, task scheduling, and data processing.

2. Spark SQL

Allows you to process data in Spark's distributed storage.

3. Spark Streaming

Handles realtime data streaming by separating the data into very small

continuous blocks.

4. Machine Learning Library

This provides several Machine Learning algorithms you can use.

5. GraphX

Tools for visualizing data

Key Differences between Hadoop and Spark

The main difference between Hadoop and Spark is that Hadoop is an older technology that only handles batch processing whereas Spark is newer and can handle streaming of data.

Another thing to note is that Spark does not actually have its own native file system. A common approach is to run Spark on top of Hadoop's filesystem (HDFS), but you could also use another data storage option.

Spark is also significantly faster than Hadoop, because Spark processes the data in RAM and only writes back to external storage after completing a task. This is not the case with Hadoop which writes data back to external storage at each processing step.

Hadoop scales more easily than Spark because it uses hard disks for storage so to scale up you just need to add more nodes. Scaling in Spark is more expensive because Spark uses RAM for in memory processing.

So there you have it: A complete breakdown of the similarities and differences between two great tools for processing large amounts of data.

What is the Difference between Spark and Hadoop?