Today’s leading technology is Hadoop, an open-source framework for reliable, scalable, and distributed computing for Big Data analytics.
Hadoop Stack for Big Data
Apache Hadoop is an open-source software framework for the storage and large-scale processing of data on clusters of commodity hardware. Its architecture consists of hundred or thousands of commodity hardware with proper connectivity and storage capabilities. The storage and processing are distributed among the clusters as the datasets are enormous and can’t be stored in a single node. Primary computation has moved where data resides.
Apache Hadoop MapReduce and HDFS components were originally derived from Google’s MapReduce and Google’s File System (GFS).
The fundamental design issues such as reliability, scalability, keeping all the data while reading on schema, moving computation to the data make Hadoop a leading technology.
Processing
Hadoop started as a simple batch processing framework. The idea behind Hadoop is that instead of moving data to a computation system, It moved computation to data. In other words, instead of keep data in a distributed fashion, and during analysis, move that data back, perform calculation, and update the results, instead move the computation where data resides and passes the result to the master node.
Scalability
Scalability is the core of the design of Hadoop architecture. It follows scaled-out technology, where we can add more resources based on the requirements of storage of data and computation on data.
Reliability
Hardware failures handle automatically. If we think about an individual machine or rack of devices, or a large cluster or supercomputer, they all fail at some point in time, or some of their components will fail. These failures are so common that we have to account for them ahead of time. And all of these are handled within the Hadoop framework system.
Apache Hadoop Framework and its basic modules
- Hadoop Common: It contains libraries and utilities needed by other Hadoop modules.
- Hadoop Distributed File System (HDFS): It is a distributed file system that stores data on a commodity machine and provides very high aggregate bandwidth across the entire cluster.
- Hadoop YARN: It is a resource management platform responsible for computing resources in the cluster and scheduling users and applications.
- Hadoop MapReduce: It is a programming model that scales data across a lot of different processes. In other words, it ensures the processing of data takes place where it is stored.

High-Level Architecture of Hadoop
Hadoop uses a Master and Slave asymmetric communication model where one process (the master) controls one or more other processes (slaves) through the communication hub. The major pieces of Hadoop are Hadoop Distributed File System and the MapReduce, a parallel processing framework that will map and reduce data. Both of these are open source and inspired by the technologies developed at Google.

In the HDFS layer, there are NameNode and DataNode. NameNode resides in the Master system and controls all the DataNode resides in the clusters. NameNode also interacts with MapReduce Layer.
In MapReduce, there are two sub-components, TaskTracker and JobTracker. TaskTracker runs on all the clusters, including the master node whereas, JobTracker only runs on the master node and communicates with TaskTracker of slave nodes. The MapReduce parallel processing framework Map reaches wherever data is stored for computation, and after performing those computations, the aggregate values are being collected, which is called reduce of that particular data item.
Hadoop Distributed File System (HDFS)
It is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Each node in the Hadoop instance typically has a single Name Node (resides in the Master System) and a cluster of Data Nodes (resides both in Master and Slave systems) that form the HDFS cluster.
Each HDFS stores large files, typically gigabytes to terabytes or petabytes, across multiple machines. It achieves reliability by replicating the data across multiple hosts or (slave systems), and therefore it does not require any range storage on hosts. As the requirements come for more storage, more slave systems will be added gradually.

There are two types of nodes at a very high level: name node and data node. The name node sits on the master system to whom clients communicate, and data nodes reside across multiple clusters. These data nodes receive instructions from the name node and do it accordingly.
In the above diagram, each client communicates to the name node, and the name node redirects to the nearest data node based on the reading or writes requests. Mainly, the external applications or clients communicate to the name node, and it facilitates the external client based on request and manages internal data nodes across the clusters. It does data replication, balancing, and periodically fetches data nodes’ status (heartbeats) across clusters. A very high level data flow diagram is shown below.

Heartbeat is a mechanism through which name node get the status of individual data node across the cluster. As each data node and its replica keep some piece of information about the files. Suppose if one data node or its replica fails, it removes it from the cluster and tries to create an alternate data node or replica as part of the fault tolerance policy. Further, it makes that node consistent and synchronous with all the previous updates received by the faulty data node from the backup data node.
Replication is the process of deduplicating a primary data node’s content and creating multiple backup data nodes based on the policy.
Balancing is a phenomena through which data is equally distributed across the clusters.
The name node, which keeps the complete information about data nodes, updates, and lots more that are received from the HDFS client, can be a part of the single point of failure. So to avoid it, HDFS maintains a replica (called secondary name node) of the name node, and in the case of failure, the secondary name node becomes primary. It automatically creates and starts updating to the secondary name node.
MapReduce Engine
Job tracker push out all the computation to task trackers, and the task trackers use data node for data retrival and storage. MapReduce engine uses to distribute work across a cluster. A typical MapReduce engine consist of a job tracker, to which client applications can submit MapReduce jobs, and job tracker typically pushes work out to all the available task trackers.
