Big Data Computing – Big Data Enabling Technologies

In the earlier note on Big Data Computing – introduction, we have seen how the traditional system cannot handle big data storage and processing generated by various sensors. In this note, we will read about big data, enabling technologies at a very high level. As the course progresses, we will detail the individual technology from the usability point of view.

Big Data

Big Data is used to collect large and complex datasets that are difficult to process using traditional tools. Even a recent survey says that 80% of the data created in the world are unstructured. The major problem is how to store and process this big amount of data. In the later section, we will see various technologies and enabling framework used to store and analyze big data at a very high level. 

Apache Hadoop Eco System

Apache Hadoop is an open-source software framework for big data. As per the initial version of Hadoop, it has two major components, the first one is called Hadoop Distributed File System (HDFS), and the other is called MapReduce.

The HDFS is the storage system of Hadoop, which splits big data and distributes it across many nodes or clusters. It provides scalability and fault tolerance. Whereas MapReduce is a programming model that simplifies parallel programming. It works in phases. In the first phase, the Map function splits and converts the dataset into key-value pairs or tuples. In the second phase, the Reduce function combines the data tuples based on the key and modifies the value of the key accordingly.

Later, Hadoop 2.0 introduced Yet Another Resource Negotiator (YARN) between MapReduce and HDFS, which was not in Hadoop 1.0. It provides flexible scheduling and resource management over HDFS.

Hadoop Eco System

Hadoop Distributed File System (HDFS)

It is a distributed file system that runs over multiple clusters. It is designed based on the concept of scaling out of hardware resources, where multiple nodes or computers can be added on-demand based on the storage requirements. These clusters or nodes can be a part of commodity hardware or system. However, this commodity hardware need not be compelling and prone to failure, so HDFS provides a lot of provisioning of fault tolerance. Moreover, HDFS is a file system over the cluster.

Yet Another Resource Negotiator (YARN)

YARN manages resources through a resource manager for the different applications. Primarly, it is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes. Moreover, it provides flexible scheduling and resource management over HDFS.

YARN works based on the principle of master and slave. The resource manager resides in the master cluster, and the node manager lives in the slave clusters. Within the slave clusters, resources are managed on containers. Moreover, the applications interact with the resource manager and demand different resources through YARN. YARN invokes resource manager, and resource manager communicates to node manager to get the work done.

MapReduce

It is a programming model for big data computations, and it simplifies parallel programming using two functions called Map and Reduce. In other words, it is a programming model and an associated implementation for processing and generating large data sets. The user specifies a map function that processes a key-value pair to create a set of intermediate key-value pairs. A reduce function that merges all intermediate values associated with the same intermediate key.

Pig & Hive

Pig and Hive run over MapReduce, and Facebook creates it. Through Pig and Hive, we can query data from HDFS. Hive supports SQL-like queries. Whereas, Pig supports scripting or dataflow queries to fetch data from HDFS. Also, both Pig and Hive augment MapReduce, so the complex programming of MapReduce can be simplified.

For Non-MapReduce application, such as GiraphStormSpark and Flink. These are the applications that directly use YARN and HDFS. 

Big Data Technologies
Big Data Technologies

Giraph

Giraph is a graph processing tool that Facebook uses to analyze the social network or the big graphs computations.

Storm/Spark/Flink

Storm/Spark/Flink is the real-time in-memory processing of data on top of YARN and HDFS. It is 100 times faster than regular processing.

We have seen distributed file systems and query processing frameworks to store and retrieve data efficiently and effectively. Now we will see the NoSQL database, which does both the work together. Further, Big data is mostly stored in the form of key-value pairs, and these types of processing and storage solutions are called the NoSQL database. The example of NoSQL database is CassandraMongoDB, and HBase

NoSQL

The traditional SQL can be effectively used to handle a large amount of structured data. We need NoSQL (Not Only SQL) to take unstructured data and store column-wise in the form of key-value pairs. However, it does not follow any particular schema. Each row can have its own set of column values. It gives better performance in storing a massive amount of unstructured data.

Hive

It is a distributed data management for Hadoop. It supports the SQL-like query option HiveSQL (HSQL) to access big data and primarily uses it in data mining. It runs on top of Hadoop.

Apache Spark

Apache Spark runs over HDFS, and it is a big data analytics framework that was originally developed at the University of California, Berkeley, in 2012. Since then, it has gained a lot of attraction both in academia and in industry. It is lightning-fast cluster computing technology designed for fast computation.

HBase

HBase is an open-source, distributed database developed by the Apache software foundation. Initially, it was google big table; afterward, it was renamed HBase and is primarily written in Java. HBase can store massive amounts of data from terabytes to petabytes.

Spark Streaming

Spark Streaming is an extension of the sore spark API that enables scalable, high throughput, fault tolerant stream processing of live data streams. Streaming data input from HDFS, Kafka, Flume, TCP sockets, Kinesis, etc. Spark ML functions and GraphX graph processing algorithms are fully applicable to streaming data. 

Spark MLlib

Spark MLlib is a distributed machine learning framework on top of spark core. MLlib is spark’s scalable machine learning library of standard learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction.

NoSQL Databases 

Apache Cassandra

Cassandra is a highly scalable, distributed, and high-performance NoSQL database. It is designed to handle a massive amount of data with its distributed architecture. Data is placed on different machines with more than one replication factor that provides high availability and no single point of failure. Moreover, in the Hadoop ecosystem, there is a single master for all data input, processing, and output compared to the Cassandra, which supports multiple masters.

Zookeeper

Zookeeper is a highly reliable distributed coordination kernel that can be used for distributed locking, configuration management, leadership election, work queues, etc.
Zookeeper is a replicated service that holds the metadata of distributed applications. It is a central store of key-value using which distributed systems can coordinate. Since it needs to be able to handle the load, the zookeeper itself runs on many machines.
Zookeeper is a coordinator service that is useful for all these different projects. Yahoo creates it to form the centralized management for synchronization, configurations to ensure high availability. These are the various projects or open-source frameworks for big data computation, called the Hadoop Ecosystem.

References

  1. Big Data Computing, By Prof. Rajiv Misra, IIT Patna.Big data are mostly stored in the form of key

Loading

Scroll to Top
Scroll to Top