Hadoop Framework:
Hadoop Definition
Hadoop is an open-source framework that allows to
store and process big data in a distributed environment across clusters of
computers using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local computation and
storage.
Hadoop
is a distributed file system and data processing engine that is designed to
handle extremely high volumes of data in any structure.
Hadoop has two components:
Technically, Hadoop
consists of two key elements.
• The
first is the Hadoop Distributed File System (HDFS), which
permits the high-bandwidth, cluster-based storage essential for Big Data
computing.
• The
second part of Hadoop is then a data processing framework Called MapReduce.
Based on Google's search technology, this distributes or "maps"
large data sets across multiple servers.
• Each
server are then aggregated in the so-termed "Reduce" stage.
• The
focus is on supporting redundancy, distributed architectures, and parallel
processing.
Hadoop Architecture:
Hadoop framework includes following four
modules:
· Hadoop
Common: These are Java libraries and utilities required
by other Hadoop modules. These libraries provides file system and OS level
abstractions and contains the necessary Java files and scripts required to
start Hadoop.
· Hadoop
YARN: This is a framework for job scheduling and
cluster resource management.
· Hadoop
Distributed File System (HDFS™): A distributed file system that provides high-throughput
access to application data.
· Hadoop
MapReduce: This is YARN-based system for parallel processing of large data
sets.
We can use following diagram to explain four components available
in Hadoop framework.
MapReduce:
Hadoop MapReduce is
a software framework for easily writing applications which process big amounts
of data in-parallel on large clusters or thousands of nodes of commodity
hardware in a reliable manner.
Two different tasks that
Hadoop programs perform:
· The Map Task: It takes input data and converts it into a set
of data, the individual elements are divided into tuples (key/value pairs).
· The Reduce Task: This task takes the output from a map task as
input and combines those data tuples into a smaller set of tuples. The reduce
task is always performed after the map task.
Typically both input and
output are stored in a file-system. The framework takes care of scheduling
tasks, monitoring them and re-executes the failed tasks.
JobTracker & TaskTracker
·
The MapReduce framework works like master and slave.
· It consists of a single
master JobTracker and one slave TaskTracker
per cluster-node.
· The master is responsible for resource management, tracking
resource consumption/availability and scheduling the jobs component tasks on
the slaves, monitoring them and re-executing the failed tasks.
·
The slaves TaskTracker execute the tasks as directed by the master
and provide task-status information to the master periodically.
· The JobTracker is a single point of failure for the Hadoop
MapReduce service which means if JobTracker goes down, all running jobs are
halted.
Hadoop Distributed
File System (HDFS)
The Hadoop Distributed File System is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on large clusters
(thousands of computers) of small
computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where
master consists of a single NameNode that
manages the file system metadata and one or more slave DataNodes that store the actual data.
A file in an HDFS namespace is split into several
blocks and those blocks are stored in a set of DataNodes. The NameNode determines the mapping
of blocks to the DataNodes.
The DataNodes takes care of read and write
operation with the file system. They also take care of block creation, deletion
and replication based on instruction given by NameNode.
HDFS Architecture
Given below is the architecture of a Hadoop File
System.
HDFS follows the master-slave architecture and it
has the following elements.
Namenode:
The namenode is the commodity hardware that
contains the GNU/Linux operating system and the namenode software. It is a
software that can be run on commodity hardware.
The system having the namenode acts as the master
server and it does the following tasks:
- Manages the file system namespace.
- Regulates client’s access to files.
- It also executes file system
operations such as renaming, closing, and opening files and directories.
Datanode:
The datanode is a commodity hardware having the
GNU/Linux operating system and datanode software. For every node (Commodity
hardware/System) in a cluster, there will be a datanode. These nodes manage the
data storage of their system.
- Datanodes perform read-write
operations on the file systems, as per client request.
- They also perform operations such
as block creation, deletion, and replication according to the instructions
of the namenode.
Block:
The user data is stored in the files of HDFS. The
file in a file system will be divided into one or more segments and/or stored
in individual data nodes. These file segments are called as blocks.
In other words, the minimum amount of data that
HDFS can read or write is called a Block. The default block size is 64MB, but
it can be increased as per the need to change in HDFS configuration.
Features
of HDFS
- It
is suitable for the distributed storage and processing.
- Hadoop
provides a command interface to interact with HDFS.
- The
built-in servers of namenode and datanode help users to easily check the
status of cluster.
- Streaming
access to file system data.
- HDFS
provides file permissions and authentication.
Goals of HDFS:
· Fault detection and recovery : Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have
mechanisms for quick and automatic fault detection and recovery.
· Huge
datasets : HDFS should have hundreds of nodes per cluster
to manage the applications having huge datasets.
· Hardware
at data : A requested task can be done efficiently, when
the computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
The Algorithm
·
Generally MapReduce
paradigm is based on sending the computer to where the data resides!
·
MapReduce program
executes in three stages, namely map stage, shuffle stage, and reduce stage.
o Map stage : The map or mapper’s job
is to process the input data. Generally the input data is in the form of file
or directory and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
o Reduce stage : This stage is the
combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.
· During a MapReduce job,
Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster.
· The framework manages
all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
· Most of the computing
takes place on nodes with data on local disks that reduces the network traffic.
After completion of the
given tasks, the cluster collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.
Inputs and Outputs
The MapReduce framework operates on <key,
value> pairs, that is, the framework views the input to the job as a set of
<key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
The key and the value classes should be in
serialized manner by the framework and hence, need to implement the Writable
interface. Additionally, the key classes have to implement the Writable-Comparable
interface to facilitate sorting by the framework. Input and Output types of a
MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>->
reduce -> <k3, v3>(Output).
|
Input
|
Output
|
|
|
Map
|
<k1, v1>
|
list (<k2, v2>)
|
|
Reduce
|
<k2, list(v2)>
|
list (<k3, v3>)
|
Terminology:
· PayLoad - Applications implement the Map and the Reduce functions, and form
the core of the job.
· Mapper - Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
· NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
· DataNode - Node where data is presented in advance before any processing takes place.
· MasterNode - Node where JobTracker runs and which accepts job requests from clients.
· SlaveNode - Node where Map and Reduce program runs.
· JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
· Task
Tracker - Tracks the task and reports status to
JobTracker.
· Job - A program is an execution of a Mapper and Reducer across a
dataset.
· Task - An execution of a Mapper or a Reducer on a slice of data.
· Task
Attempt - A particular instance of an attempt to execute a
task on a SlaveNode.




No comments:
Post a Comment