Learning Experience about Android Bigdata: Hadoop.

Hadoop Framework:

Hadoop Definition

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Hadoop is a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure.

Hadoop has two components:

Technically, Hadoop consists of two key elements.

• The first is the Hadoop Distributed File System (HDFS), which permits the high-bandwidth, cluster-based storage essential for Big Data computing.

• The second part of Hadoop is then a data processing framework Called MapReduce. Based on Google's search technology, this distributes or "maps" large data sets across multiple servers.

• Each server are then aggregated in the so-termed "Reduce" stage.

• The focus is on supporting redundancy, distributed architectures, and parallel processing.

Hadoop Architecture:

Hadoop framework includes following four modules:

· Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides file system and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.

· Hadoop YARN: This is a framework for job scheduling and cluster resource management.

· Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

· Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

We can use following diagram to explain four components available in Hadoop framework.

MapReduce:

Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters or thousands of nodes of commodity hardware in a reliable manner.

Two different tasks that Hadoop programs perform:

· The Map Task: It takes input data and converts it into a set of data, the individual elements are divided into tuples (key/value pairs).

· The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.

Typically both input and output are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

JobTracker & TaskTracker

· The MapReduce framework works like master and slave.

· It consists of a single master JobTracker and one slave TaskTracker per cluster-node.

· The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks.

· The slaves TaskTracker execute the tasks as directed by the master and provide task-status information to the master periodically.

· The JobTracker is a single point of failure for the Hadoop MapReduce service which means if JobTracker goes down, all running jobs are halted.

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System is based on the Google File System (GFS) and provides a distributed file system that is designed to run on large clusters (thousands of computers) of small computer machines in a reliable, fault-tolerant manner.

HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data.

A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of DataNodes. The NameNode determines the mapping of blocks to the DataNodes.

The DataNodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by NameNode.

HDFS Architecture

Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following elements.

Namenode:

The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware.

The system having the namenode acts as the master server and it does the following tasks:

Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode:

The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.

Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

Block:

The user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks.

In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Features of HDFS

It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

Goals of HDFS:

· Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.

· Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.

· Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

The Algorithm

· Generally MapReduce paradigm is based on sending the computer to where the data resides!

· MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.

o Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

o Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

· During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.

· The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.

· Most of the computing takes place on nodes with data on local disks that reduces the network traffic.

After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs

The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output).

	Input	Output
Map	<k1, v1>	list (<k2, v2>)
Reduce	<k2, list(v2)>	list (<k3, v3>)

Terminology:

· PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.

· Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.

· NamedNode - Node that manages the Hadoop Distributed File System (HDFS).

· DataNode - Node where data is presented in advance before any processing takes place.

· MasterNode - Node where JobTracker runs and which accepts job requests from clients.

· SlaveNode - Node where Map and Reduce program runs.

· JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.

· Task Tracker - Tracks the task and reports status to JobTracker.

· Job - A program is an execution of a Mapper and Reducer across a dataset.

· Task - An execution of a Mapper or a Reducer on a slice of data.

· Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

Learning Experience about Android Bigdata

Sunday, January 31, 2016

Hadoop.

Hadoop Distributed File System (HDFS)

Namenode:

Datanode:

Block:

Features of HDFS

No comments:

Post a Comment