Wednesday, May 6, 2015

Hadoop Cluster modes


Hadoop Cluster modes

There are 3 different types of Hadoop Cluster. They are

1.       Standalone mode or Local mode

2.       Pseudo-distributed mode

3.       Fully distributed mode

This basically indicates how the Hadoop Cluster is set up. The following diagram shows the minimal setup

 

Hadoop Components
JDK
Operating System (UNIX, CentOS etc.)

 

1.       Standalone mode or Local mode

a.       This mode is purely used for development purpose

b.      Once mapreduce programs are developed and tested, we can migrate it to the PROD environment

c.       There will not be any daemons running

d.      Hadoop configuration files are not set up (i.e.) configuring name node etc. is not happening here

2.       Pseudo-Distributed mode

a.       This mode is also configured for development purpose

b.      All the daemons are running in the machine  (Name node, Data node, Node manager , Resource manager are all configured)

c.       Hadoop configuration files are configured

3.       Fully distributed mode
a.       This is real production environment system
b.      It is configured with multiple machines
c.       Many machines will be configured based on Master/Slave architecture

Sunday, May 3, 2015

Hadoop Framework


Hadoop Framework

Knowledge shared is knowledge gained J

Wish you all a happy day folks. Let us look at the Hadoop architecture in today’s topic.

Hadoop is a framework developed by Apache in order to handle large volume of data (i.e.) big data.

Have a look at the below picture closely.

·       Hadoop framework is designed based on Master/Slave architecture

Let me explain in it laymen terms about the architecture then let us move on to the real terminologies used in the Hadoop





Simple explanation

Just imagine Master Rob has three slaves named A, B, C.

·       Rob will note down all A, B, C information in his employee register – Where A,B,C lives ? How much potential each one possess etc

·       A,B,C knows who the master after interacting with Master

·       Rob will periodically (1 hour) check if A/B/C is performing their duty. If anyone is not replying or missing means there is something wrong with them

·       Rob is not sure when A, B or C will leave him. Hence whatever A knows he orders to share with B and C. Similarly B à share with A and C and C à share with A and B

·       A/B/C after performing their duty will update their current amount of work, pending work etc.

·       Rob as a single person cannot monitor all employees. He appoints Tom as resource manager in order to monitor A/B/C

·       Rob maintains who log where who did what

·       If Rob loses his register and log then it will be very difficult to track A, B, C wages, capability, past achievements etc. Hence Bob gets this information once in a while

Let us map the above scenario in Hadoop terminology

1.    Rob à Master Node or Name Node

2.    A/B/C à Data Node or Slave Node

3.    Rob maintains A/B/C info à Metadata (FS Image). This has all the information about the Cluster

4.    Rob’s Log à Edit logs which has transactional information

5.    Rob Checks periodically à Heart beat every 30 seconds to make sure the slave is not down

6.    Tom à Resource manager process or daemon to manage Slaves A/B/C

7.    Sharing info à Replication factor. This is to make sure the data is retrieved even if one machine goes down

8.    Bob à Secondary name node. This is housekeeping node periodically gets the metadata information from Name Node

9.    A/B/C performing work à Read/Write operation

 

 

 

 

Process Explanation

HDFS - Hadoop File System

                   We know file systems like UNIX file system, NTFS, FAT32. Similarly Hadoop file system is a type of file system used in Big data world. This file system determines

·       How the data is stored

·       What are the directory structures?

All the Big data tools developed has to imply with Hadoop file system

YARN Processing

                   Yarn – Yet another resource negotiator

YARN was introduced from Hadoop 2.0. This is a processing unit for Hadoop framework. Some of the operations are

·       How to track the job

·       Scheduling the job