Saturday, May 16, 2015

Hadoop Ecosystem


 
There are various components/Tools when it comes to Hadoop Eco system. The tools which are used in Hadoop framework is quite different from what we use in our traditional data analytics.

·       I will brief about each component of the Hadoop eco system. Before moving on to Hadoop Eco System, I want to give a small insight about the different between Hadoop 1 and Hadoop 2. Because of Hadoop 2's architecutre variety of Tools were added to Hadoop Ecosystem 

There is a main difference between Hadoop 1.0 and Hadoop 2.0 version.

·       In Hadoop 1.0,

o   Mapreduce is the only framework that is supported

o   Cluster operations and Data operations was performed here

·       In Hadoop 2.0,

o   Lot of data processing frameworks are added

o   Cluster operations is done by YARN

o   Data operations is done by Data processing frameworks like Mapreduce etc

If you look closely between Hadoop 1.0 and Hadoop 2.0 you can notice there are Data processing framework is separated from Cluster operations

 

 

 
 
 
 

 
 
 
Hadoop Eco system
Data Loading:
                             These tools are used to feed big data into Hadoop Cluster        
·       Sqoop à Structured Data like Database tables, Excel files etc.          
·       Flume à Load Logs from Twitter, Facebook, LinkedIn, XML etc.
Configuration, Synchronization, Co-ordination between Clusters
Zookeeper à This component is used to manage the Cluster Configuration in the Hadoop environment. It also synchronizes the data with different machines in the Cluster
Scheduling and Workflow
Oozie à This component is used to create/schedule jobs in a Cluster. Java based Web Interface which enables us to create Workflow
Data Processing Tools
          Data Analytics
Pig – Procedural based language developed by Yahoo used to analyze large data sets. This can perform ETL operations
Hive – SQL based language Developed by facebook used more like data warehousing solutions.           
R is used
HBASE – Used to access Non-relational database structure. This is used to query data lookups from HDFS, inserts and updates. Facebook and ebay used extensively
Data Science
Mahout – This tool is used for machine learning.
Cluster operations
YARNYet another resource negotiator.
File system
HDFS – Hadoop File system
 

2 comments: