There are various components/Tools when it comes to Hadoop
Eco system. The tools which are used in Hadoop framework is quite different
from what we use in our traditional data analytics.
· I will brief about each component of
the Hadoop eco system. Before moving on to Hadoop Eco System, I want to give a small insight about the different between Hadoop 1 and Hadoop 2. Because of Hadoop 2's architecutre variety of Tools were added to Hadoop Ecosystem
There is a
main difference between Hadoop 1.0 and Hadoop 2.0 version.
· In Hadoop 1.0,
o
Mapreduce
is the only framework that is supported
o
Cluster
operations and Data operations was performed here
· In Hadoop 2.0,
o
Lot
of data processing frameworks are added
o
Cluster
operations is done by YARN
o
Data
operations is done by Data processing frameworks like Mapreduce etc
If you look
closely between Hadoop 1.0 and Hadoop 2.0 you can notice there are Data
processing framework is separated from Cluster operations
Hadoop Eco system
Data Loading:
These tools are used to feed big data into Hadoop Cluster
· Sqoop à Structured Data like Database
tables, Excel files etc.
· Flume à Load Logs from Twitter, Facebook, LinkedIn,
XML etc.
Configuration, Synchronization,
Co-ordination between Clusters
Zookeeper à This component is used to manage the Cluster Configuration
in the Hadoop environment. It also synchronizes the data with different
machines in the Cluster
Scheduling and Workflow
Oozie à This component is used to create/schedule jobs in a Cluster.
Java based Web Interface which enables us to create Workflow
Data Processing Tools
Data
Analytics
Pig – Procedural based language developed by Yahoo used to analyze
large data sets. This can perform ETL operations
Hive – SQL based language Developed by facebook used more like
data warehousing solutions.
R is used
HBASE – Used to access Non-relational database structure. This is
used to query data lookups from HDFS, inserts and updates. Facebook and ebay
used extensively
Data Science
Mahout – This tool is used for machine learning.
Cluster operations
YARN – Yet another
resource negotiator.
File system
HDFS – Hadoop File system