It is all about data !!! Data Science, Data analytics,Data warehouse : 2015

Sunday, May 17, 2015

Hadoop Architecture

Hadoop Architecture Explained

I have explained Hadoop 1 and Hadoop 2 architecture

· Hadoop 1.0 architecture

· Hadoop 2.0 architecture

a. Federation architecture

b. High availability architecture

c. Federation architecture with High availability

Hadoop 1.0 architecture

There are two important components of Hadoop architecture. They are

· HDFS – Hadoop file system

· Processing Framework – Map reduce

HDFS

· Defines how the data is stored and distributed in different data notes

· Name node or the master node contains Metadata information

· Data node is actually where the exact data is present

· Data is stored in multiple blocks in different data nodes based on the replication factor

· Once there is a seek or write operation request, Client process contact Name node and Data node intern retrieves the information

· Name node sends all the Metadata information to Secondary name once in a while

· Secondary name node is not a fail over setup node

Job Tracker

· This is processing unit for Hadoop system

· Once the request is received, Job tracker schedules the job and monitors the job

· Job tracker creates a request to Data node which in turn creates a Task tracker and executes the real map reduce jobs in data node

Disadvantages of Hadoop 1.0 architecture

· Was not able to Scale up more than 4000 node Cluster

· Job Tracker function was too complex to handle as it was used to schedule and monitor Jobs

· No High availability mechanism

Saturday, May 16, 2015

Hadoop Ecosystem

There are various components/Tools when it comes to Hadoop Eco system. The tools which are used in Hadoop framework is quite different from what we use in our traditional data analytics.

· I will brief about each component of the Hadoop eco system. Before moving on to Hadoop Eco System, I want to give a small insight about the different between Hadoop 1 and Hadoop 2. Because of Hadoop 2's architecutre variety of Tools were added to Hadoop Ecosystem

There is a main difference between Hadoop 1.0 and Hadoop 2.0 version.

· In Hadoop 1.0,

o Mapreduce is the only framework that is supported

o Cluster operations and Data operations was performed here

· In Hadoop 2.0,

o Lot of data processing frameworks are added

o Cluster operations is done by YARN

o Data operations is done by Data processing frameworks like Mapreduce etc

If you look closely between Hadoop 1.0 and Hadoop 2.0 you can notice there are Data processing framework is separated from Cluster operations

Hadoop Eco system

Data Loading:

These tools are used to feed big data into Hadoop Cluster

· Sqoop à Structured Data like Database tables, Excel files etc.

· Flume à Load Logs from Twitter, Facebook, LinkedIn, XML etc.

Configuration, Synchronization, Co-ordination between Clusters

Zookeeper à This component is used to manage the Cluster Configuration in the Hadoop environment. It also synchronizes the data with different machines in the Cluster

Scheduling and Workflow

Oozie à This component is used to create/schedule jobs in a Cluster. Java based Web Interface which enables us to create Workflow

Data Processing Tools

Data Analytics

Pig – Procedural based language developed by Yahoo used to analyze large data sets. This can perform ETL operations

Hive – SQL based language Developed by facebook used more like data warehousing solutions.

R is used

HBASE – Used to access Non-relational database structure. This is used to query data lookups from HDFS, inserts and updates. Facebook and ebay used extensively

Data Science

Mahout – This tool is used for machine learning.

Cluster operations

YARN – Yet another resource negotiator.

File system

HDFS – Hadoop File system

Monday, May 11, 2015

Datastgae Introduction - Server Vs Parallel

Datastage is ETL tool

· Extract , transform and load

· Earlier the product was owned by company called stage àthen Ascential datastage,-à IBM Infosphere in 2008

IBM Infosphere has several tools

· Datastage

· Quality stage

· Information Analyzer

· MDM – Master data management etc

Difference between version 7.5 (Ascential) Vs Datastage 8.0

Ascential	IBM Infosphere
File based Repository like table definitions etc	Database based repository
2 Tier ( Unix server + Datastage )	3 Tier ( Unix server + xMETA + Datastage )
Director,Manager,Designer and Administrator	Director, Designer and Manager is integrated into one as Designer,Administrator
Unix Login is sufficient	Datastage needs a separate user group and access rights
	Parameter sets were introduced
Previously it was 1-100 ex:, next time when we run it is again 1-100	Enhanced Surogate key generator à 1-100 101->200
	New stages were introduced like connector stages, improved transformer stage

Director Client

· Validate, runs ,monitor and schedule the jobs. We can do the same thing in designer client however we can look at multiple running jobs at a time

Administrator Client

· Creating and managing user creation/projects

· Setting up project specific parameter sets

Designer client

· Designing the job

Types of jobs

· Server Jobs

· Parallel Jobs

· Sequence Jobs

Server Jobs	Parallel Jobs
Uses Basic Compiler	Uses C++ compiler. Background all the datastage jobs are converted to OSH which requires a C++ Compiler
Uses Single node	Uses multiple node
Executes on DS Server Engine	Executes on DS Parallel Engine
Handles less data	Handles huge data
Processing speed is slow	Processing speed is fast

Datastage - Node and APT configuration file

Node – Logical Processing unit – Represent resources. This is helpful in Load balance. Optimal number of nodes can be chosen

· A Node is a logical processing unit. Each node in a configuration file is distinguished by a virtual name and defines a number and speed of CPUs, memory availability, page and swap space, network connectivity details, etc.

· Node information is stored in APT configuration file.

Server Job à single node ( ex: Single Lane highway)

Parallel Job à Based on the number of nodes , the data will be passed on the nodes (ex: Multi lane Highway) . This is called parallelism

APT Configuration file

It denotes about the degree of parallelism.

4 things to note

main_program: APT configuration file: /opt/IBM/InformationServer/Server/Configurations/default.apt

{

node "node1"

{

fastname "xxxx" à Physical node name

pools "" à In some cases this will be represent for specific functionality – For ex: sort

resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""} -à Physical storage . All the datasets will be created here

resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools ""} à Temporary location for processing

}

node "node2"

{

fastname "xxxx"

pools ""

resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}

resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools ""}

}

Example:

node "node2"

{

fastname "xxxx"

pools "" “sort” à This indicates this node will be exclusively used for sort operation

resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}

resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools ""}

}

How datastage decides on which processing node a stage should be run?

1. If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior)

2. If the node is constrained then the constrained processing nodes are chosen while executing the parallel stage.

Data Quality - Investigate stage ( QualityStage)

Investigate stage

This stage investigates the data quality. We can get an idea of how data looks like after this stage.

For Example:

If you look at the below data and do investigate -à we can make the following

o Name is repeated at two instances

o Phone number is having only 3 digits at one occurrence

o Email ID length varies etc

Basically this will give an idea about the data

Name,address,email_address,phone,zip_code

sunil,1165 office park road,sunil@gmail.com,5555555555,50265

kumar,100 pleasant avenue,kumar@gmail.com, 5555555555,50266

deepa,3000 Lake road,deepa@gmail.com, 5555555555,97583

kavin,1567 Mansion place,kavin@gmail.com, 5555555555,61853

sunil,2458 valleywest,chandru@gmail.com, 5555555555,64321

pavi,161 office park road,pavi@gmail.com,223,54210

office,161 office park road,sunilkumar.gunasekaran@mercer.com, 5555555555,54210

Quick steps

· Define and read an input file

· Select the columns for investigation

· Choose the rule set based on the columns

· Choose Token report/Pattern report and see the quality of data

About the stage

Two types of investigation

o Character discrete investigation – This is default option. Suppose you would like to investigate how many people’s zip code starts with . In this case you can choose this and mask like CCXXX. This means first two portion of zip code will be used and the remaining digits will be ignored.

o Character concatenate investigation – ( Normally when two columns are concatenated then this would be the best option like address line 1,address line 2)

o Word investigation

Output

o Token report

o Pattern report

About Masks:

Input text = IBM Quality Stage 8

If you apply Mask C: IBM Quality Stage 8 à as it is

If you apply Mask T: aaabaaaaaaabaaaaabn à b is space,n is number

If you apply mask C & Mask X like cccXcccccccXcccccXc then the à output will be IBMQualityStage8

If you apply Mask T & Mask X like TTTXTTTTTTTXTTTTTXT then the output will be Aaaaaaaaaaaaaaan

The Mask X will simply skip the letter presented in the position and will gives you the result for the remaining characters coming from the input.

C. Displays the actual character and includes it in the frequency count and pattern analysis.

You use the C column mask when you want to inspect the actual values in your columns to make sure there is no false data in a column. For example, 99999 for a postal code or 111111111 for a national identification.

T. Displays the type of character in the frequency count and pattern analysis. The following is a list of the character types:

You use the T column mask when you want to inspect the type of data in a character position such as with telephone numbers as nnn-nnn-nnnn or (nnn)-nnn-nnnn.

X. Skips the character and does not include it in the frequency count or the pattern analysis. It does include it in the sample data.

You use the X column mask when you only want to include the data from the column in the sample but not as a token or part of the token for investigation. For example, you want to investigate the first two characters of a postal code to determine the frequency distribution based on state. You would set the column mask for the postal code to CCXXX. The pattern column of the pattern report displays only the first two characters. The frequency count would be based on the number of records in the file that start with the first two characters of the postal code. In the value column, you would see all five characters of the postal code in the sample.

By default, all characters at each position in the column are set to T (type). For every position in the column, adjust the mask as necessary: