Saturday, May 16, 2015

Hadoop Ecosystem


 
There are various components/Tools when it comes to Hadoop Eco system. The tools which are used in Hadoop framework is quite different from what we use in our traditional data analytics.

·       I will brief about each component of the Hadoop eco system. Before moving on to Hadoop Eco System, I want to give a small insight about the different between Hadoop 1 and Hadoop 2. Because of Hadoop 2's architecutre variety of Tools were added to Hadoop Ecosystem 

There is a main difference between Hadoop 1.0 and Hadoop 2.0 version.

·       In Hadoop 1.0,

o   Mapreduce is the only framework that is supported

o   Cluster operations and Data operations was performed here

·       In Hadoop 2.0,

o   Lot of data processing frameworks are added

o   Cluster operations is done by YARN

o   Data operations is done by Data processing frameworks like Mapreduce etc

If you look closely between Hadoop 1.0 and Hadoop 2.0 you can notice there are Data processing framework is separated from Cluster operations

 

 

 
 
 
 

 
 
 
Hadoop Eco system
Data Loading:
                             These tools are used to feed big data into Hadoop Cluster        
·       Sqoop à Structured Data like Database tables, Excel files etc.          
·       Flume à Load Logs from Twitter, Facebook, LinkedIn, XML etc.
Configuration, Synchronization, Co-ordination between Clusters
Zookeeper à This component is used to manage the Cluster Configuration in the Hadoop environment. It also synchronizes the data with different machines in the Cluster
Scheduling and Workflow
Oozie à This component is used to create/schedule jobs in a Cluster. Java based Web Interface which enables us to create Workflow
Data Processing Tools
          Data Analytics
Pig – Procedural based language developed by Yahoo used to analyze large data sets. This can perform ETL operations
Hive – SQL based language Developed by facebook used more like data warehousing solutions.           
R is used
HBASE – Used to access Non-relational database structure. This is used to query data lookups from HDFS, inserts and updates. Facebook and ebay used extensively
Data Science
Mahout – This tool is used for machine learning.
Cluster operations
YARNYet another resource negotiator.
File system
HDFS – Hadoop File system
 

Monday, May 11, 2015

Datastgae Introduction - Server Vs Parallel


 

Datastage is ETL tool

·        Extract , transform and load

·        Earlier the product was owned by company called stage àthen Ascential datastage,-à IBM Infosphere in 2008

 

IBM Infosphere has several tools

·        Datastage

·        Quality stage

·        Information Analyzer

·        MDM – Master data management etc

 

Difference between version 7.5 (Ascential) Vs Datastage 8.0

 

Ascential
IBM Infosphere
File based Repository like table definitions etc
Database based repository
2 Tier ( Unix server + Datastage )
3 Tier ( Unix server + xMETA + Datastage )
Director,Manager,Designer and Administrator
Director, Designer and Manager is integrated into one as Designer,Administrator
Unix Login is sufficient
Datastage needs a separate user group and access rights
 
Parameter sets were introduced
Previously it was 1-100 ex:, next time when we run it is again 1-100
Enhanced Surogate key generator à 1-100
101->200
 
New stages were introduced like connector stages, improved transformer stage

 

 

Director Client

·        Validate, runs ,monitor and schedule the jobs. We can do the same thing in designer client however we can look at multiple running jobs at a time

 

Administrator Client

·        Creating and managing user creation/projects

·        Setting up project specific parameter sets

 

Designer client

·        Designing the job

 

Types of jobs

·        Server Jobs

·        Parallel Jobs

·        Sequence Jobs

 

Server Jobs
Parallel Jobs
Uses Basic Compiler
Uses C++ compiler. Background all the datastage jobs are converted to OSH which requires a C++ Compiler
Uses Single node
Uses multiple node
Executes on DS Server Engine
Executes on DS Parallel Engine
Handles less data
Handles huge data
Processing speed is slow
Processing speed is fast

 

Datastage - Node and APT configuration file



Node – Logical Processing unit – Represent resources. This is helpful in Load balance. Optimal number of nodes can be chosen

·        A Node is a logical processing unit. Each node in a configuration file is distinguished by a virtual name and defines a number and speed of CPUs, memory availability, page and swap space, network connectivity details, etc.

·        Node information is stored in APT configuration file.

 

Server Job  à single node ( ex: Single Lane highway)

Parallel Job à Based on the number of nodes , the data will be passed on the nodes (ex: Multi lane Highway) . This is called parallelism

 

APT Configuration file

                                It denotes about the degree of parallelism.

 

4 things to note

main_program: APT configuration file: /opt/IBM/InformationServer/Server/Configurations/default.apt

{

                node "node1"

                {

                                fastname "xxxx"  à Physical node name

                                pools "" à In some cases this will be represent for specific functionality – For ex: sort

                                resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""} -à Physical storage . All the datasets will be created here

                                resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools ""} à Temporary location for processing

                }

                node "node2"

                {

                                fastname "xxxx"

                                pools ""

                                resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}

                                resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools ""}

                }

}

 

Example:

                node "node2"

                {

                                fastname "xxxx"

                                pools "" “sort” à This indicates this node will be exclusively used for sort operation

                                resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}

                                resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools ""}

                }

 

How datastage decides on which processing node a stage should be run?

1. If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior)

2. If the node is constrained then the constrained processing nodes are chosen while executing the parallel stage.

 
 

Data Quality - Investigate stage ( QualityStage)


 

Investigate stage

                                This stage investigates the data quality. We can get an idea of how data looks like after this stage.

 

 

For Example:

 

If you look at the below data and do investigate -à we can make the following

 

o   Name is repeated at two instances

o   Phone number is having only 3 digits at one occurrence

o   Email ID length varies etc

 

Basically this will give an idea about the data

 

Name,address,email_address,phone,zip_code

sunil,1165 office park road,sunil@gmail.com,5555555555,50265

kumar,100 pleasant avenue,kumar@gmail.com, 5555555555,50266

deepa,3000 Lake road,deepa@gmail.com, 5555555555,97583

kavin,1567 Mansion place,kavin@gmail.com, 5555555555,61853

sunil,2458 valleywest,chandru@gmail.com, 5555555555,64321

pavi,161  office park road,pavi@gmail.com,223,54210

office,161  office park road,sunilkumar.gunasekaran@mercer.com, 5555555555,54210

 

 

Quick steps

·        Define and read an input file

·        Select the columns for investigation

·        Choose the rule set based on the columns

·        Choose Token report/Pattern report and see the quality of data

 

About the stage

Two types of investigation

o   Character discrete investigation – This is default option. Suppose you would like to investigate how many people’s zip code starts with . In this case you can choose this and mask like CCXXX. This means first two portion of zip code will be used and the remaining digits will be ignored.

o   Character concatenate investigation – ( Normally when two columns are concatenated then this would be the best option like address line 1,address line 2)

o   Word investigation

 

Output

o   Token report

o   Pattern report

About Masks:

Input text = IBM Quality Stage 8

 

If you apply Mask C: IBM Quality Stage 8 à as it is

If you apply Mask T: aaabaaaaaaabaaaaabn à b is space,n is number

If you apply mask C & Mask X like cccXcccccccXcccccXc then the à output will be IBMQualityStage8

If you apply Mask T & Mask X like TTTXTTTTTTTXTTTTTXT then the output will be Aaaaaaaaaaaaaaan

 

The Mask X will simply skip the letter presented in the position and will gives you the result for the remaining characters coming from the input.

C. Displays the actual character and includes it in the frequency count and pattern analysis.

You use the C column mask when you want to inspect the actual values in your columns to make sure there is no false data in a column. For example, 99999 for a postal code or 111111111 for a national identification.

T. Displays the type of character in the frequency count and pattern analysis. The following is a list of the character types:

You use the T column mask when you want to inspect the type of data in a character position such as with telephone numbers as nnn-nnn-nnnn or (nnn)-nnn-nnnn.

X. Skips the character and does not include it in the frequency count or the pattern analysis. It does include it in the sample data.

You use the X column mask when you only want to include the data from the column in the sample but not as a token or part of the token for investigation. For example, you want to investigate the first two characters of a postal code to determine the frequency distribution based on state. You would set the column mask for the postal code to CCXXX. The pattern column of the pattern report displays only the first two characters. The frequency count would be based on the number of records in the file that start with the first two characters of the postal code. In the value column, you would see all five characters of the postal code in the sample.

By default, all characters at each position in the column are set to T (type). For every position in the column, adjust the mask as necessary:

Dimension - Slowly Changing Dimension Technique


This is the concept of “Slowly changing Dimension”. This is widely used in data warehousing environment

 

Slowly changing Dimension

ü Type 0 - The passive method à Dimension data is left as such ( For ex: state codes)

 

 

ü Type 1 - Overwriting the old value à Dimension data is over written if a new record is received

 

 

Before change:

Customer_ID
Customer_Name
Customer_Type
1
Cust_1
Corporate



After change:

Customer_ID
Customer_Name
Customer_Type
1
Cust_1
Retail
 
 
 
 

 

 

ü Type 2 - Creating a new additional record

 

Before change:

 

 

Customer_ID
Customer_Name
Customer_Type
Start_Date
End_Date
Current_Flag
1
Cust_1
Corporate
22-07-2010
31-12-9999
Y



After change:

Customer_ID
Customer_Name
Customer_Type
Start_Date
End_Date
Current_Flag
1
Cust_1
Corporate
22-07-2010
17-05-2012
N
2
Cust_1
Retail
18-05-2012
31-12-9999
Y

 

ü Type 3 - Adding a new column

 

Before change:

 

Customer_ID
Customer_Name
Current_Type
Previous_Type
1
Cust_1
Corporate
Corporate



After change:

Customer_ID
Customer_Name
Current_Type
Previous_Type
1
Cust_1
Retail
Corporate

 

ü Type 4 - Using historical table

 

Current Table

 

Customer_ID
Customer_Name
Customer_Type
1
Cust_1
Corporate

 

 

Historical table



Customer_ID
Customer_Name
Customer_Type
Start_Date
End_Date
1
Cust_1
Retail
01-01-2010
21-07-2010
1
Cust_1
Oher
22-07-2010
17-05-2012
1
Cust_1
Corporate
18-05-2012
31-12-9999

 

ü Type 5 - Combine approaches of types 1,2,3 (1+2+3=6)

 

·  Customer_ID
Customer_Name
Current_Type
Historical_Type
Start_Date
End_Date
Current_Flag
1
Cust_1
Corporate
Retail
01-01-2010
21-07-2010
N
2
Cust_1
Corporate
Other
22-07-2010
17-05-2012
N
3
Cust_1
Corporate
Corporate
18-05-2012
31-12-9999
Y