It is all about data !!! Data Science, Data analytics,Data warehouse : Data Quality

Investigate stage

This stage investigates the data quality. We can get an idea of how data looks like after this stage.

For Example:

If you look at the below data and do investigate -à we can make the following

o Name is repeated at two instances

o Phone number is having only 3 digits at one occurrence

o Email ID length varies etc

Basically this will give an idea about the data

Name,address,email_address,phone,zip_code

sunil,1165 office park road,sunil@gmail.com,5555555555,50265

kumar,100 pleasant avenue,kumar@gmail.com, 5555555555,50266

deepa,3000 Lake road,deepa@gmail.com, 5555555555,97583

kavin,1567 Mansion place,kavin@gmail.com, 5555555555,61853

sunil,2458 valleywest,chandru@gmail.com, 5555555555,64321

pavi,161 office park road,pavi@gmail.com,223,54210

office,161 office park road,sunilkumar.gunasekaran@mercer.com, 5555555555,54210

Quick steps

· Define and read an input file

· Select the columns for investigation

· Choose the rule set based on the columns

· Choose Token report/Pattern report and see the quality of data

About the stage

Two types of investigation

o Character discrete investigation – This is default option. Suppose you would like to investigate how many people’s zip code starts with . In this case you can choose this and mask like CCXXX. This means first two portion of zip code will be used and the remaining digits will be ignored.

o Character concatenate investigation – ( Normally when two columns are concatenated then this would be the best option like address line 1,address line 2)

o Word investigation

Output

o Token report

o Pattern report

About Masks:

Input text = IBM Quality Stage 8

If you apply Mask C: IBM Quality Stage 8 à as it is

If you apply Mask T: aaabaaaaaaabaaaaabn à b is space,n is number

If you apply mask C & Mask X like cccXcccccccXcccccXc then the à output will be IBMQualityStage8

If you apply Mask T & Mask X like TTTXTTTTTTTXTTTTTXT then the output will be Aaaaaaaaaaaaaaan

The Mask X will simply skip the letter presented in the position and will gives you the result for the remaining characters coming from the input.

C. Displays the actual character and includes it in the frequency count and pattern analysis.

You use the C column mask when you want to inspect the actual values in your columns to make sure there is no false data in a column. For example, 99999 for a postal code or 111111111 for a national identification.

T. Displays the type of character in the frequency count and pattern analysis. The following is a list of the character types:

You use the T column mask when you want to inspect the type of data in a character position such as with telephone numbers as nnn-nnn-nnnn or (nnn)-nnn-nnnn.

X. Skips the character and does not include it in the frequency count or the pattern analysis. It does include it in the sample data.

You use the X column mask when you only want to include the data from the column in the sample but not as a token or part of the token for investigation. For example, you want to investigate the first two characters of a postal code to determine the frequency distribution based on state. You would set the column mask for the postal code to CCXXX. The pattern column of the pattern report displays only the first two characters. The frequency count would be based on the number of records in the file that start with the first two characters of the postal code. In the value column, you would see all five characters of the postal code in the sample.

By default, all characters at each position in the column are set to T (type). For every position in the column, adjust the mask as necessary:

It is all about data !!! Data Science, Data analytics,Data warehouse

Monday, May 11, 2015

Data Quality - Investigate stage ( QualityStage)

No comments:

Post a Comment

List of topics