Investigate
stage
This stage investigates the data quality. We can get an idea of how data looks
like after this stage.
For Example:
If you look at the below data
and do investigate -à we can make the following
o
Name is repeated
at two instances
o
Phone number is
having only 3 digits at one occurrence
o
Email ID length
varies etc
Basically this will give an
idea about the data
Name,address,email_address,phone,zip_code
sunil,1165
office park road,sunil@gmail.com,5555555555,50265
kumar,100
pleasant avenue,kumar@gmail.com, 5555555555,50266
deepa,3000
Lake road,deepa@gmail.com, 5555555555,97583
kavin,1567
Mansion place,kavin@gmail.com, 5555555555,61853
sunil,2458
valleywest,chandru@gmail.com, 5555555555,64321
pavi,161
office park road,pavi@gmail.com,223,54210
office,161
office park road,sunilkumar.gunasekaran@mercer.com, 5555555555,54210
Quick steps
·
Define and read
an input file
·
Select the columns
for investigation
·
Choose the rule
set based on the columns
·
Choose Token
report/Pattern report and see the quality of data
About the
stage
Two types of investigation
o
Character
discrete investigation – This is default option. Suppose you would like to
investigate how many people’s zip code starts with . In this case you can
choose this and mask like CCXXX. This means first two portion of zip code will
be used and the remaining digits will be ignored.
o
Character
concatenate investigation – ( Normally when two columns are concatenated then
this would be the best option like address line 1,address line 2)
o
Word
investigation
Output
o
Token report
o
Pattern report
About Masks:
Input text = IBM Quality
Stage 8
If you apply Mask C: IBM
Quality Stage 8 à as it is
If you apply Mask T: aaabaaaaaaabaaaaabn à b
is space,n is number
If you apply mask C
& Mask X like cccXcccccccXcccccXc then the à output will be IBMQualityStage8
If you apply Mask T
& Mask X like TTTXTTTTTTTXTTTTTXT then the output will be Aaaaaaaaaaaaaaan
The Mask X will simply skip
the letter presented in the position and will gives you the result for the
remaining characters coming from the input.
C. Displays the actual character and includes it in the frequency
count and pattern analysis.
You use the C column mask when you want to inspect the actual
values in your columns to make sure there is no false data in a column. For
example, 99999 for a postal code or 111111111 for a national identification.
T. Displays the type of character in the frequency count and
pattern analysis. The following is a list of the character types:
You use the T column mask when you want to inspect the type of
data in a character position such as with telephone numbers as nnn-nnn-nnnn or
(nnn)-nnn-nnnn.
X. Skips the character and does not include it in the frequency
count or the pattern analysis. It does include it in the sample data.
You use the X column mask when you only want to include the data
from the column in the sample but not as a token or part of the token for
investigation. For example, you want to investigate the first two characters of
a postal code to determine the frequency distribution based on state. You would
set the column mask for the postal code to CCXXX. The pattern column of the
pattern report displays only the first two characters. The frequency count
would be based on the number of records in the file that start with the first
two characters of the postal code. In the value column, you would see all five characters
of the postal code in the sample.
By default, all characters at each position in the column are set
to T (type). For every position in the column, adjust the mask as necessary: