Remove Duplicates Stage in DataStage

The Remove Duplicates stage is a processing stage. It can have a single input link and a single output link. The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set.

Input data should be sorted for this stage so that all records of having similar key values will be adjacent. Instead of adding separate ‘Sort stage’, we can perform ‘Link Level Sort’.

Remove Duplicates stage: Properties tab
Key - Specifies the key column for the operation. This property can be repeated to specify multiple key columns.

Remove Duplicates stage: Options category
Duplicate to retain - Specifies which of the duplicate columns encountered to retain. Choose between First and Last. It is set to First by default.

Consider below input data
ID Name
10 Joe
11 Marsh
12 Shawn
10 Joe
10 Roger

Step 1: Design job structure as shown below.




Step 2: Sort the data on ID column in sort stage
Step 3: Double click on Remove duplicate stage and define key = ID and Duplicate to retain = First.
Step 4: Map all the required output column under ‘Output’ tab in Remove duplicate stage.

You will see the below output:
ID Name
10 Joe
11 Marsh
12 Shawn

Duplicate entries will get removed.

How to remove duplicates without using remove duplicate stage in DataStage?
1. Using hash file stage (Specify the keys and check the unique checkbox, Unique Key is not allowed duplicate values)
2. Using a sort stage,set property: ALLOW DUPLICATES :false
2. You can do it at any stage. Just do a hash partion of the input data and check the options stable Sort and Unique.

Comments

Post a Comment

Popular Posts