Remove Duplicates Stage in DataStage

September 04, 2018

Remove Duplicates Stage in DataStage

The Remove Duplicates stage is a processing stage. It can have a single input link and a single output link. The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set.

Input data should be sorted for this stage so that all records of having similar key values will be adjacent. Instead of adding separate ‘Sort stage’, we can perform ‘Link Level Sort’.

Remove Duplicates stage: Properties tab

Key - Specifies the key column for the operation. This property can be repeated to specify multiple key columns.

Remove Duplicates stage: Options category

Duplicate to retain - Specifies which of the duplicate columns encountered to retain. Choose between First and Last. It is set to First by default.

Consider below input data

ID Name

10 Joe

11 Marsh

12 Shawn

10 Joe

10 Roger

Step 1: Design job structure as shown below.

Step 2: Sort the data on ID column in sort stage

Step 3: Double click on Remove duplicate stage and define key = ID and Duplicate to retain = First.

Step 4: Map all the required output column under ‘Output’ tab in Remove duplicate stage.

You will see the below output:

ID Name

10 Joe

11 Marsh

12 Shawn

Duplicate entries will get removed.

How to remove duplicates without using remove duplicate stage in DataStage?

1. Using hash file stage (Specify the keys and check the unique checkbox, Unique Key is not allowed duplicate values)

2. Using a sort stage,set property: ALLOW DUPLICATES :false

2. You can do it at any stage. Just do a hash partion of the input data and check the options stable Sort and Unique.

Comments

Add comment

Search This Blog

IBM DataStage Tutorial and Guide

Remove Duplicates Stage in DataStage

Comments

Post a Comment

Popular Posts

DataStage Architecture

Change Capture Stage in DataStage