DataStage Job Run-time Architecture

September 23, 2018

DataStage Job Run-time Architecture

During the start phase of a parallel job several processes and verification's are performed. Here is a list of the most relevant:

1. The Parallel Engine starts the Conductor process and some other processes including job monitor process.
2. The Parallel Engine builds the score (the execution plan) of the job.
3. The Parallel Engine verifies the input and output schema of every stage. The larger the schema the longer this step takes.
4. The Parallel Engine verifies that the settings for every stage (operator) are valid. If a job has stages that interact with databases, DataStage connects to each database and if needed verifies that the database is configured properly to work with parallel processes.
5. The Parallel Engine creates a copy of the job design into disc.
6. The Parallel Engine connects to (remote) servers and starts Section Leader processes. Communication channels are created between Section Leader processes and the Conductor Process.
7. Section Leaders receives from the Conductor the "score" or the job plan and create Player processes that will run the job. After that communication channels are opened between Players for record transfer.
Job scores are divided into two sections — data sets (partitioning and collecting) and operators (node/operator mapping). Both sections identify sequential or parallel processing.

For every job that starts there will be one (1) conductor process (started on the conductor node), there will be one (1) section leader for each node in the configuration file and there will be one (1) player process (may or may not be true) for each stage in your job for each node. So if you have a job that uses a two (2) node configuration file and has 3 stages then your job will have
1 conductor
2 section leaders (2 nodes * 1 section leader per node)
6 player processes (3 stages * 2 nodes)
Your dump score may show that your job will run 9 processes on 2 nodes.

Conductor Node (one per job): This is the main process used to startup jobs, determine resource assignments, and create Section Leader processes on one or more processing nodes. Acts as a single coordinator for status and error messages, manages orderly shutdown when processing completes or in the event of a fatal error. The conductor node is run from the primary server.
Section Leaders (one per logical processing node): Section Leader process used to create and manage player processes which perform the actual job execution. The Section Leaders also manage communication between the individual player processes and the master Conductor Node.
Players Process: These are one or more logical groups of processes used to execute the data flow logic. All players are created as groups on the same server as their managing Section Leader process.

Example: A job design consists of and input sequential file, a modify stage, followed by a filter stage and an output sequential file stage. The job run on an SMP environment with a configuration file defined with 3 nodes. How many osh processes will this job create?
1 Conductor
3 Section leaders
2 stage (seqFile) *1node

2 stage (filter and modify) *3 nodes

Search This Blog

IBM DataStage Tutorial and Guide

DataStage Job Run-time Architecture

Comments

Post a Comment

Popular Posts

DataStage Architecture

DataStage File Stages