DataStage File Stages

August 22, 2018

DataStage File Stages

File stages used to read and write data from files. Following are some common file stages used into DataStage.

Sequential File Stage

The sequential file Stage is a file Stage. It is the most common I/O Stage used in a DataStage Job. It is used to read data from or write data to one or more flat Files. It can have only one input link or one Output link. It can also have one reject link.

While handling huge volumes of data, this Stage can itself become one of the major bottlenecks as reading and writing from this Stage is slow.

Sequential files should be used in following conditions When we are reading a flat file (fixed width or delimited) from UNIX environment, which is FTPed from some external systems When some UNIX operations has to be done on the file don’t use sequential file for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a UNIX file.

In order to have faster reading from the Stage the number of readers per node can be increased (default value is one).

Dataset Stage

The Data Set is a file Stage, which allows reading data from or writing data to a dataset. This Stage can have a single input link or single Output link. It does not support a reject link.

It can be configured to operate in sequential mode or parallel mode. DataStage parallel extender jobs use Dataset to store data being operated on in a persistent form. Datasets are operating system files which by convention has the suffix .ds Datasets are much faster compared to sequential files. Data is spread across multiple nodes and is referred by a control file.

Datasets are not UNIX files and no UNIX operation can be performed on them. Usage of Dataset results in a good performance in a set of linked jobs. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order.It also preserve partitions.

Dataset is having following parts:

Descriptor file: contains metadata, data location, but NOT the data itself

Data file(s): Contains data in Native format C:/IBM/Information Server / Server/data set/ file. Ds

Control file (or) header file : Resides in operating system.

File set stage

The File Set stage is a file stage. It allows you to read data from or write data to a file set. The stage can have a single input link, a single output link, and a single rejects link.

It only executes in parallel mode. advantage of using fileset over a sequential file is "it preserves partitioning scheme". The amount of data that can be stored in each destination data file is limited by the characteristics of the file system and the amount of free disk space available. The number of files created by a file set depends on 1) the number of processing nodes in the default node pool. 2) The number of disks in the export or default disk pool connected to each processing node in the default node pool. 3) The size of the partitions of the data set.

Lookup file set stage

The Lookup File Set stage is a file stage. It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link.

External source stage

The External Source stage is a file stage. It allows you to read data that is output from one or more source programs. The stage calls the program and passes appropriate arguments. The stage can have a single output link, and a single rejects link. It can be configured to execute in parallel or sequential mode.

External Target stage

The External Target stage is a file stage. It allows you to write data to one or more source programs. The stage can have a single input link and a single rejects link. It can be configured to execute in parallel or sequential mode.

Complex Flat File stage

The Complex Flat File (CFF) stage is a file stage. You can use the stage to read a file or write to a file, but you cannot use the same stage to do both. As a source, the CFF stage can have multiple output links and a single reject link. You can read data from one or more complex flat files, including MVS™ data sets with QSAM and VSAM files. You can also read data from files that contain multiple record types. The source data can contain one or more of the following clauses:

v GROUP

v REDEFINES

v OCCURS

v OCCURS DEPENDING ON

CFF source stages run in parallel mode when they are used to read multiple files, but you can configure the stage to run sequentially if it is reading only one file with a single reader.

By using CFF, we can read ASCII or EBCDIC (Extended Binary coded Decimal Interchage Code) data. We can select the required columns and can omit the remaining. We can collect the rejects (bad formatted records) by setting the property of rejects to "save" (other options: continue, fail). We can flatten the arrays(COBOL files).

As a target, the CFF stage can have a single input link and a single reject link. You can write data to one or more complex flat files. You cannot write to MVS data sets or to files that contain multiple record types.

Search This Blog

IBM DataStage Tutorial and Guide

DataStage File Stages

Comments

Post a Comment

Popular Posts

DataStage Architecture