DataStage – Overview

The DataStage is an ETL tool and it’s a component of the IBM information Platforms Solution suite and the InfoSphere. Thus called as IBM InfoSphere DataStage. This tool can extract information from dissimilar sources, carry out transformations as per a business’s requirements and transfer the data into chosen data warehouses. It is widely used for development and maintenance of Data warehouses and Data marts. 
It was first launched by VMark in mid-90's. Later IBM acquiring DataStage in 2005, it was renamed to IBM WebSphere DataStage and later to IBM InfoSphere. DataStage is available in various versions in the market so far was Enterprise Edition (PX), Server Edition, MVS Edition, DataStage for PeopleSoft and so on. The latest edition is IBM InfoSphere DataStage. 

DataStage plays major role of information management in stream of Business intelligence (BI). Datastage provides a GUI (Graphical User Interface) driven interface to carry out the Extract Transform Load work. 

The ETL work is carried out through jobs. A DataStage job can be referred to as an implementable unit of work that can be gathered & executed individually or as a component of a stream data flow. 

A job is made of various stages that are connected via links. 

A stage serves many purposes, comparable to database stages to link to target systems and source, running stages to carry out many data transformations, file stages so as to link to many file systems and so on. 
Links are used to bring together various stages in a job to describe the flow of data.

What is DataStage

DataStage is an ETL tool which extracts data, transform and load data from source to the target. The data sources might include sequential files, indexed files, relational databases, external data sources, archives, enterprise applications, etc. DataStage facilitates business analysis by providing quality data to help in gaining business intelligence.

DataStage Features 
  • DataStage can Extract the data from any source and can loads the data into the any target.
  • The Job developed in the one platform can run on the any other platform If we designed a job in the Uni level processing, it can be run in the SMP machine.
  • Node Configuration is a technique to create logical C.P.U. Node is a Logical C.P.U.
  • Partition parallelism is a technique distributing the data across the nodes based on the partition techniques.
  • It can be used to build and load data warehouse which can operate in batch, real time, or as a Web service.
  • It can handle complex transformations and manage multiple integration processes. 
DataStage ETL work carried out through jobs. It contains mainly three different types of jobs.
  • Server jobs
  • Parallel jobs
  • Sequence jobs
DataStage server jobs runs on single node on DataStage server engine. Server jobs handle less volume of data and slow processing capabilities. Server jobs contain less number of components and compiled into Basic language.

DataStage parallel jobs runs on multiple nodes on DataStage parallel engine. Parallel jobs can handles huge volume of data and processing speed is high. Parallel jobs contain more number of components and compiled into OSH (Orchestrate Shell script) except Transformer which compiles into C++.

Sequence jobs - For more complex designs, you can build sequence jobs to run multiple jobs in conjunction with other jobs. By using sequence jobs, you can integrate programming controls into your job workflow, such as branching and looping.
You specify the control information, such as the different courses of action to take depending on whether a job in the sequence succeeds or fails. After you create a sequence job, you schedule it to run using the InfoSphere DataStage Director client, just like you would with a parallel job or server job.

Comments

Popular Posts