DataStage Configuration File
The Datastage configuration file is a master control file (a
textfile which sits on the server side) for jobs which describes the parallel
system resources and architecture. The configuration file provides hardware
configuration for supporting such architectures as SMP (Single machine with
multiple CPU , shared memory and disk), Grid , Cluster or MPP (multiple CPU,
mulitple nodes and dedicated memory per node). DataStage understands the
architecture of the system through this file.
This is one of the biggest strengths of Datastage. For cases
in which you have changed your processing configurations, or changed servers or
platform, you will never have to worry about it affecting your jobs since
all the jobs depend on this configuration file for execution. Datastage jobs
determine which node to run the process on, where to store the temporary data,
where to store the dataset data, based on the entries provide in the
configuration file. There is a default configuration file available whenever
the server is installed.
The configuration files have extension ".apt". The
main outcome from having the configuration file is to separate software and
hardware configuration from job design. It allows changing hardware and
software resources without changing a job design. Datastage jobs can point to
different configuration files by using job parameters, which means that a job
can utilize different hardware architectures without being recompiled.
The general form of a configuration file is as follows:
/* commentary */
{
node "node name" {
<node information>
.
.
.
}
.
.
.
}
What are the different options a logical node can have in the configuration file?
1. Fastname – The fastname is the physical node name that
stages use to open connections for high volume data transfers. The attribute of
this option is often the network name. Typically, you can get this name by
using Unix command ‘uname -n’.
2. Pools – Name of the pools to which the node is
assigned to. Based on the characteristics of the processing nodes you can group
nodes into set of pools. A pool can be associated with many nodes and a node
can be part of many pools.
A node belongs to the default pool unless you explicitly
specify pools list for it and omit the default pool name (“”) from the list.
3. Resource – This will specify Specifies the location on
your server where the processing node will write all the data set files. As you
might know when Datastage creates a dataset, the file you see will not contain
the actual data. The dataset file will actually point to the place where the
actual data is stored. Now where the dataset data is stored is specified in
this line.
4. Resource scratchdisk – The location of temporary files
created during Datastage processes, like lookups and sorts will be specified
here. If the node is part of the sort pool then the scratch disk can also be
made part of the sort scratch disk pool. This will ensure that the temporary
files created during sort are stored only in this location. If such a pool is
not specified then Datastage determines if there are any scratch disk resources
that belong to the default scratch disk pool on the nodes that sort is
specified to run on. If this is the case then this space will be used.
Sample configuration file shown below
{
node node1 {
fastname
"node1_css"
pools ""
"node1" "node1_css"
resource disk
"/orch/s0" {}
resource scratchdisk
"/scratch0" {pools "buffer"}
resource scratchdisk
"/scratch1" {}
}
node node2 {
fastname
"node2_css"
pools ""
"node2" "node2_css"
resource disk
"/orch/s0" {}
resource scratchdisk
"/scratch0" {pools "buffer"}
resource scratchdisk
"/scratch1" {}
}
}
Comments
Post a Comment