The ETL Help Guide: Datastage Configuration File

Datastage learns about the resources needed for a job according to what is defined in the configuration file. Configuration files specify what processing, storage, and sorting facilities on your system should be used to run a parallel job.

Why Configuration File -
1) Datastage jobs are independent of underlying hardware resources.
2) Configuration file can always be changed according to user's processing requirements.
3) Multiple Configuration files can be maintained and used to run different processes in same project.

Default configuration file - default.apt

Syntax -
   /* commentary */
   {
     node "node name" {
         <node information>
        .
      }
     .
    }

Sample -
{
      node "Nod0" {
          fastname "Nod"
          pools "pool_1" ""
          resource disk "/dir/node0" {pools ""}
          resource scratchdisk "/scratch0" {pools ""}
        }
       node "Nod1" {
          fastname "Nod"
          pools "pool_1" ""
          resource disk "/dir/node1" {pools ""}
          resource scratchdisk "/scratch1" {pools "" "sort"}
        }
       node "Nod2" {
          fastname "Nod"
          pools "pool_1" ""
          resource disk "/dir/node2" {pools ""}
          resource scratchdisk "/scratch2" {pools "" "sort"}
        }
}

Logical processing nodes -
Its set of hardware resources (Processors, Storage Space and Temp Space) on which your parallel job will run.
The number of processing nodes does not necessarily correspond to the number of CPUs in your system

fastname –
The fastname is the physical node name/network name. (Usually, can get by Unix command ‘uname -n’).

pools –
1) Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools.
2) A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes), for example , in above sample, resource scratchdisks of Nod1 and Nod2 are defined for sort pool.
3) sort scratch disk pool to assign scratch disk space explicitly for the storage of temporary files created by the Sort stage and otherwise uses the default node pool (default scratch disk pool)

resource – resource resource_type “location” [{pools “disk_pool_name”}] |

resource disk - to read/write persistent data to this directory. This is where data file of datasets are stored.

resource scratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)

Optimal parallelization -
Datastage creates one process for every active stage for each processing node. Hardware resources must support the configured parallelization, otherwise the performance of overall system would go down because as number of nodes increases, resourse management (processes handling, scheduling, reporting etc) requires more resources.

Resource Pooling -
1.For sorting, aggregation etc which requires large amount of temp memory then nodes should be selected from pool which more disk space is available.
2.For jobs which exchange large amounts of data, they should be assigned to nodes where stages communicate by either shared memory (SMP) or high-speed link (MPP).

The ETL Help Guide

Friday, 22 December 2017

Datastage Configuration File - APT_CONFIG_FILE

About Me

Total Pageviews

Friday, 22 December 2017

Datastage Configuration File - APT_CONFIG_FILE

About Me

Total Pageviews

Subscribe To