ETL should be designed in such a way that it's re-startable from any point of failure without any/much manual intervention.
Why is this required -
1) Save time - Efficient ETL design ( by not doing what is already done).
2) Save money - Minimize manual work hence low maintenance cost.
We can apply this concept into each aspect of ETL ie. E(extraction)-T(transformation)-L(load).
Extraction - Suppose an extraction job aborts while extracting data into staging area after extracting 60% of source data.
Solution - There can be many solutions to implement Restartability here. Most commonly, Extraction is done with over right option hence in case of failure process can be restarted without any manual clean up but when data volume is huge, this option is not very efficient , that's why incremental/delta fetch should be used with the help of audit table/source time stamp/process flag.
Transformation - With the help of IBM Datastage Sequencer Restartability feature, Sequencer checkpoints run of each activity and incase of any failure, Sequencer restarts from the aborted job and skips the jobs which were already run.
Load - Suppose a load job aborts after loading only 10,000 records from its designated 20,000 records. If partially loaded records need to be deleted from target table manually then the process is not re-startable.
Why is this required -
1) Save time - Efficient ETL design ( by not doing what is already done).
2) Save money - Minimize manual work hence low maintenance cost.
We can apply this concept into each aspect of ETL ie. E(extraction)-T(transformation)-L(load).
Extraction - Suppose an extraction job aborts while extracting data into staging area after extracting 60% of source data.
Solution - There can be many solutions to implement Restartability here. Most commonly, Extraction is done with over right option hence in case of failure process can be restarted without any manual clean up but when data volume is huge, this option is not very efficient , that's why incremental/delta fetch should be used with the help of audit table/source time stamp/process flag.
Transformation - With the help of IBM Datastage Sequencer Restartability feature, Sequencer checkpoints run of each activity and incase of any failure, Sequencer restarts from the aborted job and skips the jobs which were already run.
Load - Suppose a load job aborts after loading only 10,000 records from its designated 20,000 records. If partially loaded records need to be deleted from target table manually then the process is not re-startable.
In this case restartability can be achieved by several ways -
Before SQL: With the help of Pre-Sql we can delete all records that are loaded as part of the current batch id/batch date.
Upsert Method: we can utilize upsert method rather than insert method. In that way it’ll update the existing records and insert only new records that are not available in target table.
Truncate Load & SQL insert: We can load into temporary table in truncate and load method. Then it can be loaded into final target table using SQL insert which has atomicity property. (DB2 attach partition)
RDBMS Trasaction mode - commit once all the records are loaded otherwise rollback
Processed flag - For records which are processed and loaded into target, Processed_flg field is updated in staging as 1 and stage extraction filters already processed record (Processed_flg = 0 )
Audit table - For insert only targets, an audit table can be maintained with max of auto generated surrogate key or max source timestamp and source extraction can be restricted for only records greater than max(last processed surrogate key/timestamp).
When not to apply Restartability - It all depends on requirement and there can be many requirements from business where process should start again from the beginning instead of last checkpoint in case of any failure.
Upsert Method: we can utilize upsert method rather than insert method. In that way it’ll update the existing records and insert only new records that are not available in target table.
Truncate Load & SQL insert: We can load into temporary table in truncate and load method. Then it can be loaded into final target table using SQL insert which has atomicity property. (DB2 attach partition)
Processed flag - For records which are processed and loaded into target, Processed_flg field is updated in staging as 1 and stage extraction filters already processed record (Processed_flg = 0 )
Audit table - For insert only targets, an audit table can be maintained with max of auto generated surrogate key or max source timestamp and source extraction can be restricted for only records greater than max(last processed surrogate key/timestamp).
When not to apply Restartability - It all depends on requirement and there can be many requirements from business where process should start again from the beginning instead of last checkpoint in case of any failure.