Saturday, 27 August 2016

ETL Recoverability


ETL design should be such that if required, It can be set back to any stable point from current state. For example, when ETL run is completed with some data issues from source files/tables, it’s required to seamlessly put the ETL system back in original state before this run has happened. If the system satisfies above said condition it’s said to be recoverable.

In order to have this recoverability, we need to have audit columns in all our tables so that ETL system can achieve recoverability at any granular level.

· Batch id
· Batch date
· Batch sequence Id
· Source system id
· Process id
· Populated time/Updation time

Let us take a scenario of an ETL where 10 Source systems data has loaded the data into target tables. If there is an issue with a particular source because of which ETL needs to be reset to its previous state and rerun with corrected data. source systems id available in each target table will help deleting of data from all target tables can be done with the help of source system id and batch id. Then rerun can be taken.

To take it even further, ETL recovery can be automated by developing a set of UNIX shell scripts/database procedures to perform this activity in efficient manner.