Building End-to-end Workflow Management System (WMS) Pipelines Using Apache Airflow: Compute, Automate & Orchestrate #Integration, #Automation, #Machine Learning
Date & Time
Wednesday, November 11, 2020, 9:30 AM - 10:00 AM
Anubhav Kohli

Data warehouses are growing out of proportion, leading to the shift of moving from on-prem to cloud based data warehouses and unstructured datalakes. Generating insights from this data, requires running of parallel complex algorithms. With this ever increasing vast volume of data, it is a challenge for a data manager to maintain a robust workflow management system with existing methods. Apache Airflow is an open-source computational orchestrator that allows to author, schedule and monitor workflows. A workflow is represented as Directed Acyclic Graphs (DAGs) which are a collection of different tasks. These tasks are the compute steps that can run in sequence (task dependency) or in parallel (branching). Different coding operators which may also have other programming languages wrapped within them allow flexible development of DAGs. The web server enables easy management and code visualization. With Docker, Airflow can be set up as a cluster using kubernetes. Using Google Composer inside Google Cloud Platform, we have set-up multiple end-to-end Airflow pipelines. The pipelines connect to 3rd-party data warehouses and extract all the required data. The extracted data is stored on Cloud utilizing the unlimited storage capability which is then consolidated and transformed to be consumption ready. Post that, it is ingested in Cloud Storage using different RESTful APIs. Internal retry mechanisms have been setup for known exceptions along with auto load balancing to use the computational capacity of Cloud environment. Alert mechanisms have been set up to notify the data managers upon completion of job or report unknown errors, if any. Schedule to run every fortnight, the DAGs perform ETL operations to ingest nearly 50 million records for different datatypes including Field, Well, Wellbore, Completions, Perforations, Documents and Log files. The orchestration has resulted in 64% reduction in overall man-days effort and up to 95% in some cases making them highly re-usable.