GCP – Karthik's BI Musings

Time flies. Looking back at the first one I wrote which was almost 5 years back, I feel it did serve it purpose. Just looking at my resume and what I had done in the past with roles and responsiblities, it doesn’t give the complete picture. It’s always best to have your accomplishments and things that you achieved jotted down in time tested STAR fashion.

The circumstances of the post definitely isn’t in dire situation but I realized given there have been two shifts from my last post, it’s best I put down some brain dump. WIthout further ado, here it goes.

Woolworths Group

It was a complete happenstance through which I got the job at Woolies. In this day and age, I am not sure my profile would have been picked up. I will be writing a complete new post on it. This was my first taste into world of Cloud Platform. Up until then, the profile that I had set-up was predominantly featuring Alterxy, SQL, Python and databases like Amazon Redshift, Postegres.

My job at Woolies was moslty in the world of Airflow and SQL with little bit sprinkling of Cloud Functions, Cloud Build thrown in. Some projects were challenging, some were gruelling to work (there were bunch of sql scripts each containing 1000 lines of code) all getting executed via DAGs, some outright contentious (where you begin to wonder if you ever had a voice). With all that said, everything is an experience. Here it goes..

Config File Generation
- Challenge – Tableau dashboards were getting refreshed via Tableau Data Sources (tds) which were generated via Airflow DAG’s. The file generation process involved having a config file for the data source that is to be published in form of a nested JSON file i.e. essentially having a bunch of metadata attributes with dashboard name, dashboard link, and then another subsection within which is essentially the data types and column names.
  
  Any changes being done on the schema, say adding or removal of column or data type change, we were doing this manually to the JSON file. This process was error prone. As the number of tables to maintain increased, the vulnerability of manually modifying the JSON as and when changes being done at the table level happened.
  
  I made the mistake once wherein I updated a wrong file with the column changes needing a late night debugging call with at least 8 people online and weeding out the root cause. It felt bad and it was then I decided this needed a change.
- Action Taken – Setting up a google sheet which just contained header attributes and the table name whose schema details were to be generated. An external table was then created on top of it. I then created a Airflow DAG that performs the operation of iterating through each entry of the google sheet, generate the required json using the INFORMATION_SCHEMA.TABLES data and the fixed attributes and dump the file in the desired destination.
- Result – Made the entire system fully robust and entirely automated and scalable. Immaterial of the changes that are needed (be it removal of column, adding new etc) is just handled automatically.

Quite often I see that in many projects that I worked with, there are some disparate data pipelines doing the same tasks but with different operators, different methodology in achieving the same objective. This could be due to different people joining the project not being aware of similar work done before or people trying to come with their own approach. It’s not an ideal way of working and where possible it would be good to have a framework set and have the team implement it with end to end documentation to guide them.

In course of my work, I came across a solution wherein they had implemented a robust solution to achieve this albeit with extra layering i.e. nesting at two levels down which I will explain further. At first, it was very hard for me to absorb what was going on. Now, with some time in my hands and being able to relook at it from fresh perspective I am able to decode it better and appreciate the work done. This is my attempt to make it easier for people to read and implement it for their own use case.

Business Problem

Let’s say we need to perform data loads for a data warehousing project in GCP. The typical flow for a table would be as shown below –

Karthik's BI Musings

Tag: GCP

Automating Data Pipelines in Airflow – Dynamic DAG’s