Quite often I see that in many projects that I worked with, there are some disparate data pipelines doing the same tasks but with different operators, different methodology in achieving the same objective. This could be due to different people joining the project not being aware of similar work done before or people trying to come with their own approach. It’s not an ideal way of working and where possible it would be good to have a framework set and have the team implement it with end to end documentation to guide them.
In course of my work, I came across a solution wherein they had implemented a robust solution to achieve this albeit with extra layering i.e. nesting at two levels down which I will explain further. At first, it was very hard for me to absorb what was going on. Now, with some time in my hands and being able to relook at it from fresh perspective I am able to decode it better and appreciate the work done. This is my attempt to make it easier for people to read and implement it for their own use case.
Let’s say we need to perform data loads for a data warehousing project in GCP. The typical flow for a table would be as shown below –Read More »