November 2022 – Karthik's BI Musings

Quite often I see that in many projects that I worked with, there are some disparate data pipelines doing the same tasks but with different operators, different methodology in achieving the same objective. This could be due to different people joining the project not being aware of similar work done before or people trying to come with their own approach. It’s not an ideal way of working and where possible it would be good to have a framework set and have the team implement it with end to end documentation to guide them.

In course of my work, I came across a solution wherein they had implemented a robust solution to achieve this albeit with extra layering i.e. nesting at two levels down which I will explain further. At first, it was very hard for me to absorb what was going on. Now, with some time in my hands and being able to relook at it from fresh perspective I am able to decode it better and appreciate the work done. This is my attempt to make it easier for people to read and implement it for their own use case.

Business Problem

Let’s say we need to perform data loads for a data warehousing project in GCP. The typical flow for a table would be as shown below –

I had an interesting use case as part of my work wherein we needed to generate Nested JSON out of table definitions (i.e. from the INFORMATION_SCHEMA.COLUMNS) that was then used by other system for further processing. Any changes being done on the schema, say adding or removal of column or data type change, we were doing this manually to the JSON file. This process was error prone. As the number of tables to maintain increased, the vulnerability of manually modifying the JSON as and when changes being done at the table level happened.

Looking for simpler solution when I looked around google BigQuery had some ready made operators up to the task namely –

ARRAY_AGG – This returns an ARRAY of expression values specified and can also work with aggregation.
STRUCT – Constructs a container of ordered fields i.e. like a list in python. Returns an ARRAY object.

Let me illustrate the use case with an example. Say we have following dataset (data taken from here)-

Karthik's BI Musings

Month: November 2022

Automating Data Pipelines in Airflow – Dynamic DAG’s

Generate Nested JSON using BigQuery