Automate BigQuery Scripts with Cloud Functions and Secret Manager

Having worked with Google Cloud composer for quite long, when someone says pipeline my brain automatically goes to one and only one thing – DAG. With such wide variety of pluggable options available, it feels like a manna. It is kind of similar to all the other tools of yore Informatica, SSIS etc but less GUI and more extensible. It is only in my current job I am discovering how much more google has to offer out of the box and much cheaper options and I am loving the autonomy that I am provided with to come up with solution.

That’s not to say it’s wild wild west out here with everyone left to do what they want. There are some established patterns as listed below which I have worked on –

  • Airflow DAGs running dbt models – Very costly. Essentially, what is happening here is all the jobs do is run Kubernetes Pod operator which in turn run dbt commands. Effective if there are lot of ad-hoc jobs getting run across the project. Am planning to put a complete post about it, especially about dbt (which I am REALLY excited about).
  • Databricks – Mostly for ML models but there are quite good number of legacy jobs doing the ETL workflows. Again, a very pricey option for simple workflows. I loved working on it and seeing the flexibility it provides and ease of debugging.
  • Airflow DAGs – Regular airflow jobs with all elements of ETL (very few jobs) with plug and play operators.
  • Cloud Functions – Using Cloud Functions trigger pattern to deploy BQ objects (main meat of this post)

In addition to the above there are some more such as Cloud Run with Cloud Scheduler / Workflows, Vertex AI using dbt. I haven’t really worked on them yet but would be discovering and learning about it.

Cloud Functions as Google states just write your code and let Google handle all the operational infrastructure. With release of version 2, there are literally hundreds of ways with which you can orchestrate the Cloud Functions and integrate with multitude of actions from the entire suite of Google Cloud Platform.

Read More »

Automating Data Pipelines in Airflow – Dynamic DAG’s

Quite often I see that in many projects that I worked with, there are some disparate data pipelines doing the same tasks but with different operators, different methodology in achieving the same objective. This could be due to different people joining the project not being aware of similar work done before or people trying to come with their own approach. It’s not an ideal way of working and where possible it would be good to have a framework set and have the team implement it with end to end documentation to guide them.

In course of my work, I came across a solution wherein they had implemented a robust solution to achieve this albeit with extra layering i.e. nesting at two levels down which I will explain further. At first, it was very hard for me to absorb what was going on. Now, with some time in my hands and being able to relook at it from fresh perspective I am able to decode it better and appreciate the work done. This is my attempt to make it easier for people to read and implement it for their own use case.

Business Problem

Let’s say we need to perform data loads for a data warehousing project in GCP. The typical flow for a table would be as shown below –

Read More »