Toughest Challenges – 2

March 19, 2025 Karthik VLeave a comment

Time flies. Looking back at the first one I wrote which was almost 5 years back, I feel it did serve it purpose. Just looking at my resume and what I had done in the past with roles and responsiblities, it doesn’t give the complete picture. It’s always best to have your accomplishments and things that you achieved jotted down in time tested STAR fashion.

The circumstances of the post definitely isn’t in dire situation but I realized given there have been two shifts from my last post, it’s best I put down some brain dump. WIthout further ado, here it goes.

Woolworths Group

It was a complete happenstance through which I got the job at Woolies. In this day and age, I am not sure my profile would have been picked up. I will be writing a complete new post on it. This was my first taste into world of Cloud Platform. Up until then, the profile that I had set-up was predominantly featuring Alterxy, SQL, Python and databases like Amazon Redshift, Postegres.

My job at Woolies was moslty in the world of Airflow and SQL with little bit sprinkling of Cloud Functions, Cloud Build thrown in. Some projects were challenging, some were gruelling to work (there were bunch of sql scripts each containing 1000 lines of code) all getting executed via DAGs, some outright contentious (where you begin to wonder if you ever had a voice). With all that said, everything is an experience. Here it goes..

Config File Generation
- Challenge – Tableau dashboards were getting refreshed via Tableau Data Sources (tds) which were generated via Airflow DAG’s. The file generation process involved having a config file for the data source that is to be published in form of a nested JSON file i.e. essentially having a bunch of metadata attributes with dashboard name, dashboard link, and then another subsection within which is essentially the data types and column names.
  
  Any changes being done on the schema, say adding or removal of column or data type change, we were doing this manually to the JSON file. This process was error prone. As the number of tables to maintain increased, the vulnerability of manually modifying the JSON as and when changes being done at the table level happened.
  
  I made the mistake once wherein I updated a wrong file with the column changes needing a late night debugging call with at least 8 people online and weeding out the root cause. It felt bad and it was then I decided this needed a change.
- Action Taken – Setting up a google sheet which just contained header attributes and the table name whose schema details were to be generated. An external table was then created on top of it. I then created a Airflow DAG that performs the operation of iterating through each entry of the google sheet, generate the required json using the INFORMATION_SCHEMA.TABLES data and the fixed attributes and dump the file in the desired destination.
- Result – Made the entire system fully robust and entirely automated and scalable. Immaterial of the changes that are needed (be it removal of column, adding new etc) is just handled automatically.

BigQuery TableReference Initialization Error: Too Many Arguments provided

August 26, 2023 Karthik VLeave a comment

I tend to agree now with my friend’s recent observation that google cloud documentation is a wreck and doesn’t provide proper examples for some of the basic things in its documentation sdk.

I needed to use the function TableReference. When you search for it you will land in the following page — Class TableReference where it’s not clear on how to use

So here is an example code I was using —

from google.cloud import bigquery

# Create a client object.
client = bigquery.Client()

# Get the project ID.
project_id = "my-project-id"

# Get the dataset ID.
dataset_id = "my-dataset"

# Get the table ID.
table_id = "my-table"

# Create a table reference.
table_ref = bigquery.TableReference(
    project_id=project_id,
    dataset_id=dataset_id,
    table_id=table_id,
)

print(table_ref)

This is the error I was getting

TypeError: TableReference.__init__() takes 3 positional arguments but 4 were given

Fix to be applied is pretty simple.

# Create a table reference using from_string
table_ref = bigquery.TableReference.from_string("{project_id}.{dataset_id}.{table_id}")

Automate BigQuery Scripts with Cloud Functions and Secret Manager

June 1, 2023June 1, 2023 Karthik VLeave a comment

Having worked with Google Cloud composer for quite long, when someone says pipeline my brain automatically goes to one and only one thing – DAG. With such wide variety of pluggable options available, it feels like a manna. It is kind of similar to all the other tools of yore Informatica, SSIS etc but less GUI and more extensible. It is only in my current job I am discovering how much more google has to offer out of the box and much cheaper options and I am loving the autonomy that I am provided with to come up with solution.

That’s not to say it’s wild wild west out here with everyone left to do what they want. There are some established patterns as listed below which I have worked on –

Airflow DAGs running dbt models – Very costly. Essentially, what is happening here is all the jobs do is run Kubernetes Pod operator which in turn run dbt commands. Effective if there are lot of ad-hoc jobs getting run across the project. Am planning to put a complete post about it, especially about dbt (which I am REALLY excited about).
Databricks – Mostly for ML models but there are quite good number of legacy jobs doing the ETL workflows. Again, a very pricey option for simple workflows. I loved working on it and seeing the flexibility it provides and ease of debugging.
Airflow DAGs – Regular airflow jobs with all elements of ETL (very few jobs) with plug and play operators.
Cloud Functions – Using Cloud Functions trigger pattern to deploy BQ objects (main meat of this post)

In addition to the above there are some more such as Cloud Run with Cloud Scheduler / Workflows, Vertex AI using dbt. I haven’t really worked on them yet but would be discovering and learning about it.

Cloud Functions as Google states just write your code and let Google handle all the operational infrastructure. With release of version 2, there are literally hundreds of ways with which you can orchestrate the Cloud Functions and integrate with multitude of actions from the entire suite of Google Cloud Platform.

Efficient Schema Extraction in Databricks with Pyspark: A Step-by-Step Guide

April 4, 2023June 1, 2023 Karthik VLeave a comment

Extracting Schema information from Databricks would seem to be a very simple solution, isn’t it? I mean there is INFORMATION_SCHEMA to use. That is unfortunately only applicable only to Unity Catalog i.e. a databricks metastore that can share data across multiple Azure Databricks workspaces. Unfortunately no such catalog exists in our project. Best recourse for this is by using Python Spark SQL libraries.

Problem

In one of the recent tasks, I had to update a large SQL query comprised of 10+ Databricks tables with additional attributes and replace some existing ones. The only problem is most of the attributes had no aliases to them. So, there was no way to know which attribute is from which table. It’s one of the most irritating and annoying things that bothers me to no end and feel like…

Solution

Automating Data Pipelines in Airflow – Dynamic DAG’s

November 25, 2022November 25, 2022 Karthik VLeave a comment

Quite often I see that in many projects that I worked with, there are some disparate data pipelines doing the same tasks but with different operators, different methodology in achieving the same objective. This could be due to different people joining the project not being aware of similar work done before or people trying to come with their own approach. It’s not an ideal way of working and where possible it would be good to have a framework set and have the team implement it with end to end documentation to guide them.

In course of my work, I came across a solution wherein they had implemented a robust solution to achieve this albeit with extra layering i.e. nesting at two levels down which I will explain further. At first, it was very hard for me to absorb what was going on. Now, with some time in my hands and being able to relook at it from fresh perspective I am able to decode it better and appreciate the work done. This is my attempt to make it easier for people to read and implement it for their own use case.

Business Problem

Let’s say we need to perform data loads for a data warehousing project in GCP. The typical flow for a table would be as shown below –

Generate Nested JSON using BigQuery

November 19, 2022November 25, 2022 Karthik VLeave a comment

I had an interesting use case as part of my work wherein we needed to generate Nested JSON out of table definitions (i.e. from the INFORMATION_SCHEMA.COLUMNS) that was then used by other system for further processing. Any changes being done on the schema, say adding or removal of column or data type change, we were doing this manually to the JSON file. This process was error prone. As the number of tables to maintain increased, the vulnerability of manually modifying the JSON as and when changes being done at the table level happened.

Looking for simpler solution when I looked around google BigQuery had some ready made operators up to the task namely –

ARRAY_AGG – This returns an ARRAY of expression values specified and can also work with aggregation.
STRUCT – Constructs a container of ordered fields i.e. like a list in python. Returns an ARRAY object.

Let me illustrate the use case with an example. Say we have following dataset (data taken from here)-

*Marvel and DC Superhero character info*

Dynamically read Zip file contents using Alteryx

September 7, 2020April 9, 2021 Karthik VLeave a comment

I had an interesting business problem to solve and wanted to share on how this can be achieved.

Business Problem

On daily basis a zip file containing various flat files is dropped at a file location. Contents of the flat files are to be read and extracted. All the files have the same metadata.

File names within the zip file are dynamic.

Solution

Alteryx provides out of the box Input tool for working with zip files. All one needs to do is drag and drop the zip file on to the canvas and tool itself will pop-up asking which files to be extracted as shown below.

Pop-up image when dragging a Zip file on to the canvas.

Toughest Challenges

August 21, 2020April 9, 2021 Karthik V1 Comment

In the quest for next big break or opportunity out there, I have come to expect these interview questions and I have decided to blog about it.

Background –
After months of relentlessly applying for jobs with mix and match of skill sets that I am eligible for, going through multiple ghost callings (I believe that’s what it is called when recruiters spend around 20 to 30 minutes inquiring everything about you, explaining the job needs, setting the pay expectations, raising false hopes and then never hearing anything back even after sending out multiple mails or messages), I was fortunate to be set up a call finally with actual person in an actual company.

Unfortunately, the interview didn’t pan out well. In terms of actual interview itself, it was more about getting to know about the current role, about the team I would be working with and then straight down to the actual interview. The first question that I was asked was the ‘Toughest challenge’ till date. It caught me bit off-guard as I was expecting some technical questions to start with before settling down on behavioural and finally ending with the expectations of next meeting. I did my best to explain things I have done but I guess it just was not good enough.

Why really?

This will act as not only reference to myself but also reminder of the other ‘tough challenges’ that lie ahead that I need to face in the future.
Second and foremost reason being writing this down has been very cathartic.
Not EVERYTHING can be put on resume, so here it is then.

Without further ado, let me get to the Toughest Challenge question. Questions like this needs to be addressed via STAR system.

Mantras to live by

August 20, 2020April 9, 2021 Karthik VLeave a comment

I am inspired to write this blog after seeing a post in my company’s Workplace talking about wellbeing during COVID times and making effective communication. As a working professional there are some ‘mantras’ that I live by and here they go

General

Ownership of a Production issue – If you are tasked with an issue that requires urgent attention for production need, ensure that right from start to finish you own it. This would mean assigning proper timeline to start with, upon delivery ensure it is getting tested in UAT, goes Pre-Prod and then finally getting deployed to Prod. I want your eyes and ears all throughout this process either through regular follow-up’s and ensure post production.
Look out for opportunities – Quite often apart from your regular work, you may come across some tasks that are being done following a set process following certain sequence of steps in a methodical manner. Usually the people doing it do spend considerable amount of time out of their normal routine in accomplishing this (sometimes a week or more!). That should be the first sign of opportunity to seize and automate things.
Commitment to task- Never commit to anything upfront. If anyone is coming to you with a request to deliver something urgently, take some time to pause to first analyse it. Only after proper analysis give a timeline on when it can be done. When giving an estimate consider the time for actual build, Unit Testing, Regression Testing (if needed), Design document update, Peer Review (must), Rework time.
Meetings – Always have an agenda for a meeting and circulate it before hand as it provides context for participants. Be mindful of time and do not book it after-hours just because only person is off-limits. If you feel a phone call would be easier, then by all means give a ring and get it done fast.
Technical Front –
- When solutioning a problem ask yourself the following questions –
  - How critical is this problem?
  - Are there any more areas where such problem exists? If so what can be done about it?
  - At whole component level, is there something I can do to make the solution better?
  - Don’t be afraid to loosen things up and go one step further if you feel you can deliver a more robust and stable solution. Pitch for it if you are confident that you can deliver within the time frame.
- Actual Fix –
  - Provide proper code comments in the code and even in the Fix Details so that the anyone can understand what has gone into it.
  - Do thorough formatting of the code to make it more readable.
  - Look for extensibility and scalability of the solution (how does it impact Asset)
  - Ability for the solution to withstand large volumes (Query Plan analysis, stastics check etc.)

AWS – Developer Associate Certification learnings

July 29, 2020July 29, 2020 Karthik VLeave a comment

It’s been almost 2 weeks since I passed the certification exam and I wanted to pen down the high level details of all the components that I have studied to pass the AWS Developer Associate exam.

First off, I would like to thank Stephane Maarek and his wonderful Udemy course – Ultimate AWS Certified Developer Associate without which I am not sure I could have even inched past the priliminary set pieces.

Background
Like every Software Developer worth their salt my fascination to learn about cloud technologies began few years back. With help of Pluralsight courses, I started off my learning. The course, as usual was of excellent calibre but one tiny teeny detail that wasn’t mentioned was the need to monitor the bill. I was of the opinion 750Hrs of free-tier would last a lifetime.

I drifted off the course for a while and forgot to turn off the EC2 instances and voila! One fine day, in my mailbox I saw a bill of AUD $160. I immediately contacted the support centre and had the account suspended.

It really scared me off for a while and I put off learning about it for quite bit of time.

Motivation – I
On and off after that experience I just dabbled with S3 storage and static websites, trying to programatically load some images using Amazon SDK. As part of my Udacity Nano degree experience, I worked on small ETL batch jobs using Python modules by first loading the data on to S3 and then on to Redshift as the final destination. The whole program though left me with a bad taste with one of the worst support system and sub par course quality, though I managed to create some portfolio projects.

As I started off my job search, I realised it’s hard to convince people that I am well acquainted with AWS technologies and I know how to work with them. Though I don’t directly work on it in my current, I am quite aware that the Cloudera offering that we have is deployed across multitude EC2 clusters and we are not using the out-of-the box EMR provided by Amazon.

Additionally, I have been quite often asked if I have certification at least.

Motivation – II
When I started searching for the certification offerings from AWS, I realised the one I really want to give is – AWS Certified Data Analytics – Speciality as I aim to become Big Data Engineer/ Developer. That certification explores whole gamut of technologies that one can utilise as part of Data Analytics of Big Data –

Collection (Kinesis, Database Migration Services (DMS))
Storage (S3, DynamoDB)
Processing (Glue, Lambda, Hive, Spark, Hue, HBase)
Analysis (Redshift, Athena)
Visualisation (QuickInsight)
Security (STS, KMS)

The ones highlighted are something that I have worked\working with. AWS mandates that I need to have a Associate certificate before I can attempt an Speciality certificate. I chose ‘Certified Developer – Associate’ out of the three options. Fielding around with friends and colleagues I could see that Udemy course was a strong first followed by ACloudGuru subscription. I took the former. It was an intense 4 week preparation that ultimately bore the results. So, without much further ado here is the recap of all the suite of products that I have learnt

#	Product Name	Description
1	IAM (Identity and Access Management)	Access Management forms the heart and soul of AWS eco system. It has a global view and all the permissions are governed by Policies (written in JSON) format. Governance is accorded in three segments (Users, Groups, Roles)
2	EC2 – Elastic Cloud Compute	EC2 is akin virtual servers on the cloud. AWS provides you whole gamut of choices depending on the 5 distinct characteristics – RAM, CPU, I/O, Network, GPU. Additionally you can have different launch types too – On Demand Instances – short workloads Reserved – Minimum 1 Year Spot Instances – short workloads, less reliable, can be kicked off the instance Dedicated Instances – exclusive access to the hardware and not shared by anyone Dedicated Hosts – Booking of entire physical server, control instance placement etc.
3	ELB – Elastic Load Balancer	Load balancers are servers that forward internet traffic to multiple EC2 servers and essentially spread the load to downstream instances. Three types of Load Balancers are present – Classic Load Balancer Application Load Balancer (v2) Network Load Balancer (v2)
4	ACG – Auto Scaling Group	Purpose of ASG is to Scale Out (EC2) to match increased load or Scale In to match decreased load. Goes hand in hand with ELB’s. Trigger for scaling can be on CPU, Network or even custom metrics. Various types of scaling can be done – step scaling; scheduled scaling etc
5	EBS – Elastic Block Storage Instance Store EFS – Elastic File System	EBS is a network drie you can attach to EC2 instance when they run and retain data in case the instances crash. They are locked to AZ. Depending on need various types of storages are available (from large to small, high latencey to low latency etc). A EBS can be attached to only one EC2 instance Instance Store unlike EBS is like a USB attached to EC2. Available directly from the machine. On flip side, you will lose all the data if instance crashes Elastic File System is highly scalable expensive storage that is available across multi-AZ. EFS can be attached to multiple EC2 instances.
6	RDS – Relational Database Store Aurora ElastiCache	Managed database service from AWS stable that provides automated provisioning, continous backup, read replicas, auto-scaling (both vertically and horizontally, os patching) and so on. Aurora is a serverless Database management from AWS which is akin to AWS RDS on steroids i.e. 5 times more performant. ElastiCache is similar to EBS i.e. in-memory databases for RDS. It gives ability to cache requests and reduce the hits going to the DB. Remember on the cloud every read/write counts in the cost. Two Types- Redis – Backup and Restore features Memcached – Non-persistent
7	Route 53	A service akin to Traffic Police redirecting road traffic. Redirection can be done at domain level (CNAME), or to another amazon resource (Alias). Various types of routing are available – Simple Multi Value Routing Geolocation Routing Failover Routing Weighted Routing Failover Routing
8	VPC – Virtual Private Cloud	VPC isn’t extensively asked for Developer Associate but high level knowledge should suffice. It’s a private network to deploy resource within which public subnet and private subnet can be set-up NAT Gateway and Internet Gateways would be used to communicate with www.
9	Amazon S3 – Simple Storage Service Athena	Major building blocks of AWS. Infinite storage layer to store wide variety of data. Data is stored in buckets (directories). Version controlling can be enable. One of the most interesting things I found is the various storage classes capabilities starting from General Purpose to Glacier Deep Archive (min 180 days storage) Serverless service to perform analytics direclty against S3 files remotely.
10	CloudFront	Content Delivery Network to improve read performance, DDoS protection etc. Provides Global Edge Networks; great for static content that must be available everywhere
11	ECR – Elastic Container Service Fargate	Container Management service for docker installations. ECS clustoers are logical grouping of EC2 instances Fargate provides serverless management of container services providing high scalability without manual intervention
12	Elastic Beanstalk	Developer centric view of deploying application on AWS. Has three main components – Application, Application Version, Environment name (dev, test, prod) etc. Provides highly flexible deployment modes – All-At-Once; Rolling; Rolling with Additional Batches; Immutable Can make use of CLI capabilities to manage entirely via code.
13	AWS CICD	DevOps on AWS can be done using these components providing CI/CD CodeCommit – CodePipeline – CodeBuild – CodeDeploy –
14	CloudFormation	Infrastructe as Code. I absolutely LOVE this feature. It’s just mindblowing in every sense. It’s declarative way of outlining AWS infrastructure. Create a template of the infrastructure that you desire. It’s then just a matter of creating and removing infrastructure on click of a button. I will be focusing more on this from now on to enrich my learning
	Monitoring – CloudWatch X-Ray CloudTrail	All the applications sends logs to CloudWatch. Alarm can be set for notificaiton in case of unexpected Metrics. X-Ray service provides automate trace analysis and Central Service Map Visualiation. Request tracking across distributed systems Audits API calls made by users/ services/ AWS console. Useful to detect unauthorized calls or root cause of changes
20	AWS Integration & Messages – SQS SNS Kinesis	SQS refers to consumers polling data, data getting deleted after message being read, highly scalable service. SNS refers to messages being pushed to subscribers, up to 10M subscribers, easy integration with SQS for fan-out pattern Kinesis is used for streaming data services where the data gets distributed in mutliple shards. Data is read-only which then provides ability to do multiple analysis.

Karthik's BI Musings

Author: Karthik V