Automate BigQuery Scripts with Cloud Functions and Secret Manager

Having worked with Google Cloud composer for quite long, when someone says pipeline my brain automatically goes to one and only one thing – DAG. With such wide variety of pluggable options available, it feels like a manna. It is kind of similar to all the other tools of yore Informatica, SSIS etc but less GUI and more extensible. It is only in my current job I am discovering how much more google has to offer out of the box and much cheaper options and I am loving the autonomy that I am provided with to come up with solution.

That’s not to say it’s wild wild west out here with everyone left to do what they want. There are some established patterns as listed below which I have worked on –

  • Airflow DAGs running dbt models – Very costly. Essentially, what is happening here is all the jobs do is run Kubernetes Pod operator which in turn run dbt commands. Effective if there are lot of ad-hoc jobs getting run across the project. Am planning to put a complete post about it, especially about dbt (which I am REALLY excited about).
  • Databricks – Mostly for ML models but there are quite good number of legacy jobs doing the ETL workflows. Again, a very pricey option for simple workflows. I loved working on it and seeing the flexibility it provides and ease of debugging.
  • Airflow DAGs – Regular airflow jobs with all elements of ETL (very few jobs) with plug and play operators.
  • Cloud Functions – Using Cloud Functions trigger pattern to deploy BQ objects (main meat of this post)

In addition to the above there are some more such as Cloud Run with Cloud Scheduler / Workflows, Vertex AI using dbt. I haven’t really worked on them yet but would be discovering and learning about it.

Cloud Functions as Google states just write your code and let Google handle all the operational infrastructure. With release of version 2, there are literally hundreds of ways with which you can orchestrate the Cloud Functions and integrate with multitude of actions from the entire suite of Google Cloud Platform.

Read More »

Efficient Schema Extraction in Databricks with Pyspark: A Step-by-Step Guide

Extracting Schema information from Databricks would seem to be a very simple solution, isn’t it? I mean there is INFORMATION_SCHEMA to use. That is unfortunately only applicable only to Unity Catalog i.e. a databricks metastore that can share data across multiple Azure Databricks workspaces. Unfortunately no such catalog exists in our project. Best recourse for this is by using Python Spark SQL libraries.

Problem

In one of the recent tasks, I had to update a large SQL query comprised of 10+ Databricks tables with additional attributes and replace some existing ones. The only problem is most of the attributes had no aliases to them. So, there was no way to know which attribute is from which table. It’s one of the most irritating and annoying things that bothers me to no end and feel like…

Solution

Read More »

Does experience matter?

Experience Meme | TaylorMadeMarketing©

As my current contract gets a last breath of life and I begin to search for opportunities, this one question has really been doing rounds in my mind. Let me further elaborate on that question before people jump the gun. Does experience in particular technology really matter, if you already have knowledge in related subject matter?

For over a decade since starting my career, I had worked with MSBI suite of technologies with PowerShell, C# thrown in. It wasn’t until I came here to Sydney to work for a different client (LINK Group) where I had the opportunity of working with different sets of tools such as Informatica and Oracle. Things were going good.

As the work came through, I adapted myself to learning the new tools and working with them. It didn’t take time to really upskill. It was then I got a lucky break. Vinay Sammineni of Cognitivo Consulting and his partner Alan Hsiao took upon my CV (from a mutual friend) and saw that I had very good data warehousing and SQL skills.

They phoned me and asked me if I was interested to work for Amaysim who were looking for a data analyst with the skills required being – Alteryx, Tableau, Amazon Redshift and stated further that they have also set-up an interview with them. This was quite a shock and I clearly remember asking them multiple times if they have gone through my CV as I didn’t have any experience in any of them up till that point.

The interview with Jacquie went well and the questions were mostly focused on standard SQL and my past work experience – the challenges that I had faced, dealing with demanding managers etc. She then went on about the tools the company were using and she was quite blasé about me not knowing them and said you shouldn’t find any difficulty getting acclimatised with them.

That was how my new path began to take shape. I honestly can’t thank enough the folks at Amaysim and also Vinay and Alan for believing in me and my abilities to transition my existing knowledge and using it to develop new ones.

As I start applying for jobs for Data Engineering roles, I hope I come across a company who can see my capability to transition my existing breadth of knowledge and not exactly on the tech skills that I work with.

So who do you really work for?

I attended a Town Hall meeting yesterday of Genpact from who I currently contract with. It was very interesting meeting with about 70 odd people attending it. One of the things that pleasantly surprised me is the strength of non-Indian representation. Genpact is one of the mid-tier IT firms based out of in India and having worked for such consultancy firms before I expected large Indian diaspora with occasional Aussies.

This was the very case in the first company that I worked here in Australia and contracting for LINK Group, as a MindTree employee. In the annual parties that we had back then, I could count with my fingers on number of Aussies in the whole pact. So it was quite a sight to see a change.

The one difference I can see right away is whereas in MindTree there was predominant presence of developers, here I could see the opposite. This would only mean , at least to me, expansion is yet to happen for the company. There were two main speakers who gave an update of the company’s outlook, how it fared from last year, exciting new clients that they bagged this year. The company looks to be heading very strong with really good performance outlook.

Right at the end, the meeting then veered into employee feedback they obtained and what it meant to them, how they are going to address it etc. It was it this point, I was kind of zoned out.

It reminded me of all the times that I have been a salaried full-time employee, for whom talks like this used to invigorate a sense of belonging in me. Ever since I have been contracting, events like this put a different perspective on thinking.

Post the meeting, I stayed on for a while just to have a casual chat with any of the folks there. It was during that time I got asked – so who do you really work for?

That’s when for the first time, I got a sense of liberation. Being on contract, is in a way being on your own. In truest sense the answer would be I work for the client as that is my primary responsibility. The question, though, is much broader than that. In absence of allegiance, for whom am I really working?

Ever since the day I have started working, my first and foremost dedication is to the quality of work I deliver. It has to be flawless, easily scalable, extensible and most importantly well-documented. It is the work that gives me the utmost satisfaction. The one thing I have consciously decided to focus on now is to improve my technical knowledge and gain inroads into big data engineering space.

Day #2 – Data Modeling

Day #2 of my course involved getting over view of data modelling. The course started off with basic introduction courses for relational and cloud databases. The course per se was touching only on basic terms and bit underwhelming w.r.t to intro to PostgreSQL database.

Cassandra database is the next hurdle to cross and work on.

Day #1 – Getting the engine started

Yesterday marked my first day in the Data Engineer Nanodegree course offered by Udacity. After thinking a lot on how to best equip myself and enrich my knowledge in the world of Big Data and taking the steps towards it, this course came along talking about THE essential things that I wanted to learn – Cloud data warehouses, Spark and Data Lakes.

What further sealed the deal was I am working on a project where we are using Spark and Data Lake as well. However, it is being handled by a separate team. My involvement so far has been to the extent of writing Impala queries, creating data structure, testing the sqoop queries and occasionally query tuning by looking at the logs to understand which partitioning is better. I reasoned that doing this course will give me a better ammo to pitch myself to get into the Data Lake team. Time will tell (fingers crossed)

I have been longing for an opportunity to pivot my career from the traditional BI to Data Engineering on Big Data Platforms. Here is a course that not only promises to teach the nitty gritties of being a Data Engineer with a proper structured methodical teaching but also help with shaping up my career via services like resume editing and LinkedIn page setup. Long way to go for that.

So here is what my Day 1 (yesterday) felt like so far- Absolutely wonderful!
In the first few videos I have really gotten to know what Data Engineer really means and what other titles actually mean and how they stack up.

What resonated me a lot was this article that was one of the materials to read up. It spoke volumes to me as this was exactly the path I had been following all through my career. I started off writing ETL packages via SSIS on traditional OLTP – OLAP databases, designing cubes off of it, designing and developing reports based on it.

All these have stopped about 3 years ago and it was only a year ago, I am completely off it. I am now working on data sources which are disparate in nature or are built on the Data Lake. This is a brand new world for me and am loving every part of it. The challenges are different, more exciting and there is SO much more to be done.

Looking at the evolution of how data has proliferated and how the traditional RDBMS technologies are not sufficient to cater the growing needs of business, I am happy to see the organic growth in me. Of course, to be where I am today, the forces that have shaped me are largely due to the work done in BI but stepping into new future I need more ammo.

Coming back to the course, I started off with Data Modelling basics and some intro into PostgreSQL.

Next post would be more structured. The purpose of this post with # tag is to motivate myself to read every day and share my thoughts on my learning.

Alteryx App- Adding (Select All) to Dropdown

One of the most common requirement that comes when creating Alteryx Apps involving a dropdown is to have (Select All) as one of the values. This value, if you have not inferred by now, would not be part of a data source but something we add to it. Basically I am trying to simulate what Excel does when ‘Sort & Filter’ is enabled as seen below – 

2019-02-07 11_39_50-Marvel Movies Box Office Report.xlsx - Excel

In this post I am going to demonstrate how to add this value in the drop down and how it then needs to be consumed in filtering the datasource. This post will have the following sections –

  1. Get Dummy Data
  2. Add (Select All) to Dropdown tool
  3. Filter dataset using the ‘(Select All)’

1. Get Dummy Data

The test data which I am going to use for this post is the highest grossing Marvel Movies which I am getting from this link – Marvel Comics Movies at Box Office. I copied the first table and stored it in my local disk. In my Alteryx App, I have dragged this data as a Input Tool. Used DateTime Tool and Filter to create a unique Date Format.

Sample Data is shown below –

2. Create Filter with (Select All) in it

As next step I am going to create one Dropdown filter – ‘Studio’. For this post I am choosing the option – Manually set values (Name:Value – one per line). I have entered the values manually. As you can see I have kept the option ‘(Select All)’ as the first entry. 

2019-02-07 11_55_37-Alteryx Designer x64 - Test Workflow - Blog.yxwz_.png

If the data is coming via connected tool, ensure that source data for the filter is joined with the ‘(Select All)’ manual text data. Likewise, if it’s coming from external source, add this entry. Essentially because this value doesn’t exist, we would need to add it.

3. Filter dataset using the ‘(Select All)’

We identified our column on which we need to filter. We have created the values for the filter. It is now time to actually ‘filter’. I have now dragged in a Filter transform. Too much repetitive, isn’t it. I will now ‘filter’ (pardon the pun) going forward. Put the following entry for the ‘Customer Filter’ option as shown below –

2019-02-07 12_05_27-Alteryx Designer x64 - Test Workflow - Blog.yxwz_.png

Here is what we need to do next –
1. Connect the ‘search’ icon of ‘Studio’ dropdown to the lightning bolt of ‘Filter on Studio’ Operator. A ‘Update value’ operator pops up in between.
2. Click on the operator. Under the ‘Value or Attribute to Update:’ section, click on “Expression – value = ….”
3. Click on the checkbox ‘Replace a specific string:’ and just keep the value as ‘<studio>’ (without quotes) as can be seen below- 

2019-02-07 12_14_02-Alteryx Designer x64 - Test Workflow - Blog.yxwz_.png

That’s it. Essentially the logic here is the inclusion of OR expression of place holder value equal to ‘(Select All)’. When a specific filter is selected, the select all expression goes false and when ‘(Select All)’ option is selected, well this one gets true and essentially the filter will flow through all the data.

Same logic can be used for filtering the data say in a Dynamic Input with data coming from a particular SQL data source. Code for that would look like this – 

select *
from  really_awesome_table2019-02-07 14_18_14-Alteryx Designer x64 - Select All - Filter implementation.yxwz.png
where (
           that_parameter  = ‘<this_value>’
            or
           ‘(Select All)’ = ‘<this_value>’
          )

To test it out I have put in a summarize transform to group by ‘Studio’ and ‘Total Movies’. Let’s run the workflow.  Here is how the wizard looks like –

2019-02-07 14_20_04-Alteryx Designer x64 - Select All - Filter implementation.yxwz.png
Here is how the output looks like, executing with (Select All) and individual Filter –



IntelliJ – New Project – Scala not appearing

Have you come across the problem of not being able to select ‘Scala’ as part of the ‘New Project’ in IntelliJ as shown below (psst – there is Scala present, I know. For imagination sake let’s say you don’t see it, kapise?)  ?

IntelliJ-New Project-Window.png

First thing when you google, everything it points out is for you to installing ‘Scala plugin’.

Let’s say you do have the plugin installed and available and enabled as shown below but even then you don’t find it in the ‘New Project’ –
IntelliJ-Plugins-Scala-Enabled.png
This usually happens say when you have upgraded your IntelliJ or re-installed your IntelliJ with a different version at different time and have imported the old settings. The best thing to do to get over this error is navigate to the folder – C:\users\<username>\ and delete all the folders starting with .IdeaICxxxx as highlighted below –
IntelliJ-User Folder-Delete.png

That’s it. Now open up your IntelliJ again, all should be good now to get going. Of course this time around you would need to get your JDK and Scala libraries set up. Follow this blog – link

 

Alteryx InboundNamedPipe::ReadFile: Not enough bytes read error – Postgres

Yet another strange day to be in. I kept getting the following error –

Error – The Designer x64 reported: InboundNamedPipe::ReadFile: Not enough bytes read. The pipe has been ended¶

I tried several methods – clearing of temp files, restarting alteryx, restarting my system multiple times, logging on and logging off etc to no avail. It just kept failing.

What added to the confusion is, it was happening only for few tables whose volumes are much smaller in size. I mean to say Table 1 with about ~20M records from the same database, we were able to extract successfully whereas from Table 2 with only ~3.5M we were facing this problem. Seemed really strange. The good old internet hadn’t turned up with anything useful and neither the Alteryx Forums. All the info that I got was that during one stage of workflow , it was running into error.

The case for me though is in my workflow apart from the ‘Input Tool’ from where I was pulling in the data and the ‘Output Tool’, there was nothing in between. Still it was failing.

Anyway I just took a break, thought for a while and said to myself – “When was the last time you had such strange error with Alteryx and Postgres combo?” Oh yeah, right here.

Hmm, why not try the same fix? I sure did and you know what! It just god darn works! Don’t even ask me how or why, it works. So gents and ladies, here is the fix you gotta do –

Go to the, Pre SQL Statement of your Alteryx ‘Input Tool’ and insert the following –

set client_encoding to ‘UTF-8’

As simple as that! Thank me later.

Alteryx – Extract data from various excel files with various tab names

It was one of those challenging tasks that had to be done and I felt it’s good to blog about it as well.

Task –
Extract data from a limited range of a sheet from various excel files. Each of the files have various tabs in it and the names of the tabs are not unique.

For instance, let’s say I have three excel files with the following name format – MonthlyReport_<<Month>>.xlsx where the Month is for Dec, Jan and Feb.
The data within my ‘Dec’ file is as follows –
1_srcExcel_Dec

Nice, well formatted data. Let’s look at Jan’s data –
1_srcExcel_Jan

Oh, oh! somebody left a comment in the Column E. Let’s now look at the Feb’s data –
1_srcExcel_Feb
Aargh..Another person left another comment this time at Column D.
Now my task is to get all the data from Columns A and B only across all the excel files with the differently named tabs.

Solution-
The first step for us is to get all the sheet names from each of the excel files. To achieve this we need to use a macro and the steps are listed below –

  1. Open a new workflow. Drag in a Input tool and connect it to one of the excel files and choose the option – <List of Sheet Names> and also set the option for ‘Output File Name as Field’ to ‘Full Path’ as shown below. Setting these two options will result in two fields – FileName, Sheet Names as the output –2_ListExcelSheet_InputData
  2. Add a ‘Formula’ transform. Set the ‘FileName’ with the ‘Query’ that’s needed as shown below. The purpose of the below query is to replace the ‘List of Sheet Names’ with the actual Sheet Name and in addition to that the ‘Range’ of data to be fetched. If we don’t specify this and just stick to ‘Sheet Name’, then the workflow would fail as the data is non-uniform. This is important to take note of. I am pasting the query for easy reference. Keep in mind of the tilde character –
    Replace([Filename], “<List of Sheet Names>”, “Select * from `“+[Sheet Names]+”$A1:B100000`“)
    3_ListExcelSheet_ForumlaTool
  3. Drag an ‘Macro Output’ and connect it to the ‘Filter’ transform.
  4. Drag a ‘Control Parameter’ on to the workflow. Connect the ‘search’ icon of Control Parameter to ‘lightning’ icon of the ‘Input Tool’. Doing so you would get an ‘Update Value’ transform in between both. Just leave it as is. By default it would select the ‘File – value’ as shown below. The purpose of dragging in the Control Input is to create a placeholder through which multiple excel files can be passed. If you are coming from SSIS background, you can say this is the equivalent of ‘ForEach’ file transform where the full file path is passed as a parameter.
    4_ListExcelSheet_UpdateValue
  5. Save the workflow as ‘List of sheets Macro.yxmc’ and the full workflow would look like as shown below. (Note- I have added ‘Annotation’ for Control Parameter, Input Tool and Formula Tool) –5_ListExcelSheets_Completed.png

Now we have a macro which can give the full path of the excel file along with the query that we need to get the data from. We now need another macro where we can put the query in and get the data. Remember anytime we need looping, macro is the way to go. The first step we did was just getting the sheet names from each excel file and modifying the file path with the query.

The second macro to do is fairly simple –

  1. Open a new workflow. Drag in a ‘Input Tool’ and connect it to one of the source excel files. In my case I connected it to the ‘Feb’ file. When you connect to the file, here is how ‘Choose Table or Specify Query’ window looks like –
    6_ObtainDataFromExcel_ChooseTable.png
  2. Do not choose any sheet name rather click on ‘SQL Editor’ and paste the following query and click ‘OK’ – SELECT * FROM February$A1:B100000
    as shown below-7_ObtainDataFromExcel_SQLEditor
  3. Drag in ‘Filter’ Transform and connect to the ‘Input Tool’. In the properties, select the column ‘Id’ and set the drop down to ‘Is Not Null’. Basically if you look at the query in the above step I have given arbitrary number B100000. The data may or many not be there. So we need to filter out empty data.
  4. Drag in an ‘Macro Output’ and connect it to the ‘Filter Transform’ above.
  5. Drag in a ‘Control Parameter’ and connect the ‘search’ icon of transform to the ‘lightning’ icon of the ‘Input Data’ transform. An ‘Update Value’ transform would appear in between and you don’t need to do anything. By default it will transform the ‘File – value’ as shown below – 8_ObtainDataFromExcel_InputData.png
  6. Save the workflow with name say – Obtain limited range from excel.yxmc. Here is how it would look like – 9_ObtainDataFromExcel_Overall

Okay, so we now have two macros, one for getting sheet names and modifying the full file path with the requisite query, the other for getting the data from the excel file. We now need another workflow to call these two macros and do our job.

  1. Open a new workflow. Drag in the directory tool and point the location to the path where the source files reside. Set the file specification as ‘.xlsx’ just so that we only get the excel files
  2. Drag in an ‘Formula’ transform and connect it to the directory tool. Add new column and give it a name say ‘FullPath_SheetNames’ and give it the following value – [FullPath]+”|<List of Sheet Names>”.
  3. Right-click on the blank space in the workflow, go to ‘Insert’->’Macro’ -> ‘List of Sheets’ as shown below- 10_ObtainDataFromExcel_InsertFirstMacro.png
  4. Connect the macro to the ‘Formula’ transform. In the ‘Properties’ box of the macro, set the ‘Choose list of sheets input field’ as ‘FullPath_SheetNames’. (Note- in case you have not done any annotations in the macro, then it would appear as Control Parameter Input’
  5. Drag in a ‘Filter’ transform and connect it to the above macro. Set the filter as – [Sheet Names] != “Summary”. This step I am doing because each of our files have two sheets and the sheet that we are interested in is not the ‘Summary’ one. I could have just put in one sheet in all files but I wanted to show you how you can if need be utilize all the sheets separately for different purposes.
  6. Right-click on the blank space in the workflow, go to ‘Insert’->’Macro’ -> ‘Obtain Limited range from Excel’ macro and connect it to the ‘Filter’ transform.
  7. In the properties of the second macro, set the value to ‘FileName’.
  8. Drag in a ‘Browse’ transform to look at the output. (here is where you would typically plug-in your Output Data tool in the real world scenario)

11_ExtractData_Overview

Post-running the workflow here is the output from Browse –
12_ExtractData_Output

That’s it. For my task I had to go through about 12 different excel files with various tab names and the out-of-the-box tools that Alteryx provides is so very powerful to get the job done.

Just falling in love with it.