Dynamically read Zip file contents using Alteryx

I had an interesting business problem to solve and wanted to share on how this can be achieved.

Business Problem

On daily basis a zip file containing various flat files is dropped at a file location. Contents of the flat files are to be read and extracted. All the files have the same metadata.

File names within the zip file are dynamic.

Solution

Alteryx provides out of the box Input tool for working with zip files. All one needs to do is drag and drop the zip file on to the canvas and tool itself will pop-up asking which files to be extracted as shown below.

Pop-up image when dragging a Zip file on to the canvas

However, what steps are needed to make this process dynamic is pretty hard to find. Of course, one can rig up a Control Parameter on the top and have the zip file path updated. That’s about it though. Nothing much can be done after.

Let’s first look at the sample zip file named students_20200902.zip and the files within

Contents of the zip file
Sample data within a text file

My solution involves using Python entirely. Working on this, I remembered similar tweaks I used when working with SSIS. Anything that cannot be done using the existing Data Transformation Tasks (DFT), Script Task come to the rescue. Whole programming logic can be embedded within to get the task done.

As can be seen below, this is how the end workflow would look like.
Text Input – Contains the full filepath of the zipfile
Formula Tool – Adds a TempPath variable whose value equal to engine’s TempFilePath
Python – Contains the full logic for extraction of data from the files within.

Alteryx Workflow Overview

Let’s now focus on Python tool. Here is what the tool does at high level –
1. Use Zip command and unzip the files to Temp location.
2. Use Panda library’s pd.concat and pd.read_csv commands and load all the data into a dataframe, which is then sent to output.
3. Delete all the files from Temp location.

Listed below is the Python Code –

#################################
# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])


#################################
from ayx import Alteryx
from csv import reader
import pandas as pd
import os
from zipfile import ZipFile


#################################
# read all the inputs
data = Alteryx.read('#1')

# we need to use .any() component as data read would be in the form 
# of record set and we need to take one row out of it
zip_file_path = data['Path'].any()
temp_file_path = data['TempPath'].any()

#declare variables
extensions = ['.csv']


#################################
# Create a ZipFile Object and load *.zip in it
with ZipFile(zip_file_path, 'r') as zipObj:
   # Extract all the contents of zip file from TempPath directory
   zipObj.extractall(temp_file_path)


#################################
# variable to hold list of paths for the extracted csv files
file_list = []

#loop through all the files in the directory
for root, dirnames, filenames in os.walk(temp_file_path):
    for file in filenames:
        #get file extension and name
        fname, fext = os.path.splitext(file)

        #parse the file only if the extension is of desired extensions
        if fext in extensions:
            #open the file and read contents
            filepath = os.path.join(root, file)
            file_list.append(filepath)
            
# get the combined data
combined_csv = pd.concat( [ pd.read_csv(f) for f in file_list ])

# delete the files post processing
filenames = os.listdir(temp_file_path)
for filename in filenames:
    if filename.endswith(".csv"):
        os.remove(os.path.join(temp_file_path, filename))


#################################
Alteryx.write(combined_csv, 1)

When you drag in a Python tool, by default the Jupyter interface will open up when clicking on the ‘Configuration’ page. It will be set to ‘Interactive Mode’ as shown below –

Once code is working as expected, do remember to open this tool again and set the mode to ‘Production’. This will speed up execution and it doesn’t give any intermediary outputs that we use for debugging the code.

Preview of Alteryx Python Configuration tool

That’s about it. The best about this code is, it is file-agnostic i.e. it doesn’t matter if the metadata of the files within change, the solution still works.

Since read_csv has multitude of options catering to all types of needs i.e. delimiter type, header row etc., the solution can be made flexible to handle all types of files data to extract.

Here is the final output post running the job.

Should Thanos be with the other lot?

Toughest Challenges

In the quest for next big break or opportunity out there, I have come to expect these interview questions and I have decided to blog about it.

Phase 1 has ended. Phase 2 is up ahead.


Background –
After months of relentlessly applying for jobs with mix and match of skill sets that I am eligible for, going through multiple ghost callings (I believe that’s what it is called when recruiters spend around 20 to 30 minutes inquiring everything about you, explaining the job needs, setting the pay expectations, raising false hopes and then never hearing anything back even after sending out multiple mails or messages), I was fortunate to be set up a call finally with actual person in an actual company.

Unfortunately, the interview didn’t pan out well. In terms of actual interview itself, it was more about getting to know about the current role, about the team I would be working with and then straight down to the actual interview. The first question that I was asked was the ‘Toughest challenge’ till date. It caught me bit off-guard as I was expecting some technical questions to start with before settling down on behavioural and finally ending with the expectations of next meeting. I did my best to explain things I have done but I guess it just was not good enough.

Why really?

  • This will act as not only reference to myself but also reminder of the other ‘tough challenges’ that lie ahead that I need to face in the future.
  • Second and foremost reason being writing this down has been very cathartic.
  • Not EVERYTHING can be put on resume, so here it is then.

    Without further ado, let me get to the Toughest Challenge question. Questions like this needs to be addressed via STAR system.

Toughest Challenges Faced –

Accenture
My stint at Accenture has been the most challenging and most gratifying time of my professional career. I had worn several hats during my 5 years there – started off as a Software Engineer, BI Developer, BI Team Lead, Operations Team Lead (BI). There were numerous times when I had to stay back at the office for one whole day and more to ensure smooth operations. As I write this, it feels as if it was only yesterday I left Accenture as everything feels so real even though it’s been 5 years since I left.

Here are few of the most challenging times and how I handled them –

  • Data Migration –
    • Challenge – Project was being upgraded from version 1 to version 2. This necessitated initiating data migration from Operational Data Store (ODS) to Data Warehouse (DWH). Daily Nightly Job at that point was designed to handle about ~1M records data migration in about 4-6 hours between ODS and DWH on daily basis. Task was to migrate about 90M records.
    • Action Taken – I re-architected the ETL jobs for loading the main fact tables end to end to come up with a scalable, efficient and most performant solution. This involved introducing cached lookups, replacing row by row updates with bulk updates, removing blocking transformations, creating indexed views etc. I had also included audit table to track each batch progress.
    • Result – Data migration was successfully achieved in 12 Hrs timeframe as opposed to original design that took about 240+ hours. Additionally, some of the design changes introduced were kept unchanged for daily normal runs thereafter.
  • BI Data Feed
    • Challenge – BI Data Feed solution was to be implemented and that required someone to be familiar with C#.Net and BI technologies to develop a command line application to ingest data coming from 4 different SQL Server instances comprising 60+ tables. Business logic too dictated numerous complex rules that had to be incorporated.
    • Action Taken – Upskilled myself in learning to write efficient C#.Net multi-threaded application, using SMO.Net API calls, BCP commands and combining that with dynamic SQL and SSIS to build the needed solution.
    • Result – Solution was delivered much ahead of the time adhering to business specifications.

MindTree –

It was through this company that I first landed here in Sydney, Australia and I worked as Technology Lead for LINK Group over 2.5 years. Best thing I found about the role I can say is the complete autonomy that was offered to perform the task needed with whatever tools available at hand.

Here are few of the most challenging times and how I handled while working here-

  • Data import monitoring –
    • Challenge – Data imports were regularly done over varying periods of time using out-of-the-box TrimPort and HP Trim. There was no proper way to track the imports to estimate the job completion times resulting in unnecessary waiting for data completion or not able to report on the progress.
    • Action Taken – Developed a Power BI dashboard to report the batch import progress via monitoring the data at SQL Server. Created a SSIS package that continually refreshed the data at regular intervals.
    • Result – Created a wholistic pluggable data monitoring solution that is fully flexible for future monitoring and giving flexibility for the end-users to properly plan the imports.

Macquarie Group –

Work at Macquarie Group has been the most joyous one. It is a company that I would say puts technology at the forefront of whatever they do.

Here are few of the most challenging times and how I handled while working here-

  • Historical data migration failure handling –
    • Challenge – Production release got halted when an issue was identified with one of the data migrations script for historical data pertaining to 18 reporting months. Reproducing the issue in non-Prod environment was not possible as it did not have requisite data. Upstream flows had huge impact in absence of this change and release window for the change implementation was not long.
    • Action Taken – I broke down the analysis into multiple phases, first by testing out the column counts, then schema validation and identifying the root cause of the problem. Developed unified script handling all the discrepancies. Performed data sampling by running it across the whole dataset to ensure it is working.
    • Result – Was able to deliver the solution within two hours after the issue was identified after going through proper peer review process, attaching test evidences, and obtaining business sign-off.
  • Setting up of new Environment –
    • Challenge – Application team was living on dangerous grounds with only one environment to work with i.e. Production. All the workflows (over 40+), dependent objects, were being run from one drive and there was no safety net. Periodic backups were taken but it was not an efficient system. UAT process had to be set-up.
    • Action Taken – I broke down the needs into categories of varying complexities and needs. Co-ordinated with multiple teams in setting a new environment, creating a new repository for version control (BitBucket), performing unit testing and system testing on the UAT branch. Integrated Alteryx with Python in developing a localization workflow that could refresh the underlying data pertaining to an environment.
    • Result – Robust new system was in place for the team to work with without having to take multiple backups each time they had to work on change. End to End integrated module was set-up right from raising a Jira, to merging the code was properly documented in a workflow detailing each step.
  • Performance enhancement –
    • Challenge –  Regulatory report containing complex business logic needed to be replicated with data on the data hub. Data pertained to movements across two successive reporting periods and the challenge was to perform cross currency conversion across 15+ reporting currencies using Power BI. Incorporating the business logic within the Power BI was leading to slow-performing reports that was not effective to scaled data volumes. Task needed was to create a scalable solution that also enhances the response time.
    • Action Taken – Solution was tackled by moving all the logic on to the data hub. Utilizing the power of Impala queries on the hub, cross currency conversion, look up data inclusion, and using the Pivot along with cross join, whole business logic was replicated. Decision was taken further to move the output to a flat table instead of retrieving it directly from Power BI.
    • Result – Highly performant report was generated by using DAX queries and pulling in the transformed data from the flat table. Report performance was brought down from 300+ seconds to <20 sec.

Amaysim –

Amaysim was the first product company that I worked with. One of the things that blew my mind was orientation classes that were planned in the first week of joining in the company. I mean who even does that these days? There were series of workshops where different people chipped in and explained the company culture (even CEO).

As I listed here, it was a role that I got accepted because of their belief in my technical capabilities and not the tools themselves and this led me to work on technologies such as Alteryx, AWS Redshift, PostgreSQL, Tableau etc.

  • Report Generation –
    • Challenge – Finance team were tasked to generate Revenue Reports of NBN customers. This entailed following series of steps in document that involved manual extraction of data, data clean-up for discrepancies, data reconciliation and repeating this process over multiple use cases.
    • Action Taken – Developed Alteryx workflow that hooked data right at the source, embedded the business logic within the workflow and using the Tableau Data Extract (TDE) as output with no manual intervention and data getting refreshed at scheduled time.
    • Result – Tableau dashboard that business could directly use to perform the reporting, analytics saving hours of effort every month in getting it done.

Mantras to live by

I am inspired to write this blog after seeing a post in my company’s Workplace talking about wellbeing during COVID times and making effective communication. As a working professional there are some ‘mantras’ that I live by and here they go

General

  • Ownership of a Production issue – If you are tasked with an issue that requires urgent attention for production need, ensure that right from start to finish you own it. This would mean assigning proper timeline to start with, upon delivery ensure it is getting tested in UAT, goes Pre-Prod and then finally getting deployed to Prod. I want your eyes and ears all throughout this process either through regular follow-up’s and ensure post production.
  • Look out for opportunities – Quite often apart from your regular work, you may come across some tasks that are being done following a set process following certain sequence of steps in a methodical manner. Usually the people doing it do spend considerable amount of time out of their normal routine in accomplishing this (sometimes a week or more!). That should be the first sign of opportunity to seize and automate things.
  • Commitment to task- Never commit to anything upfront. If anyone is coming to you with a request to deliver something urgently, take some time to pause to first analyse it. Only after proper analysis give a timeline on when it can be done. When giving an estimate consider the time for actual build, Unit Testing, Regression Testing (if needed), Design document update, Peer Review (must), Rework time.
  • Meetings – Always have an agenda for a meeting and circulate it before hand as it provides context for participants. Be mindful of time and do not book it after-hours just because only person is off-limits. If you feel a phone call would be easier, then by all means give a ring and get it done fast.
  • Technical Front –
    • When solutioning a problem ask yourself the following questions – 
      • How critical is this problem?
      • Are there any more areas where such problem exists? If so what can be done about it?
      • At whole component level, is there something I can do to make the solution better?
      • Don’t be afraid to loosen things up and go one step further if you feel you can deliver a more robust and stable solution. Pitch for it if you are confident that you can deliver within the time frame.
    • Actual Fix – 
      • Provide proper code comments in the code and even in the Fix Details so that the anyone can understand what has gone into it.
      • Do thorough formatting of the code to make it more readable. 
      • Look for extensibility and scalability of the solution (how does it impact Asset) 
      • Ability for the solution to withstand large volumes (Query Plan analysis, stastics check etc.)

Mail Etiquettes –

  • Signatures – During my initial years, it was such a fanciful thing I used to put my heart into trying to come up with very unique, non-repetitive, colour formatted signatures, images etc mostly for bragging rights. It’s only later I see the futility of it. Anything non-standard is just plain distraction Here is how to tackle it
    • Check the organization’s internal pages for any signature etiquettes. 90% of the time they always do and it’s a straightforward process of just using it.
    • In cases where there isn’t one, the only things of need should be – Name, Role, Company Address, Official Mail Id (Very Important), Extension. That’s about it. Avoid putting in images.
  • Mail Acknowledgement – Always acknowledge a mail that is sent to you seeking advice/reply. If it takes time let them know about it that you would get back after certain time. If that time elapses and you aren’t able to do accomplish the task, put the mail again on why you weren’t able to. Essence is you should not keep people in dark, when they are waiting for response.
  • To/CC/BCC know-how – Always put the person whom you are referring to in ‘To’ and rest in CC. If there are group of 10 people in ‘To’ and if you are intending to reply only to one/two people, then except them everyone else should be in CC. Make this a practise. I have seen people create rules specific to being mentioned in ‘To’/’Cc’. Remember CC means it is an FYI and putting the unintended people in ‘To’ you are just wasting their time.
  • Mail Content – Remember that depending on the recipient your content should be altered. If you are writing mail to say senior leaders / non-technical people, then explain the issue with as much less technical details as possible, connect to them from customers impact point of view. As a general guideline you can structure your mail with blocks such as – Issue Observed (Job is failing/job taking long time etc.), Customer Impact (non-timely reports, process blockage etc.), Root Cause (if you have found out then put it else you can say it is still being investigated), Workaround/Solution Proposed. In this manner you are structuring your message properly and succinctly
  • In continuation of the above, when seeking for clarification, make it a point to let the team know that you have done some homework prior to reaching out. This would mean explaining them the thought process that went into coming to the current logic. Only then seek information. Remember half sent information causes to & fro exchange that would cost you a day or more.
  • Manager and Above – When you put mail to manager or above, wait for 10 to 15 minutes (don’t go anywhere from your desk) to see if they are seeking any clarifications. They are the busiest people and if they are looking for some more info then you are buying them good time in waiting for it. Also, never be afraid in terms of grammatical errors or overthinking in detailing. Everybody makes mistakes, own it and improve on it. Trust me they don’t mind as long as they understand what you are trying to convey.

Personal Front-

  • Folks, love everything you do. Only then you would be able to deliver results.
  • Never look for personal gains. Derive satisfaction in the work you deliver. Remember your bread and butter and your daily life is being provided to you by this job that you have got. Respect it. Rewards and recognition will automatically follow
  • When it comes to discussing things/issues keep in mind that every work you are doing is affecting someone and all your efforts are addressing a real problem.
  • Learn, learn and learn. Remember. Keep learning. Aim to cater at least 1 Hr per week in learning to enhance your career – it could be technical or non-technical ones.Once you learn, aim to deliver one session within the project team by preparing a deck and presenting it to them. This goes a long way in improving your presentation and oral skills.
  • Remember to invest in yourself – Of late, many of the training sites are demand hefty annual fee which may look daunting at first (say $400 – $600) but ration it per month and you will see, you will be able to afford it.
  • Don’t be a frog in the well. Have a periodic connect with managers or senior managers to know what their future plans for you is/ career opportunities are.

AWS – Developer Associate Certification learnings

It’s been almost 2 weeks since I passed the certification exam and I wanted to pen down the high level details of all the components that I have studied to pass the AWS Developer Associate exam.

First off, I would like to thank Stephane Maarek and his wonderful Udemy course – Ultimate AWS Certified Developer Associate without which I am not sure I could have even inched past the priliminary set pieces.

Background
Like every Software Developer worth their salt my fascination to learn about cloud technologies began few years back. With help of Pluralsight courses, I started off my learning. The course, as usual was of excellent calibre but one tiny teeny detail that wasn’t mentioned was the need to monitor the bill. I was of the opinion 750Hrs of free-tier would last a lifetime.

I drifted off the course for a while and forgot to turn off the EC2 instances and voila! One fine day, in my mailbox I saw a bill of AUD $160. I immediately contacted the support centre and had the account suspended.

It really scared me off for a while and I put off learning about it for quite bit of time.

Motivation – I
On and off after that experience I just dabbled with S3 storage and static websites, trying to programatically load some images using Amazon SDK. As part of my Udacity Nano degree experience, I worked on small ETL batch jobs using Python modules by first loading the data on to S3 and then on to Redshift as the final destination. The whole program though left me with a bad taste with one of the worst support system and sub par course quality, though I managed to create some portfolio projects.

As I started off my job search, I realised it’s hard to convince people that I am well acquainted with AWS technologies and I know how to work with them. Though I don’t directly work on it in my current, I am quite aware that the Cloudera offering that we have is deployed across multitude EC2 clusters and we are not using the out-of-the box EMR provided by Amazon.

Additionally, I have been quite often asked if I have certification at least.

Motivation – II
When I started searching for the certification offerings from AWS, I realised the one I really want to give is – AWS Certified Data Analytics – Speciality as I aim to become Big Data Engineer/ Developer. That certification explores whole gamut of technologies that one can utilise as part of Data Analytics of Big Data –

Collection (Kinesis, Database Migration Services (DMS))
Storage (S3, DynamoDB)
Processing (Glue, Lambda, Hive, Spark, Hue, HBase)
Analysis (Redshift, Athena)
Visualisation (QuickInsight)
Security (STS, KMS)

The ones highlighted are something that I have worked\working with. AWS mandates that I need to have a Associate certificate before I can attempt an Speciality certificate. I chose ‘Certified Developer – Associate’ out of the three options. Fielding around with friends and colleagues I could see that Udemy course was a strong first followed by ACloudGuru subscription. I took the former. It was an intense 4 week preparation that ultimately bore the results. So, without much further ado here is the recap of all the suite of products that I have learnt

#Product NameDescription
1IAM (Identity and Access Management)Access Management forms the heart and soul of AWS eco system. It has a global view and all the permissions are governed by Policies (written in JSON) format. Governance is accorded in three segments (Users, Groups, Roles)
2EC2 – Elastic Cloud ComputeEC2 is akin virtual servers on the cloud. AWS provides you whole gamut of choices depending on the 5 distinct characteristics – RAM, CPU, I/O, Network, GPU.
Additionally you can have different launch types too –
On Demand Instances – short workloads
Reserved – Minimum 1 Year
Spot Instances – short workloads, less reliable, can be kicked off the instance
Dedicated Instances – exclusive access to the hardware and not shared by anyone
Dedicated Hosts – Booking of entire physical server, control instance placement etc.
3ELB – Elastic Load BalancerLoad balancers are servers that forward internet traffic to multiple EC2 servers and essentially spread the load to downstream instances. Three types of Load Balancers are present –
Classic Load Balancer
Application Load Balancer (v2)
Network Load Balancer (v2)
4ACG – Auto Scaling GroupPurpose of ASG is to Scale Out (EC2) to match increased load or Scale In to match decreased load. Goes hand in hand with ELB’s. Trigger for scaling can be on CPU, Network or even custom metrics. Various types of scaling can be done – step scaling; scheduled scaling etc
5EBS – Elastic Block Storage
Instance Store
EFS – Elastic File System
EBS is a network drie you can attach to EC2 instance when they run and retain data in case the instances crash. They are locked to AZ. Depending on need various types of storages are available (from large to small, high latencey to low latency etc). A EBS can be attached to only one EC2 instance

Instance Store unlike EBS is like a USB attached to EC2. Available directly from the machine. On flip side, you will lose all the data if instance crashes

Elastic File System is highly scalable expensive storage that is available across multi-AZ. EFS can be attached to multiple EC2 instances.
6RDS – Relational Database Store
Aurora
ElastiCache
Managed database service from AWS stable that provides automated provisioning, continous backup, read replicas, auto-scaling (both vertically and horizontally, os patching) and so on.

Aurora is a serverless Database management from AWS which is akin to AWS RDS on steroids i.e. 5 times more performant.

ElastiCache is similar to EBS i.e. in-memory databases for RDS. It gives ability to cache requests and reduce the hits going to the DB. Remember on the cloud every read/write counts in the cost. Two Types-
Redis – Backup and Restore features
Memcached – Non-persistent
7Route 53A service akin to Traffic Police redirecting road traffic. Redirection can be done at domain level (CNAME), or to another amazon resource (Alias). Various types of routing are available –
Simple
Multi Value Routing
Geolocation Routing
Failover Routing
Weighted Routing
Failover Routing
8VPC – Virtual Private CloudVPC isn’t extensively asked for Developer Associate but high level knowledge should suffice. It’s a private network to deploy resource within which public subnet and private subnet can be set-up

NAT Gateway and Internet Gateways would be used to  communicate with www.
9Amazon S3 – Simple Storage Service

Athena
Major building blocks of AWS. Infinite storage layer to store wide variety of data. Data is stored in buckets (directories). Version controlling can be enable.
One of the most interesting things I found is the various storage classes capabilities starting from General Purpose to Glacier Deep Archive (min 180 days storage)

Serverless service to perform analytics direclty against S3 files remotely.
10CloudFrontContent Delivery Network to improve read performance, DDoS protection etc. Provides Global Edge Networks; great for static content that must be available everywhere
11ECR – Elastic Container Service
Fargate
Container Management service for docker installations. ECS clustoers are logical grouping of EC2 instances

Fargate provides serverless management of container services providing high scalability without manual intervention
12Elastic BeanstalkDeveloper centric view of deploying application on AWS. Has three main components – Application, Application Version, Environment name (dev, test, prod) etc.
Provides highly flexible deployment modes – All-At-Once; Rolling; Rolling with Additional Batches; Immutable
Can make use of CLI capabilities to manage entirely via code.
13AWS CICDDevOps on AWS can be done using these components providing CI/CD
CodeCommit –
CodePipeline –
CodeBuild –
CodeDeploy –
14CloudFormationInfrastructe as Code. I absolutely LOVE this feature. It’s just mindblowing in every sense. It’s declarative way of outlining AWS infrastructure.
Create a template of the infrastructure that you desire. It’s then just a matter of creating and removing infrastructure on click of a button.
I will be focusing more on this from now on to enrich my learning
Monitoring –
CloudWatch
X-Ray
CloudTrail
All the applications sends logs to CloudWatch. Alarm can be set for notificaiton in case of unexpected Metrics.
X-Ray service provides automate trace analysis and Central Service Map Visualiation. Request tracking across distributed systems
Audits API calls made by users/ services/ AWS console. Useful to detect unauthorized calls or root cause of changes
20AWS Integration & Messages –
SQS
SNS
Kinesis
SQS refers to consumers polling data, data getting deleted after message being read, highly scalable service.
SNS refers to messages being pushed to subscribers, up to 10M subscribers, easy integration with SQS for fan-out pattern
Kinesis is used for streaming data services where the data gets distributed in mutliple shards. Data is read-only which then provides ability to do multiple analysis.

Alteryx – Get SheetNames from Excel file

Once the base Macro is set, we now need another macro which can spit out the sheet names from a given excel file. Perform the following steps to create this macro –
1. Drag in ‘Input Tool’ and connect to any of the existing excel file say ‘Movies.xlsx’. On connection, chose the option – Import only the list of sheet names ( as shown below ). Additionally in the configuration pane of the tool set ‘Output File Name as Field’ to ‘Full Path’

2. Drag in a ‘Formula’ tool and create a new field ‘FullPath’ with the following formula
TrimRight([FileName],'<List of Sheet Names>’)+”‘”+[Sheet Names]+”$’”
3. Drag in a ‘Select’ tool and deselect everything but the ‘FullPath’ field.
4. Drag in a ‘Macro Output’.
Here is how it should look like –

Run the workflow and ensure full path is being shown –

Does experience matter?

Experience Meme | TaylorMadeMarketing©

As my current contract gets a last breath of life and I begin to search for opportunities, this one question has really been doing rounds in my mind. Let me further elaborate on that question before people jump the gun. Does experience in particular technology really matter, if you already have knowledge in related subject matter?

For over a decade since starting my career, I had worked with MSBI suite of technologies with PowerShell, C# thrown in. It wasn’t until I came here to Sydney to work for a different client (LINK Group) where I had the opportunity of working with different sets of tools such as Informatica and Oracle. Things were going good.

As the work came through, I adapted myself to learning the new tools and working with them. It didn’t take time to really upskill. It was then I got a lucky break. Vinay Sammineni of Cognitivo Consulting and his partner Alan Hsiao took upon my CV (from a mutual friend) and saw that I had very good data warehousing and SQL skills.

They phoned me and asked me if I was interested to work for Amaysim who were looking for a data analyst with the skills required being – Alteryx, Tableau, Amazon Redshift and stated further that they have also set-up an interview with them. This was quite a shock and I clearly remember asking them multiple times if they have gone through my CV as I didn’t have any experience in any of them up till that point.

The interview with Jacquie went well and the questions were mostly focused on standard SQL and my past work experience – the challenges that I had faced, dealing with demanding managers etc. She then went on about the tools the company were using and she was quite blasé about me not knowing them and said you shouldn’t find any difficulty getting acclimatised with them.

That was how my new path began to take shape. I honestly can’t thank enough the folks at Amaysim and also Vinay and Alan for believing in me and my abilities to transition my existing knowledge and using it to develop new ones.

As I start applying for jobs for Data Engineering roles, I hope I come across a company who can see my capability to transition my existing breadth of knowledge and not exactly on the tech skills that I work with.

Alteryx – Max Date from multiple excel files with varying schemas

As the title states, the requirement is simple – Extract max date from multiple excel files with varying schemas. I had a task where the requirement was to go through about 30 odd excel files and pick max date from each of them for monitoring purpose. All the files do not have necessarily have same schemas i.e. same number of columns or sheet name for that matter.

Just looking at the requirement, I knew this is a cakewalk for Alteryx. You see once you get a small taste of how a macro works, you will begin to realize the incredible power of the ability to customize every single component of the workflow – input tool, output tool, filter, join. Anything, you just name it, it will do it for you. It will put those tiny black thingy on the controls and hallelujah! you are set.

Alteryx Designer x64 - Macros_configuration.png

I did some simple mind calc and bam, I knew I could rig this up effortlessly.

Without any further ado, let’s get into this with an example. Like every other person out there I consume a lot, I mean a LOT of audio/movie content be it movies, series or podcasts. So I have three files now to keep track of these –

1. Movies Seen.xlsx containing the following data-

Alteryx Designer x64 - Blog_Excel_Movies_Sample

2. Series Seen.xlsx containing the following data –

Alteryx Designer x64 - Blog_Excel_Series_Sample

3.Podcasts Heard.xlsx containing the following data –

Alteryx Designer x64 - Blog_Excel_Podcasts_Sample

As can be seen all the files above are having different column names and different sheet names too. My goal is to get when was the last movie that I saw or series I saw or podcast I heard.

Here is the expected result –

Solution –

My first step was to create a lookup file – media_config_data.csv which looks like below –

Alteryx Designer x64 - Blog_Config_Data

We will make use of this later. First let’s create one simple workflow that gives the max date from one of the files above i.e. Movies Seen.xlsx. Here is the screenshot of the base workflow –

The workflow does following things –
1. Use ‘Input Data’ and connect to the excel file.
2. Add a ‘Formula Tool’ and create new column ‘Max Date’ and set it to ‘Last Watched On’
3. Add a ‘Select’ transform and deselect all the columns excepting the ‘Max Date’ and ‘File Name’ column
4. Add a ‘Summarize’ transform and set it as ‘Last Max Date’
5. Generates a dummy output

Ensure that the workflow works without any error. The next step is to create a ‘macro’ out of it. Some of ‘comments’ in the above screen shot should give an idea of what is going to happen.

Step 1 – Macro to obtain Max of Date column

Perform the following steps to create the macro. This is where the actual magic happens –
1. Drag in three ‘Control Paramters’ on to the canvas and set the labels as –
Excel Name, Sheet Name, Date Column Name.
2. Connect the ‘Excel Name’ control parameter’s magnifying glass to lightning strike of ‘Input Tool’. An ‘Update value’ action tool will popup.
3. Perform the following actions on selecting the ‘Update value’ tool as shown below –

4. Perform the steps (2) and (3) for Sheet Name control parameter and in following through ‘Replace a specific string:’ Movies.
5. Now connect the ‘Date Column Name’ control parameter to the lightning strike of ‘Formula’ tool and perform the actions as seen below –

6. Lastly remove the ‘Output’ tool and connect ‘Macro output’.
7. Save the file as ‘Obtain Max Date from Excel File’.

Here is how the macro would like like once done –

Step 2 – Workflow to process the files
This workflow is where we use Directory tool, to obtain all the excel files and get file name and full file path, join it up with config file and obtain the three variables that the above macro needs –
1. Excel Name
2. Sheet Name
3. Date Column Name
Here is how the workflow would look like –

On running the workflow, we get the desired output as shown below –

So who do you really work for?

I attended a Town Hall meeting yesterday of Genpact from who I currently contract with. It was very interesting meeting with about 70 odd people attending it. One of the things that pleasantly surprised me is the strength of non-Indian representation. Genpact is one of the mid-tier IT firms based out of in India and having worked for such consultancy firms before I expected large Indian diaspora with occasional Aussies.

This was the very case in the first company that I worked here in Australia and contracting for LINK Group, as a MindTree employee. In the annual parties that we had back then, I could count with my fingers on number of Aussies in the whole pact. So it was quite a sight to see a change.

The one difference I can see right away is whereas in MindTree there was predominant presence of developers, here I could see the opposite. This would only mean , at least to me, expansion is yet to happen for the company. There were two main speakers who gave an update of the company’s outlook, how it fared from last year, exciting new clients that they bagged this year. The company looks to be heading very strong with really good performance outlook.

Right at the end, the meeting then veered into employee feedback they obtained and what it meant to them, how they are going to address it etc. It was it this point, I was kind of zoned out.

It reminded me of all the times that I have been a salaried full-time employee, for whom talks like this used to invigorate a sense of belonging in me. Ever since I have been contracting, events like this put a different perspective on thinking.

Post the meeting, I stayed on for a while just to have a casual chat with any of the folks there. It was during that time I got asked – so who do you really work for?

That’s when for the first time, I got a sense of liberation. Being on contract, is in a way being on your own. In truest sense the answer would be I work for the client as that is my primary responsibility. The question, though, is much broader than that. In absence of allegiance, for whom am I really working?

Ever since the day I have started working, my first and foremost dedication is to the quality of work I deliver. It has to be flawless, easily scalable, extensible and most importantly well-documented. It is the work that gives me the utmost satisfaction. The one thing I have consciously decided to focus on now is to improve my technical knowledge and gain inroads into big data engineering space.

Day #2 – Data Modeling

Day #2 of my course involved getting over view of data modelling. The course started off with basic introduction courses for relational and cloud databases. The course per se was touching only on basic terms and bit underwhelming w.r.t to intro to PostgreSQL database.

Cassandra database is the next hurdle to cross and work on.

Day #1 – Getting the engine started

Yesterday marked my first day in the Data Engineer Nanodegree course offered by Udacity. After thinking a lot on how to best equip myself and enrich my knowledge in the world of Big Data and taking the steps towards it, this course came along talking about THE essential things that I wanted to learn – Cloud data warehouses, Spark and Data Lakes.

What further sealed the deal was I am working on a project where we are using Spark and Data Lake as well. However, it is being handled by a separate team. My involvement so far has been to the extent of writing Impala queries, creating data structure, testing the sqoop queries and occasionally query tuning by looking at the logs to understand which partitioning is better. I reasoned that doing this course will give me a better ammo to pitch myself to get into the Data Lake team. Time will tell (fingers crossed)

I have been longing for an opportunity to pivot my career from the traditional BI to Data Engineering on Big Data Platforms. Here is a course that not only promises to teach the nitty gritties of being a Data Engineer with a proper structured methodical teaching but also help with shaping up my career via services like resume editing and LinkedIn page setup. Long way to go for that.

So here is what my Day 1 (yesterday) felt like so far- Absolutely wonderful!
In the first few videos I have really gotten to know what Data Engineer really means and what other titles actually mean and how they stack up.

What resonated me a lot was this article that was one of the materials to read up. It spoke volumes to me as this was exactly the path I had been following all through my career. I started off writing ETL packages via SSIS on traditional OLTP – OLAP databases, designing cubes off of it, designing and developing reports based on it.

All these have stopped about 3 years ago and it was only a year ago, I am completely off it. I am now working on data sources which are disparate in nature or are built on the Data Lake. This is a brand new world for me and am loving every part of it. The challenges are different, more exciting and there is SO much more to be done.

Looking at the evolution of how data has proliferated and how the traditional RDBMS technologies are not sufficient to cater the growing needs of business, I am happy to see the organic growth in me. Of course, to be where I am today, the forces that have shaped me are largely due to the work done in BI but stepping into new future I need more ammo.

Coming back to the course, I started off with Data Modelling basics and some intro into PostgreSQL.

Next post would be more structured. The purpose of this post with # tag is to motivate myself to read every day and share my thoughts on my learning.