Yesterday marked my first day in the Data Engineer Nanodegree course offered by Udacity. After thinking a lot on how to best equip myself and enrich my knowledge in the world of Big Data and taking the steps towards it, this course came along talking about THE essential things that I wanted to learn – Cloud data warehouses, Spark and Data Lakes.
What further sealed the deal was I am working on a project where we are using Spark and Data Lake as well. However, it is being handled by a separate team. My involvement so far has been to the extent of writing Impala queries, creating data structure, testing the sqoop queries and occasionally query tuning by looking at the logs to understand which partitioning is better. I reasoned that doing this course will give me a better ammo to pitch myself to get into the Data Lake team. Time will tell (fingers crossed)
I have been longing for an opportunity to pivot my career from the traditional BI to Data Engineering on Big Data Platforms. Here is a course that not only promises to teach the nitty gritties of being a Data Engineer with a proper structured methodical teaching but also help with shaping up my career via services like resume editing and LinkedIn page setup. Long way to go for that.
So here is what my Day 1 (yesterday) felt like so far- Absolutely wonderful!
In the first few videos I have really gotten to know what Data Engineer really means and what other titles actually mean and how they stack up.
What resonated me a lot was this article that was one of the materials to read up. It spoke volumes to me as this was exactly the path I had been following all through my career. I started off writing ETL packages via SSIS on traditional OLTP – OLAP databases, designing cubes off of it, designing and developing reports based on it.
All these have stopped about 3 years ago and it was only a year ago, I am completely off it. I am now working on data sources which are disparate in nature or are built on the Data Lake. This is a brand new world for me and am loving every part of it. The challenges are different, more exciting and there is SO much more to be done.
Looking at the evolution of how data has proliferated and how the traditional RDBMS technologies are not sufficient to cater the growing needs of business, I am happy to see the organic growth in me. Of course, to be where I am today, the forces that have shaped me are largely due to the work done in BI but stepping into new future I need more ammo.
Coming back to the course, I started off with Data Modelling basics and some intro into PostgreSQL.
Next post would be more structured. The purpose of this post with # tag is to motivate myself to read every day and share my thoughts on my learning.
Yet another strange day to be in. I kept getting the following error –
Error – The Designer x64 reported: InboundNamedPipe::ReadFile: Not enough bytes read. The pipe has been ended¶
I tried several methods – clearing of temp files, restarting alteryx, restarting my system multiple times, logging on and logging off etc to no avail. It just kept failing.
What added to the confusion is, it was happening only for few tables whose volumes are much smaller in size. I mean to say Table 1 with about ~20M records from the same database, we were able to extract successfully whereas from Table 2 with only ~3.5M we were facing this problem. Seemed really strange. The good old internet hadn’t turned up with anything useful and neither the Alteryx Forums. All the info that I got was that during one stage of workflow , it was running into error.
The case for me though is in my workflow apart from the ‘Input Tool’ from where I was pulling in the data and the ‘Output Tool’, there was nothing in between. Still it was failing.
Anyway I just took a break, thought for a while and said to myself – “When was the last time you had such strange error with Alteryx and Postgres combo?” Oh yeah, right here.
Hmm, why not try the same fix? I sure did and you know what! It just god darn works! Don’t even ask me how or why, it works. So gents and ladies, here is the fix you gotta do –
Go to the, Pre SQL Statement of your Alteryx ‘Input Tool’ and insert the following –
set client_encoding to ‘UTF-8’
As simple as that! Thank me later.
This is one of the most troublesome errors that I had encountered for long time and finally at last I found a way to get past it.
I kept getting the following error when trying to fetch data from my PostgreSQL source. This error was an arbitrary one which used to pop based on the data within. Here is how the standard error message is shown in Messages window of Alteryx –
Error: Input Data (11): Error SQLExecute: ERROR: character with byte sequence 0xe2 0x80 0xad in encoding “UTF8” has no equivalent in encoding “WIN1252”;
Error while executing the query
The byte sequence that I highlighted corresponds to blank space. I tried trimming out blank spaces in my data. I then got another error this time indicating another character. The first workaround to this problem that I applied was to create one more ODBC data source with a Unicode driver. Now that resulted in another problem altogether. All the date format of the input data had changed to text.
Here is the actual fix that helped me. Reading online I did see that one would need to change the setting from WIN1252 to UTF8. Now that I didn’t want to do at database level and I wasn’t sure where to put it. It is after bit of tinkering I see putting that in the Pre-Execute SQL Statement does the trick as shown below –
One of the tasks that I had needed me to have a proxy table of dates using which I had to achieve other business logic. Using PostgreSQL and recursive CTE I was able to achieve this.
Here is the code –
--cte to get range of numbers, which in this case is 1 to 90
with recursive ints (n) AS (
select n+1 from ints where n+1<= 90
--cte to get date ranges by using interval function and
--multiplying the ranges
select current_date - interval '90days' + interval '1day' * n as dt
--the query below will now get you last 90 days date ranges
select dt from cte_dates</pre>
The point to note is you can multiply the intervals that you specify using the good old multiplier.